这是用户在 2025-2-1 23:07 为 https://app.immersivetranslate.com/pdf-pro/f2c933a0-e760-4220-b89c-8bb9a4525067 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DeepSeek-V3 Technical Report
DeepSeek-V3 技术报告

DeepSeek-AI

Abstract  摘要

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at/https://github.com/deepseek-ai/DeepSeek-V3.

Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts.
图 1 | DeepSeek-V3 及其对应模型的基准性能。

Contents  内容

1 Introduction … 4  1 引言 … 4
2 Architecture … 6
2 架构 … 6

2.1 Basic Architecture … 6
2.1 基本架构 … 6

2.1.1 Multi-Head Latent Attention … 7
2.1.1 多头潜在注意力 … 7

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing … 8
2.1.2 DeepSeekMoE 与无辅助损失负载均衡 … 8

2.2 Multi-Token Prediction … 10
2.2 多标记预测 … 10

3 Infrastructures … 11
3 基础设施 … 11

3.1 Compute Clusters … 11
3.1 计算集群 … 11

3.2 Training Framework … 12
3.2 训练框架 … 12

3.2.1 DualPipe and Computation-Communication Overlap … 12
3.2.1 DualPipe 和计算-通信重叠 … 12

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication … 13
3.2.2 跨节点全到全通信的高效实现 … 13

3.2.3 Extremely Memory Saving with Minimal Overhead … 14
3.2.3 极低开销的极致节省内存 … 14

3.3 FP8 Training … 14
3.3 FP8 训练 … 14

3.3.1 Mixed Precision Framework … 15
3.3.1 混合精度框架 … 15

3.3.2 Improved Precision from Quantization and Multiplication … 16
3.3.2 量化和乘法带来的精度提升 … 16

3.3.3 Low-Precision Storage and Communication … 18
3.3.3 低精度存储和通信 … 18

3.4 Inference and Deployment … 18
3.4 推理与部署 … 18

3.4.1 Prefilling … 19
3.4.1 预填充 … 19

3.4.2 Decoding … 19
3.4.2 解码 … 19

3.5 Suggestions on Hardware Design … 20
3.5 硬件设计建议 … 20

3.5.1 Communication Hardware … 20
3.5.1 通信硬件 … 20

3.5.2 Compute Hardware … 20
3.5.2 计算硬件 … 20

4 Pre-Training … 22
4 预训练 … 22

4.1 Data Construction … 22
4.1 数据构建 … 22

4.2 Hyper-Parameters … 22
4.2 超参数 … 22

4.3 Long Context Extension … 23
4.3 长上下文扩展 … 23

4.4 Evaluations … 24
4.4 评估 … 24

4.4.1 Evaluation Benchmarks … 24
4.4.1 评估基准 … 24

4.4.2 Evaluation Results … 25
4.4.2 评估结果 … 25

4.5 Discussion … 26
4.5 讨论 … 26

4.5.1 Ablation Studies for Multi-Token Prediction … 26
4.5.1 多标记预测的消融研究 … 26

4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy … 27
4.5.2 辅助损失自由平衡策略的消融研究 … 27

4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance … 27
4.5.3 批量负载均衡 VS. 序列负载均衡 … 27

5 Post-Training … 28
5 后训练 … 28

5.1 Supervised Fine-Tuning … 28
5.1 监督微调 … 28

5.2 Reinforcement Learning … 29
5.2 强化学习 … 29

5.2.1 Reward Model … 29
5.2.1 奖励模型 … 29

5.2.2 Group Relative Policy Optimization … 30
5.2.2 群体相对策略优化 … 30

5.3 Evaluations … 30
5.3 评估 … 30

5.3.1 Evaluation Settings … 30
5.3.1 评估设置 … 30

5.3.2 Standard Evaluation … 32
5.3.2 标准评估 … 32

5.3.3 Open-Ended Evaluation … 33
5.3.3 开放式评估 … 33

5.3.4 DeepSeek-V3 as a Generative Reward Model … 33
5.3.4 DeepSeek-V3 作为生成奖励模型 … 33

5.4 Discussion … 34
5.4 讨论 … 34

5.4.1 Distillation from DeepSeek-R1 … 34
5.4.1 从 DeepSeek-R1 中提取 … 34

5.4.2 Self-Rewarding … 34
5.4.2 自我奖励 … 34

5.4.3 Multi-Token Prediction Evaluation … 35
5.4.3 多标记预测评估 … 35

6 Conclusion, Limitations, and Future Directions … 35
6 结论、局限性和未来方向 … 35

A Contributions and Acknowledgments … 45
A 贡献与致谢 … 45

B Ablation Studies for Low-Precision Training … 47
B 低精度训练的消融研究 … 47

B. 1 FP8 v.s. BF16 Training … 47
B. 1 FP8 与 BF16 训练 … 47

B. 2 Discussion About Block-Wise Quantization … 47
B. 2 关于块级量化的讨论 … 47

C Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models … 48
C 专家专业化模式的 16B 辅助损失基础和无辅助损失模型……48

1. Introduction  1. 引言

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024a|b|c; Guo et al., 2024), LLaMA series (AI@Meta, 2024a b; Touvron et al., 2023a b), Qwen series (Qwen, 2023, 2024a b), and Mistral series (Jiang et al. 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.
近年来,大型语言模型(LLMs)正在经历快速的迭代和演变(Anthropic, 2024;Google, 2024;OpenAI, 2024a),逐渐缩小与人工通用智能(AGI)之间的差距。除了闭源模型,开源模型,包括 DeepSeek 系列(DeepSeek-AI, 2024a|b|c;Guo 等, 2024)、LLaMA 系列(AI@Meta, 2024a b;Touvron 等, 2023a b)、Qwen 系列(Qwen, 2023, 2024a b)和 Mistral 系列(Jiang 等, 2023;Mistral, 2024),也在取得显著进展,努力缩小与其闭源同行之间的差距。为了进一步推动开源模型能力的边界,我们扩大了模型规模,并推出 DeepSeek-V3,这是一个大型的混合专家(MoE)模型,具有 671B 参数,其中每个 token 激活 37B。
With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeekV2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks.
从前瞻性的角度出发,我们始终努力追求强大的模型性能和经济的成本。因此,在架构方面,DeepSeek-V3 仍然采用 Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) 以实现高效推理,并采用 DeepSeekMoE (Dai et al., 2024) 以实现具有成本效益的训练。这两种架构已在 DeepSeekV2 (DeepSeek-AI, 2024c) 中得到了验证,证明它们能够在实现高效训练和推理的同时保持强大的模型性能。除了基本架构外,我们还实施了两种额外策略,以进一步增强模型能力。首先,DeepSeek-V3 开创了一种无辅助损失策略 (Wang et al., 2024a) 以实现负载均衡,旨在最小化因努力促进负载均衡而对模型性能产生的不利影响。其次,DeepSeek-V3 采用了多标记预测训练目标,我们观察到这增强了在评估基准上的整体性能。
In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al., 2019, Narang et al., 2017, Peng et al., 2023b), its evolution being closely tied to advancements in hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.
为了实现高效训练,我们支持 FP8 混合精度训练,并对训练框架实施全面优化。低精度训练已成为高效训练的一个有前景的解决方案(Dettmers et al., 2022; Kalamkar et al., 2019, Narang et al., 2017, Peng et al., 2023b),其演变与硬件能力的进步密切相关(Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a)。在这项工作中,我们引入了 FP8 混合精度训练框架,并首次验证其在极大规模模型上的有效性。通过对 FP8 计算和存储的支持,我们实现了加速训练和减少 GPU 内存使用。至于训练框架,我们设计了 DualPipe 算法以实现高效的管道并行性,该算法具有更少的管道气泡,并通过计算-通信重叠隐藏了大部分训练过程中的通信。 这种重叠确保了,随着模型的进一步扩展,只要我们保持恒定的计算与通信比率,我们仍然可以在节点之间使用细粒度专家,同时实现近乎零的全到全通信开销。此外,我们还开发了高效的跨节点全到全通信内核,以充分利用 InfiniBand (IB)和 NVLink 带宽。此外,我们还精心优化了内存占用,使得在不使用昂贵的张量并行的情况下训练 DeepSeek-V3 成为可能。结合这些努力,我们实现了高训练效率。
During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32 K , and in the second stage, it is further extended to 128 K . Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models, and meanwhile carefully maintain the balance between model accuracy
在预训练期间,我们在 14.8T 高质量和多样化的标记上训练 DeepSeek-V3。预训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值,也不需要回滚。接下来,我们对 DeepSeek-V3 进行两阶段的上下文长度扩展。在第一阶段,最大上下文长度扩展到 32 K,在第二阶段,进一步扩展到 128 K。随后,我们进行后训练,包括对 DeepSeek-V3 基础模型的监督微调(SFT)和强化学习(RL),以使其与人类偏好对齐,并进一步释放其潜力。在后训练阶段,我们从 DeepSeekR1 系列模型中提炼推理能力,同时仔细保持模型准确性之间的平衡。
Training Costs  训练成本 Pre-Training  预训练 Context Extension  上下文扩展 Post-Training  后训练 Total  总计
in H800 GPU Hours
在 H800 GPU 小时
2664 K 119 K 5 K 2788 K
in USD  以美元计 $ 5.328 M $ 5.328 M $5.328M\$ 5.328 \mathrm{M} $ 0.238 M $ 0.238 M $0.238M\$ 0.238 \mathrm{M} $ 0.01 M $ 0.01 M $0.01M\$ 0.01 \mathrm{M} $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M}
Training Costs Pre-Training Context Extension Post-Training Total in H800 GPU Hours 2664 K 119 K 5 K 2788 K in USD $5.328M $0.238M $0.01M $5.576M| Training Costs | Pre-Training | Context Extension | Post-Training | Total | | :--- | :---: | :---: | :---: | :---: | | in H800 GPU Hours | 2664 K | 119 K | 5 K | 2788 K | | in USD | $\$ 5.328 \mathrm{M}$ | $\$ 0.238 \mathrm{M}$ | $\$ 0.01 \mathrm{M}$ | $\$ 5.576 \mathrm{M}$ |
Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $ 2 $ 2 $2\$ 2 per GPU hour.
表 1 | 假设 H800 的租赁价格为 $ 2 $ 2 $2\$ 2 每个 GPU 小时,DeepSeek-V3 的训练成本。

and generation length.  和生成长度。
We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks.
我们在一系列综合基准上评估了 DeepSeek-V3。尽管其训练成本经济,但全面评估显示,DeepSeek-V3-Base 已成为目前可用的最强开源基础模型,特别是在代码和数学方面。其聊天版本也优于其他开源模型,并在一系列标准和开放式基准上达到了与领先的闭源模型(包括 GPT-4o 和 Claude-3.5-Sonnet)相当的性能。
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180 K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664 K GPU hours. Combined with 119 K GPU hours for the context length extension and 5 K GPU hours for post-training, DeepSeek-V3 costs only 2.788 M GPU hours for its full training. Assuming the rental price of the H800 GPU is $ 2 $ 2 $2\$ 2 per GPU hour, our total training costs amount to only $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M}. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
最后,我们再次强调 DeepSeek-V3 的经济训练成本,如表 1 所示,这得益于我们对算法、框架和硬件的优化共同设计。在预训练阶段,训练 DeepSeek-V3 每万亿个标记仅需 180 K H800 GPU 小时,即在我们 2048 H800 GPU 的集群上仅需 3.7 天。因此,我们的预训练阶段在不到两个月的时间内完成,耗时 2664 K GPU 小时。结合上下文长度扩展的 119 K GPU 小时和后训练的 5 K GPU 小时,DeepSeek-V3 的完整训练仅需 2.788 M GPU 小时。假设 H800 GPU 的租赁价格为 $ 2 $ 2 $2\$ 2 每 GPU 小时,我们的总训练成本仅为 $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M} 。请注意,上述成本仅包括 DeepSeek-V3 的官方训练,不包括与先前研究和架构、算法或数据的消融实验相关的成本。
Our main contribution includes:
我们的主要贡献包括:

Architecture: Innovative Load Balancing Strategy and Training Objective
架构:创新的负载均衡策略和训练目标

  • On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
    在 DeepSeek-V2 高效架构的基础上,我们开创了一种无辅助损失的负载平衡策略,最小化因促进负载平衡而导致的性能下降。
  • We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.
    我们研究了多标记预测(MTP)目标,并证明它对模型性能有益。它还可以用于推测解码以加速推理。

Pre-Training: Towards Ultimate Training Efficiency
预训练:朝着终极训练效率迈进

  • We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
    我们设计了一个 FP8 混合精度训练框架,并首次验证了 FP8 训练在极大规模模型上的可行性和有效性。
  • Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computationcommunication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.
    通过算法、框架和硬件的共同设计,我们克服了跨节点 MoE 训练中的通信瓶颈,实现了近乎完全的计算与通信重叠。这显著提高了我们的训练效率并降低了训练成本,使我们能够在没有额外开销的情况下进一步扩大模型规模。
  • At an economical cost of only 2.664 M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.
    以仅需 2.664 M H800 GPU 小时的经济成本,我们在 14.8T 标记上完成了 DeepSeek-V3 的预训练,生成了目前最强的开源基础模型。预训练后的后续训练阶段仅需 0.1M GPU 小时。

Post-Training: Knowledge Distillation from DeepSeek-R1
后训练:来自 DeepSeek-R1 的知识蒸馏

  • We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the
    我们引入了一种创新的方法,将推理能力从长链思维(CoT)模型中提炼出来,特别是从 DeepSeek R1 系列模型之一,转化为标准LLMs,尤其是 DeepSeek-V3。我们的流程优雅地结合了

    verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.
    将 R1 的验证和反思模式引入 DeepSeek-V3,并显著提高其推理性能。同时,我们还控制 DeepSeek-V3 的输出风格和长度。

Summary of Core Evaluation Results
核心评估结果摘要

  • Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge.
    知识:(1)在 MMLU、MMLU-Pro 和 GPQA 等教育基准测试中,DeepSeek-V3 超越了所有其他开源模型,在 MMLU 上获得 88.5 分,在 MMLU-Pro 上获得 75.9 分,在 GPQA 上获得 59.1 分。其性能与领先的闭源模型如 GPT-4o 和 Claude-Sonnet-3.5 相当,缩小了该领域开源模型与闭源模型之间的差距。(2)在事实性基准测试中,DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 的开源模型中表现优越。虽然在英语事实知识(SimpleQA)方面落后于 GPT-4o 和 Claude-Sonnet-3.5,但在中文事实知识(中文 SimpleQA)方面超越了这些模型,突显了其在中文事实知识方面的优势。
  • Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as LiveCodeBench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks.
    代码、数学和推理:(1)DeepSeek-V3 在所有非长链思维的开源和闭源模型中,在与数学相关的基准测试中实现了最先进的性能。值得注意的是,它在特定基准测试(如 MATH-500)上甚至超越了 o1-preview,展示了其强大的数学推理能力。(2)在与编码相关的任务中,DeepSeek-V3 成为编码竞赛基准测试(如 LiveCodeBench)的最佳表现模型,巩固了其在该领域的领先地位。在与工程相关的任务中,尽管 DeepSeek-V3 的表现略低于 Claude-Sonnet-3.5,但仍然显著超过所有其他模型,展示了其在多样化技术基准测试中的竞争力。
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, longcontext extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6).
在本文的其余部分,我们首先详细介绍我们的 DeepSeek-V3 模型架构(第 2 节)。随后,我们介绍我们的基础设施,包括我们的计算集群、训练框架、对 FP8 训练的支持、推理部署策略以及我们对未来硬件设计的建议。接下来,我们描述我们的预训练过程,包括训练数据的构建、超参数设置、长上下文扩展技术、相关评估以及一些讨论(第 4 节)。之后,我们讨论我们在后训练方面的努力,包括监督微调(SFT)、强化学习(RL)、相应的评估和讨论(第 5 节)。最后,我们总结这项工作,讨论 DeepSeek-V3 的现有限制,并提出未来研究的潜在方向(第 6 节)。

2. Architecture  2. 架构

We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeekV2 (DeepSeek-AI, 2024c).
我们首先介绍 DeepSeek-V3 的基本架构,其特点是采用 Multi-head Latent Attention (MLA)(DeepSeek-AI, 2024c)以实现高效推理,以及 DeepSeekMoE(Dai et al., 2024)以实现经济训练。然后,我们提出了一种 Multi-Token Prediction (MTP)训练目标,我们观察到它能够提升评估基准上的整体性能。对于其他未明确提及的细节,DeepSeek-V3 遵循 DeepSeekV2(DeepSeek-AI, 2024c)的设置。

2.1. Basic Architecture  2.1. 基本架构

The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing
DeepSeek-V3 的基本架构仍然在 Transformer(Vaswani 等,2017)框架内。为了实现高效推理和经济训练,DeepSeek-V3 还采用了 MLA 和 DeepSeekMoE,这些在 DeepSeek-V2 中得到了充分验证。与 DeepSeek-V2 相比,一个例外是我们额外引入了无辅助损失的负载平衡。

Figure 2 | Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.
图 2 | DeepSeek-V3 基本架构的示意图。继 DeepSeek-V2 之后,我们采用 MLA 和 DeepSeekMoE 进行高效推理和经济训练。

strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and DeepSeekMoE in this section.
策略(Wang et al., 2024a)用于 DeepSeekMoE,以减轻因确保负载平衡而导致的性能下降。图 2 展示了 DeepSeek-V3 的基本架构,我们将在本节中简要回顾 MLA 和 DeepSeekMoE 的细节。

2.1.1. Multi-Head Latent Attention
2.1.1. 多头潜在注意力

For attention, DeepSeek-V3 adopts the MLA architecture. Let d d dd denote the embedding dimension, n h n h n_(h)n_{h} denote the number of attention heads, d h d h d_(h)d_{h} denote the dimension per head, and h t R d h t R d h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} denote the attention input for the t t tt-th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference:
对于注意力,DeepSeek-V3 采用 MLA 架构。设 d d dd 表示嵌入维度, n h n h n_(h)n_{h} 表示注意力头的数量, d h d h d_(h)d_{h} 表示每个头的维度, h t R d h t R d h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} 表示给定注意力层中第 t t tt 个 token 的注意力输入。MLA 的核心是对注意力键和值进行低秩联合压缩,以减少推理过程中的键值 (KV) 缓存:
c t K V = W D K V h t , [ k t , 1 C ; k t , 2 C ; ; k t , n h C ] = k t C = W U K c t K V , k t R = RoPE ( W K R h t ) , k t , i = [ k t , i C k ˙ t R ] , [ v t , 1 C ; v t , 2 C ; ; v t , n h C ] = v t C = W U V c t K V , c t K V = W D K V h t , k t , 1 C ; k t , 2 C ; ; k t , n h C = k t C = W U K c t K V , k t R = RoPE W K R h t , k t , i = k t , i C k ˙ t R , v t , 1 C ; v t , 2 C ; ; v t , n h C = v t C = W U V c t K V , {:[c_(t)^(KV)=W^(DKV)h_(t)","],[[k_(t,1)^(C);k_(t,2)^(C);dots;k_(t,n_(h))^(C)]=k_(t)^(C)=W^(UK)c_(t)^(KV)","],[k_(t)^(R)=RoPE(W^(KR)h_(t))","],[k_(t,i)=[k_(t,i)^(C)k^(˙)_(t)^(R)]","],[[v_(t,1)^(C);v_(t,2)^(C);dots;v_(t,n_(h))^(C)]=v_(t)^(C)=W^(UV)c_(t)^(KV)","]:}\begin{aligned} \boxed{\mathbf{c}_{t}^{K V}} & =W^{D K V} \mathbf{h}_{t}, \\ {\left[\mathbf{k}_{t, 1}^{C} ; \mathbf{k}_{t, 2}^{C} ; \ldots ; \mathbf{k}_{t, n_{h}}^{C}\right]=\mathbf{k}_{t}^{C} } & =W^{U K} \mathbf{c}_{t}^{K V}, \\ \boxed{\mathbf{k}_{t}^{R}} & =\operatorname{RoPE}\left(W^{K R} \mathbf{h}_{t}\right), \\ \mathbf{k}_{t, i} & =\left[\mathbf{k}_{t, i}^{C} \dot{\mathbf{k}}_{t}^{R}\right], \\ {\left[\mathbf{v}_{t, 1}^{C} ; \mathbf{v}_{t, 2}^{C} ; \ldots ; \mathbf{v}_{t, n_{h}}^{C}\right]=\mathbf{v}_{t}^{C} } & =W^{U V} \mathbf{c}_{t}^{K V}, \end{aligned}
where c t K V R d c c t K V R d c c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d c ( d h n h ) d c d h n h d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) indicates the K V K V KVK V compression dimension; W D K V R d c × d W D K V R d c × d W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} denotes the down-projection matrix; W U K , W U V R d h n h × d c W U K , W U V R d h n h × d c W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} are the up-projection matrices for keys and values, respectively; W K R R h d h R × d W K R R h d h R × d W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE ( ) RoPE ( ) RoPE(*)\operatorname{RoPE}(\cdot) denotes the operation that applies RoPE matrices; and [ ] [ ] [:'*][\because \cdot] denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., c t K V c t K V c_(t)^(KV)\mathbf{c}_{t}^{K V} and k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R} ) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017).
其中 c t K V R d c c t K V R d c c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} 是用于键和值的压缩潜在向量; d c ( d h n h ) d c d h n h d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) 表示 K V K V KVK V 压缩维度; W D K V R d c × d W D K V R d c × d W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} 表示下投影矩阵; W U K , W U V R d h n h × d c W U K , W U V R d h n h × d c W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} 分别是键和值的上投影矩阵; W K R R h d h R × d W K R R h d h R × d W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} 是用于生成携带旋转位置嵌入(RoPE)的解耦键的矩阵(Su et al., 2024); RoPE ( ) RoPE ( ) RoPE(*)\operatorname{RoPE}(\cdot) 表示应用 RoPE 矩阵的操作; [ ] [ ] [:'*][\because \cdot] 表示连接。请注意,对于 MLA,仅在生成过程中需要缓存蓝框向量(即 c t K V c t K V c_(t)^(KV)\mathbf{c}_{t}^{K V} k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R} ),这显著减少了 KV 缓存,同时保持与标准多头注意力(MHA)相当的性能(Vaswani et al., 2017)。
For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training:
对于注意力查询,我们还执行低秩压缩,这可以在训练期间减少激活内存:
c t Q = W D Q h t , [ q t , 1 C ; q t , 2 C ; ; q t , n h C ] = q t C = W U Q c t Q [ q t , 1 R ; q t , 2 R ; ; q t , n h R ] = q t R = RoPE ( W Q R c t Q ) , q t , i = [ q t , i C ; q t , i R ] , c t Q = W D Q h t , q t , 1 C ; q t , 2 C ; ; q t , n h C = q t C = W U Q c t Q q t , 1 R ; q t , 2 R ; ; q t , n h R = q t R = RoPE W Q R c t Q , q t , i = q t , i C ; q t , i R , {:[c_(t)^(Q)=W^(DQ)h_(t)","],[[q_(t,1)^(C);q_(t,2)^(C);dots;q_(t,n_(h))^(C)]=q_(t)^(C)=W^(UQ)c_(t)^(Q)],[[q_(t,1)^(R);q_(t,2)^(R);dots;q_(t,n_(h))^(R)]=q_(t)^(R)=RoPE(W^(QR)c_(t)^(Q))","],[q_(t,i)=[q_(t,i)^(C);q_(t,i)^(R)]","]:}\begin{aligned} \mathbf{c}_{t}^{Q} & =W^{D Q} \mathbf{h}_{t}, \\ {\left[\mathbf{q}_{t, 1}^{C} ; \mathbf{q}_{t, 2}^{C} ; \ldots ; \mathbf{q}_{t, n_{h}}^{C}\right]=\mathbf{q}_{t}^{C} } & =W^{U Q} \mathbf{c}_{t}^{Q} \\ {\left[\mathbf{q}_{t, 1}^{R} ; \mathbf{q}_{t, 2}^{R} ; \ldots ; \mathbf{q}_{t, n_{h}}^{R}\right]=\mathbf{q}_{t}^{R} } & =\operatorname{RoPE}\left(W^{Q R} \mathbf{c}_{t}^{Q}\right), \\ \mathbf{q}_{t, i} & =\left[\mathbf{q}_{t, i}^{C} ; \mathbf{q}_{t, i}^{R}\right], \end{aligned}
where c t Q R d c c t Q R d c c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d c ( d h n h ) d c d h n h d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) denotes the query compression dimension; W D Q R d c × d , W U Q R d h n h × d c d W D Q R d c × d , W U Q R d h n h × d c d W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^(d))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{d}} are the down-projection and up-projection matrices for queries, respectively; and W Q R R d h R n h × d c W Q R R d h R n h × d c W^(QR)inR^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} is the matrix to produce the decoupled queries that carry RoPE.
其中 c t Q R d c c t Q R d c c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} 是查询的压缩潜在向量; d c ( d h n h ) d c d h n h d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) 表示查询压缩维度; W D Q R d c × d , W U Q R d h n h × d c d W D Q R d c × d , W U Q R d h n h × d c d W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^(d))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{d}} 分别是查询的下投影和上投影矩阵; W Q R R d h R n h × d c W Q R R d h R n h × d c W^(QR)inR^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} 是生成携带 RoPE 的解耦查询的矩阵。
Ultimately, the attention queries ( q t , i ) q t , i (q_(t,i))\left(\mathbf{q}_{t, i}\right), keys ( k j , i ) k j , i (k_(j,i))\left(\mathbf{k}_{j, i}\right), and values ( v j , i C ) v j , i C (v_(j,i)^(C))\left(\mathbf{v}_{j, i}^{C}\right) are combined to yield the final attention output u t u t u_(t)\mathbf{u}_{t} :
最终,注意力查询 ( q t , i ) q t , i (q_(t,i))\left(\mathbf{q}_{t, i}\right) 、键 ( k j , i ) k j , i (k_(j,i))\left(\mathbf{k}_{j, i}\right) 和值 ( v j , i C ) v j , i C (v_(j,i)^(C))\left(\mathbf{v}_{j, i}^{C}\right) 被组合以产生最终的注意力输出 u t u t u_(t)\mathbf{u}_{t}
o t , i = j = 1 t Softmax j ( q t , i T k j , i d h + d h R ) v j , i C u t = W O [ o t , i ; o t , 2 ; ; o t , n h ] , o t , i = j = 1 t Softmax j q t , i T k j , i d h + d h R v j , i C u t = W O o t , i ; o t , 2 ; ; o t , n h , {:[o_(t,i)=sum_(j=1)^(t)Softmax_(j)((q_(t,i)^(T)k_(j,i))/(sqrt(d_(h)+d_(h)^(R))))v_(j,i)^(C)],[u_(t)=W^(O)[o_(t,i);o_(t,2);dots;o_(t,n_(h))]","]:}\begin{aligned} \mathbf{o}_{t, i} & =\sum_{j=1}^{t} \operatorname{Softmax}_{j}\left(\frac{\mathbf{q}_{t, i}^{T} \mathbf{k}_{j, i}}{\sqrt{d_{h}+d_{h}^{R}}}\right) \mathbf{v}_{j, i}^{C} \\ \mathbf{u}_{t} & =W^{O}\left[\mathbf{o}_{t, i} ; \mathbf{o}_{t, 2} ; \ldots ; \mathbf{o}_{t, n_{h}}\right], \end{aligned}
where W O R d × d h n h W O R d × d h n h W^(O)inR^(d xxd_(h)n_(h))W^{O} \in \mathbb{R}^{d \times d_{h} n_{h}} denotes the output projection matrix.
其中 W O R d × d h n h W O R d × d h n h W^(O)inR^(d xxd_(h)n_(h))W^{O} \in \mathbb{R}^{d \times d_{h} n_{h}} 表示输出投影矩阵。

2.1.2. DeepSeekMoE with Auxiliary-Loss-Free Load Balancing
2.1.2. DeepSeekMoE 与无辅助损失负载均衡

Basic Architecture of DeepSeekMoE. For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let u t u t u_(t)\mathbf{u}_{t} denote the FFN input of the t t tt-th token, we compute the FFN output h t h t h_(t)^(')\mathbf{h}_{t}^{\prime} as follows:
h t = u t + i = 1 N s FFN i ( s ) ( u t ) + i = 1 N r g i , t FFN i ( r ) ( u t ) , g i , t = g i , t j = 1 N r g j , t , g i , t = { s i , t , s i , t Topk ( { s j , t 1 j N r } , K r ) , 0 , otherwise , s i , t = Sigmoid ( u t T e i ) , h t = u t + i = 1 N s FFN i ( s ) u t + i = 1 N r g i , t FFN i ( r ) u t , g i , t = g i , t j = 1 N r g j , t , g i , t = s i , t , s i , t Topk s j , t 1 j N r , K r , 0 ,  otherwise  , s i , t = Sigmoid u t T e i , {:[h_(t)^(')=u_(t)+sum_(i=1)^(N_(s))FFN_(i)^((s))(u_(t))+sum_(i=1)^(N_(r))g_(i,t)FFN_(i)^((r))(u_(t))","],[g_(i,t)=(g_(i,t)^('))/(sum_(j=1)^(N_(r))g_(j,t)^('))","],[g_(i,t)^(')={[s_(i,t)",",s_(i,t)in Topk({s_(j,t)∣1 <= j <= N_(r)},K_(r))","],[0","," otherwise "","]:}],[s_(i,t)=Sigmoid(u_(t)^(T)e_(i))","]:}\begin{aligned} & \mathbf{h}_{t}^{\prime}=\mathbf{u}_{t}+\sum_{i=1}^{N_{s}} \operatorname{FFN}_{i}^{(s)}\left(\mathbf{u}_{t}\right)+\sum_{i=1}^{N_{r}} g_{i, t} \operatorname{FFN}_{i}^{(r)}\left(\mathbf{u}_{t}\right), \\ & g_{i, t}=\frac{g_{i, t}^{\prime}}{\sum_{j=1}^{N_{r}} g_{j, t}^{\prime}}, \\ & g_{i, t}^{\prime}= \begin{cases}s_{i, t}, & s_{i, t} \in \operatorname{Topk}\left(\left\{s_{j, t} \mid 1 \leqslant j \leqslant N_{r}\right\}, K_{r}\right), \\ 0, & \text { otherwise },\end{cases} \\ & s_{i, t}=\operatorname{Sigmoid}\left(\mathbf{u}_{t}^{T} \mathbf{e}_{i}\right), \end{aligned}
where N s N s N_(s)N_{s} and N r N r N_(r)N_{r} denote the numbers of shared experts and routed experts, respectively; FFN i ( s ) ( ) FFN i ( s ) ( ) FFN_(i)^((s))(*)\mathrm{FFN}_{i}^{(s)}(\cdot) and FFN i ( r ) ( ) FFN i ( r ) ( ) FFN_(i)^((r))(*)\mathrm{FFN}_{i}^{(r)}(\cdot) denote the i i ii-th shared expert and the i i ii-th routed expert, respectively; K r K r K_(r)K_{r} denotes the number of activated routed experts; g i , t g i , t g_(i,t)g_{i, t} is the gating value for the i i ii-th expert; s i , t s i , t s_(i,t)s_{i, t} is the token-to-expert affinity; e i e i e_(i)\mathbf{e}_{i} is the centroid vector of the i i ii-th routed expert; and Topk ( , K ) Topk ( , K ) Topk(*,K)\operatorname{Topk}(\cdot, K) denotes the set comprising K K KK highest scores among the affinity scores calculated for the t t tt-th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.
其中 N s N s N_(s)N_{s} N r N r N_(r)N_{r} 分别表示共享专家和路由专家的数量; FFN i ( s ) ( ) FFN i ( s ) ( ) FFN_(i)^((s))(*)\mathrm{FFN}_{i}^{(s)}(\cdot) FFN i ( r ) ( ) FFN i ( r ) ( ) FFN_(i)^((r))(*)\mathrm{FFN}_{i}^{(r)}(\cdot) 分别表示第 i i ii 个共享专家和第 i i ii 个路由专家; K r K r K_(r)K_{r} 表示激活的路由专家数量; g i , t g i , t g_(i,t)g_{i, t} 是第 i i ii 个专家的门控值; s i , t s i , t s_(i,t)s_{i, t} 是令牌与专家的亲和力; e i e i e_(i)\mathbf{e}_{i} 是第 i i ii 个路由专家的质心向量; Topk ( , K ) Topk ( , K ) Topk(*,K)\operatorname{Topk}(\cdot, K) 表示包含第 K K KK 高分的亲和力分数的集合,这些分数是针对第 t t tt 个令牌和所有路由专家计算的。与 DeepSeek-V2 略有不同,DeepSeek-V3 使用 sigmoid 函数来计算亲和力分数,并在所有选定的亲和力分数之间应用归一化,以生成门控值。
Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term b i b i b_(i)b_{i} for each expert and add it to the corresponding affinity scores s i , t s i , t s_(i,t)s_{i, t} to determine the top-K routing:
无辅助损失的负载均衡。对于 MoE 模型,不平衡的专家负载会导致路由崩溃(Shazeer et al., 2017),并在专家并行的场景中降低计算效率。传统解决方案通常依赖于辅助损失(Fedus et al., 2021;Lepikhin et al., 2021)来避免不平衡负载。然而,过大的辅助损失会损害模型性能(Wang et al., 2024a)。为了在负载平衡和模型性能之间实现更好的权衡,我们开创了一种无辅助损失的负载均衡策略(Wang et al., 2024a)以确保负载平衡。具体来说,我们为每个专家引入一个偏置项 b i b i b_(i)b_{i} ,并将其添加到相应的亲和力分数 s i , t s i , t s_(i,t)s_{i, t} 中,以确定前 K 个路由:
g i , t = { s i , t , s i , t + b i Topk ( { s j , t + b j 1 j N r } , K r ) 0 , otherwise. g i , t = s i , t , s i , t + b i Topk s j , t + b j 1 j N r , K r 0 ,  otherwise.  g_(i,t)^(')={[s_(i,t)",",s_(i,t)+b_(i)in Topk({s_(j,t)+b_(j)∣1 <= j <= N_(r)},K_(r))],[0","," otherwise. "]:}g_{i, t}^{\prime}= \begin{cases}s_{i, t}, & s_{i, t}+b_{i} \in \operatorname{Topk}\left(\left\{s_{j, t}+b_{j} \mid 1 \leqslant j \leqslant N_{r}\right\}, K_{r}\right) \\ 0, & \text { otherwise. }\end{cases}
Note that the bias term is only used for routing. The gating value, which will be multiplied with the FFN output, is still derived from the original affinity score s i , t s i , t s_(i,t)s_{i, t}. During training, we keep monitoring the expert load on the whole batch of each training step. At the end of each step, we will decrease the bias term by γ γ gamma\gamma if its corresponding expert is overloaded, and increase it by γ γ gamma\gamma if its corresponding expert is underloaded, where γ γ gamma\gamma is a hyper-parameter called bias update speed. Through the dynamic adjustment, DeepSeek-V3 keeps balanced expert load during training, and achieves better performance than models that encourage load balance through pure auxiliary losses.
请注意,偏置项仅用于路由。门控值将与 FFN 输出相乘,仍然源自原始亲和力得分 s i , t s i , t s_(i,t)s_{i, t} 。在训练过程中,我们持续监控每个训练步骤整个批次的专家负载。在每个步骤结束时,如果其对应的专家过载,我们将减少偏置项 γ γ gamma\gamma ,如果其对应的专家负载不足,我们将增加偏置项 γ γ gamma\gamma ,其中 γ γ gamma\gamma 是一个称为偏置更新速度的超参数。通过动态调整,DeepSeek-V3 在训练过程中保持专家负载平衡,并且比通过纯辅助损失鼓励负载平衡的模型实现更好的性能。
Complementary Sequence-Wise Auxiliary Loss. Although DeepSeek-V3 mainly relies on the auxiliary-loss-free strategy for load balance, to prevent extreme imbalance within any single sequence, we also employ a complementary sequence-wise balance loss:
互补序列级辅助损失。尽管 DeepSeek-V3 主要依赖于无辅助损失策略来实现负载平衡,但为了防止任何单个序列内的极端不平衡,我们还采用了互补的序列级平衡损失:
L Bal = α i = 1 N r f i P i , f i = N r K r T t = 1 T 1 ( s i , t Topk ( { s j , t 1 j N r } , K r ) ) , s i , t = s i , t j r = 1 N r s j , t , P i = 1 T t = 1 T s i , t L Bal  = α i = 1 N r f i P i , f i = N r K r T t = 1 T 1 s i , t Topk s j , t 1 j N r , K r , s i , t = s i , t j r = 1 N r s j , t , P i = 1 T t = 1 T s i , t {:[L_("Bal ")=alphasum_(i=1)^(N_(r))f_(i)P_(i)","],[f_(i)=(N_(r))/(K_(r)T)sum_(t=1)^(T)1(s_(i,t)in Topk({s_(j,t)∣1 <= j <= N_(r)},K_(r)))","],[s_(i,t)^(')=(s_(i,t))/(sum_(jr=1)^(N_(r))s_(j,t))","],[P_(i)=(1)/(T)sum_(t=1)^(T)s_(i,t)^(')]:}\begin{gathered} \mathcal{L}_{\text {Bal }}=\alpha \sum_{i=1}^{N_{r}} f_{i} P_{i}, \\ f_{i}=\frac{N_{r}}{K_{r} T} \sum_{t=1}^{T} \mathbb{1}\left(s_{i, t} \in \operatorname{Topk}\left(\left\{s_{j, t} \mid 1 \leqslant j \leqslant N_{r}\right\}, K_{r}\right)\right), \\ s_{i, t}^{\prime}=\frac{s_{i, t}}{\sum_{j r=1}^{N_{r}} s_{j, t}}, \\ P_{i}=\frac{1}{T} \sum_{t=1}^{T} s_{i, t}^{\prime} \end{gathered}𝟙
where the balance factor α α alpha\alpha is a hyper-parameter, which will be assigned an extremely small value for DeepSeek-V3; 1 ( ) 1 ( ) 1(*)\mathbb{1}(\cdot)𝟙 denotes the indicator function; and T T TT denotes the number of tokens in a sequence. The sequence-wise balance loss encourages the expert load on each sequence to be balanced.
其中平衡因子 α α alpha\alpha 是一个超参数,将为 DeepSeek-V3 分配一个极小的值; 1 ( ) 1 ( ) 1(*)\mathbb{1}(\cdot)𝟙 表示指示函数; T T TT 表示序列中的标记数量。序列级平衡损失鼓励每个序列上的专家负载保持平衡。

Figure 3 | Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.
图 3 | 我们的多标记预测(MTP)实现的示意图。我们保持每个深度每个标记预测的完整因果链。
Node-Limited Routing. Like the device-limited routing used by DeepSeek-V2, DeepSeek-V3 also uses a restricted routing mechanism to limit communication costs during training. In short, we ensure that each token will be sent to at most M M MM nodes, which are selected according to the sum of the highest K r M K r M (K_(r))/(M)\frac{K_{r}}{M} affinity scores of the experts distributed on each node. Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap.
节点限制路由。与 DeepSeek-V2 使用的设备限制路由类似,DeepSeek-V3 也使用了一种受限路由机制,以限制训练期间的通信成本。简而言之,我们确保每个令牌最多会发送到 M M MM 个节点,这些节点是根据分布在每个节点上的专家的最高 K r M K r M (K_(r))/(M)\frac{K_{r}}{M} 亲和力分数之和进行选择的。在这一约束下,我们的 MoE 训练框架几乎可以实现完全的计算-通信重叠。
No Token-Dropping. Due to the effective load balancing strategy, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training. In addition, we also implement specific deployment strategies to ensure inference load balance, so DeepSeek-V3 also does not drop tokens during inference.
无令牌丢失。由于有效的负载均衡策略,DeepSeek-V3 在整个训练过程中保持良好的负载平衡。因此,DeepSeek-V3 在训练期间不会丢弃任何令牌。此外,我们还实施了特定的部署策略,以确保推理负载平衡,因此 DeepSeek-V3 在推理期间也不会丢弃令牌。

2.2. Multi-Token Prediction
2.2. 多标记预测

Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts D D DD additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.
受到 Gloeckle 等人(2024 年)的启发,我们研究并为 DeepSeek-V3 设定了多标记预测(MTP)目标,该目标将预测范围扩展到每个位置的多个未来标记。一方面,MTP 目标密集化了训练信号,可能提高数据效率。另一方面,MTP 可能使模型能够预先规划其表示,以更好地预测未来标记。图 3 展示了我们对 MTP 的实现。与 Gloeckle 等人(2024 年)并行预测 D D DD 个额外标记使用独立输出头不同,我们顺序预测额外标记,并在每个预测深度保持完整的因果链。我们在本节中介绍 MTP 实现的细节。
MTP Modules. To be specific, our MTP implementation uses D D DD sequential modules to predict D D DD additional tokens. The k k kk-th MTP module consists of a shared embedding layer Emb ( ) ( ) (*)(\cdot), a shared output head OutHead ( ) ( ) (*)(\cdot), a Transformer block TRM k ( ) TRM k ( ) TRM_(k)(*)\operatorname{TRM}_{k}(\cdot), and a projection matrix M k R d × 2 d M k R d × 2 d M_(k)inR^(d xx2d)M_{k} \in \mathbb{R}^{d \times 2 d}. For the i i ii-th input token t i t i t_(i)t_{i}, at the k k kk-th prediction depth, we first combine the representation of the i i ii-th token at the ( k 1 k 1 k-1k-1 )-th depth h i k 1 R d h i k 1 R d h_(i)^(k-1)inR^(d)\mathbf{h}_{i}^{k-1} \in \mathbb{R}^{d} and the embedding of the ( i + k i + k i+ki+k )-th token E m b ( t i + k ) R d E m b t i + k R d Emb(t_(i+k))inR^(d)E m b\left(t_{i+k}\right) \in \mathbb{R}^{d}
MTP 模块。具体来说,我们的 MTP 实现使用 D D DD 个顺序模块来预测 D D DD 个额外的标记。第 k k kk 个 MTP 模块由一个共享嵌入层 Emb ( ) ( ) (*)(\cdot) 、一个共享输出头 OutHead ( ) ( ) (*)(\cdot) 、一个 Transformer 块 TRM k ( ) TRM k ( ) TRM_(k)(*)\operatorname{TRM}_{k}(\cdot) 和一个投影矩阵 M k R d × 2 d M k R d × 2 d M_(k)inR^(d xx2d)M_{k} \in \mathbb{R}^{d \times 2 d} 组成。对于第 i i ii 个输入标记 t i t i t_(i)t_{i} ,在第 k k kk 个预测深度,我们首先结合第 i i ii 个标记在( k 1 k 1 k-1k-1 )-th 深度 h i k 1 R d h i k 1 R d h_(i)^(k-1)inR^(d)\mathbf{h}_{i}^{k-1} \in \mathbb{R}^{d} 的表示和( i + k i + k i+ki+k )-th 标记的嵌入 E m b ( t i + k ) R d E m b t i + k R d Emb(t_(i+k))inR^(d)E m b\left(t_{i+k}\right) \in \mathbb{R}^{d}

with the linear projection:
通过线性投影:
h i k = M k [ RMSNorm ( h i k 1 ) ; RMSNorm ( Emb ( t i + k ) ) ] , h i k = M k RMSNorm h i k 1 ; RMSNorm Emb t i + k , h_(i)^('k)=M_(k)[RMSNorm(h_(i)^(k-1));RMSNorm(Emb(t_(i+k)))],\mathbf{h}_{i}^{\prime k}=M_{k}\left[\operatorname{RMSNorm}\left(\mathbf{h}_{i}^{k-1}\right) ; \operatorname{RMSNorm}\left(\operatorname{Emb}\left(t_{i+k}\right)\right)\right],
where [ ] [ ] [:'*][\because \cdot] denotes concatenation. Especially, when k = 1 , h i k 1 k = 1 , h i k 1 k=1,h_(i)^(k-1)k=1, \mathbf{h}_{i}^{k-1} refers to the representation given by the main model. Note that for each MTP module, its embedding layer is shared with the main model. The combined h i k h i k h_(i)^('k)\mathbf{h}_{i}^{\prime k} serves as the input of the Transformer block at the k k kk-th depth to produce the output representation at the current depth h i k h i k h_(i)^(k)\mathbf{h}_{i}^{k} :
其中 [ ] [ ] [:'*][\because \cdot] 表示连接。特别是,当 k = 1 , h i k 1 k = 1 , h i k 1 k=1,h_(i)^(k-1)k=1, \mathbf{h}_{i}^{k-1} 指代主模型给出的表示时。请注意,对于每个 MTP 模块,其嵌入层与主模型共享。组合后的 h i k h i k h_(i)^('k)\mathbf{h}_{i}^{\prime k} 作为第 k k kk 层深度的 Transformer 块的输入,以生成当前深度 h i k h i k h_(i)^(k)\mathbf{h}_{i}^{k} 的输出表示:
h 1 : T k k = TRM k ( h 1 : T k k ) , h 1 : T k k = TRM k h 1 : T k k , h_(1:T-k)^(k)=TRM_(k)(h_(1:T-k)^('k)),\mathbf{h}_{1: T-k}^{k}=\operatorname{TRM}_{k}\left(\mathbf{h}_{1: T-k}^{\prime k}\right),
where T T TT represents the input sequence length and i : j i : j _(i:j){ }_{i: j} denotes the slicing operation (inclusive of both the left and right boundaries). Finally, taking h i k h i k h_(i)^(k)\mathbf{h}_{i}^{k} as the input, the shared output head will compute the probability distribution for the k k kk-th additional prediction token P i + 1 + k k R V P i + 1 + k k R V P_(i+1+k)^(k)inR^(V)P_{i+1+k}^{k} \in \mathbb{R}^{V}, where V V VV is the vocabulary size:
其中 T T TT 表示输入序列长度, i : j i : j _(i:j){ }_{i: j} 表示切片操作(包括左边界和右边界)。最后,以 h i k h i k h_(i)^(k)\mathbf{h}_{i}^{k} 作为输入,共享输出头将计算 k k kk -th 额外预测标记 P i + 1 + k k R V P i + 1 + k k R V P_(i+1+k)^(k)inR^(V)P_{i+1+k}^{k} \in \mathbb{R}^{V} 的概率分布,其中 V V VV 是词汇表大小:
P i + k + 1 k = OutHead ( h i k ) . P i + k + 1 k = OutHead h i k . P_(i+k+1)^(k)=OutHead(h_(i)^(k)).P_{i+k+1}^{k}=\operatorname{OutHead}\left(\mathbf{h}_{i}^{k}\right) .
The output head OutHead ( *\cdot ) linearly maps the representation to logits and subsequently applies the Softmax (•) function to compute the prediction probabilities of the k k kk-th additional token. Also, for each MTP module, its output head is shared with the main model. Our principle of maintaining the causal chain of predictions is similar to that of EAGLE (Li et al., 2024b), but its primary objective is speculative decoding (Leviathan et al., 2023: Xia et al., 2023), whereas we utilize MTP to improve training.
输出头 OutHead ( *\cdot ) 线性映射表示到 logits,并随后应用 Softmax (•) 函数来计算 k k kk -th 附加标记的预测概率。此外,对于每个 MTP 模块,其输出头与主模型共享。我们保持预测因果链的原则与 EAGLE (Li et al., 2024b) 类似,但其主要目标是推测解码 (Leviathan et al., 2023: Xia et al., 2023),而我们利用 MTP 来改善训练。
MTP Training Objective. For each prediction depth, we compute a cross-entropy loss L MTP k L MTP k L_(MTP)^(k)\mathcal{L}_{\mathrm{MTP}}^{k} :
MTP 训练目标。对于每个预测深度,我们计算交叉熵损失 L MTP k L MTP k L_(MTP)^(k)\mathcal{L}_{\mathrm{MTP}}^{k}
L MTP k = CrossEntropy ( P 2 + k : T + 1 k , t 2 + k : T + 1 ) = 1 T i = 2 + k T + 1 log P i k [ t i ] , L MTP k = CrossEntropy P 2 + k : T + 1 k , t 2 + k : T + 1 = 1 T i = 2 + k T + 1 log P i k t i , L_(MTP)^(k)=CrossEntropy(P_(2+k:T+1)^(k),t_(2+k:T+1))=-(1)/(T)sum_(i=2+k)^(T+1)log P_(i)^(k)[t_(i)],\mathcal{L}_{\mathrm{MTP}}^{k}=\operatorname{CrossEntropy}\left(P_{2+k: T+1}^{k}, t_{2+k: T+1}\right)=-\frac{1}{T} \sum_{i=2+k}^{T+1} \log P_{i}^{k}\left[t_{i}\right],
where T T TT denotes the input sequence length, t i t i t_(i)t_{i} denotes the ground-truth token at the i i ii-th position, and P i k [ t i ] P i k t i P_(i)^(k)[t_(i)]P_{i}^{k}\left[t_{i}\right] denotes the corresponding prediction probability of t i t i t_(i)t_{i}, given by the k k kk-th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a weighting factor λ λ lambda\lambda to obtain the overall MTP loss L MTP L MTP  L_("MTP ")\mathcal{L}_{\text {MTP }}, which serves as an additional training objective for DeepSeek-V3:
其中 T T TT 表示输入序列长度, t i t i t_(i)t_{i} 表示第 i i ii 个位置的真实标记, P i k [ t i ] P i k t i P_(i)^(k)[t_(i)]P_{i}^{k}\left[t_{i}\right] 表示由第 k k kk 个 MTP 模块给出的 t i t i t_(i)t_{i} 的相应预测概率。最后,我们计算所有深度的 MTP 损失的平均值,并乘以权重因子 λ λ lambda\lambda 以获得整体 MTP 损失 L MTP L MTP  L_("MTP ")\mathcal{L}_{\text {MTP }} ,这作为 DeepSeek-V3 的额外训练目标:
L MTP = λ D k = 1 D L MTP k L MTP = λ D k = 1 D L MTP k L_(MTP)=(lambda )/(D)sum_(k=1)^(D)L_(MTP)^(k)\mathcal{L}_{\mathrm{MTP}}=\frac{\lambda}{D} \sum_{k=1}^{D} \mathcal{L}_{\mathrm{MTP}}^{k}
MTP in Inference. Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.
推理中的 MTP。我们的 MTP 策略主要旨在提高主模型的性能,因此在推理过程中,我们可以直接丢弃 MTP 模块,主模型可以独立且正常地运行。此外,我们还可以将这些 MTP 模块重新用于推测解码,以进一步提高生成延迟。

3. Infrastructures  3. 基础设施

3.1. Compute Clusters  3.1. 计算集群

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications.
DeepSeek-V3 在一个配备有 2048 个 NVIDIA H800 GPU 的集群上进行训练。H800 集群中的每个节点包含 8 个通过 NVLink 和 NVSwitch 连接的 GPU。在不同节点之间,使用 InfiniBand (IB) 互连来促进通信。

Figure 4 | Overlapping strategy for a pair of individual forward and backward chunks (the boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes “backward for input”, blue denotes “backward for weights”, purple denotes PP communication, and red denotes barriers. Both all-to-all and PP communication can be fully hidden.
图 4 | 一对单个前向和后向块的重叠策略(变换器块的边界未对齐)。橙色表示前向,绿色表示“输入的后向”,蓝色表示“权重的后向”,紫色表示 PP 通信,红色表示障碍。全到全和 PP 通信都可以完全隐藏。

3.2. Training Framework  3.2. 训练框架

The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a), 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020).
DeepSeek-V3 的训练得到了 HAI-LLM框架的支持,这是一个由我们的工程师从零开始打造的高效轻量级训练框架。总体而言,DeepSeek-V3 应用了 16 路管道并行(PP)(Qi et al., 2023a)、64 路专家并行(EP)(Lepikhin et al., 2021),跨越 8 个节点,以及 ZeRO-1 数据并行(DP)(Rajbhandari et al., 2020)。
In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimizations. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP methods, DualPipe has fewer pipeline bubbles. More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism. Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication. Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP).
为了促进 DeepSeek-V3 的高效训练,我们实施了细致的工程优化。首先,我们设计了 DualPipe 算法以实现高效的管道并行。与现有的 PP 方法相比,DualPipe 具有更少的管道气泡。更重要的是,它在前向和后向过程中重叠计算和通信阶段,从而解决了跨节点专家并行引入的高通信开销的挑战。其次,我们开发了高效的跨节点全到全通信内核,以充分利用 IB 和 NVLink 带宽,并节省专用于通信的流处理器(SM)。最后,我们在训练过程中细致优化内存占用,从而使我们能够在不使用昂贵的张量并行(TP)的情况下训练 DeepSeek-V3。

3.2.1. DualPipe and Computation-Communication Overlap
3.2.1. DualPipe 和计算-通信重叠

For DeepSeek-V3, the communication overhead introduced by cross-node expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1. To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computationcommunication phases, but also reduces the pipeline bubbles.
对于 DeepSeek-V3,由于跨节点专家并行性引入的通信开销,导致计算与通信的比率约为 1:1。为了解决这个挑战,我们设计了一种创新的管道并行算法,称为 DualPipe,它不仅通过有效重叠前向和后向计算通信阶段来加速模型训练,还减少了管道气泡。
The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component. As illustrated in Figure 4 for a pair of forward and backward chunks, we rearrange these components and manually adjust the ratio of GPU SMs dedicated to communication versus computation. In this overlapping strategy, we can ensure that both all-to-all and PP communication can be fully hidden during execution. Given the efficient overlapping strategy, the full DualPipe scheduling is illustrated in Figure5. It employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped. This overlap also ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
DualPipe 的关键思想是在一对独立的前向和后向块中重叠计算和通信。具体来说,我们将每个块分为四个组件:注意力、全到全分发、MLP 和全到全合并。特别地,对于一个后向块,注意力和 MLP 进一步分为两个部分,输入的后向和权重的后向,类似于 ZeroBubble(Qi et al., 2023b)。此外,我们还有一个 PP 通信组件。如图 4 所示,对于一对前向和后向块,我们重新排列这些组件,并手动调整专用于通信与计算的 GPU SM 的比例。在这种重叠策略中,我们可以确保在执行期间全到全和 PP 通信都可以完全隐藏。鉴于高效的重叠策略,完整的 DualPipe 调度如图 5 所示。它采用双向管道调度,同时从管道的两端输入微批次,并且大量的通信可以完全重叠。 这种重叠还确保了,随着模型进一步扩展,只要我们保持恒定的计算与通信比率,我们仍然可以在节点之间使用细粒度专家,同时实现接近零的全到全通信开销。

Figure 5 | Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.
图 5 | 示例 DualPipe 调度,适用于 8 个 PP 等级和 20 个微批次,分为两个方向。反向的微批次与正向的微批次是对称的,因此我们省略了它们的批次 ID 以简化说明。两个被共享黑色边框包围的单元具有相互重叠的计算和通信。
Method  方法 Bubble  气泡 Parameter  参数 Activation  激活
1F1B ( P P 1 ) ( F + B ) ( P P 1 ) ( F + B ) (PP-1)(F+B)(P P-1)(F+B) 1 × 1 × 1xx1 \times P P P P PPP P
ZB1P ( P P 1 ) ( F + B 2 W ) ( P P 1 ) ( F + B 2 W ) (PP-1)(F+B-2W)(P P-1)(F+B-2 W) 1 × 1 × 1xx1 \times P P P P PPP P
DualPipe (Ours) ( P P 2 1 ) ( F & B + B 3 W ) P P 2 1 ( F & B + B 3 W ) ((PP)/(2)-1)(F&B+B-3W)\left(\frac{P P}{2}-1\right)(F \& B+B-3 W) 2 × 2 × 2xx2 \times P P + 1 P P + 1 PP+1P P+1
Method Bubble Parameter Activation 1F1B (PP-1)(F+B) 1xx PP ZB1P (PP-1)(F+B-2W) 1xx PP DualPipe (Ours) ((PP)/(2)-1)(F&B+B-3W) 2xx PP+1| Method | Bubble | Parameter | Activation | | :---: | :---: | :---: | :---: | | 1F1B | $(P P-1)(F+B)$ | $1 \times$ | $P P$ | | ZB1P | $(P P-1)(F+B-2 W)$ | $1 \times$ | $P P$ | | DualPipe (Ours) | $\left(\frac{P P}{2}-1\right)(F \& B+B-3 W)$ | $2 \times$ | $P P+1$ |
Table 2 | Comparison of pipeline bubbles and memory usage across different pipeline parallel methods. F F FF denotes the execution time of a forward chunk, B B BB denotes the execution time of a full backward chunk, W W WW denotes the execution time of a “backward for weights” chunk, and F & B F & B F&BF \& B denotes the execution time of two mutually overlapped forward and backward chunks.
表 2 | 不同管道并行方法的管道气泡和内存使用情况比较。 F F FF 表示前向块的执行时间, B B BB 表示完整反向块的执行时间, W W WW 表示“反向权重”块的执行时间, F & B F & B F&BF \& B 表示两个相互重叠的前向和反向块的执行时间。
In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages. In Table 2, we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qi) et al., 2023b) and 1F1B (Harlap et al., 2018), DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1 P P 1 P P (1)/(PP)\frac{1}{P P} times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows.
此外,即使在没有重通信负担的更一般场景中,DualPipe 仍然表现出效率优势。在表 2 中,我们总结了不同 PP 方法的管道气泡和内存使用情况。如表中所示,与 ZB1P (Qi et al., 2023b) 和 1F1B (Harlap et al., 2018)相比,DualPipe 显著减少了管道气泡,同时仅将峰值激活内存增加了 1 P P 1 P P (1)/(PP)\frac{1}{P P} 倍。尽管 DualPipe 需要保留模型参数的两个副本,但由于我们在训练期间使用了较大的 EP 大小,这并不会显著增加内存消耗。与 Chimera (Li and Hoefler, 2021)相比,DualPipe 只要求管道阶段和微批次可被 2 整除,而不要求微批次可被管道阶段整除。此外,对于 DualPipe,随着微批次数量的增加,气泡和激活内存都不会增加。

3.2.2. Efficient Implementation of Cross-Node All-to-All Communication
3.2.2. 跨节点全到全通信的高效实现

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB / s 160 GB / s 160GB//s160 \mathrm{~GB} / \mathrm{s}, roughly 3.2 times that of IB ( 50 GB / s 50 GB / s 50GB//s50 \mathrm{~GB} / \mathrm{s} ). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3
为了确保 DualPipe 的计算性能足够,我们定制了高效的跨节点全到全通信内核(包括调度和合并),以节省用于通信的 SM 数量。这些内核的实现与 MoE 门控算法和我们集群的网络拓扑共同设计。具体来说,在我们的集群中,跨节点 GPU 通过 IB 完全互联,节点内通信通过 NVLink 处理。NVLink 提供的带宽为 160 GB / s 160 GB / s 160GB//s160 \mathrm{~GB} / \mathrm{s} ,大约是 IB 的 3.2 倍( 50 GB / s 50 GB / s 50GB//s50 \mathrm{~GB} / \mathrm{s} )。为了有效利用 IB 和 NVLink 的不同带宽,我们限制每个令牌最多分发到 4 个节点,从而减少 IB 流量。对于每个令牌,当其路由决策做出时,它将首先通过 IB 传输到目标节点上具有相同节点内索引的 GPU。一旦到达目标节点,我们将努力确保它通过 NVLink 瞬时转发到承载其目标专家的特定 GPU,而不被随后到达的令牌阻塞。 通过这种方式,IB 和 NVLink 之间的通信完全重叠,每个节点的每个令牌可以高效地选择平均 3.2 个专家,而不会产生来自 NVLink 的额外开销。这意味着,尽管 DeepSeek-V3

selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts ( 4 nodes × 3.2 × 3.2 xx3.2\times 3.2 experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.
在实际操作中,仅选择 8 个路由专家,但可以将这个数字扩展到最多 13 个专家(每个节点 4 个 × 3.2 × 3.2 xx3.2\times 3.2 专家),同时保持相同的通信成本。总体而言,在这种通信策略下,仅需 20 个 SM 即可充分利用 IB 和 NVLink 的带宽。
In detail, we employ the warp specialization technique (Bauer et al. 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.
详细来说,我们采用了 warp 专门化技术(Bauer et al. 2014),并将 20 个 SM 分成 10 个通信通道。在调度过程中,(1)IB 发送,(2)IB 到 NVLink 转发,以及(3)NVLink 接收由各自的 warp 处理。分配给每个通信任务的 warp 数量根据所有 SM 的实际工作负载动态调整。同样,在合并过程中,(1)NVLink 发送,(2)NVLink 到 IB 的转发和累积,以及(3)IB 接收和累积也由动态调整的 warp 处理。此外,调度和合并内核与计算流重叠,因此我们还考虑它们对其他 SM 计算内核的影响。具体来说,我们采用定制的 PTX(Parallel Thread Execution)指令,并自动调优通信块大小,这显著减少了对 L2 缓存的使用和对其他 SM 的干扰。

3.2.3. Extremely Memory Saving with Minimal Overhead
3.2.3. 极低开销的极致内存节省

In order to reduce the memory footprint during training, we employ the following techniques.
为了减少训练期间的内存占用,我们采用以下技术。
Recomputation of RMSNorm and MLA Up-Projection. We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activations. With a minor overhead, this strategy significantly reduces memory requirements for storing activations.
RMSNorm 和 MLA 上投影的重新计算。我们在反向传播过程中重新计算所有 RMSNorm 操作和 MLA 上投影,从而消除了持续存储其输出激活的需要。通过少量的开销,这一策略显著减少了存储激活所需的内存。
Exponential Moving Average in CPU. During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead.
CPU 中的指数移动平均。在训练过程中,我们保留模型参数的指数移动平均(EMA),以便在学习率衰减后对模型性能进行早期评估。EMA 参数存储在 CPU 内存中,并在每个训练步骤后异步更新。该方法使我们能够在不增加额外内存或时间开销的情况下维护 EMA 参数。
Shared Embedding and Output Head for Multi-Token Prediction. With the DualPipe strategy, we deploy the shallowest layers (including the embedding layer) and deepest layers (including the output head) of the model on the same PP rank. This arrangement enables the physical sharing of parameters and gradients, of the shared embedding and output head, between the MTP module and the main model. This physical sharing mechanism further enhances our memory efficiency.
多标记预测的共享嵌入和输出头。通过 DualPipe 策略,我们将模型的最浅层(包括嵌入层)和最深层(包括输出头)部署在同一个 PP 等级上。这种安排使得 MTP 模块和主模型之间的共享嵌入和输出头的参数和梯度能够物理共享。这种物理共享机制进一步提高了我们的内存效率。

3.3. FP8 Training  3.3. FP8 训练

Inspired by recent advances in low-precision training (Dettmers et al., 2022, Noune et al., 2022, Peng et al., 2023b), we propose a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. While low-precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients (Fishman et al., 2024; He et al.; Sun et al., 2024). Although significant progress has been made in inference quantization (Frantar et al., 2022; Xiao et al., 2023), there are relatively few studies demonstrating successful application of low-precision techniques in large-scale language model
受到最近低精度训练的进展启发(Dettmers et al., 2022, Noune et al., 2022, Peng et al., 2023b),我们提出了一种细粒度混合精度框架,利用 FP8 数据格式来训练 DeepSeek-V3。虽然低精度训练前景广阔,但通常受到激活、权重和梯度中异常值的限制(Fishman et al., 2024; He et al.; Sun et al., 2024)。尽管在推理量化方面取得了显著进展(Frantar et al., 2022; Xiao et al., 2023),但成功应用低精度技术于大规模语言模型的研究相对较少。

Figure 6 | The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.
图 6 | 使用 FP8 数据格式的整体混合精度框架。为清晰起见,仅展示了线性运算符。

pre-training (Fishman et al. 2024). To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1 × N c 1 × N c 1xxN_(c)1 \times N_{c} elements or block-wise grouping with N c × N c N c × N c N_(c)xxN_(c)N_{c} \times N_{c} elements. The associated dequantization overhead is largely mitigated under our increased-precision accumulation process, a critical aspect for achieving accurate FP8 General Matrix Multiplication (GEMM). Moreover, to further reduce memory and communication overhead in MoE training, we cache and dispatch activations in FP8, while storing low-precision optimizer states in BF16. We validate the proposed FP8 mixed precision framework on two model scales similar to DeepSeek-V2-Lite and DeepSeekV2, training for approximately 1 trillion tokens (see more details in Appendix B.1. Notably, compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25 % 0.25 % 0.25%0.25 \%, a level well within the acceptable range of training randomness.
预训练(Fishman 等,2024)。为了解决这个挑战并有效扩展 FP8 格式的动态范围,我们引入了一种细粒度量化策略:按块分组的 1 × N c 1 × N c 1xxN_(c)1 \times N_{c} 元素或按块分组的 N c × N c N c × N c N_(c)xxN_(c)N_{c} \times N_{c} 元素。在我们的高精度累积过程中,相关的去量化开销在很大程度上得到了缓解,这是实现准确的 FP8 一般矩阵乘法(GEMM)的关键方面。此外,为了进一步减少 MoE 训练中的内存和通信开销,我们在 FP8 中缓存和调度激活,同时以 BF16 存储低精度优化器状态。我们在两个与 DeepSeek-V2-Lite 和 DeepSeekV2 类似的模型规模上验证了所提出的 FP8 混合精度框架,训练大约 1 万亿个标记(更多细节见附录 B.1)。值得注意的是,与 BF16 基线相比,我们的 FP8 训练模型的相对损失误差始终保持在 0.25 % 0.25 % 0.25%0.25 \% 之下,这一水平在训练随机性可接受范围内。

3.3.1. Mixed Precision Framework
3.3.1. 混合精度框架

Building upon widely adopted techniques in low-precision training (Kalamkar et al., 2019; Narang et al., 2017), we propose a mixed precision framework for FP8 training. In this framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability. The overall framework is illustrated in Figure 6 .
基于广泛采用的低精度训练技术(Kalamkar et al., 2019; Narang et al., 2017),我们提出了一种用于 FP8 训练的混合精度框架。在该框架中,大多数计算密集型操作在 FP8 中进行,而少数关键操作则战略性地保持在其原始数据格式中,以平衡训练效率和数值稳定性。整体框架如图 6 所示。
Firstly, in order to accelerate model training, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure6, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. This design theoretically doubles the computational speed compared with the original BF16 method. Additionally, the FP8 Wgrad GEMM allows activations to be stored in FP8 for use in the backward pass. This significantly reduces memory consumption.
首先,为了加速模型训练,大多数核心计算内核,即 GEMM 操作,采用 FP8 精度实现。这些 GEMM 操作接受 FP8 张量作为输入,并以 BF16 或 FP32 格式产生输出。如图 6 所示,与线性操作相关的所有三个 GEMM,即 Fprop(前向传播)、Dgrad(激活反向传播)和 Wgrad(权重反向传播),均在 FP8 中执行。与原始的 BF16 方法相比,这种设计理论上将计算速度提高了一倍。此外,FP8 Wgrad GEMM 允许将激活存储为 FP8,以便在反向传播中使用。这显著减少了内存消耗。
Despite the efficiency advantage of the FP8 format, certain operators still require a higher precision due to their sensitivity to low-precision computations. Besides, some low-cost operators can also utilize a higher precision with a negligible overhead to the overall training cost. For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. These targeted retentions of high precision ensure stable training dynamics for DeepSeek-V3. To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. While
尽管 FP8 格式具有效率优势,但某些运算符仍然由于对低精度计算的敏感性而需要更高的精度。此外,一些低成本运算符也可以在对整体训练成本几乎没有影响的情况下利用更高的精度。因此,在经过仔细调查后,我们对以下组件保持原始精度(例如,BF16 或 FP32):嵌入模块、输出头、MoE 门控模块、归一化运算符和注意力运算符。这些针对性的高精度保留确保了 DeepSeek-V3 的稳定训练动态。为了进一步保证数值稳定性,我们以更高的精度存储主权重、权重梯度和优化器状态。虽然

Figure 7 | (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of N C = 128 N C = 128 N_(C)=128N_{C}=128 elements MMA for the high-precision accumulation.
图 7 | (a) 我们提出了一种细粒度量化方法,以减轻由特征异常值引起的量化误差;为了简化说明,仅展示了 Fprop。(b) 结合我们的量化策略,我们通过在 N C = 128 N C = 128 N_(C)=128N_{C}=128 个元素的间隔内提升到 CUDA Cores 来提高 FP8 GEMM 的精度,以实现高精度累积。

these high-precision components incur some memory overheads, their impact can be minimized through efficient sharding across multiple DP ranks in our distributed training system.
这些高精度组件会产生一些内存开销,但通过在我们的分布式训练系统中对多个 DP 等级进行有效分片,可以最小化它们的影响。

3.3.2. Improved Precision from Quantization and Multiplication
3.3.2. 量化和乘法带来的精度提升

Based on our mixed precision FP8 framework, we introduce several strategies to enhance lowprecision training accuracy, focusing on both the quantization method and the multiplication process.
基于我们的混合精度 FP8 框架,我们引入了几种策略来提高低精度训练的准确性,重点关注量化方法和乘法过程。
Fine-Grained Quantization. In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. As a standard practice, the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum representable value of FP8 (Narang et al., 2017). This method makes lowprecision training highly sensitive to activation outliers, which can heavily degrade quantization accuracy. To solve this, we propose a fine-grained quantization method that applies scaling at a more granular level. As illustrated in Figure 7 (a), (1) for activations, we group and scale elements on a 1 × 128 1 × 128 1xx1281 \times 128 tile basis (i.e., per token per 128 channels); and (2) for weights, we group and scale elements on a 128 × 128 128 × 128 128 xx128128 \times 128 block basis (i.e., per 128 input channels per 128 output channels). This approach ensures that the quantization process can better accommodate outliers by adapting the scale according to smaller groups of elements. In Appendix B.2, we further discuss the training instability when we group and scale activations on a block basis in the same way as weights quantization.
细粒度量化。在低精度训练框架中,由于 FP8 格式的动态范围有限,溢出和下溢是常见的挑战,这受到其减少的指数位数的限制。作为标准做法,输入分布通过将输入张量的最大绝对值缩放到 FP8 的最大可表示值来与 FP8 格式的可表示范围对齐(Narang 等,2017)。这种方法使得低精度训练对激活异常值高度敏感,这可能严重降低量化精度。为了解决这个问题,我们提出了一种细粒度量化方法,在更细的层面上应用缩放。如图 7(a)所示,(1)对于激活,我们在 1 × 128 1 × 128 1xx1281 \times 128 瓦片基础上对元素进行分组和缩放(即每个令牌每 128 个通道);(2)对于权重,我们在 128 × 128 128 × 128 128 xx128128 \times 128 块基础上对元素进行分组和缩放(即每 128 个输入通道每 128 个输出通道)。这种方法确保量化过程能够更好地适应异常值,通过根据较小的元素组调整缩放。 在附录 B.2 中,我们进一步讨论了在以与权重量化相同的方式对激活进行分组和缩放时的训练不稳定性。
One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM. However, combined with our precise FP32 accumulation strategy, it can
我们方法中的一个关键修改是引入了沿 GEMM 操作内维度的每组缩放因子。这个功能在标准 FP8 GEMM 中并不直接支持。然而,结合我们精确的 FP32 累积策略,它可以

be efficiently implemented.
可以高效地实现。

Notably, our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b), while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for microscaling formats with smaller quantization granularity (NVIDIA, 2024a). We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures.
值得注意的是,我们的细粒度量化策略与微缩格式的理念高度一致(Rouhani et al., 2023b),而 NVIDIA 下一代 GPU 的 Tensor Cores(Blackwell 系列)已宣布支持具有更小量化粒度的微缩格式(NVIDIA, 2024a)。我们希望我们的设计能够为未来的工作提供参考,以跟上最新的 GPU 架构。
Increasing Accumulation Precision. Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision (Kalamkar et al., 2019; Narang et al., 2017). However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits, which is significantly lower than FP32 accumulation precision. This problem will become more pronounced when the inner dimension K is large (Wortsman et al., 2023), a typical scenario in large-scale model training where the batch size and model width are increased. Taking GEMM operations of two random matrices with K = 4096 K = 4096 K=4096K=4096 for example, in our preliminary test, the limited accumulation precision in Tensor Cores results in a maximum relative error of nearly 2 % 2 % 2%2 \%. Despite these problems, the limited accumulation precision is still the default option in a few FP8 frameworks (NVIDIA, 2024b), severely constraining the training accuracy.
提高累积精度。低精度 GEMM 操作通常会遭遇下溢问题,其准确性在很大程度上依赖于高精度累积,这通常在 FP32 精度下进行(Kalamkar et al., 2019; Narang et al., 2017)。然而,我们观察到在 NVIDIA H800 GPU 上,FP8 GEMM 的累积精度仅限于保留约 14 位,这显著低于 FP32 的累积精度。当内维度 K 较大时(Wortsman et al., 2023),这一问题将更加明显,这在大规模模型训练中是一个典型场景,其中批量大小和模型宽度都在增加。以两个随机矩阵的 GEMM 操作为例,在我们的初步测试中,Tensor Cores 中有限的累积精度导致最大相对误差接近 2 % 2 % 2%2 \% 。尽管存在这些问题,有限的累积精度仍然是一些 FP8 框架中的默认选项(NVIDIA, 2024b),严重限制了训练准确性。
In order to address this issue, we adopt the strategy of promotion to CUDA Cores for higher precision (Thakkar et al. 2023). The process is illustrated in Figure 7(b). To be specific, during MMA (Matrix Multiply-Accumulate) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of N C N C N_(C)N_{C} is reached, these partial results will be copied to FP32 registers on CUDA Cores, where full-precision FP32 accumulation is performed. As mentioned before, our fine-grained quantization applies per-group scaling factors along the inner dimension K . These scaling factors can be efficiently multiplied on the CUDA Cores as the dequantization process with minimal additional computational cost.
为了解决这个问题,我们采用将计算推广到 CUDA Cores 以提高精度的策略(Thakkar et al. 2023)。该过程如图 7(b)所示。具体来说,在 Tensor Cores 上执行 MMA(矩阵乘法-累加)时,使用有限的位宽累积中间结果。一旦达到 N C N C N_(C)N_{C} 的区间,这些部分结果将被复制到 CUDA Cores 上的 FP32 寄存器,在那里进行全精度 FP32 累积。如前所述,我们的细粒度量化在内维度 K 上应用每组缩放因子。这些缩放因子可以在 CUDA Cores 上高效地乘以,作为去量化过程,几乎没有额外的计算成本。
It is worth noting that this modification reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs the promotion operation, the other is able to execute the MMA operation. This design enables overlapping of the two operations, maintaining high utilization of Tensor Cores. Based on our experiments, setting N C = 128 N C = 128 N_(C)=128N_{C}=128 elements, equivalent to 4 WGMMAs, represents the minimal accumulation interval that can significantly improve precision without introducing substantial overhead.
值得注意的是,这一修改降低了单个 warpgroup 的 WGMMA(Warpgroup 级别矩阵乘法-累加)指令发射率。然而,在 H800 架构上,两个 WGMMA 同时存在是典型的:当一个 warpgroup 执行提升操作时,另一个能够执行 MMA 操作。这一设计使得两个操作能够重叠,从而保持 Tensor Cores 的高利用率。根据我们的实验,设置 N C = 128 N C = 128 N_(C)=128N_{C}=128 个元素,相当于 4 个 WGMMAs,代表了可以显著提高精度而不引入大量开销的最小累积间隔。
Mantissa over Exponents. In contrast to the hybrid FP8 format adopted by prior work (NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b), which uses E4M3 (4-bit exponent and 3-bit mantissa) in Fprop and E5M2 (5-bit exponent and 2-bit mantissa) in Dgrad and Wgrad, we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling. By operating on smaller element groups, our methodology effectively shares exponent bits among these grouped elements, mitigating the impact of the limited dynamic range.
尾数与指数。与之前的工作采用的混合 FP8 格式(NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b)相比,该格式在前向传播中使用 E4M3(4 位指数和 3 位尾数),在反向传播和权重更新中使用 E5M2(5 位指数和 2 位尾数),我们在所有张量上采用 E4M3 格式以提高精度。我们将这种方法的可行性归因于我们的细粒度量化策略,即瓷砖和块级缩放。通过对较小的元素组进行操作,我们的方法有效地在这些分组元素之间共享指数位,减轻了有限动态范围的影响。
Online Quantization. Delayed quantization is employed in tensor-wise quantization frameworks (NVIDIA, 2024b; Peng et al., 2023b), which maintains a history of the maximum absolute
在线量化。延迟量化在张量级量化框架中被采用(NVIDIA, 2024b;Peng et al., 2023b),它保持最大绝对值的历史记录。

values across prior iterations to infer the current value. In order to ensure accurate scales and simplify the framework, we calculate the maximum absolute value online for each 1 × 128 1 × 128 1xx1281 \times 128 activation tile or 128 × 128 128 × 128 128 xx128128 \times 128 weight block. Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format.
通过先前迭代中的值来推断当前值。为了确保准确的缩放并简化框架,我们在线计算每个 1 × 128 1 × 128 1xx1281 \times 128 激活块或 128 × 128 128 × 128 128 xx128128 \times 128 权重块的最大绝对值。基于此,我们推导出缩放因子,然后将激活或权重在线量化为 FP8 格式。

3.3.3. Low-Precision Storage and Communication
3.3.3. 低精度存储和通信

In conjunction with our FP8 training framework, we further reduce the memory consumption and communication overhead by compressing cached activations and optimizer states into lower-precision formats.
结合我们的 FP8 训练框架,我们通过将缓存的激活和优化器状态压缩为低精度格式,进一步减少了内存消耗和通信开销。
Low-Precision Optimizer States. We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017) optimizer, without incurring observable performance degradation. However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training.
低精度优化器状态。我们采用 BF16 数据格式而不是 FP32 来跟踪 AdamW(Loshchilov 和 Hutter,2017)优化器中的第一和第二时刻,而不会导致可观察的性能下降。然而,主权重(由优化器存储)和梯度(用于批量大小累积)仍然保留在 FP32 中,以确保训练过程中的数值稳定性。
Low-Precision Activation. As illustrated in Figure6 the Wgrad operation is performed in FP8. To reduce the memory consumption, it is a natural choice to cache activations in FP8 format for the backward pass of the Linear operator. However, special considerations are taken on several operators for low-cost high-precision training:
低精度激活。如图 6 所示,Wgrad 操作在 FP8 中执行。为了减少内存消耗,将激活缓存为 FP8 格式以用于线性算子的反向传播是一个自然的选择。然而,对于低成本高精度训练,几个算子需要特别考虑:

(1) Inputs of the Linear after the attention operator. These activations are also used in the backward pass of the attention operator, which makes it sensitive to precision. We adopt a customized E5M6 data format exclusively for these activations. Additionally, these activations will be converted from an 1 × 128 1 × 128 1xx1281 \times 128 quantization tile to an 128 × 1 128 × 1 128 xx1128 \times 1 tile in the backward pass. To avoid introducing extra quantization error, all the scaling factors are round scaled, i.e., integral power of 2.
(1) 注意力操作后的线性输入。这些激活值也用于注意力操作的反向传播,这使得它对精度敏感。我们为这些激活值采用了专门定制的 E5M6 数据格式。此外,这些激活值将在反向传播中从 1 × 128 1 × 128 1xx1281 \times 128 量化块转换为 128 × 1 128 × 1 128 xx1128 \times 1 块。为了避免引入额外的量化误差,所有的缩放因子都是四舍五入缩放的,即 2 的整数次幂。

(2) Inputs of the SwiGLU operator in MoE. To further reduce the memory cost, we cache the inputs of the SwiGLU operator and recompute its output in the backward pass. These activations are also stored in FP8 with our fine-grained quantization method, striking a balance between memory efficiency and computational accuracy.
(2) MoE 中 SwiGLU 操作符的输入。为了进一步降低内存成本,我们缓存 SwiGLU 操作符的输入,并在反向传播中重新计算其输出。这些激活也使用我们的细粒度量化方法以 FP8 格式存储,达到内存效率和计算准确性之间的平衡。
Low-Precision Communication. Communication bandwidth is a critical bottleneck in the training of MoE models. To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections. Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2 . A similar strategy is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline.
低精度通信。通信带宽是 MoE 模型训练中的一个关键瓶颈。为了缓解这一挑战,我们在 MoE 上投影之前将激活量化为 FP8,然后应用调度组件,这与 MoE 上投影中的 FP8 前向传播兼容。与注意力操作后的线性输入类似,这个激活的缩放因子是 2 的整数次幂。对 MoE 下投影之前的激活梯度也应用了类似的策略。对于前向和后向组合组件,我们将其保留为 BF16,以保持训练管道关键部分的训练精度。

3.4. Inference and Deployment
3.4. 推理与部署

We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages.
我们在 H800 集群上部署 DeepSeek-V3,其中每个节点内的 GPU 通过 NVLink 互连,集群内的所有 GPU 通过 IB 完全互连。为了同时确保在线服务的服务级目标(SLO)和高吞吐量,我们采用以下部署策略,将预填充和解码阶段分开。

3.4.1. Prefilling  3.4.1. 预填充

The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8). Its small TP size of 4 limits the overhead of TP communication. For the MoE part, we use 32-way Expert Parallelism (EP32), which ensures that each expert processes a sufficiently large batch size, thereby enhancing computational efficiency. For the MoE all-to-all communication, we use the same method as in training: first transferring tokens across nodes via IB, and then forwarding among the intra-node GPUs via NVLink. In particular, we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TP communication.
预填充阶段的最小部署单元由 4 个节点和 32 个 GPU 组成。注意力部分采用 4 路张量并行(TP4)与序列并行(SP)相结合,并结合 8 路数据并行(DP8)。其小的 TP 大小为 4 限制了 TP 通信的开销。对于 MoE 部分,我们使用 32 路专家并行(EP32),确保每个专家处理足够大的批量大小,从而提高计算效率。对于 MoE 的全到全通信,我们使用与训练相同的方法:首先通过 IB 在节点之间传输令牌,然后通过 NVLink 在节点内的 GPU 之间转发。特别地,我们对浅层中的密集 MLP 使用 1 路张量并行,以节省 TP 通信。
To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.
为了在 MoE 部分实现不同专家之间的负载均衡,我们需要确保每个 GPU 处理大约相同数量的标记。为此,我们引入了一种冗余专家的部署策略,该策略复制高负载专家并进行冗余部署。高负载专家是根据在线部署期间收集的统计数据进行检测的,并定期进行调整(例如,每 10 分钟)。在确定冗余专家集后,我们根据观察到的负载仔细重新安排节点内 GPU 之间的专家,努力在不增加跨节点全到全通信开销的情况下尽可能平衡 GPU 之间的负载。对于 DeepSeek-V3 的部署,我们在预填充阶段设置了 32 个冗余专家。对于每个 GPU,除了它所托管的原始 8 个专家外,它还将托管一个额外的冗余专家。
Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.
此外,在预填充阶段,为了提高吞吐量并隐藏全到全和 TP 通信的开销,我们同时处理两个具有相似计算负载的微批次,将一个微批次的注意力和 MoE 与另一个的调度和合并重叠。
Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.
最后,我们正在探索一种动态冗余策略,对于专家,每个 GPU 托管更多的专家(例如,16 个专家),但在每次推理步骤中只有 9 个会被激活。在每层的全到全操作开始之前,我们会实时计算全局最优路由方案。考虑到预填充阶段涉及的巨大计算,计算这个路由方案的开销几乎可以忽略不计。

3.4.2. Decoding  3.4.2. 解码

During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.
在解码过程中,我们将共享专家视为路由专家。从这个角度来看,每个令牌在路由时将选择 9 个专家,其中共享专家被视为一个重载专家,始终会被选择。解码阶段的最小部署单元由 40 个节点和 320 个 GPU 组成。注意力部分采用 TP4 与 SP 相结合,结合 DP80,而 MoE 部分使用 EP320。对于 MoE 部分,每个 GPU 仅托管一个专家,64 个 GPU 负责托管冗余专家和共享专家。调度和合并部分的全到全通信通过直接点对点传输在 IB 上进行,以实现低延迟。此外,我们利用 IBGDA(NVIDIA,2022)技术进一步减少延迟并提高通信效率。
Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.
类似于预填充,我们定期根据我们在线服务的统计专家负载,在某个时间间隔内确定冗余专家的集合。然而,我们不需要重新排列专家,因为每个 GPU 只托管一个专家。我们还在探索解码的动态冗余策略。然而,这需要对计算全局最优路由方案的算法以及与调度内核的融合进行更仔细的优化,以减少开销。
Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage. Therefore, we overlap the attention of one micro-batch with the dispatch+MoE+combine of another. In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation. Since the MoE part only needs to load the parameters of one expert, the memory access overhead is minimal, so using fewer SMs will not significantly affect the overall performance. Therefore, to avoid impacting the computation speed of the attention part, we can allocate only a small portion of SMs to dispatch+MoE+combine.
此外,为了提高吞吐量并隐藏全到全通信的开销,我们还在解码阶段探索同时处理两个具有相似计算负载的微批次。与预填充不同,注意力在解码阶段消耗了更大一部分时间。因此,我们将一个微批次的注意力与另一个微批次的调度+MoE+合并重叠。在解码阶段,每个专家的批量大小相对较小(通常在 256 个标记以内),瓶颈是内存访问而不是计算。由于 MoE 部分只需要加载一个专家的参数,内存访问开销最小,因此使用较少的 SM 不会显著影响整体性能。因此,为了避免影响注意力部分的计算速度,我们可以仅分配少量的 SM 用于调度+MoE+合并。

3.5. Suggestions on Hardware Design
3.5. 硬件设计建议

Based on our implementation of the all-to-all communication and FP8 training scheme, we propose the following suggestions on chip design to AI hardware vendors.
基于我们对全到全通信和 FP8 训练方案的实现,我们向 AI 硬件供应商提出以下芯片设计建议。

3.5.1. Communication Hardware
3.5.1. 通信硬件

In DeepSeek-V3, we implement the overlap between computation and communication to hide the communication latency during computation. This significantly reduces the dependency on communication bandwidth compared to serial computation and communication. However, the current communication implementation relies on expensive SMs (e.g., we allocate 20 out of the 132 SMs available in the H800 GPU for this purpose), which will limit the computational throughput. Moreover, using SMs for communication results in significant inefficiencies, as tensor cores remain entirely under-utilized.
在 DeepSeek-V3 中,我们实现了计算与通信之间的重叠,以隐藏计算过程中的通信延迟。这显著减少了与串行计算和通信相比对通信带宽的依赖。然而,目前的通信实现依赖于昂贵的 SM(例如,我们为此目的在 H800 GPU 中分配了 132 个 SM 中的 20 个),这将限制计算吞吐量。此外,使用 SM 进行通信会导致显著的低效,因为张量核心完全未被充分利用。
Currently, the SMs primarily perform the following tasks for all-to-all communication:
目前,SMs 主要执行以下任务以实现全到全的通信:
  • Forwarding data between the IB (InfiniBand) and NVLink domain while aggregating IB traffic destined for multiple GPUs within the same node from a single GPU.
    在同一节点内,从单个 GPU 聚合目标为多个 GPU 的 IB(InfiniBand)流量,同时在 IB(InfiniBand)和 NVLink 域之间转发数据。
  • Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers.
    在 RDMA 缓冲区(注册的 GPU 内存区域)和输入/输出缓冲区之间传输数据。
  • Executing reduce operations for all-to-all combine.
    执行全到全组合的归约操作。
  • Managing fine-grained memory layout during chunked data transferring to multiple experts across the IB and NVLink domain.
    在通过 IB 和 NVLink 域将分块数据传输到多个专家时管理细粒度内存布局。
We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink (scale-up) networks from the perspective of the computation units. With this unified interface, computation units can easily accomplish operations such as read, write, multicast, and reduce across the entire IB-NVLink-unified domain via submitting communication requests based on simple primitives.
我们希望未来的供应商开发硬件,将这些通信任务从宝贵的计算单元 SM 中卸载,作为 GPU 协处理器或类似 NVIDIA SHARP Graham 等人(2016)的网络协处理器。此外,为了减少应用程序编程的复杂性,我们希望该硬件能够从计算单元的角度统一 IB(扩展)和 NVLink(升级)网络。通过这个统一的接口,计算单元可以通过基于简单原语提交通信请求,轻松完成跨整个 IB-NVLink-统一域的读取、写入、多播和归约等操作。

3.5.2. Compute Hardware  3.5.2. 计算硬件

Higher FP8 GEMM Accumulation Precision in Tensor Cores. In the current Tensor Core implementation of the NVIDIA Hopper architecture, FP8 GEMM (General Matrix Multiply) employs fixed-point accumulation, aligning the mantissa products by right-shifting based on the maximum exponent before addition. Our experiments reveal that it only uses the highest 14
在 Tensor Core 中更高的 FP8 GEMM 累积精度。在当前 NVIDIA Hopper 架构的 Tensor Core 实现中,FP8 GEMM(通用矩阵乘法)采用定点累积,通过在加法之前根据最大指数右移来对齐尾数乘积。我们的实验表明,它仅使用最高 14

bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of 32 FP8 × × xx\times FP8 multiplications, at least 34 -bit precision is required. Thus, we recommend that future chip designs increase accumulation precision in Tensor Cores to support full-precision accumulation, or select an appropriate accumulation bit-width according to the accuracy requirements of training and inference algorithms. This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency.
在符号填充右移后,每个尾数乘积的位数,并截断超出此范围的位数。然而,例如,要从 32 个 FP8 × × xx\times FP8 乘法的累积中获得精确的 FP32 结果,至少需要 34 位精度。因此,我们建议未来的芯片设计在张量核心中增加累积精度,以支持全精度累积,或根据训练和推理算法的准确性要求选择适当的累积位宽。这种方法确保错误保持在可接受的范围内,同时保持计算效率。
Support for Tile- and Block-Wise Quantization. Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like our tile- and blockwise quantization. In the current implementation, when the N C N C N_(C)N_{C} interval is reached, the partial results will be copied from Tensor Cores to CUDA cores, multiplied by the scaling factors, and added to FP32 registers on CUDA cores. Although the dequantization overhead is significantly mitigated combined with our precise FP32 accumulation strategy, the frequent data movements between Tensor Cores and CUDA cores still limit the computational efficiency. Therefore, we recommend future chips to support fine-grained quantization by enabling Tensor Cores to receive scaling factors and implement MMA with group scaling. In this way, the whole partial sum accumulation and dequantization can be completed directly inside Tensor Cores until the final result is produced, avoiding frequent data movements.
支持瓦片和块级量化。目前的 GPU 仅支持每个张量的量化,缺乏对我们瓦片和块级量化等细粒度量化的原生支持。在当前的实现中,当达到 N C N C N_(C)N_{C} 区间时,部分结果将从 Tensor Cores 复制到 CUDA 核心,乘以缩放因子,并添加到 CUDA 核心上的 FP32 寄存器中。尽管结合我们精确的 FP32 累积策略,去量化的开销显著减小,但 Tensor Cores 和 CUDA 核心之间频繁的数据移动仍然限制了计算效率。因此,我们建议未来的芯片支持细粒度量化,通过使 Tensor Cores 接收缩放因子并实现带组缩放的 MMA。通过这种方式,整个部分和累积和去量化可以直接在 Tensor Cores 内部完成,直到产生最终结果,从而避免频繁的数据移动。
Support for Online Quantization. The current implementations struggle to effectively support online quantization, despite its effectiveness demonstrated in our research. In the existing process, we need to read 128 BF16 activation values (the output of the previous computation) from HBM (High Bandwidth Memory) for quantization, and the quantized FP8 values are then written back to HBM, only to be read again for MMA. To address this inefficiency, we recommend that future chips integrate FP8 cast and TMA (Tensor Memory Accelerator) access into a single fused operation, so quantization can be completed during the transfer of activations from global memory to shared memory, avoiding frequent memory reads and writes. We also recommend supporting a warp-level cast instruction for speedup, which further facilitates the better fusion of layer normalization and FP8 cast. Alternatively, a near-memory computing approach can be adopted, where compute logic is placed near the HBM. In this case, BF16 elements can be cast to FP8 directly as they are read from HBM into the GPU, reducing off-chip memory access by roughly 50 % 50 % 50%50 \%.
支持在线量化。目前的实现难以有效支持在线量化,尽管我们的研究证明了其有效性。在现有过程中,我们需要从 HBM(高带宽内存)读取 128 个 BF16 激活值(前一次计算的输出)进行量化,然后将量化后的 FP8 值写回 HBM,再次读取以进行 MMA。为了解决这一低效问题,我们建议未来的芯片将 FP8 转换和 TMA(张量内存加速器)访问集成到一个单一的融合操作中,以便在将激活从全局内存转移到共享内存的过程中完成量化,避免频繁的内存读写。我们还建议支持一个 warp 级别的转换指令以加速,这进一步促进了层归一化和 FP8 转换的更好融合。或者,可以采用近内存计算的方法,将计算逻辑放置在 HBM 附近。在这种情况下,BF16 元素可以在从 HBM 读取到 GPU 时直接转换为 FP8,从而减少大约 50 % 50 % 50%50 \% 的外部内存访问。
Support for Transposed GEMM Operations. The current architecture makes it cumbersome to fuse matrix transposition with GEMM operations. In our workflow, activations during the forward pass are quantized into 1 × 128 1 × 128 1xx1281 \times 128 FP8 tiles and stored. During the backward pass, the matrix needs to be read out, dequantized, transposed, re-quantized into 128 × 1 128 × 1 128 xx1128 \times 1 tiles, and stored in HBM. To reduce memory operations, we recommend future chips to enable direct transposed reads of matrices from shared memory before MMA operation, for those precisions required in both training and inference. Combined with the fusion of FP8 format conversion and TMA access, this enhancement will significantly streamline the quantization workflow.
对转置 GEMM 操作的支持。目前的架构使得将矩阵转置与 GEMM 操作融合变得繁琐。在我们的工作流程中,前向传播期间的激活被量化为 1 × 128 1 × 128 1xx1281 \times 128 FP8 块并存储。在反向传播期间,需要读取矩阵,去量化,转置,重新量化为 128 × 1 128 × 1 128 xx1128 \times 1 块,并存储在 HBM 中。为了减少内存操作,我们建议未来的芯片在 MMA 操作之前启用从共享内存直接读取转置矩阵,以满足训练和推理中所需的精度。结合 FP8 格式转换和 TMA 访问的融合,这一增强将显著简化量化工作流程。

4. Pre-Training  4. 预训练

4.1. Data Construction  4.1. 数据构建

Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. (2024), we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8 T high-quality and diverse tokens in our tokenizer.
与 DeepSeek-V2 相比,我们通过增强数学和编程样本的比例来优化预训练语料库,同时扩展多语言覆盖范围,超越英语和中文。此外,我们的数据处理流程经过精细化,以最小化冗余,同时保持语料库的多样性。受到 Ding 等人(2024 年)的启发,我们实施了文档打包方法以确保数据完整性,但在训练过程中不采用跨样本注意力掩蔽。最后,DeepSeek-V3 的训练语料库由我们分词器中的 14.8 T 高质量和多样化的标记组成。
In the training process of DeepSeekCoder-V2 (DeepSeek-AI, 2024a), we observe that the Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability while enabling the model to accurately predict middle text based on contextual cues. In alignment with DeepSeekCoder-V2, we also incorporate the FIM strategy in the pre-training of DeepSeek-V3. To be specific, we employ the Prefix-Suffix-Middle (PSM) framework to structure data as follows:
在 DeepSeekCoder-V2(DeepSeek-AI,2024a)的训练过程中,我们观察到填充中间(FIM)策略并未妨碍下一个标记的预测能力,同时使模型能够根据上下文线索准确预测中间文本。与 DeepSeekCoder-V2 一致,我们在 DeepSeek-V3 的预训练中也采用了 FIM 策略。具体来说,我们使用前缀-后缀-中间(PSM)框架来构建数据,如下所示:
<∣ f i m begin | > f pre < | fim_hole | > f suf < | fim_end | > f middle < | eos_token ∣> . <∣ f i m begin  > f pre  <  fim_hole  > f suf  <  fim_end  > f middle  <  eos_token  ∣> <∣fim_(-)"begin "| > f_("pre ") < |" fim_hole "| > f_("suf ") < |" fim_end "| > f_("middle ") < |" eos_token "∣>". "<\mid f i m_{-} \text {begin }\left|>f_{\text {pre }}<\right| \text { fim_hole }\left|>f_{\text {suf }}<\right| \text { fim_end }\left|>f_{\text {middle }}<\right| \text { eos_token } \mid>\text {. }
This structure is applied at the document level as a part of the pre-packing process. The FIM strategy is applied at a rate of 0.1 , consistent with the PSM framework.
该结构在文档级别作为预打包过程的一部分应用。FIM 策略以 0.1 的速率应用,与 PSM 框架一致。
The tokenizer for DeepSeek-V3 employs Byte-level BPE (Shibata et al. 1999) with an extended vocabulary of 128 K tokens. The pretokenizer and training data for our tokenizer are modified to optimize multilingual compression efficiency. In addition, compared with DeepSeek-V2, the new pretokenizer introduces tokens that combine punctuations and line breaks. However, this trick may introduce the token boundary bias (Lundberg, 2023) when the model processes multi-line prompts without terminal line breaks, particularly for few-shot evaluation prompts. To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias.
DeepSeek-V3 的分词器采用字节级 BPE(Shibata 等,1999),具有 128 K 令牌的扩展词汇表。我们的分词器的预分词器和训练数据经过修改,以优化多语言压缩效率。此外,与 DeepSeek-V2 相比,新的预分词器引入了结合标点符号和换行符的令牌。然而,当模型处理没有终端换行符的多行提示时,这种技巧可能会引入令牌边界偏差(Lundberg,2023),特别是在少量示例评估提示中。为了解决这个问题,我们在训练期间随机拆分一定比例的此类组合令牌,这使模型接触到更广泛的特殊情况,并减轻了这种偏差。

4.2. Hyper-Parameters  4.2. 超参数

Model Hyper-Parameters. We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006 . In MLA, we set the number of attention heads n h n h n_(h)n_{h} to 128 and the per-head dimension d h d h d_(h)d_{h} to 128 . The KV compression dimension d c d c d_(c)d_{c} is set to 512 , and the query compression dimension d c d c d_(c)^(')d_{c}^{\prime} is set to 1536 . For the decoupled queries and key, we set the per-head dimension d h R d h R d_(h)^(R)d_{h}^{R} to 64 . We substitute all FFNs except for the first three layers with MoE layers. Each MoE layer consists of 1 shared expert and 256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the routed experts, 8 experts will be activated for each token, and each token will be ensured to be sent to at most 4 nodes. The multi-token prediction depth D D DD is set to 1 , i.e., besides the exact next token, each token will predict one additional token. As DeepSeek-V2, DeepSeek-V3 also employs additional RMSNorm layers after the compressed latent vectors, and multiplies additional scaling factors at the width bottlenecks. Under this configuration, DeepSeek-V3 comprises 671B total parameters, of which 37B are activated for each token.
模型超参数。我们将 Transformer 层的数量设置为 61,隐藏维度设置为 7168。所有可学习参数随机初始化,标准差为 0.006。在 MLA 中,我们将注意力头的数量 n h n h n_(h)n_{h} 设置为 128,每个头的维度 d h d h d_(h)d_{h} 设置为 128。KV 压缩维度 d c d c d_(c)d_{c} 设置为 512,查询压缩维度 d c d c d_(c)^(')d_{c}^{\prime} 设置为 1536。对于解耦的查询和键,我们将每个头的维度 d h R d h R d_(h)^(R)d_{h}^{R} 设置为 64。我们用 MoE 层替换除了前三层之外的所有 FFN。每个 MoE 层由 1 个共享专家和 256 个路由专家组成,其中每个专家的中间隐藏维度为 2048。在路由专家中,每个 token 将激活 8 个专家,并确保每个 token 最多发送到 4 个节点。多 token 预测深度 D D DD 设置为 1,即除了确切的下一个 token 外,每个 token 将预测一个额外的 token。与 DeepSeek-V2 一样,DeepSeek-V3 在压缩的潜在向量之后也采用了额外的 RMSNorm 层,并在宽度瓶颈处乘以额外的缩放因子。 在此配置下,DeepSeek-V3 包含 671B 总参数,其中每个 token 激活 37B。
Training Hyper-Parameters. We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to β 1 = 0.9 , β 2 = 0.95 β 1 = 0.9 , β 2 = 0.95 beta_(1)=0.9,beta_(2)=0.95\beta_{1}=0.9, \beta_{2}=0.95, and weight_decay = 0.1 = 0.1 =0.1=0.1. We set the maximum sequence length to 4 K during pre-training, and pre-train DeepSeek-V3 on 14.8T tokens. As for
训练超参数。我们使用 AdamW 优化器(Loshchilov 和 Hutter,2017),超参数设置为 β 1 = 0.9 , β 2 = 0.95 β 1 = 0.9 , β 2 = 0.95 beta_(1)=0.9,beta_(2)=0.95\beta_{1}=0.9, \beta_{2}=0.95 ,权重衰减为 = 0.1 = 0.1 =0.1=0.1 。我们在预训练期间将最大序列长度设置为 4K,并在 14.8T 标记上预训练 DeepSeek-V3。至于

the learning rate scheduling, we first linearly increase it from 0 to 2.2 × 10 4 2.2 × 10 4 2.2 xx10^(-4)2.2 \times 10^{-4} during the first 2 K steps. Then, we keep a constant learning rate of 2.2 × 10 4 2.2 × 10 4 2.2 xx10^(-4)2.2 \times 10^{-4} until the model consumes 10T training tokens. Subsequently, we gradually decay the learning rate to 2.2 × 10 5 2.2 × 10 5 2.2 xx10^(-5)2.2 \times 10^{-5} in 4.3 T tokens, following a cosine decay curve. During the training of the final 500B tokens, we keep a constant learning rate of 2.2 × 10 5 2.2 × 10 5 2.2 xx10^(-5)2.2 \times 10^{-5} in the first 333B tokens, and switch to another constant learning rate of 7.3 × 10 6 7.3 × 10 6 7.3 xx10^(-6)7.3 \times 10^{-6} in the remaining 167B tokens. The gradient clipping norm is set to 1.0 . We employ a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training. We leverage pipeline parallelism to deploy different layers of a model on different GPUs, and for each layer, the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes. As for the node-limited routing, each token will be sent to at most 4 nodes (i.e., M = 4 M = 4 M=4M=4 ). For auxiliary-loss-free load balancing, we set the bias update speed γ γ gamma\gamma to 0.001 for the first 14.3 T tokens, and to 0.0 for the remaining 500B tokens. For the balance loss, we set α α alpha\alpha to 0.0001 , just to avoid extreme imbalance within any single sequence. The MTP loss weight λ λ lambda\lambda is set to 0.3 for the first 10T tokens, and to 0.1 for the remaining 4.8 T tokens.
在学习率调度方面,我们首先在前 2K 步内将其线性增加从 0 到 2.2 × 10 4 2.2 × 10 4 2.2 xx10^(-4)2.2 \times 10^{-4} 。然后,我们保持一个恒定的学习率 2.2 × 10 4 2.2 × 10 4 2.2 xx10^(-4)2.2 \times 10^{-4} ,直到模型消耗 10T 训练标记。随后,我们在 4.3T 标记中逐渐将学习率衰减到 2.2 × 10 5 2.2 × 10 5 2.2 xx10^(-5)2.2 \times 10^{-5} ,遵循余弦衰减曲线。在最后 500B 标记的训练过程中,我们在前 333B 标记中保持恒定的学习率 2.2 × 10 5 2.2 × 10 5 2.2 xx10^(-5)2.2 \times 10^{-5} ,并在剩余的 167B 标记中切换到另一个恒定的学习率 7.3 × 10 6 7.3 × 10 6 7.3 xx10^(-6)7.3 \times 10^{-6} 。梯度裁剪范数设置为 1.0。我们采用批量大小调度策略,在前 469B 标记的训练中,批量大小从 3072 逐渐增加到 15360,然后在剩余的训练中保持 15360。我们利用管道并行性将模型的不同层部署在不同的 GPU 上,对于每一层,路由的专家将均匀部署在属于 8 个节点的 64 个 GPU 上。至于节点限制路由,每个标记最多将发送到 4 个节点(即 M = 4 M = 4 M=4M=4 )。对于无辅助损失的负载平衡,我们将偏置更新速度 γ γ gamma\gamma 设置为 0.001,适用于前 14.3T 标记,而对于剩余的 500B 标记则设置为 0.0。 对于平衡损失,我们将 α α alpha\alpha 设置为 0.0001,以避免任何单个序列中的极端不平衡。MTP 损失权重 λ λ lambda\lambda 在前 10T 令牌中设置为 0.3,在剩余的 4.8 T 令牌中设置为 0.1。

Figure 8 8 8∣8 \mid Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128 K .
8 8 8∣8 \mid “干草堆中的针”(NIAH)测试的评估结果。DeepSeek-V3 在所有上下文窗口长度(最高可达 128K)上表现良好。

4.3. Long Context Extension
4.3. 长上下文扩展

We adopt a similar approach to DeepSeek-V2 (DeepSeek-AI, 2024c) to enable long context capabilities in DeepSeek-V3. After the pre-training stage, we apply YaRN (Peng et al., 2023a) for context extension and perform two additional training phases, each comprising 1000 steps, to progressively expand the context window from 4 K to 32 K and then to 128 K . The YaRN configuration is consistent with that used in DeepSeek-V2, being applied exclusively to the decoupled shared key k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R}. The hyper-parameters remain identical across both phases, with the scale s = 40 , α = 1 , β = 32 s = 40 , α = 1 , β = 32 s=40,alpha=1,beta=32s=40, \alpha=1, \beta=32, and the scaling factor t = 0.1 ln s + 1 t = 0.1 ln s + 1 sqrtt=0.1 ln s+1\sqrt{t}=0.1 \ln s+1. In the first phase, the sequence length is set to 32 K , and the batch size is 1920 . During the second phase, the sequence length is increased to 128 K , and the batch size is reduced to 480 . The learning rate for both phases is set to 7.3 × 10 6 7.3 × 10 6 7.3 xx10^(-6)7.3 \times 10^{-6}, matching the final learning rate from the pre-training stage.
我们采用与 DeepSeek-V2 (DeepSeek-AI, 2024c) 类似的方法,以在 DeepSeek-V3 中实现长上下文能力。在预训练阶段之后,我们应用 YaRN (Peng et al., 2023a) 进行上下文扩展,并进行两个额外的训练阶段,每个阶段包含 1000 步,以逐步将上下文窗口从 4 K 扩展到 32 K,然后到 128 K。YaRN 配置与 DeepSeek-V2 中使用的配置一致,仅应用于解耦的共享密钥 k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R} 。两个阶段的超参数保持一致,规模为 s = 40 , α = 1 , β = 32 s = 40 , α = 1 , β = 32 s=40,alpha=1,beta=32s=40, \alpha=1, \beta=32 ,缩放因子为 t = 0.1 ln s + 1 t = 0.1 ln s + 1 sqrtt=0.1 ln s+1\sqrt{t}=0.1 \ln s+1 。在第一阶段,序列长度设置为 32 K,批量大小为 1920。在第二阶段,序列长度增加到 128 K,批量大小减少到 480。两个阶段的学习率设置为 7.3 × 10 6 7.3 × 10 6 7.3 xx10^(-6)7.3 \times 10^{-6} ,与预训练阶段的最终学习率相匹配。
Through this two-phase extension training, DeepSeek-V3 is capable of handling inputs up to 128 K in length while maintaining strong performance. Figure 8 illustrates that DeepSeek-V3, following supervised fine-tuning, achieves notable performance on the “Needle In A Haystack” (NIAH) test, demonstrating consistent robustness across context window lengths up to 128 K .
通过这种两阶段扩展训练,DeepSeek-V3 能够处理长度达到 128K 的输入,同时保持强大的性能。图 8 显示,经过监督微调的 DeepSeek-V3 在“Needle In A Haystack”(NIAH)测试中取得了显著的表现,展示了在长度达到 128K 的上下文窗口中一致的鲁棒性。

4.4. Evaluations  4.4. 评估

4.4.1. Evaluation Benchmarks
4.4.1. 评估基准

The base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chinese constituting the majority, so we evaluate its performance on a series of benchmarks primarily in English and Chinese, as well as on a multilingual benchmark. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Considered benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese and double-underlined benchmarks are multilingual ones:
DeepSeek-V3 的基础模型是在一个多语言语料库上进行预训练的,其中英语和中文占据了大部分,因此我们在一系列基准测试中评估其性能,主要是英语和中文,以及一个多语言基准。我们的评估基于我们在 HAI-LLM框架中集成的内部评估框架。考虑的基准测试被分类并列出如下,其中下划线基准为中文,双下划线基准为多语言基准:
Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), MMLURedux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024b), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).
多主题多选数据集包括 MMLU (Hendrycks et al., 2020)、MMLURedux (Gema et al., 2024)、MMLU-Pro (Wang et al., 2024b)、MMMLU (OpenAI, 2024b)、C-Eval (Huang et al., 2023) 和 CMMLU (Li et al., 2023)。
Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
语言理解和推理数据集包括 HellaSwag (Zellers et al., 2019)、PIQA (Bisk et al., 2020)、ARC (Clark et al., 2018) 和 BigBench Hard (BBH) (Suzgun et al., 2022)。
Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).
闭卷问答数据集包括 TriviaQA(Joshi 等,2017)和 NaturalQuestions(Kwiatkowski 等,2019)。
Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019),
阅读理解数据集包括 RACE Lai et al. (2017)、DROP (Dua et al., 2019)、

(Sun et al., 2019a), and CMRC (Cui et al., 2019).
(Sun et al.,2019a),和 CMRC(Cui et al.,2019)。

Reference disambiguation datasets include CLUEWSC (Xu et al., 2020) and WinoGrande Sakaguchi et al. (2019).
参考消歧义数据集包括 CLUEWSC (Xu et al., 2020) 和 WinoGrande (Sakaguchi et al., 2019)。
Language modeling datasets include Pile (Gao et al., 2020).
语言建模数据集包括 Pile (Gao et al., 2020)。

Chinese understanding and culture datasets include CCPM (Li et al., 2021).
中文理解和文化数据集包括 CCPM(Li et al., 2021)。

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM (Shi et al., 2023), and CMath (Wei et al., 2023).
数学数据集包括 GSM8K(Cobbe 等,2021 年)、MATH(Hendrycks 等,2021 年)、MGSM(Shi 等,2023 年)和 CMath(Wei 等,2023 年)。
Code datasets include HumanEval (Chen et al., 2021), LiveCodeBench-Base (0801-1101) (Jain et al., 2024), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).
代码数据集包括 HumanEval (Chen et al., 2021)、LiveCodeBench-Base (0801-1101) (Jain et al., 2024)、MBPP (Austin et al., 2021) 和 CRUXEval (Gu et al., 2024)。
Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.
标准化考试包括 AGIEval(Zhong 等,2023)。请注意,AGIEval 包括英语和中文子集。
Following our previous work (DeepSeek-AI, 2024b|c), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, MMLU-Redux, MMLU-Pro, MMMLU, ARC-Easy, ARC-Challenge, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, MGSM, HumanEval, MBPP, LiveCodeBench-Base, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models using different tokenizers.
根据我们之前的工作(DeepSeek-AI, 2024b|c),我们采用基于困惑度的评估方法,针对包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、MMLU-Redux、MMLU-Pro、MMMLU、ARC-Easy、ARC-Challenge、C-Eval、CMMLU、C3 和 CCPM 的数据集,并对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、MGSM、HumanEval、MBPP、LiveCodeBench-Base、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 采用基于生成的评估方法。此外,我们对 Pile-test 进行基于语言建模的评估,并使用每字节比特数(BPB)作为指标,以确保使用不同分词器的模型之间的公平比较。
Benchmark (Metric)  基准(指标) # Shots DeepSeek-V2 Base  DeepSeek-V2 基础
Qwen2.5
72B Base
Qwen2.5 72B Base| Qwen2.5 | | :--- | | 72B Base |
  LLaMA-3.1 405B 基础
LLaMA-3.1
405B Base
LLaMA-3.1 405B Base| LLaMA-3.1 | | :--- | | 405B Base |
DeepSeek-V3 Base  DeepSeek-V3 基础
Architecture  架构 - MoE Dense  密集 Dense  密集 MoE
# Activated Params  # 激活参数 - 21B 72B 405B 37B
# Total Params  # 总参数 - 236B 72B 405B 671B
English  英语 Pile-test (BPB) - 0.606 0.638 0.542 0.548
BBH (Em) 3-shot 78.8 79.8 82.9 87.5
MMLU (EM) 5-shot 78.4 85.0 84.4 87.1
MMLU-Redux (EM) 5 -shot 75.6 83.2 81.3 86.2
MMLU-Pro (em) 5 -shot 51.4 58.3 52.8 64.4
DROP (F1) 3-shot 80.4 80.6 86.0 89.0
ARC-Easy (EM) 25-shot 97.6 98.4 98.4 98.9
ARC-Challenge (EM) 25-shot 92.2 94.5 95.3 95.3
HellaSwag (Em) 10-shot 87.1 84.8 89.2 88.9
PIQA (Em) 0 -shot 83.9 82.6 85.9 84.7
WinoGrande (EM) 5-shot 86.3 82.3 85.2 84.9
RACE-Middle (em) 5-shot 73.1 68.1 74.2 67.1
RACE-High (EM) 5 -shot 52.6 50.3 56.8 51.3
TriviaQA (EM) 5-shot 80.0 71.9 82.7 82.9
NaturalQuestions (EM) 5-shot 38.6 33.2 41.5 40.0
AGIEval (EM) 0 -shot 57.5 75.8 60.6 79.6
Code  代码 HumanEval (Pass@1) 0 -shot 43.3 53.0 54.9 65.2
MBPP (Pass@1) 3-shot 65.0 72.6 68.4 75.4
LiveCodeBench-Base (Pass@1) 3-shot 11.6 12.9 15.5 19.4
CRUXEval-I (em) 2-shot 52.5 59.1 58.5 67.3
CRUXEval-O (EM) 2-shot 49.8 59.9 59.9 69.8
Math  数学 GSM8K (Em) 8 -shot  8-shot 81.6 88.3 83.5 89.3
MATH (EM) 4-shot 43.4 54.4 49.0 61.6
MGSM (EM) 8 -shot  8-shot 63.6 76.2 69.9 79.8
CMath (EM) 3 -shot 78.7 84.5 77.3 90.7
Chinese  中文 CLUEWSC (EM) 5-shot 82.0 82.5 83.0 82.7
C-Eval (EM) 5-shot 81.4 89.2 72.5 90.1
CMMLU (EM) 5-shot 84.0 89.5 73.7 88.8
CMRC (EM) 1-shot 77.4 75.8 76.0 76.3
C3 (Em) 0 -shot 77.4 76.7 79.7 78.6
CCPM (EM) 0 -shot 93.0 88.5 78.6 92.0
Multilingual  多语言 MMMLU-non-English (em) 5-shot 64.0 74.8 73.8 79.4
Benchmark (Metric) # Shots DeepSeek-V2 Base "Qwen2.5 72B Base" "LLaMA-3.1 405B Base" DeepSeek-V3 Base Architecture - MoE Dense Dense MoE # Activated Params - 21B 72B 405B 37B # Total Params - 236B 72B 405B 671B English Pile-test (BPB) - 0.606 0.638 0.542 0.548 BBH (Em) 3-shot 78.8 79.8 82.9 87.5 MMLU (EM) 5-shot 78.4 85.0 84.4 87.1 MMLU-Redux (EM) 5 -shot 75.6 83.2 81.3 86.2 MMLU-Pro (em) 5 -shot 51.4 58.3 52.8 64.4 DROP (F1) 3-shot 80.4 80.6 86.0 89.0 ARC-Easy (EM) 25-shot 97.6 98.4 98.4 98.9 ARC-Challenge (EM) 25-shot 92.2 94.5 95.3 95.3 HellaSwag (Em) 10-shot 87.1 84.8 89.2 88.9 PIQA (Em) 0 -shot 83.9 82.6 85.9 84.7 WinoGrande (EM) 5-shot 86.3 82.3 85.2 84.9 RACE-Middle (em) 5-shot 73.1 68.1 74.2 67.1 RACE-High (EM) 5 -shot 52.6 50.3 56.8 51.3 TriviaQA (EM) 5-shot 80.0 71.9 82.7 82.9 NaturalQuestions (EM) 5-shot 38.6 33.2 41.5 40.0 AGIEval (EM) 0 -shot 57.5 75.8 60.6 79.6 Code HumanEval (Pass@1) 0 -shot 43.3 53.0 54.9 65.2 MBPP (Pass@1) 3-shot 65.0 72.6 68.4 75.4 LiveCodeBench-Base (Pass@1) 3-shot 11.6 12.9 15.5 19.4 CRUXEval-I (em) 2-shot 52.5 59.1 58.5 67.3 CRUXEval-O (EM) 2-shot 49.8 59.9 59.9 69.8 Math GSM8K (Em) 8 -shot 81.6 88.3 83.5 89.3 MATH (EM) 4-shot 43.4 54.4 49.0 61.6 MGSM (EM) 8 -shot 63.6 76.2 69.9 79.8 CMath (EM) 3 -shot 78.7 84.5 77.3 90.7 Chinese CLUEWSC (EM) 5-shot 82.0 82.5 83.0 82.7 C-Eval (EM) 5-shot 81.4 89.2 72.5 90.1 CMMLU (EM) 5-shot 84.0 89.5 73.7 88.8 CMRC (EM) 1-shot 77.4 75.8 76.0 76.3 C3 (Em) 0 -shot 77.4 76.7 79.7 78.6 CCPM (EM) 0 -shot 93.0 88.5 78.6 92.0 Multilingual MMMLU-non-English (em) 5-shot 64.0 74.8 73.8 79.4| | Benchmark (Metric) | # Shots | DeepSeek-V2 Base | Qwen2.5 <br> 72B Base | LLaMA-3.1 <br> 405B Base | DeepSeek-V3 Base | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Architecture | - | MoE | Dense | Dense | MoE | | | # Activated Params | - | 21B | 72B | 405B | 37B | | | # Total Params | - | 236B | 72B | 405B | 671B | | English | Pile-test (BPB) | - | 0.606 | 0.638 | 0.542 | 0.548 | | | BBH (Em) | 3-shot | 78.8 | 79.8 | 82.9 | 87.5 | | | MMLU (EM) | 5-shot | 78.4 | 85.0 | 84.4 | 87.1 | | | MMLU-Redux (EM) | 5 -shot | 75.6 | 83.2 | 81.3 | 86.2 | | | MMLU-Pro (em) | 5 -shot | 51.4 | 58.3 | 52.8 | 64.4 | | | DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | 89.0 | | | ARC-Easy (EM) | 25-shot | 97.6 | 98.4 | 98.4 | 98.9 | | | ARC-Challenge (EM) | 25-shot | 92.2 | 94.5 | 95.3 | 95.3 | | | HellaSwag (Em) | 10-shot | 87.1 | 84.8 | 89.2 | 88.9 | | | PIQA (Em) | 0 -shot | 83.9 | 82.6 | 85.9 | 84.7 | | | WinoGrande (EM) | 5-shot | 86.3 | 82.3 | 85.2 | 84.9 | | | RACE-Middle (em) | 5-shot | 73.1 | 68.1 | 74.2 | 67.1 | | | RACE-High (EM) | 5 -shot | 52.6 | 50.3 | 56.8 | 51.3 | | | TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | 82.9 | | | NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | 41.5 | 40.0 | | | AGIEval (EM) | 0 -shot | 57.5 | 75.8 | 60.6 | 79.6 | | Code | HumanEval (Pass@1) | 0 -shot | 43.3 | 53.0 | 54.9 | 65.2 | | | MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | 75.4 | | | LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | 19.4 | | | CRUXEval-I (em) | 2-shot | 52.5 | 59.1 | 58.5 | 67.3 | | | CRUXEval-O (EM) | 2-shot | 49.8 | 59.9 | 59.9 | 69.8 | | Math | GSM8K (Em) | 8 -shot | 81.6 | 88.3 | 83.5 | 89.3 | | | MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | 61.6 | | | MGSM (EM) | 8 -shot | 63.6 | 76.2 | 69.9 | 79.8 | | | CMath (EM) | 3 -shot | 78.7 | 84.5 | 77.3 | 90.7 | | Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | 83.0 | 82.7 | | | C-Eval (EM) | 5-shot | 81.4 | 89.2 | 72.5 | 90.1 | | | CMMLU (EM) | 5-shot | 84.0 | 89.5 | 73.7 | 88.8 | | | CMRC (EM) | 1-shot | 77.4 | 75.8 | 76.0 | 76.3 | | | C3 (Em) | 0 -shot | 77.4 | 76.7 | 79.7 | 78.6 | | | CCPM (EM) | 0 -shot | 93.0 | 88.5 | 78.6 | 92.0 | | Multilingual | MMMLU-non-English (em) | 5-shot | 64.0 | 74.8 | 73.8 | 79.4 |
Table 3 | Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3-Base achieves the best performance on most benchmarks, especially on math and code tasks.
表 3 | DeepSeek-V3-Base 与其他代表性开源基础模型的比较。所有模型在我们的内部框架中进行评估,并共享相同的评估设置。得分差距不超过 0.3 的被视为处于同一水平。DeepSeek-V3-Base 在大多数基准测试中表现最佳,尤其是在数学和代码任务上。

4.4.2. Evaluation Results
4.4.2. 评估结果

In Table 3. we compare the base model of DeepSeek-V3 with the state-of-the-art open-source base models, including DeepSeek-V2-Base (DeepSeek-AI, 2024c) (our previous release), Qwen2.5 72B Base (Qwen, 2024b), and LLaMA-3.1 405B Base (AI@Meta, 2024b). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Note that due to the changes in our evaluation framework over the past months, the performance of DeepSeek-V2-Base exhibits a slight difference from our previously reported results. Overall, DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base, and surpasses LLaMA-3.1 405B Base in the majority of benchmarks, essentially becoming the strongest open-source model.
在表 3 中,我们将 DeepSeek-V3 的基础模型与最先进的开源基础模型进行比较,包括 DeepSeek-V2-Base(DeepSeek-AI,2024c)(我们之前的版本)、Qwen2.5 72B Base(Qwen,2024b)和 LLaMA-3.1 405B Base(AI@Meta,2024b)。我们使用内部评估框架对所有这些模型进行评估,并确保它们共享相同的评估设置。请注意,由于我们评估框架在过去几个月的变化,DeepSeek-V2-Base 的性能与我们之前报告的结果略有不同。总体而言,DeepSeek-V3-Base 在各方面都优于 DeepSeek-V2-Base 和 Qwen2.5 72B Base,并在大多数基准测试中超过 LLaMA-3.1 405B Base,基本上成为最强的开源模型。
From a more detailed perspective, we compare DeepSeek-V3-Base with the other open-source base models individually. (1) Compared with DeepSeek-V2-Base, due to the improvements in our model architecture, the scale-up of the model size and training tokens, and the enhancement of data quality, DeepSeek-V3-Base achieves significantly better performance as expected. (2) Compared with Qwen 2.5 72B Base, the state-of-the-art Chinese open-source model, with only half of the activated parameters, DeepSeek-V3-Base also demonstrates remarkable advantages, especially on English, multilingual, code, and math benchmarks. As for Chinese benchmarks, except for CMMLU, a Chinese multi-subject multiple-choice task, DeepSeek-V3-Base also shows better performance than Qwen2.5 72B. (3) Compared with LLaMA-3.1 405B Base, the largest open-source model with 11 times the activated parameters, DeepSeek-V3-Base also exhibits much better performance on multilingual, code, and math benchmarks. As for English and Chinese language benchmarks, DeepSeek-V3-Base shows competitive or better performance, and is especially good on BBH, MMLU-series, DROP, C-Eval, CMMLU, and CCPM.
从更详细的角度来看,我们将 DeepSeek-V3-Base 与其他开源基础模型逐一进行比较。(1) 与 DeepSeek-V2-Base 相比,由于我们模型架构的改进、模型规模和训练标记的扩大以及数据质量的提升,DeepSeek-V3-Base 的性能显著优于预期。(2) 与 Qwen 2.5 72B Base 相比,这一最先进的中文开源模型,DeepSeek-V3-Base 在激活参数仅为其一半的情况下,仍展现出显著优势,特别是在英语、多语言、代码和数学基准测试上。至于中文基准测试,除了 CMMLU 这一中文多学科多项选择任务外,DeepSeek-V3-Base 的表现也优于 Qwen2.5 72B。(3) 与 LLaMA-3.1 405B Base 相比,这一最大的开源模型激活参数是其 11 倍,DeepSeek-V3-Base 在多语言、代码和数学基准测试上也表现得更好。至于英语和中文语言基准测试,DeepSeek-V3-Base 表现出竞争力或更好的性能,尤其在 BBH、MMLU 系列、DROP、C-Eval、CMMLU 和 CCPM 上表现突出。
Due to our efficient architectures and comprehensive engineering optimizations, DeepSeekV3 achieves extremely high training efficiency. Under our training framework and infrastructures, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, which is much cheaper than training 72B or 405B dense models.
由于我们高效的架构和全面的工程优化,DeepSeek-V3 实现了极高的训练效率。在我们的训练框架和基础设施下,训练 DeepSeek-V3 每万亿个标记仅需 180K H800 GPU 小时,这比训练 72B 或 405B 稠密模型便宜得多。
Benchmark (Metric)  基准(指标) # Shots
  小型 MoE 基线
Small MoE
Baseline
Small MoE Baseline| Small MoE | | :---: | | Baseline |

小型 MoE w/ MTP
Small MoE
w/ MTP
Small MoE w/ MTP| Small MoE | | :---: | | w/ MTP |
  大型 MoE 基线
Large MoE
Baseline
Large MoE Baseline| Large MoE | | :---: | | Baseline |

大型 MoE w/ MTP
Large MoE
w/ MTP
Large MoE w/ MTP| Large MoE | | :---: | | w/ MTP |
# Activated Params (Inference)
# 激活参数(推理)
- 2.4 B 2.4 B 20.9 B

20.9B # 总参数(推理)
20.9B
# Total Params (Inference)
20.9B # Total Params (Inference)| 20.9B | | :---: | | # Total Params (Inference) |
# Training Tokens  # 训练标记 - 15.7 B 15.7 B 228.7 B 228.7 B
Pile-test (BPB) - 1.33 T 1.33 T 540 B 540B
BBH (EM) - 0 . 7 2 9 0 . 7 2 9 0.729\mathbf{0 . 7 2 9} 0 . 7 2 9 0 . 7 2 9 0.729\mathbf{0 . 7 2 9} 0.658 0 . 6 5 7 0 . 6 5 7 0.657\mathbf{0 . 6 5 7}
MMLU (EM) 3-shot 39.0 4 1 . 4 4 1 . 4 41.4\mathbf{4 1 . 4} 70.0 70.7
DROP (F1) 5-shot 50.0 53.3 67.5 66.6
TriviaQA (EM) 1-shot 39.2 4 1 . 3 4 1 . 3 41.3\mathbf{4 1 . 3} 68.5 70.6
NaturalQuestions (EM) 5-shot 56.9 57.7 6 7 . 0 6 7 . 0 67.0\mathbf{6 7 . 0} 6 7 . 3 6 7 . 3 67.3\mathbf{6 7 . 3}
HumanEval (Pass@1) 5-shot 22.7 22.3 27.2 2 8 . 5 2 8 . 5 28.5\mathbf{2 8 . 5}
MBPP (Pass@1) 0-shot 20.7 26.8 44.5 53.7
GSM8K (EM) 3-shot 35.8 36.8 61.6 6 2 . 2 6 2 . 2 62.2\mathbf{6 2 . 2}
MATH (EM) 8-shot 25.4 3 1 . 4 3 1 . 4 31.4\mathbf{3 1 . 4} 72.3 7 4 . 0 7 4 . 0 74.0\mathbf{7 4 . 0}
4-shot 10.7 1 2 . 6 1 2 . 6 12.6\mathbf{1 2 . 6} 38.6 3 9 . 8 3 9 . 8 39.8\mathbf{3 9 . 8}
Benchmark (Metric) # Shots "Small MoE Baseline" "Small MoE w/ MTP" "Large MoE Baseline" "Large MoE w/ MTP" # Activated Params (Inference) - 2.4 B 2.4 B 20.9 B "20.9B # Total Params (Inference)" # Training Tokens - 15.7 B 15.7 B 228.7 B 228.7 B Pile-test (BPB) - 1.33 T 1.33 T 540 B 540B BBH (EM) - 0.729 0.729 0.658 0.657 MMLU (EM) 3-shot 39.0 41.4 70.0 70.7 DROP (F1) 5-shot 50.0 53.3 67.5 66.6 TriviaQA (EM) 1-shot 39.2 41.3 68.5 70.6 NaturalQuestions (EM) 5-shot 56.9 57.7 67.0 67.3 HumanEval (Pass@1) 5-shot 22.7 22.3 27.2 28.5 MBPP (Pass@1) 0-shot 20.7 26.8 44.5 53.7 GSM8K (EM) 3-shot 35.8 36.8 61.6 62.2 MATH (EM) 8-shot 25.4 31.4 72.3 74.0 4-shot 10.7 12.6 38.6 39.8 | Benchmark (Metric) | # Shots | Small MoE <br> Baseline | Small MoE <br> w/ MTP | Large MoE <br> Baseline | Large MoE <br> w/ MTP | | :--- | :---: | :---: | :---: | :---: | :---: | | # Activated Params (Inference) | - | 2.4 B | 2.4 B | 20.9 B | 20.9B <br> # Total Params (Inference) | | # Training Tokens | - | 15.7 B | 15.7 B | 228.7 B | 228.7 B | | Pile-test (BPB) | - | 1.33 T | 1.33 T | 540 B | 540B | | BBH (EM) | - | $\mathbf{0 . 7 2 9}$ | $\mathbf{0 . 7 2 9}$ | 0.658 | $\mathbf{0 . 6 5 7}$ | | MMLU (EM) | 3-shot | 39.0 | $\mathbf{4 1 . 4}$ | 70.0 | 70.7 | | DROP (F1) | 5-shot | 50.0 | 53.3 | 67.5 | 66.6 | | TriviaQA (EM) | 1-shot | 39.2 | $\mathbf{4 1 . 3}$ | 68.5 | 70.6 | | NaturalQuestions (EM) | 5-shot | 56.9 | 57.7 | $\mathbf{6 7 . 0}$ | $\mathbf{6 7 . 3}$ | | HumanEval (Pass@1) | 5-shot | 22.7 | 22.3 | 27.2 | $\mathbf{2 8 . 5}$ | | MBPP (Pass@1) | 0-shot | 20.7 | 26.8 | 44.5 | 53.7 | | GSM8K (EM) | 3-shot | 35.8 | 36.8 | 61.6 | $\mathbf{6 2 . 2}$ | | MATH (EM) | 8-shot | 25.4 | $\mathbf{3 1 . 4}$ | 72.3 | $\mathbf{7 4 . 0}$ | | 4-shot | 10.7 | $\mathbf{1 2 . 6}$ | 38.6 | $\mathbf{3 9 . 8}$ | |
Table 4 | Ablation results for the MTP strategy. The MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.
表 4 | MTP 策略的消融结果。MTP 策略在大多数评估基准上持续提升模型性能。

4.5. Discussion  4.5. 讨论

4.5.1. Ablation Studies for Multi-Token Prediction
4.5.1. 多标记预测的消融研究

In Table 4, we show the ablation results for the MTP strategy. To be specific, we validate the MTP strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 540B tokens. On top of them, keeping the training data and the other architectures the same, we append a 1 -depth MTP module onto them and train two models with the MTP strategy for comparison. Note that during inference, we directly discard the MTP module, so the inference costs of the compared models are exactly the same. From the table, we can observe that the MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.
在表 4 中,我们展示了 MTP 策略的消融结果。具体来说,我们在两个基线模型上验证 MTP 策略,涵盖不同的规模。在小规模下,我们训练了一个基线 MoE 模型,总参数为 15.7B,使用 1.33T 的 tokens。在大规模下,我们训练了一个基线 MoE 模型,总参数为 228.7B,使用 540B 的 tokens。在此基础上,保持训练数据和其他架构不变,我们在它们上面附加了一个 1 深度的 MTP 模块,并训练了两个使用 MTP 策略的模型进行比较。请注意,在推理过程中,我们直接丢弃 MTP 模块,因此被比较模型的推理成本完全相同。从表中可以观察到,MTP 策略在大多数评估基准上始终增强了模型性能。
Benchmark (Metric)  基准(指标) # Shots

小型 MoE 辅助损失基础
Small MoE
Aux-Loss-Based
Small MoE Aux-Loss-Based| Small MoE | | :---: | | Aux-Loss-Based |
  小型 MoE 辅助损失自由
Small MoE
Aux-Loss-Free
Small MoE Aux-Loss-Free| Small MoE | | :---: | | Aux-Loss-Free |

大型 MoE 辅助损失基础
Large MoE
Aux-Loss-Based
Large MoE Aux-Loss-Based| Large MoE | | :---: | | Aux-Loss-Based |
  大型 MoE 辅助损失自由
Large MoE
Aux-Loss-Free
Large MoE Aux-Loss-Free| Large MoE | | :---: | | Aux-Loss-Free |
# Activated Params  # 激活参数 - 2.4 B 2.4 B 20.9 B 20.9 B
# Total Params  # 总参数 - 15.7 B 15.7 B 228.7 B 228.7 B
# Training Tokens  # 训练标记 - 1.33 T 1.33 T 578 B 578B
Pile-test (BPB) - 0.727 0 . 7 2 4 0 . 7 2 4 0.724\mathbf{0 . 7 2 4} 0.656 0 . 6 5 2 0 . 6 5 2 0.652\mathbf{0 . 6 5 2}
BBH (EM) 37.3 39.3 66.7 6 7 . 9 6 7 . 9 67.9\mathbf{6 7 . 9}
MMLU (EM) 3-shot 5-shot 38.0 51.8 6 8 . 3 6 8 . 3 68.3\mathbf{6 8 . 3}
DROP (F1) 1-shot 3 8 . 3 3 8 . 3 38.3\mathbf{3 8 . 3} 5 9 . 0 5 9 . 0 59.0\mathbf{5 9 . 0} 67.2
TriviaQA (EM) 5-shot 23.2 2 3 . 4 2 3 . 4 23.4\mathbf{2 3 . 4} 67.1 6 7 . 1 6 7 . 1 67.1\mathbf{6 7 . 1}
NaturalQuestions (EM) 5-shot 22.0 2 2 . 6 2 2 . 6 22.6\mathbf{2 2 . 6} 27.1 6 7 . 7 6 7 . 7 67.7\mathbf{6 7 . 7}
HumanEval (Pass@1) 0-shot 36.6 35.8 2 8 . 2 2 8 . 2 28.2\mathbf{2 8 . 2} 4 6 . 1 4 6 . 1 46.1\mathbf{4 6 . 1}
MBPP (Pass@1) 3-shot 27.1 2 9 . 6 2 9 . 6 29.6\mathbf{2 9 . 6} 59.2 4 0 . 7 4 0 . 7 40.7\mathbf{4 0 . 7}
GSM8K (EM) 8-shot 1 0 . 9 1 0 . 9 10.9\mathbf{1 0 . 9} 1 1 . 1 1 1 . 1 11.1\mathbf{1 1 . 1} 37.2 6 1 . 2 6 1 . 2 61.2\mathbf{6 1 . 2}
MATH (EM) 4-shot 7 4 . 5 7 4 . 5 74.5\mathbf{7 4 . 5}
Benchmark (Metric) # Shots "Small MoE Aux-Loss-Based" "Small MoE Aux-Loss-Free" "Large MoE Aux-Loss-Based" "Large MoE Aux-Loss-Free" # Activated Params - 2.4 B 2.4 B 20.9 B 20.9 B # Total Params - 15.7 B 15.7 B 228.7 B 228.7 B # Training Tokens - 1.33 T 1.33 T 578 B 578B Pile-test (BPB) - 0.727 0.724 0.656 0.652 BBH (EM) 37.3 39.3 66.7 67.9 MMLU (EM) 3-shot 5-shot 38.0 51.8 68.3 DROP (F1) 1-shot 38.3 59.0 67.2 TriviaQA (EM) 5-shot 23.2 23.4 67.1 67.1 NaturalQuestions (EM) 5-shot 22.0 22.6 27.1 67.7 HumanEval (Pass@1) 0-shot 36.6 35.8 28.2 46.1 MBPP (Pass@1) 3-shot 27.1 29.6 59.2 40.7 GSM8K (EM) 8-shot 10.9 11.1 37.2 61.2 MATH (EM) 4-shot 74.5 | Benchmark (Metric) | # Shots | Small MoE <br> Aux-Loss-Based | Small MoE <br> Aux-Loss-Free | Large MoE <br> Aux-Loss-Based | Large MoE <br> Aux-Loss-Free | | :--- | :---: | :---: | :---: | :---: | :---: | | # Activated Params | - | 2.4 B | 2.4 B | 20.9 B | 20.9 B | | # Total Params | - | 15.7 B | 15.7 B | 228.7 B | 228.7 B | | # Training Tokens | - | 1.33 T | 1.33 T | 578 B | 578B | | Pile-test (BPB) | - | 0.727 | $\mathbf{0 . 7 2 4}$ | 0.656 | $\mathbf{0 . 6 5 2}$ | | BBH (EM) | 37.3 | 39.3 | 66.7 | $\mathbf{6 7 . 9}$ | | | MMLU (EM) | 3-shot | 5-shot | 38.0 | 51.8 | $\mathbf{6 8 . 3}$ | | DROP (F1) | 1-shot | $\mathbf{3 8 . 3}$ | $\mathbf{5 9 . 0}$ | 67.2 | | | TriviaQA (EM) | 5-shot | 23.2 | $\mathbf{2 3 . 4}$ | 67.1 | $\mathbf{6 7 . 1}$ | | NaturalQuestions (EM) | 5-shot | 22.0 | $\mathbf{2 2 . 6}$ | 27.1 | $\mathbf{6 7 . 7}$ | | HumanEval (Pass@1) | 0-shot | 36.6 | 35.8 | $\mathbf{2 8 . 2}$ | $\mathbf{4 6 . 1}$ | | MBPP (Pass@1) | 3-shot | 27.1 | $\mathbf{2 9 . 6}$ | 59.2 | $\mathbf{4 0 . 7}$ | | GSM8K (EM) | 8-shot | $\mathbf{1 0 . 9}$ | $\mathbf{1 1 . 1}$ | 37.2 | $\mathbf{6 1 . 2}$ | | MATH (EM) | 4-shot | | | $\mathbf{7 4 . 5}$ | |
Table 5 | Ablation results for the auxiliary-loss-free balancing strategy. Compared with the purely auxiliary-loss-based method, the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.
表 5 | 无辅助损失平衡策略的消融结果。与纯粹基于辅助损失的方法相比,无辅助损失策略在大多数评估基准上始终实现了更好的模型性能。

4.5.2. Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy
4.5.2. 辅助损失自由平衡策略的消融研究

In Table 5, we show the ablation results for the auxiliary-loss-free balancing strategy. We validate this strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 578B tokens. Both of the baseline models purely use auxiliary losses to encourage load balance, and use the sigmoid gating function with top-K affinity normalization. Their hyper-parameters to control the strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2, respectively. On top of these two baseline models, keeping the training data and the other architectures the same, we remove all auxiliary losses and introduce the auxiliary-loss-free balancing strategy for comparison. From the table, we can observe that the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.
在表 5 中,我们展示了无辅助损失平衡策略的消融结果。我们在两个基线模型上验证了这一策略,涵盖不同的规模。在小规模上,我们训练了一个基线 MoE 模型,包含 15.7B 总参数,使用 1.33T 标记。在大规模上,我们训练了一个基线 MoE 模型,包含 228.7B 总参数,使用 578B 标记。这两个基线模型纯粹使用辅助损失来鼓励负载平衡,并使用带有 top-K 亲和度归一化的 sigmoid 门控函数。它们控制辅助损失强度的超参数与 DeepSeek-V2-Lite 和 DeepSeek-V2 相同。在这两个基线模型的基础上,保持训练数据和其他架构不变,我们去除了所有辅助损失,并引入了无辅助损失平衡策略进行比较。从表中可以观察到,无辅助损失策略在大多数评估基准上始终实现了更好的模型性能。

4.5.3. Batch-Wise Load Balance VS. Sequence-Wise Load Balance
4.5.3. 批量负载均衡 VS. 序列负载均衡

The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.
辅助损失自由平衡与序列级辅助损失之间的关键区别在于它们的平衡范围:批次级与序列级。与序列级辅助损失相比,批次级平衡施加了更灵活的约束,因为它不对每个序列强制执行领域内平衡。这种灵活性使得专家能够更好地专注于不同的领域。为了验证这一点,我们记录并分析了在 Pile 测试集上,基于 16B 辅助损失的基线模型和 16B 辅助损失自由模型在不同领域的专家负载。如图 9 所示,我们观察到辅助损失自由模型表现出更强的专家专业化模式,正如预期的那样。
To further investigate the correlation between this flexibility and the advantage in model performance, we additionally design and validate a batch-wise auxiliary loss that encourages load balance on each training batch instead of on each sequence. The experimental results show that, when achieving a similar level of batch-wise load balance, the batch-wise auxiliary loss can also achieve similar model performance to the auxiliary-loss-free method. To be specific, in our experiments with 1B MoE models, the validation losses are: 2.258 (using a sequencewise auxiliary loss), 2.253 (using the auxiliary-loss-free method), and 2.253 (using a batch-wise
为了进一步研究这种灵活性与模型性能优势之间的相关性,我们额外设计并验证了一种批次辅助损失,鼓励每个训练批次的负载平衡,而不是每个序列的负载平衡。实验结果表明,当实现类似水平的批次负载平衡时,批次辅助损失也可以实现与无辅助损失方法相似的模型性能。具体来说,在我们对 1B MoE 模型的实验中,验证损失为:2.258(使用序列辅助损失)、2.253(使用无辅助损失方法)和 2.253(使用批次辅助损失)。

Figure 9 | Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C .
图 9 | 在 Pile 测试集的三个领域中,辅助无损和基于辅助损失模型的专家负载。辅助无损模型显示出比基于辅助损失模型更大的专家专业化模式。相对专家负载表示实际专家负载与理论平衡专家负载之间的比率。由于空间限制,我们仅以两个层的结果作为示例,所有层的结果见附录 C。

auxiliary loss). We also observe similar results on 3 B MoE models: the model using a sequencewise auxiliary loss achieves a validation loss of 2.085, and the models using the auxiliary-loss-free method or a batch-wise auxiliary loss achieve the same validation loss of 2.080 .
辅助损失)。我们还观察到在 3B MoE 模型上有类似的结果:使用序列辅助损失的模型实现了 2.085 的验证损失,而使用无辅助损失方法或批量辅助损失的模型实现了相同的验证损失 2.080。
In addition, although the batch-wise load balancing methods show consistent performance advantages, they also face two potential challenges in efficiency: (1) load imbalance within certain sequences or small batches, and (2) domain-shift-induced load imbalance during inference. The first challenge is naturally addressed by our training framework that uses large-scale expert parallelism and data parallelism, which guarantees a large size of each micro-batch. For the second challenge, we also design and implement an efficient inference framework with redundant expert deployment, as described in Section 3.4, to overcome it.
此外,尽管按批次负载均衡方法显示出一致的性能优势,但它们在效率上也面临两个潜在挑战:(1)某些序列或小批次内的负载不平衡,以及(2)推理过程中由领域转移引起的负载不平衡。第一个挑战自然通过我们的训练框架得到解决,该框架使用大规模专家并行和数据并行,确保每个微批次的大小较大。对于第二个挑战,我们还设计并实现了一个高效的推理框架,采用冗余专家部署,如第 3.4 节所述,以克服这一挑战。

5. Post-Training  5. 后训练

5.1. Supervised Fine-Tuning
5.1. 监督微调

We curate our instruction-tuning datasets to include 1.5 M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements.
我们精心策划我们的指令调优数据集,包括 150 万实例,涵盖多个领域,每个领域采用不同的数据创建方法,以满足其特定需求。
Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.
推理数据。对于与推理相关的数据集,包括那些专注于数学、代码竞赛问题和逻辑难题的数据集,我们通过利用内部的 DeepSeek-R1 模型生成数据。具体而言,虽然 R1 生成的数据表现出较强的准确性,但也存在过度思考、格式不佳和长度过长等问题。我们的目标是平衡 R1 生成的推理数据的高准确性与常规格式化推理数据的清晰性和简洁性。
To establish our methodology, we begin by developing an expert model tailored to a specific domain, such as code, mathematics, or general reasoning, using a combined Supervised FineTuning (SFT) and Reinforcement Learning (RL) training pipeline. This expert model serves as a data generator for the final model. The training process involves generating two distinct types of SFT samples for each instance: the first couples the problem with its original response in the format of <problem, original response>, while the second incorporates a system prompt alongside the problem and the R1 response in the format of <system prompt, problem, R1 response>.
为了建立我们的方法论,我们首先开发一个针对特定领域(如代码、数学或一般推理)的专家模型,使用结合监督微调(SFT)和强化学习(RL)的训练流程。这个专家模型作为最终模型的数据生成器。训练过程涉及为每个实例生成两种不同类型的 SFT 样本:第一种将问题与其原始响应配对,格式为,而第二种则在问题和 R1 响应的基础上加入系统提示,格式为。
The system prompt is meticulously designed to include instructions that guide the model toward producing responses enriched with mechanisms for reflection and verification. During the RL phase, the model leverages high-temperature sampling to generate responses that integrate patterns from both the R1-generated and original data, even in the absence of explicit system prompts. After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.
系统提示经过精心设计,包含指导模型生成富有反思和验证机制的响应的指令。在强化学习阶段,模型利用高温采样生成响应,这些响应整合了来自 R1 生成的数据和原始数据的模式,即使在没有明确系统提示的情况下也是如此。经过数百个强化学习步骤,中间强化学习模型学会了整合 R1 模式,从而战略性地提升整体性能。
Upon completing the RL training phase, we implement rejection sampling to curate highquality SFT data for the final model, where the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective.
在完成强化学习训练阶段后,我们实施拒绝采样以策划高质量的 SFT 数据用于最终模型,其中专家模型被用作数据生成源。这种方法确保最终训练数据保留了 DeepSeek-R1 的优势,同时生成简洁有效的响应。
Non-Reasoning Data. For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.
非推理数据。对于非推理数据,如创意写作、角色扮演和简单问答,我们利用 DeepSeek-V2.5 生成响应,并招募人工标注员验证数据的准确性和正确性。
SFT Settings. We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset, using the cosine decay learning rate scheduling that starts at 5 × 10 6 5 × 10 6 5xx10^(-6)5 \times 10^{-6} and gradually decreases to 1 × 10 6 1 × 10 6 1xx10^(-6)1 \times 10^{-6}. During training, each single sequence is packed from multiple samples. However, we adopt a sample masking strategy to ensure that these examples remain isolated and mutually invisible.
SFT 设置。我们对 DeepSeek-V3-Base 进行两轮微调,使用 SFT 数据集,采用从 5 × 10 6 5 × 10 6 5xx10^(-6)5 \times 10^{-6} 开始逐渐降低到 1 × 10 6 1 × 10 6 1xx10^(-6)1 \times 10^{-6} 的余弦衰减学习率调度。在训练过程中,每个单独的序列是由多个样本打包而成。然而,我们采用了一种样本掩蔽策略,以确保这些示例保持隔离且相互不可见。

5.2. Reinforcement Learning
5.2. 强化学习

5.2.1. Reward Model  5.2.1. 奖励模型

We employ a rule-based Reward Model (RM) and a model-based RM in our RL process.
我们在我们的强化学习过程中采用了基于规则的奖励模型(RM)和基于模型的奖励模型(RM)。
Rule-Based RM. For questions that can be validated using specific rules, we adopt a rulebased reward system to determine the feedback. For instance, certain math problems have deterministic results, and we require the model to provide the final answer within a designated format (e.g., in a box), allowing us to apply rules to verify the correctness. Similarly, for LeetCode problems, we can utilize a compiler to generate feedback based on test cases. By leveraging rule-based validation wherever possible, we ensure a higher level of reliability, as this approach is resistant to manipulation or exploitation.
基于规则的 RM。对于可以使用特定规则进行验证的问题,我们采用基于规则的奖励系统来确定反馈。例如,某些数学问题具有确定的结果,我们要求模型在指定格式内(例如,在一个框中)提供最终答案,从而允许我们应用规则来验证正确性。同样,对于 LeetCode 问题,我们可以利用编译器根据测试用例生成反馈。通过尽可能利用基于规则的验证,我们确保更高的可靠性,因为这种方法抵抗操控或利用。
Model-Based RM. For questions with free-form ground-truth answers, we rely on the reward model to determine whether the response matches the expected ground-truth. Conversely, for questions without a definitive ground-truth, such as those involving creative writing, the reward model is tasked with providing feedback based on the question and the corresponding answer
基于模型的奖励模型。对于具有自由形式真实答案的问题,我们依赖奖励模型来判断响应是否与预期的真实答案匹配。相反,对于没有明确真实答案的问题,例如涉及创意写作的问题,奖励模型的任务是根据问题和相应的答案提供反馈。

as inputs. The reward model is trained from the DeepSeek-V3 SFT checkpoints. To enhance its reliability, we construct preference data that not only provides the final reward but also includes the chain-of-thought leading to the reward. This approach helps mitigate the risk of reward hacking in specific tasks.
作为输入。奖励模型是从 DeepSeek-V3 SFT 检查点训练而来的。为了增强其可靠性,我们构建了偏好数据,不仅提供最终奖励,还包括导致奖励的思维链。这种方法有助于降低特定任务中奖励黑客攻击的风险。

5.2.2. Group Relative Policy Optimization
5.2.2. 群体相对策略优化

Similar to DeepSeek-V2 (DeepSeek-AI, 2024c), we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q q qq, GRPO samples a group of outputs { o 1 , o 2 , , o G } o 1 , o 2 , , o G {o_(1),o_(2),cdots,o_(G)}\left\{o_{1}, o_{2}, \cdots, o_{G}\right\} from the old policy model π θ old π θ old  pi_(theta_("old "))\pi_{\theta_{\text {old }}} and then optimizes the policy model π θ π θ pi_(theta)\pi_{\theta} by maximizing the following objective:
类似于 DeepSeek-V2 (DeepSeek-AI, 2024c),我们采用了群体相对策略优化 (GRPO) (Shao et al., 2024),该方法放弃了通常与策略模型大小相同的评论模型,而是从群体得分中估计基线。具体来说,对于每个问题 q q qq ,GRPO 从旧的策略模型 π θ old π θ old  pi_(theta_("old "))\pi_{\theta_{\text {old }}} 中抽取一组输出 { o 1 , o 2 , , o G } o 1 , o 2 , , o G {o_(1),o_(2),cdots,o_(G)}\left\{o_{1}, o_{2}, \cdots, o_{G}\right\} ,然后通过最大化以下目标来优化策略模型 π θ π θ pi_(theta)\pi_{\theta}
J G R P O ( θ ) = E [ q P ( Q ) , { o i } i = 1 G π θ o l d ( O q ) ] 1 G i = 1 G ( min ( π θ ( o i q ) π θ o l d ( o i q ) A i , clip ( π θ ( o i q ) π θ o l d ( o i q ) , 1 ε , 1 + ε ) A i ) β D K L ( π θ | | π r e f ) ) , D K L ( π θ | | π r e f ) = π r e f ( o i q ) π θ ( o i q ) log π r e f ( o i q ) π θ ( o i q ) 1 , J G R P O ( θ ) = E q P ( Q ) , o i i = 1 G π θ o l d ( O q ) 1 G i = 1 G min π θ o i q π θ o l d o i q A i , clip π θ o i q π θ o l d o i q , 1 ε , 1 + ε A i β D K L π θ | | π r e f , D K L π θ | | π r e f = π r e f o i q π θ o i q log π r e f o i q π θ o i q 1 , {:[J_(GRPO)(theta)=E[q∼P(Q),{o_(i)}_(i=1)^(G)∼pi_(theta_(old))(O∣q)]],[(1)/(G)sum_(i=1)^(G)(min((pi_(theta)(o_(i)∣q))/(pi_(theta_(old))(o_(i)∣q))A_(i),clip((pi_(theta)(o_(i)∣q))/(pi_(theta_(old))(o_(i)∣q)),1-epsi,1+epsi)A_(i))-betaD_(KL)(pi_(theta)||pi_(ref)))","],[D_(KL)(pi_(theta)||pi_(ref))=(pi_(ref)(o_(i)∣q))/(pi_(theta)(o_(i)∣q))-log((pi_(ref)(o_(i)∣q))/(pi_(theta)(o_(i)∣q)))-1","]:}\begin{gathered} \mathcal{J}_{G R P O}(\theta)=\mathbb{E}\left[q \sim P(Q),\left\{o_{i}\right\}_{i=1}^{G} \sim \pi_{\theta_{o l d}}(O \mid q)\right] \\ \frac{1}{G} \sum_{i=1}^{G}\left(\min \left(\frac{\pi_{\theta}\left(o_{i} \mid q\right)}{\pi_{\theta_{o l d}}\left(o_{i} \mid q\right)} A_{i}, \operatorname{clip}\left(\frac{\pi_{\theta}\left(o_{i} \mid q\right)}{\pi_{\theta_{o l d}}\left(o_{i} \mid q\right)}, 1-\varepsilon, 1+\varepsilon\right) A_{i}\right)-\beta \mathbb{D}_{K L}\left(\pi_{\theta}| | \pi_{r e f}\right)\right), \\ \mathbb{D}_{K L}\left(\pi_{\theta}| | \pi_{r e f}\right)=\frac{\pi_{r e f}\left(o_{i} \mid q\right)}{\pi_{\theta}\left(o_{i} \mid q\right)}-\log \frac{\pi_{r e f}\left(o_{i} \mid q\right)}{\pi_{\theta}\left(o_{i} \mid q\right)}-1, \end{gathered}
where ε ε epsi\varepsilon and β β beta\beta are hyper-parameters; π r e f π r e f pi_(ref)\pi_{r e f} is the reference model; and A i A i A_(i)A_{i} is the advantage, derived from the rewards { r 1 , r 2 , , r G } r 1 , r 2 , , r G {r_(1),r_(2),dots,r_(G)}\left\{r_{1}, r_{2}, \ldots, r_{G}\right\} corresponding to the outputs within each group:
其中 ε ε epsi\varepsilon β β beta\beta 是超参数; π r e f π r e f pi_(ref)\pi_{r e f} 是参考模型; A i A i A_(i)A_{i} 是优势,来源于与每组输出对应的奖励 { r 1 , r 2 , , r G } r 1 , r 2 , , r G {r_(1),r_(2),dots,r_(G)}\left\{r_{1}, r_{2}, \ldots, r_{G}\right\}
A i = r i mean ( { r 1 , r 2 , , r G } ) std ( { r 1 , r 2 , , r G } ) . A i = r i mean r 1 , r 2 , , r G std r 1 , r 2 , , r G . A_(i)=(r_(i)-mean({r_(1),r_(2),cdots,r_(G)}))/(std({r_(1),r_(2),cdots,r_(G)})).A_{i}=\frac{r_{i}-\operatorname{mean}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)}{\operatorname{std}\left(\left\{r_{1}, r_{2}, \cdots, r_{G}\right\}\right)} .
We incorporate prompts from diverse domains, such as coding, math, writing, role-playing, and question answering, during the RL process. This approach not only aligns the model more closely with human preferences but also enhances performance on benchmarks, especially in scenarios where available SFT data are limited.
我们在强化学习过程中融入来自不同领域的提示,例如编码、数学、写作、角色扮演和问答。这种方法不仅使模型与人类偏好更紧密地对齐,还提高了基准测试的性能,特别是在可用的 SFT 数据有限的情况下。

5.3. Evaluations  5.3. 评估

5.3.1. Evaluation Settings
5.3.1. 评估设置

Evaluation Benchmarks. Apart from the benchmark we used for base model testing, we further evaluate instructed models on IFEval (Zhou et al., 2023), FRAMES (Krishna et al. 2024), LongBench v2 (Bai et al., 2024), GPQA (Rein et al., 2023), SimpleQA (OpenAI, 2024c), CSimpleQA (He et al. 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1 1 ^(1){ }^{1}, LiveCodeBench (Jain et al. 2024) (questions from August 2024 to November 2024), Codeforces 2 2 ^(2){ }^{2}, Chinese National High School Mathematics Olympiad (CNMO 2024) 3 3 ^(3){ }^{3}, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024).
评估基准。除了我们用于基础模型测试的基准外,我们还在 IFEval(Zhou et al., 2023)、FRAMES(Krishna et al. 2024)、LongBench v2(Bai et al., 2024)、GPQA(Rein et al., 2023)、SimpleQA(OpenAI, 2024c)、CSimpleQA(He et al. 2024)、SWE-Bench Verified(OpenAI, 2024d)、Aider 1 1 ^(1){ }^{1} 、LiveCodeBench(Jain et al. 2024)(2024 年 8 月至 2024 年 11 月的问题)、Codeforces 2 2 ^(2){ }^{2} 、中国国家高中数学奥林匹克(CNMO 2024) 3 3 ^(3){ }^{3} ,以及 2024 年美国邀请数学考试(AIME 2024)(MAA, 2024)上进一步评估指令模型。
Compared Baselines. We conduct comprehensive evaluations of our chat model against several strong baselines, including DeepSeek-V2-0506, DeepSeek-V2.5-0905, Qwen2.5 72B Instruct, LLaMA-3.1 405B Instruct, Claude-Sonnet-3.5-1022, and GPT-4o-0513. For the DeepSeek-V2 model series, we select the most representative variants for comparison. For closed-source models, evaluations are performed through their respective APIs.
比较基准。我们对我们的聊天模型进行了全面评估,针对多个强基准,包括 DeepSeek-V2-0506、DeepSeek-V2.5-0905、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022 和 GPT-4o-0513。对于 DeepSeek-V2 模型系列,我们选择了最具代表性的变体进行比较。对于闭源模型,评估通过各自的 API 进行。
Detailed Evaluation Configurations. For standard benchmarks including MMLU, DROP, GPQA, and SimpleQA, we adopt the evaluation prompts from the simple-evals framework 4 4 ^(4){ }^{4} We utilize the Zero-Eval prompt format (Lin, 2024) for MMLU-Redux in a zero-shot setting. For other datasets, we follow their original evaluation protocols with default prompts as provided by the dataset creators. For code and math benchmarks, the HumanEval-Mul dataset includes 8 mainstream programming languages (Python, Java, Cpp, C#, JavaScript, TypeScript, PHP, and Bash) in total. We use CoT and non-CoT methods to evaluate model performance on LiveCodeBench, where the data are collected from August 2024 to November 2024. The Codeforces dataset is measured using the percentage of competitors. SWE-Bench verified is evaluated using the agentless framework (Xia et al., 2024). We use the “diff” format to evaluate the Aider-related benchmarks. For mathematical assessments, AIME and CNMO 2024 are evaluated with a temperature of 0.7 , and the results are averaged over 16 runs, while MATH-500 employs greedy decoding. We allow all models to output a maximum of 8192 tokens for each benchmark.
详细评估配置。对于包括 MMLU、DROP、GPQA 和 SimpleQA 在内的标准基准,我们采用来自 simple-evals 框架的评估提示 4 4 ^(4){ }^{4} 。我们在零样本设置中对 MMLU-Redux 使用 Zero-Eval 提示格式(Lin, 2024)。对于其他数据集,我们遵循数据集创建者提供的默认提示的原始评估协议。对于代码和数学基准,HumanEval-Mul 数据集总共包括 8 种主流编程语言(Python、Java、Cpp、C#、JavaScript、TypeScript、PHP 和 Bash)。我们使用 CoT 和非 CoT 方法评估模型在 LiveCodeBench 上的表现,数据收集时间为 2024 年 8 月至 2024 年 11 月。Codeforces 数据集的测量使用竞争者的百分比。SWE-Bench verified 使用无代理框架(Xia et al., 2024)进行评估。我们使用“diff”格式评估与 Aider 相关的基准。对于数学评估,AIME 和 CNMO 2024 的温度设置为 0.7,结果在 16 次运行中取平均,而 MATH-500 则采用贪婪解码。我们允许所有模型在每个基准上输出最多 8192 个标记。
Benchmark (Metric)  基准(指标) DeepSeek V2-0506  DeepSeek   V2-0506  [[" DeepSeek "],[" V2-0506 "]\begin{array}{|c} \text { DeepSeek } \\ \text { V2-0506 } \end{array}
DeepSeek
V2.5-0905
DeepSeek V2.5-0905| DeepSeek | | :--- | | V2.5-0905 |
Qwen2.5 72B-Inst. LLaMA-3.1 405B-Inst.
Claude-3.5-
Sonnet-1022
Claude-3.5- Sonnet-1022| Claude-3.5- | | :--- | | Sonnet-1022 |
GPT-4o 0513  GPT-4o  0513 {:[" GPT-4o "],[0513]:}\begin{gathered} \text { GPT-4o } \\ 0513 \end{gathered} DeepSeek V3  DeepSeek   V3  [[" DeepSeek "],[" V3 "]\begin{array}{|c} \text { DeepSeek } \\ \text { V3 } \end{array}
Architecture  架构 MoE MoE Dense  密集 Dense  密集 - - MoE
# Activated Params  # 激活参数 21B 21B 72B 405B - - 37B
# Total Params  # 总参数 236B 236B 72B 405B - - 671B
English  英语 MMLU (EM) 78.2 80.6 85.3 88.6 88.3 87.2 88.5
MMLU-Redux (EM) 77.9 80.3 85.6 86.2 88.9 88.0 89.1
MMLU-Pro (EM) 58.5 66.2 71.6 73.3 78.0 72.6 75.9
DROP (3-shot F1) 83.0 87.8 76.7 88.7 88.3 83.7 91.6
IF-Eval (Prompt Strict) 57.7 80.6 84.1 86.0 86.5 84.3 86.1
GPQA-Diamond (Pass@1) 35.3 41.3 49.0 51.1 65.0 49.9 59.1
SimpleQA (Correct) 9.0 10.2 9.1 17.1 28.4 38.2 24.9
FRAMES (Acc.) 66.9 65.4 69.8 70.0 72.5 80.5 73.3
LongBench v2 (Acc.) 31.6 35.4 39.4 36.1 41.0 48.1 48.7
Code  代码 HumanEval-Mul (Pass@1) 69.3 77.4 77.3 77.2 81.7 80.5 82.6
LiveCodeBench (Pass@1-COT) 18.8 29.2 31.1 28.4 36.3 33.4 40.5
LiveCodeBench (Pass@1) 20.3 28.4 28.7 30.1 32.8 34.2 37.6
Codeforces (Percentile) 17.5 35.6 24.8 25.3 20.3 23.6 51.6
SWE Verified (Resolved)  SWE 验证(已解决) - 22.6 23.8 24.5 50.8 38.8 42.0
Aider-Edit (Acc.) 60.3 71.6 65.4 63.9 84.2 72.9 79.7
Aider-Polyglot (Acc.) - 18.2 7.6 5.8 45.3 16.0 49.6
Math  数学 AIME 2024 (Pass@1) 4.6 16.7 23.3 23.3 16.0 9.3 39.2
MATH-500 (ЕМ) 56.3 56.3 56.356.3 74.7 74.7 74.774.7 80.0 80.0 80.080.0 73.8 73.8 73.873.8 78.3 78.3 78.378.3 74.6 74.6 74.674.6 90.2
CNMO 2024 (Pass@1) 2.8 10.8 15.9 6.8 13.1 10.8 43.2
Chinese  中文 CLUEWSC (EM) 89.9 90.4 91.4 84.7 85.4 87.9 90.9
C-Eval (EM) 78.6 78.6 78.678.6 79.5 86.1 61.5 76.7 76.0 86.5
C-SimpleQA (Correct) 48.5 54.1 48.4 50.4 51.3 59.3 64.8
Benchmark (Metric) [" DeepSeek V2-0506 " "DeepSeek V2.5-0905" Qwen2.5 72B-Inst. LLaMA-3.1 405B-Inst. "Claude-3.5- Sonnet-1022" " GPT-4o 0513" [" DeepSeek V3 " Architecture MoE MoE Dense Dense - - MoE # Activated Params 21B 21B 72B 405B - - 37B # Total Params 236B 236B 72B 405B - - 671B English MMLU (EM) 78.2 80.6 85.3 88.6 88.3 87.2 88.5 MMLU-Redux (EM) 77.9 80.3 85.6 86.2 88.9 88.0 89.1 MMLU-Pro (EM) 58.5 66.2 71.6 73.3 78.0 72.6 75.9 DROP (3-shot F1) 83.0 87.8 76.7 88.7 88.3 83.7 91.6 IF-Eval (Prompt Strict) 57.7 80.6 84.1 86.0 86.5 84.3 86.1 GPQA-Diamond (Pass@1) 35.3 41.3 49.0 51.1 65.0 49.9 59.1 SimpleQA (Correct) 9.0 10.2 9.1 17.1 28.4 38.2 24.9 FRAMES (Acc.) 66.9 65.4 69.8 70.0 72.5 80.5 73.3 LongBench v2 (Acc.) 31.6 35.4 39.4 36.1 41.0 48.1 48.7 Code HumanEval-Mul (Pass@1) 69.3 77.4 77.3 77.2 81.7 80.5 82.6 LiveCodeBench (Pass@1-COT) 18.8 29.2 31.1 28.4 36.3 33.4 40.5 LiveCodeBench (Pass@1) 20.3 28.4 28.7 30.1 32.8 34.2 37.6 Codeforces (Percentile) 17.5 35.6 24.8 25.3 20.3 23.6 51.6 SWE Verified (Resolved) - 22.6 23.8 24.5 50.8 38.8 42.0 Aider-Edit (Acc.) 60.3 71.6 65.4 63.9 84.2 72.9 79.7 Aider-Polyglot (Acc.) - 18.2 7.6 5.8 45.3 16.0 49.6 Math AIME 2024 (Pass@1) 4.6 16.7 23.3 23.3 16.0 9.3 39.2 MATH-500 (ЕМ) 56.3 74.7 80.0 73.8 78.3 74.6 90.2 CNMO 2024 (Pass@1) 2.8 10.8 15.9 6.8 13.1 10.8 43.2 Chinese CLUEWSC (EM) 89.9 90.4 91.4 84.7 85.4 87.9 90.9 C-Eval (EM) 78.6 79.5 86.1 61.5 76.7 76.0 86.5 C-SimpleQA (Correct) 48.5 54.1 48.4 50.4 51.3 59.3 64.8| | Benchmark (Metric) | $\begin{array}{\|c} \text { DeepSeek } \\ \text { V2-0506 } \end{array}$ | DeepSeek <br> V2.5-0905 | Qwen2.5 72B-Inst. | LLaMA-3.1 405B-Inst. | Claude-3.5- <br> Sonnet-1022 | $\begin{gathered} \text { GPT-4o } \\ 0513 \end{gathered}$ | $\begin{array}{\|c} \text { DeepSeek } \\ \text { V3 } \end{array}$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Architecture | MoE | MoE | Dense | Dense | - | - | MoE | | | # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | | | # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | | English | MMLU (EM) | 78.2 | 80.6 | 85.3 | 88.6 | 88.3 | 87.2 | 88.5 | | | MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | 88.9 | 88.0 | 89.1 | | | MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | 78.0 | 72.6 | 75.9 | | | DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | 91.6 | | | IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | 86.5 | 84.3 | 86.1 | | | GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | 65.0 | 49.9 | 59.1 | | | SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | 38.2 | 24.9 | | | FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | 80.5 | 73.3 | | | LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | 48.7 | | Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | 82.6 | | | LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | 40.5 | | | LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | 37.6 | | | Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | 51.6 | | | SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | 50.8 | 38.8 | 42.0 | | | Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | 84.2 | 72.9 | 79.7 | | | Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | 49.6 | | Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | 39.2 | | | MATH-500 (ЕМ) | $56.3$ | $74.7$ | $80.0$ | $73.8$ | $78.3$ | $74.6$ | 90.2 | | | CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | 43.2 | | Chinese | CLUEWSC (EM) | 89.9 | 90.4 | 91.4 | 84.7 | 85.4 | 87.9 | 90.9 | | | C-Eval (EM) | $78.6$ | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | 86.5 | | | C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | 64.8 |
Table 6 | Comparison between DeepSeek-V3 and other representative chat models. All models are evaluated in a configuration that limits the output length to 8 K . Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.
表 6 | DeepSeek-V3 与其他代表性聊天模型的比较。所有模型在限制输出长度为 8K 的配置下进行评估。包含少于 1000 个样本的基准测试使用不同的温度设置多次测试,以得出稳健的最终结果。DeepSeek-V3 作为表现最佳的开源模型,同时在与前沿闭源模型的竞争中也展现出竞争力。

5.3.2. Standard Evaluation
5.3.2. 标准评估

Table 6 presents the evaluation results, showcasing that DeepSeek-V3 stands as the bestperforming open-source model. Additionally, it is competitive against frontier closed-source models like GPT-4o and Claude-3.5-Sonnet.
表 6 展示了评估结果,表明 DeepSeek-V3 是表现最好的开源模型。此外,它在与前沿的闭源模型如 GPT-4o 和 Claude-3.5-Sonnet 的竞争中也表现出色。
English Benchmarks. MMLU is a widely recognized benchmark designed to assess the performance of large language models, across diverse knowledge domains and tasks. DeepSeek-V3 demonstrates competitive performance, standing on par with top-tier models such as LLaMA-3.1-405B, GPT-4o, and Claude-Sonnet 3.5, while significantly outperforming Qwen2.5 72B. Moreover, DeepSeek-V3 excels in MMLU-Pro, a more challenging educational knowledge benchmark, where it closely trails Claude-Sonnet 3.5. On MMLU-Redux, a refined version of MMLU with corrected labels, DeepSeek-V3 surpasses its peers. In addition, on GPQA-Diamond, a PhD-level evaluation testbed, DeepSeek-V3 achieves remarkable results, ranking just behind Claude 3.5 Sonnet and outperforming all other competitors by a substantial margin.
英语基准。MMLU 是一个广泛认可的基准,旨在评估大型语言模型在不同知识领域和任务中的表现。DeepSeek-V3 展现出竞争力的表现,与顶级模型如 LLaMA-3.1-405B、GPT-4o 和 Claude-Sonnet 3.5 不相上下,同时显著超越 Qwen2.5 72B。此外,DeepSeek-V3 在 MMLU-Pro 这一更具挑战性的教育知识基准中表现出色,紧随 Claude-Sonnet 3.5。在 MMLU-Redux 中,DeepSeek-V3 超越了其同行,MMLU-Redux 是 MMLU 的一个经过修正标签的精炼版本。此外,在 GPQA-Diamond 这一博士级评估测试平台上,DeepSeek-V3 取得了显著的成绩,仅次于 Claude 3.5 Sonnet,并以相当大的优势超越了所有其他竞争对手。
In long-context understanding benchmarks such as DROP, LongBench v2, and FRAMES, DeepSeek-V3 continues to demonstrate its position as a top-tier model. It achieves an impressive 91.6 F1 score in the 3-shot setting on DROP, outperforming all other models in this category. On FRAMES, a benchmark requiring question-answering over 100k token contexts, DeepSeekV3 closely trails GPT-4o while outperforming all other models by a significant margin. This demonstrates the strong capability of DeepSeek-V3 in handling extremely long-context tasks. The long-context capability of DeepSeek-V3 is further validated by its best-in-class performance on LongBench v2, a dataset that was released just a few weeks before the launch of DeepSeek V3. On the factual knowledge benchmark, SimpleQA, DeepSeek-V3 falls behind GPT-4o and Claude-Sonnet, primarily due to its design focus and resource allocation. DeepSeek-V3 assigns more training tokens to learn Chinese knowledge, leading to exceptional performance on the C-SimpleQA. On the instruction-following benchmark, DeepSeek-V3 significantly outperforms its predecessor, DeepSeek-V2-series, highlighting its improved ability to understand and adhere to user-defined format constraints.
在长文本理解基准测试中,如 DROP、LongBench v2 和 FRAMES,DeepSeek-V3 继续展示其作为顶级模型的地位。在 DROP 的 3-shot 设置中,它取得了令人印象深刻的 91.6 F1 分数,超越了该类别的所有其他模型。在 FRAMES 中,这是一个需要在超过 10 万标记上下文中进行问答的基准,DeepSeek-V3 紧随 GPT-4o 之后,同时显著超越了所有其他模型。这证明了 DeepSeek-V3 在处理极长文本任务方面的强大能力。DeepSeek-V3 的长文本能力在 LongBench v2 上的最佳表现进一步得到了验证,该数据集是在 DeepSeek V3 发布前几周发布的。在事实知识基准测试 SimpleQA 中,DeepSeek-V3 落后于 GPT-4o 和 Claude-Sonnet,主要是由于其设计重点和资源分配。DeepSeek-V3 分配了更多的训练标记来学习中文知识,从而在 C-SimpleQA 上表现出色。 在遵循指令的基准测试中,DeepSeek-V3 显著优于其前身 DeepSeek-V2 系列,突显了其理解和遵循用户定义格式约束的能力的提升。
Code and Math Benchmarks. Coding is a challenging and practical task for LLMs, encompassing engineering-focused tasks like SWE-Bench-Verified and Aider, as well as algorithmic tasks such as HumanEval and LiveCodeBench. In engineering tasks, DeepSeek-V3 trails behind Claude-Sonnet-3.5-1022 but significantly outperforms open-source models. The open-source DeepSeek-V3 is expected to foster advancements in coding-related engineering tasks. By providing access to its robust capabilities, DeepSeek-V3 can drive innovation and improvement in areas such as software engineering and algorithm development, empowering developers and researchers to push the boundaries of what open-source models can achieve in coding tasks. In algorithmic tasks, DeepSeek-V3 demonstrates superior performance, outperforming all baselines on benchmarks like HumanEval-Mul and LiveCodeBench. This success can be attributed to its advanced knowledge distillation technique, which effectively enhances its code generation and problem-solving capabilities in algorithm-focused tasks.
代码和数学基准。编码是一个具有挑战性和实用性的任务,涉及工程相关的任务,如 SWE-Bench-Verified 和 Aider,以及算法任务,如 HumanEval 和 LiveCodeBench。在工程任务中,DeepSeek-V3 落后于 Claude-Sonnet-3.5-1022,但显著优于开源模型。开源的 DeepSeek-V3 预计将促进编码相关工程任务的进步。通过提供其强大的能力,DeepSeek-V3 可以推动软件工程和算法开发等领域的创新和改进,使开发者和研究人员能够突破开源模型在编码任务中所能达到的界限。在算法任务中,DeepSeek-V3 表现出卓越的性能,在 HumanEval-Mul 和 LiveCodeBench 等基准测试中超越所有基线。这一成功可归因于其先进的知识蒸馏技术,有效增强了其在算法相关任务中的代码生成和问题解决能力。
On math benchmarks, DeepSeek-V3 demonstrates exceptional performance, significantly surpassing baselines and setting a new state-of-the-art for non-o1-like models. Specifically, on AIME, MATH-500, and CNMO 2024, DeepSeek-V3 outperforms the second-best model, Qwen2.5 72B, by approximately 10 % 10 % 10%10 \% in absolute scores, which is a substantial margin for such challenging benchmarks. This remarkable capability highlights the effectiveness of the distillation technique from DeepSeek-R1, which has been proven highly beneficial for non-o1-like models.
在数学基准测试中,DeepSeek-V3 展现了卓越的性能,显著超越了基准,并为非 o1 类模型设定了新的最先进水平。具体而言,在 AIME、MATH-500 和 CNMO 2024 上,DeepSeek-V3 在绝对分数上超越了第二名模型 Qwen2.5 72B 约 10 % 10 % 10%10 \% ,这是在如此具有挑战性的基准测试中一个相当大的差距。这一显著能力突显了来自 DeepSeek-R1 的蒸馏技术的有效性,该技术已被证明对非 o1 类模型极为有利。
Model  模型 Arena-Hard AlpacaEval 2.0
DeepSeek-V2.5-0905 76.2 50.5
Qwen2.5-72B-Instruct 81.2 49.1
LLaMA-3.1 405B 69.3 40.5
GPT-4o-0513 80.4 51.1
Claude-Sonnet-3.5-1022 85.2 52.0
DeepSeek-V3 8 5 . 5 8 5 . 5 85.5\mathbf{8 5 . 5} 7 0 . 0 7 0 . 0 70.0\mathbf{7 0 . 0}
Model Arena-Hard AlpacaEval 2.0 DeepSeek-V2.5-0905 76.2 50.5 Qwen2.5-72B-Instruct 81.2 49.1 LLaMA-3.1 405B 69.3 40.5 GPT-4o-0513 80.4 51.1 Claude-Sonnet-3.5-1022 85.2 52.0 DeepSeek-V3 85.5 70.0| Model | Arena-Hard | AlpacaEval 2.0 | | :---: | :---: | :---: | | DeepSeek-V2.5-0905 | 76.2 | 50.5 | | Qwen2.5-72B-Instruct | 81.2 | 49.1 | | LLaMA-3.1 405B | 69.3 | 40.5 | | GPT-4o-0513 | 80.4 | 51.1 | | Claude-Sonnet-3.5-1022 | 85.2 | 52.0 | | DeepSeek-V3 | $\mathbf{8 5 . 5}$ | $\mathbf{7 0 . 0}$ |
Table 7 | English open-ended conversation evaluations. For AlpacaEval 2.0, we use the lengthcontrolled win rate as the metric.
表 7 | 英语开放式对话评估。对于 AlpacaEval 2.0,我们使用长度控制的胜率作为指标。
Chinese Benchmarks. Qwen and DeepSeek are two representative model series with robust support for both Chinese and English. On the factual benchmark Chinese SimpleQA, DeepSeekV3 surpasses Qwen2.5-72B by 16.4 points, despite Qwen 2.5 being trained on a larger corpus compromising 18 T tokens, which are 20 % 20 % 20%20 \% more than the 14.8 T tokens that DeepSeek-V3 is pre-trained on.
中文基准。Qwen 和 DeepSeek 是两个在中文和英文方面都有强大支持的代表性模型系列。在事实基准 Chinese SimpleQA 上,DeepSeekV3 超过了 Qwen2.5-72B 16.4 分,尽管 Qwen 2.5 是在一个更大的语料库上训练的,包含 18 T 令牌,比 DeepSeek-V3 预训练的 14.8 T 令牌多 20 % 20 % 20%20 \%
On C-Eval, a representative benchmark for Chinese educational knowledge evaluation, and CLUEWSC (Chinese Winograd Schema Challenge), DeepSeek-V3 and Qwen2.5-72B exhibit similar performance levels, indicating that both models are well-optimized for challenging Chinese-language reasoning and educational tasks.
在 C-Eval,一个代表性的中文教育知识评估基准,以及 CLUEWSC(中文 Winograd 模式挑战),DeepSeek-V3 和 Qwen2.5-72B 表现出相似的性能水平,表明这两个模型在具有挑战性的中文推理和教育任务上都经过了良好的优化。

5.3.3. Open-Ended Evaluation
5.3.3. 开放式评估

In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges, with the results shown in Table 7. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024a), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. On Arena-Hard, DeepSeek-V3 achieves an impressive win rate of over 86 % 86 % 86%86 \% against the baseline GPT-4-0314, performing on par with top-tier models like Claude-Sonnet-3.5-1022. This underscores the robust capabilities of DeepSeek-V3, especially in dealing with complex prompts, including coding and debugging tasks. Furthermore, DeepSeek-V3 achieves a groundbreaking milestone as the first open-source model to surpass 85 % 85 % 85%85 \% on the Arena-Hard benchmark. This achievement significantly bridges the performance gap between open-source and closed-source models, setting a new standard for what open-source models can accomplish in challenging domains.
除了标准基准测试,我们还在开放式生成任务上评估我们的模型,使用LLMs作为评审,结果如表 7 所示。具体而言,我们遵循 AlpacaEval 2.0(Dubois 等,2024)和 Arena-Hard(Li 等,2024a)的原始配置,这些配置利用 GPT-4-Turbo-1106 作为成对比较的评审。在 Arena-Hard 上,DeepSeek-V3 以超过 86 % 86 % 86%86 \% 的胜率击败基线 GPT-4-0314,表现与顶级模型如 Claude-Sonnet-3.5-1022 相当。这突显了 DeepSeek-V3 的强大能力,特别是在处理复杂提示时,包括编码和调试任务。此外,DeepSeek-V3 作为第一个在 Arena-Hard 基准上超过 85 % 85 % 85%85 \% 的开源模型,达成了突破性的里程碑。这一成就显著缩小了开源模型与闭源模型之间的性能差距,为开源模型在挑战性领域所能实现的标准设定了新的标杆。
Similarly, DeepSeek-V3 showcases exceptional performance on AlpacaEval 2.0, outperforming both closed-source and open-source models. This demonstrates its outstanding proficiency in writing tasks and handling straightforward question-answering scenarios. Notably, it surpasses DeepSeek-V2.5-0905 by a significant margin of 20 % 20 % 20%20 \%, highlighting substantial improvements in tackling simple tasks and showcasing the effectiveness of its advancements.
同样,DeepSeek-V3 在 AlpacaEval 2.0 上表现出色,超越了闭源和开源模型。这证明了它在写作任务和处理简单问答场景方面的卓越能力。值得注意的是,它比 DeepSeek-V2.5-0905 超出了 20 % 20 % 20%20 \% 的显著差距,突显了在处理简单任务方面的重大改进,并展示了其进步的有效性。

5.3.4. DeepSeek-V3 as a Generative Reward Model
5.3.4. DeepSeek-V3 作为生成奖励模型

We compare the judgment ability of DeepSeek-V3 with state-of-the-art models, namely GPT-4o and Claude-3.5. Table 8 presents the performance of these models in RewardBench (Lambert et al., 2024). DeepSeek-V3 achieves performance on par with the best versions of GPT-4o-0806 and Claude-3.5-Sonnet-1022, while surpassing other versions. Additionally, the judgment ability of DeepSeek-V3 can also be enhanced by the voting technique. Therefore, we employ DeepSeekV3 along with voting to offer self-feedback on open-ended questions, thereby improving the
我们将 DeepSeek-V3 的判断能力与最先进的模型进行比较,即 GPT-4o 和 Claude-3.5。表 8 展示了这些模型在 RewardBench(Lambert 等,2024)中的表现。DeepSeek-V3 的表现与 GPT-4o-0806 和 Claude-3.5-Sonnet-1022 的最佳版本相当,同时超越了其他版本。此外,DeepSeek-V3 的判断能力还可以通过投票技术得到增强。因此,我们采用 DeepSeekV3 结合投票来对开放式问题提供自我反馈,从而提高。
Model  模型 Chat  聊天 Chat-Hard Safety  安全 Reasoning  推理 Average  平均
GPT-4o-0513 96.6 70.4 86.7 84.9 84.7
GPT-4o-0806 96.1 76.1 88.1 86.6 86.7
GPT-4o-1120 95.8 71.3 86.2 85.2 84.6
Claude-3.5-sonnet-0620 96.4 74.0 81.6 84.7 84.2
Claude-3.5-sonnet-1022 96.4 79.7 91.1 87.6 88.7
DeepSeek-V3 96.9 79.8 87.0 84.3 87.0
DeepSeek-V3 (maj@6) 96.9 82.6 89.5 89.2 89.6
Model Chat Chat-Hard Safety Reasoning Average GPT-4o-0513 96.6 70.4 86.7 84.9 84.7 GPT-4o-0806 96.1 76.1 88.1 86.6 86.7 GPT-4o-1120 95.8 71.3 86.2 85.2 84.6 Claude-3.5-sonnet-0620 96.4 74.0 81.6 84.7 84.2 Claude-3.5-sonnet-1022 96.4 79.7 91.1 87.6 88.7 DeepSeek-V3 96.9 79.8 87.0 84.3 87.0 DeepSeek-V3 (maj@6) 96.9 82.6 89.5 89.2 89.6| Model | Chat | Chat-Hard | Safety | Reasoning | Average | | :---: | :---: | :---: | :---: | :---: | :---: | | GPT-4o-0513 | 96.6 | 70.4 | 86.7 | 84.9 | 84.7 | | GPT-4o-0806 | 96.1 | 76.1 | 88.1 | 86.6 | 86.7 | | GPT-4o-1120 | 95.8 | 71.3 | 86.2 | 85.2 | 84.6 | | Claude-3.5-sonnet-0620 | 96.4 | 74.0 | 81.6 | 84.7 | 84.2 | | Claude-3.5-sonnet-1022 | 96.4 | 79.7 | 91.1 | 87.6 | 88.7 | | DeepSeek-V3 | 96.9 | 79.8 | 87.0 | 84.3 | 87.0 | | DeepSeek-V3 (maj@6) | 96.9 | 82.6 | 89.5 | 89.2 | 89.6 |
Table 8 | Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.
表 8 | GPT-4o、Claude-3.5-sonnet 和 DeepSeek-V3 在 RewardBench 上的表现。
Model  模型 LiveCodeBench-CoT MATH-500
Pass@1 Length  长度 Pass@1 Length  长度
DeepSeek-V2.5 Baseline  DeepSeek-V2.5 基线 31.1 718 74.6 769
DeepSeek-V2.5 +R1 Distill 37.4 783 83.2 1510
Model LiveCodeBench-CoT MATH-500 Pass@1 Length Pass@1 Length DeepSeek-V2.5 Baseline 31.1 718 74.6 769 DeepSeek-V2.5 +R1 Distill 37.4 783 83.2 1510| Model | LiveCodeBench-CoT | | MATH-500 | | | :---: | :---: | :---: | :---: | :---: | | | Pass@1 | Length | Pass@1 | Length | | DeepSeek-V2.5 Baseline | 31.1 | 718 | 74.6 | 769 | | DeepSeek-V2.5 +R1 Distill | 37.4 | 783 | 83.2 | 1510 |
Table 9 | The contribution of distillation from DeepSeek-R1. The evaluation settings of LiveCodeBench and MATH-500 are the same as in Table 6
表 9 | DeepSeek-R1 的蒸馏贡献。LiveCodeBench 和 MATH-500 的评估设置与表 6 相同。

effectiveness and robustness of the alignment process.
对齐过程的有效性和稳健性。

5.4. Discussion  5.4. 讨论

5.4.1. Distillation from DeepSeek-R1
5.4.1. 从 DeepSeek-R1 进行蒸馏

We ablate the contribution of distillation from DeepSeek-R1 based on DeepSeek-V2.5. The baseline is trained on short CoT data, whereas its competitor uses data generated by the expert checkpoints described above.
我们基于 DeepSeek-V2.5 消融了来自 DeepSeek-R1 的蒸馏贡献。基线是在短的 CoT 数据上训练的,而其竞争者使用的是上述专家检查点生成的数据。
Table 9 demonstrates the effectiveness of the distillation data, showing significant improvements in both LiveCodeBench and MATH-500 benchmarks. Our experiments reveal an interesting trade-off: the distillation leads to better performance but also substantially increases the average response length. To maintain a balance between model accuracy and computational efficiency, we carefully selected optimal settings for DeepSeek-V3 in distillation.
表 9 展示了蒸馏数据的有效性,在 LiveCodeBench 和 MATH-500 基准测试中均显示出显著的改进。我们的实验揭示了一个有趣的权衡:蒸馏导致了更好的性能,但也显著增加了平均响应长度。为了在模型准确性和计算效率之间保持平衡,我们为 DeepSeek-V3 在蒸馏中仔细选择了最佳设置。
Our research suggests that knowledge distillation from reasoning models presents a promising direction for post-training optimization. While our current work focuses on distilling data from mathematics and coding domains, this approach shows potential for broader applications across various task domains. The effectiveness demonstrated in these specific areas indicates that long-CoT distillation could be valuable for enhancing model performance in other cognitive tasks requiring complex reasoning. Further exploration of this approach across different domains remains an important direction for future research.
我们的研究表明,从推理模型中进行知识蒸馏为后训练优化提供了一个有前景的方向。虽然我们目前的工作集中在从数学和编码领域提取数据,但这种方法在各种任务领域中显示出更广泛应用的潜力。在这些特定领域中展示的有效性表明,长链蒸馏可能对提高其他需要复杂推理的认知任务中的模型性能具有价值。进一步探索这种方法在不同领域的应用仍然是未来研究的重要方向。

5.4.2. Self-Rewarding  5.4.2. 自我奖励

Rewards play a pivotal role in RL, steering the optimization process. In domains where verification through external tools is straightforward, such as some coding or mathematics scenarios, RL demonstrates exceptional efficacy. However, in more general scenarios, constructing a feedback
奖励在强化学习中起着关键作用,指导优化过程。在通过外部工具进行验证相对简单的领域,例如某些编码或数学场景中,强化学习表现出卓越的效果。然而,在更一般的场景中,构建反馈

mechanism through hard coding is impractical. During the development of DeepSeek-V3, for these broader contexts, we employ the constitutional AI approach (Bai et al., 2022), leveraging the voting evaluation results of DeepSeek-V3 itself as a feedback source. This method has produced notable alignment effects, significantly enhancing the performance of DeepSeek-V3 in subjective evaluations. By integrating additional constitutional inputs, DeepSeek-V3 can optimize towards the constitutional direction. We believe that this paradigm, which combines supplementary information with LLMs as a feedback source, is of paramount importance. The LLM serves as a versatile processor capable of transforming unstructured information from diverse scenarios into rewards, ultimately facilitating the self-improvement of LLMs. Beyond self-rewarding, we are also dedicated to uncovering other general and scalable rewarding methods to consistently advance the model capabilities in general scenarios.
通过硬编码的机制是不切实际的。在 DeepSeek-V3 的开发过程中,对于这些更广泛的上下文,我们采用了宪法 AI 方法(Bai et al., 2022),利用 DeepSeek-V3 自身的投票评估结果作为反馈来源。这种方法产生了显著的对齐效果,显著提升了 DeepSeek-V3 在主观评估中的表现。通过整合额外的宪法输入,DeepSeek-V3 可以朝着宪法方向进行优化。我们相信,这种将补充信息与LLMs作为反馈来源相结合的范式至关重要。LLM作为一个多功能处理器,能够将来自不同场景的非结构化信息转化为奖励,最终促进LLMs的自我改进。除了自我奖励,我们还致力于发现其他通用和可扩展的奖励方法,以持续提升模型在一般场景中的能力。

5.4.3. Multi-Token Prediction Evaluation
5.4.3. 多标记预测评估

Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023. Xia et al., 2023), it can significantly accelerate the decoding speed of the model. A natural question arises concerning the acceptance rate of the additionally predicted token. Based on our evaluation, the acceptance rate of the second token prediction ranges between 85 % 85 % 85%85 \% and 90 % 90 % 90%90 \% across various generation topics, demonstrating consistent reliability. This high acceptance rate enables DeepSeek-V3 to achieve a significantly improved decoding speed, delivering 1.8 times TPS (Tokens Per Second).
DeepSeek-V3 不是仅仅预测下一个单一的 token,而是通过 MTP 技术预测下两个 token。结合推测解码的框架(Leviathan et al., 2023. Xia et al., 2023),它可以显著加速模型的解码速度。一个自然的问题是关于额外预测 token 的接受率。根据我们的评估,第二个 token 预测的接受率在不同生成主题中范围在 85 % 85 % 85%85 \% 90 % 90 % 90%90 \% 之间,显示出一致的可靠性。这一高接受率使得 DeepSeek-V3 能够实现显著提高的解码速度,达到 1.8 倍的 TPS(每秒 token 数)。

6. Conclusion, Limitations, and Future Directions
6. 结论、局限性和未来方向

In this paper, we introduce DeepSeek-V3, a large MoE language model with 671B total parameters and 37B activated parameters, trained on 14.8 T tokens. In addition to the MLA and DeepSeekMoE architectures, it also pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. The training of DeepSeek-V3 is cost-effective due to the support of FP8 training and meticulous engineering optimizations. The post-training also makes a success in distilling the reasoning capability from the DeepSeek-R1 series of models. Comprehensive evaluations demonstrate that DeepSeek-V3 has emerged as the strongest open-source model currently available, and achieves performance comparable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet. Despite its strong performance, it also maintains economical training costs. It requires only 2.788 M H800 GPU hours for its full training, including pre-training, context length extension, and post-training.
在本文中,我们介绍了 DeepSeek-V3,这是一个大型 MoE 语言模型,具有 671B 的总参数和 37B 的激活参数,训练于 14.8 T 的 token。除了 MLA 和 DeepSeekMoE 架构外,它还开创了一种无辅助损失的负载平衡策略,并设定了多 token 预测训练目标以实现更强的性能。由于支持 FP8 训练和细致的工程优化,DeepSeek-V3 的训练具有成本效益。后训练阶段也成功地从 DeepSeek-R1 系列模型中提炼了推理能力。全面评估表明,DeepSeek-V3 已成为目前可用的最强开源模型,其性能可与领先的闭源模型如 GPT-4o 和 Claude-3.5-Sonnet 相媲美。尽管性能强劲,但它仍保持经济的训练成本。其完整训练仅需 2.788 M H800 GPU 小时,包括预训练、上下文长度扩展和后训练。
While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment. Firstly, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might pose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeekV3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware.
虽然承认其强大的性能和成本效益,但我们也认识到 DeepSeek-V3 在部署方面存在一些局限性。首先,为了确保高效推理,DeepSeek-V3 的推荐部署单元相对较大,这可能对小型团队造成负担。其次,尽管我们针对 DeepSeek-V3 的部署策略已实现了超过 DeepSeek-V2 两倍的端到端生成速度,但仍然存在进一步提升的潜力。幸运的是,随着更先进硬件的发展,这些局限性预计会自然得到解决。
DeepSeek consistently adheres to the route of open-source models with longtermism, aiming to steadily approach the ultimate goal of AGI (Artificial General Intelligence). In the future, we plan to strategically invest in research across the following directions.
DeepSeek 始终坚持开放源代码模型的长期主义路线,旨在稳步接近 AGI(人工通用智能)的最终目标。未来,我们计划在以下方向上进行战略性研究投资。
  • We will consistently study and refine our model architectures, aiming to further improve
    我们将持续研究和完善我们的模型架构,旨在进一步提高

    both the training and inference efficiency, striving to approach efficient support for infinite context length. Additionally, we will try to break through the architectural limitations of Transformer, thereby pushing the boundaries of its modeling capabilities.
    提高训练和推理效率,努力接近对无限上下文长度的高效支持。此外,我们将尝试突破 Transformer 的架构限制,从而推动其建模能力的边界。
  • We will continuously iterate on the quantity and quality of our training data, and explore the incorporation of additional training signal sources, aiming to drive data scaling across a more comprehensive range of dimensions.
    我们将不断迭代我们的训练数据的数量和质量,并探索额外训练信号源的整合,旨在推动数据在更广泛的维度上进行扩展。
  • We will consistently explore and iterate on the deep thinking capabilities of our models, aiming to enhance their intelligence and problem-solving abilities by expanding their reasoning length and depth.
    我们将持续探索和迭代我们模型的深度思维能力,旨在通过扩展它们的推理长度和深度来增强它们的智能和解决问题的能力。
  • We will explore more comprehensive and multi-dimensional model evaluation methods to prevent the tendency towards optimizing a fixed set of benchmarks during research, which may create a misleading impression of the model capabilities and affect our foundational assessment.
    我们将探索更全面和多维的模型评估方法,以防止在研究过程中倾向于优化一组固定的基准,这可能会对模型能力产生误导性的印象,并影响我们的基础评估。

References  参考文献

AI@Meta. Llama 3 model card, 2024a. URL https://github.com/meta-llama/llama3/bl ob/main/MODEL_CARD.md
AI@Meta. Llama 3 模型卡,2024a。网址 https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
AI@Meta. Llama 3.1 model card, 2024b. URL https://github. com/meta-llama/llama-m odels/blob/main/models/llama3_1/MODEL_CARD.md.
AI@Meta. Llama 3.1 模型卡,2024b。网址 https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md。
Anthropic. Claude 3.5 sonnet, 2024. URL https://www. anthropic.com/news/claude-3 -5-sonnet.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, 等人。使用大型语言模型进行程序合成。arXiv 预印本 arXiv:2108.07732, 2021。

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, 等。宪法人工智能:来自人工智能反馈的无害性。arXiv 预印本 arXiv:2212.08073, 2022。

Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204, 2024.
Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, J. Tang, 和 J. Li. LongBench v2: 朝着对现实长上下文多任务的更深入理解和推理. arXiv 预印本 arXiv:2412.15204, 2024.

M. Bauer, S. Treichler, and A. Aiken. Singe: leveraging warp specialization for high performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, page 119-130, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450326568. doi: 10.1145/2555243.2555258. URL https://doi.org/10.1145/2555243.2555258
M. Bauer, S. Treichler, 和 A. Aiken. Singe: 利用波特化实现高性能的 GPU. 在第 19 届 ACM SIGPLAN 并行编程原理与实践研讨会论文集,PPoPP '14,第 119-130 页,纽约,NY,美国,2014. 计算机协会. ISBN 9781450326568. doi: 10.1145/2555243.2555258. URL https://doi.org/10.1145/2555243.2555258

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432-7439. AAAI Press, 2020. doi: 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, 和 Y. Choi. PIQA:在自然语言中推理物理常识。在第三十四届人工智能 AAAI 会议,AAAI 2020,第三十二届人工智能创新应用会议,IAAI 2020,第十届人工智能教育进展 AAAI 研讨会,EAAI 2020,美国纽约,2020 年 2 月 7-12 日,页 7432-7439。AAAI 出版社,2020。doi: 10.1609/aaai.v34i05.6239。网址 https://doi.org/10.1609/aaai.v34i05.6239

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin

B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, 和 W. Zaremba. 评估在代码上训练的大型语言模型。CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick 和 O. Tafjord. 认为你已经解决了问答问题?试试 arc,AI2 推理挑战。CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, 等。训练验证器以解决数学文字问题。arXiv 预印本 arXiv:2110.14168, 2021。

Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883-5889, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1600. URL https://aclanthology.org/D19-1 600
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, 和 G. Hu. 一个用于中文机器阅读理解的跨度提取数据集。在 K. Inui, J. Jiang, V. Ng, 和 X. Wan 编辑的《2019 年自然语言处理实证方法会议暨第九届国际联合自然语言处理会议(EMNLP-IJCNLP)论文集》,第 5883-5889 页,香港,中国,2019 年 11 月。计算语言学协会。doi: 10.18653/v1/D19-1600。网址 https://aclanthology.org/D19-1600

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, 和 W. Liang. Deepseekmoe: 朝着混合专家语言模型中的终极专家专业化迈进。CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
DeepSeek-AI. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. CoRR, abs/2406.11931, 2024a. URL https://doi.org/10.48550/arXiv.2406.11 931.
DeepSeek-AI。Deepseek-coder-v2:打破闭源模型在代码智能中的壁垒。CoRR,abs/2406.11931,2024a。网址 https://doi.org/10.48550/arXiv.2406.11 931。
DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024b. URLhttps://doi.org/10.48550/arXiv.2401.02954.
DeepSeek-AI。Deepseek LLM:通过长期主义扩展开源语言模型。CoRR,abs/2401.02954,2024b。URLhttps://doi.org/10.48550/arXiv.2401.02954。
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024c. URLhttps://doi.org/10.48550/arXiv. 2405 04434.
DeepSeek-AI。Deepseek-v2:一个强大、经济且高效的混合专家语言模型。CoRR,abs/2405.04434,2024c。URLhttps://doi.org/10.48550/arXiv. 2405 04434。

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:3031830332, 2022.
T. Dettmers, M. Lewis, Y. Belkada, 和 L. Zettlemoyer. Gpt3. int8 (): 大规模变换器的 8 位矩阵乘法. 神经信息处理系统进展, 35:3031830332, 2022.

H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations improve language modeling. arXiv preprint arXiv:2404.10830, 2024.
H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, 和 S. Soatto. 更少的截断改善语言建模. arXiv 预印本 arXiv:2404.10830, 2024.

D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 23682378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, 和 M. Gardner. DROP:一个需要对段落进行离散推理的阅读理解基准。在 J. Burstein, C. Doran, 和 T. Solorio 编辑的《2019 年北美计算语言学协会会议论文集:人类语言技术,NAACL-HLT 2019》,明尼阿波利斯,MN,美国,2019 年 6 月 2-7 日,第 1 卷(长篇和短篇论文),第 2368-2378 页。计算语言学协会,2019。doi: 10.18653/V1/N19-1246。网址 https://doi.org/10.18653/v1/n19-1246。

Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
Y. Dubois, B. Galambosi, P. Liang, 和 T. B. Hashimoto. 长度控制的 alpacaeval:一种消除自动评估者偏差的简单方法。arXiv 预印本 arXiv:2404.04475, 2024.

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/ abs/2101.03961.
W. Fedus, B. Zoph, 和 N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/ abs/2101.03961.

M. Fishman, B. Chmiel, R. Banner, and D. Soudry. Scaling FP8 training to trillion-token llms. arXiv preprint arXiv:2409.12517, 2024.
M. Fishman, B. Chmiel, R. Banner, 和 D. Soudry. 将 FP8 训练扩展到万亿标记 llms. arXiv 预印本 arXiv:2409.12517, 2024.

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
E. Frantar, S. Ashkboos, T. Hoefler, 和 D. Alistarh. Gptq: 生成预训练变换器的准确后训练量化. arXiv 预印本 arXiv:2210.17323, 2022.

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, 等人。《The Pile:一个用于语言建模的 800GB 多样文本数据集》。arXiv 预印本 arXiv:2101.00027,2020 年。

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi. or g/10.48550/arXiv.2406.04127.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, 和 P. Minervini. 我们完成 mmlu 了吗?CoRR, abs/2406.04127, 2024. URL https://doi. or g/10.48550/arXiv.2406.04127.

F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=pEWAcejiU2
F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, 和 G. Synnaeve. 通过多标记预测提高更好更快的大型语言模型. 在第四十一届国际机器学习会议,ICML 2024,奥地利维也纳,2024 年 7 月 21-27 日. OpenReview.net, 2024. URL https://openreview.net/forum?id=pEWAcejiU2
Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/techno logy/ai/google-gemini-next-generation-model-february-2024.
谷歌。我们的下一代模型:Gemini 1.5,2024。网址 https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024。

R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, et al. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), pages 1-10. IEEE, 2016.
R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, 等. 可扩展的分层聚合协议 (SHArP):一种高效数据压缩的硬件架构. 在 2016 年第一次国际高性能计算通信优化研讨会 (COMHPC) 上, 页码 1-10. IEEE, 2016.

A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution, 2024.
A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, 和 S. I. Wang. Cruxeval: 一项用于代码推理、理解和执行的基准,2024。

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196, 2024. URL/https://doi.org/10.485 50/arXiv. 2401.14196
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, 和 W. Liang. Deepseek-coder: 当大型语言模型遇上编程 - 代码智能的崛起. CoRR, abs/2401.14196, 2024. URL/ https://doi.org/10.48550/arXiv.2401.14196

A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training, 2018. URLhttps://arxiv. or g/abs/1806.03377.
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, 和 P. Gibbons. Pipedream: 快速高效的管道并行 DNN 训练, 2018. URL https://arxiv.org/abs/1806.03377.

B. He, L. Noci, D. Paliotta, I. Schlag, and T. Hofmann. Understanding and minimising outlier features in transformer training. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
B. He, L. Noci, D. Paliotta, I. Schlag, 和 T. Hofmann. 理解和最小化变压器训练中的异常特征. 在第三十八届神经信息处理系统年会上。

Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chinese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint arXiv:2411.07140, 2024.
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, 等. Chinese simpleqa: 一种针对大型语言模型的中文事实性评估. arXiv 预印本 arXiv:2411.07140, 2024.

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, 和 J. Steinhardt. 测量大规模多任务语言理解. arXiv 预印本 arXiv:2009.03300, 2020.

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, 和 J. Steinhardt. 使用数学数据集测量数学问题解决能力。arXiv 预印本 arXiv:2103.03874, 2021.

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, 等人。C-Eval:一个多层次多学科的中文评估套件,用于基础模型。arXiv 预印本 arXiv:2305.08322,2023。

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv. 2403.07974.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, 和 I. Stoica. Livecodebench: 大型语言模型代码的整体和无污染评估. CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv. 2403.07974.

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. 1. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. 1. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, 等. Mistral 7b. arXiv 预印本 arXiv:2310.06825, 2023.

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601-1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
M. Joshi, E. Choi, D. Weld, 和 L. Zettlemoyer. TriviaQA: 一个大规模远程监督的阅读理解挑战数据集。在 R. Barzilay 和 M.-Y. Kan 编辑的《计算语言学协会第 55 届年会论文集(第一卷:长篇论文)》中,页面 1601-1611,加拿大温哥华,2017 年 7 月。计算语言学协会。doi: 10.18653/v1/P17-1147。网址 https://aclanthology.org/P17-1147。

D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen, et al. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen, 等人。关于 bfloat16 在深度学习训练中的研究。arXiv 预印本 arXiv:1905.12322, 2019。

S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. CoRR, abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.485 50/arXiv.2409.12941.
S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, 和 M. Faruqui. 事实、获取和推理:检索增强生成的统一评估。CoRR, abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.48550/arXiv.2409.12941.

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452-466, 2019. doi: 10.1162/tacl \\\backslash _a \\\backslash _00276. URL https://doi.org/10.1162/tacl_a_00276.
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, 和 S. Petrov. 自然问题:一个用于问答研究的基准。计算语言学协会会刊,7:452-466, 2019。doi: 10.1162/tacl \\\backslash _a \\\backslash _00276. URL https://doi.org/10.1162/tacl_a_00276.

G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785-794. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D17-1082. URL/https://doi.org/10.18653/v1/d1 7-1082.
G. Lai, Q. Xie, H. Liu, Y. Yang, 和 E. H. Hovy. RACE: 来自考试的大规模阅读理解数据集. 在 M. Palmer, R. Hwa, 和 S. Riedel, 编辑, 2017 年自然语言处理实证方法会议论文集, EMNLP 2017, 丹麦哥本哈根, 2017 年 9 月 9-11 日, 页码 785-794. 计算语言学协会, 2017. doi: 10.18653/V1/D17-1082. URL/ https://doi.org/10.18653/v1/d17-1082.

N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, 等人。Rewardbench:评估语言建模的奖励模型。arXiv 预印本 arXiv:2403.13787,2024。

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2 2021 . 2 2021 . ¯ 2 bar(2021.)2 \overline{2021 .} URL https://openreview.net/forum?id=qrwe7XHTmYb.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, 和 Z. Chen. Gshard: 使用条件计算和自动分片来扩展巨型模型. 在第九届国际学习表征会议,ICLR 2021. OpenReview.net, 2 2021 . 2 2021 . ¯ 2 bar(2021.)2 \overline{2021 .} URL https://openreview.net/forum?id=qrwe7XHTmYb.

Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274-19286. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/leviathan23 a.html.
Y. Leviathan, M. Kalman, 和 Y. Matias. 通过推测解码实现变换器的快速推理。在国际机器学习会议,ICML 2023,2023 年 7 月 23 日至 29 日,夏威夷檀香山,美国,机器学习研究会议论文集第 202 卷,页码 19274-19286。PMLR,2023。URL https://proceedings.mlr.press/v202/leviathan23 a.html。

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, 和 T. Baldwin. CMMLU: 测量中文的大规模多任务语言理解. arXiv 预印本 arXiv:2306.09212, 2023.

S. Li and T. Hoefler. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21, page 1-14. ACM, Nov. 2021. doi: 10.1145/345881 7.3476145. URL http://dx.doi.org/10.1145/3458817.3476145.
S. Li 和 T. Hoefler. Chimera: 高效训练大规模神经网络的双向管道。在国际高性能计算、网络、存储与分析会议论文集,SC '21,第 1-14 页。ACM,2021 年 11 月。doi: 10.1145/3458817.3476145。网址 http://dx.doi.org/10.1145/3458817.3476145。

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024a.
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, 和 I. Stoica. 从众包数据到高质量基准:Arena-hard 和 benchbuilder 流程. arXiv 预印本 arXiv:2406.11939, 2024a.

W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021.
W. Li, F. Qi, M. Sun, X. Yi, 和 J. Zhang. Ccpm: 一个中文古典诗词匹配数据集, 2021.

Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE: speculative sampling requires rethinking feature uncertainty. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024b. URL https://openreview.net /forum?id=1NdN7eXyb4
Y. Li, F. Wei, C. Zhang, 和 H. Zhang. EAGLE: 推测采样需要重新思考特征不确定性. 在第四十一届国际机器学习会议,ICML 2024,奥地利维也纳,2024 年 7 月 21-27 日. OpenReview.net, 2024b. URL https://openreview.net/forum?id=1NdN7eXyb4

B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL https://github.com/WildEval/ZeroEval
B. Y. Lin. ZeroEval:评估语言模型的统一框架,2024 年 7 月。网址 https://github.com/WildEval/ZeroEval

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
I. Loshchilov 和 F. Hutter. 解耦权重衰减正则化. arXiv 预印本 arXiv:1711.05101, 2017.

S. Lundberg. The art of prompt design: Prompt boundaries and token healing, 2023. URL https://towardsdatascience.com/the-art-of-prompt-design-prompt-bound aries-and-token-healing-3b2448b0be38.
S. Lundberg. 提示设计的艺术:提示边界和令牌修复,2023。网址 https://towardsdatascience.com/the-art-of-prompt-design-prompt-bound aries-and-token-healing-3b2448b0be38。

Y. Luo, Z. Zhang, R. Wu, H. Liu, Y. Jin, K. Zheng, M. Wang, Z. He, G. Hu, L. Chen, et al. Ascend HiFloat8 format for deep learning. arXiv preprint arXiv:2409.16626, 2024.
Y. Luo, Z. Zhang, R. Wu, H. Liu, Y. Jin, K. Zheng, M. Wang, Z. He, G. Hu, L. Chen, 等. Ascend HiFloat8 format for deep learning. arXiv preprint arXiv:2409.16626, 2024.
MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math -competitions/american-invitational-mathematics-examination-aime
MAA. 美国邀请数学考试 - AIME。在 2024 年 2 月的美国邀请数学考试 - AIME 2024 中。网址 https://maa.org/math-competitions/american-invitational-mathematics-examination-aime

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, et al. FP8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, 等人。FP8 格式用于深度学习。arXiv 预印本 arXiv:2209.05433, 2022。
Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b.
Mistral。更便宜、更好、更快、更强:继续推动人工智能的前沿,使其对所有人可及,2024 年。网址 https://mistral.ai/news/mixtral-8x22b。

S. Narang, G. Diamos, E. Elsen, P. Micikevicius, J. Alben, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. Mixed precision training. In Int. Conf. on Learning Representation, 2017.
S. Narang, G. Diamos, E. Elsen, P. Micikevicius, J. Alben, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, 等人。混合精度训练。在国际学习表征会议,2017 年。

B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi. 8-bit numerical formats for deep neural networks. arXiv preprint arXiv:2206.02915, 2022.
B. Noune, P. Jones, D. Justus, D. Masters, 和 C. Luschi. 深度神经网络的 8 位数值格式. arXiv 预印本 arXiv:2206.02915, 2022.
NVIDIA. Improving network performance of HPC systems using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. https://developer.nvidia.com/blog/improving-net work-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-g pudirect-async, 2022.
NVIDIA。使用 NVIDIA Magnum IO NVSHMEM 和 GPUDirect Async 提高 HPC 系统的网络性能。https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async,2022。
NVIDIA. Blackwell architecture. https://www.nvidia.com/en-us/data-center/tech nologies/blackwell-architecture/. 2024a.
NVIDIA. Blackwell 架构。 https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/. 2024a.
NVIDIA. TransformerEngine, 2024b. URL https://github.com/NVIDIA/TransformerE ngine. Accessed: 2024-11-19.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
OpenAI。你好 GPT-4o,2024a。网址 https://openai.com/index/hello-gpt-4o/。

OpenAI. Multilingual massive multitask language understanding (mmmlu), 2024b. URL https://huggingface.co/datasets/openai/MMMLU.
OpenAI. 多语言大规模多任务语言理解 (mmmlu), 2024b. URL https://huggingface.co/datasets/openai/MMMLU.
OpenAI. Introducing SimpleQA, 2024c. URLhttps://openai.com/index/introducing -simpleqa/.
OpenAI。介绍 SimpleQA,2024c。URLhttps://openai.com/index/introducing-simpleqa/。
OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swebench that more, 2024d. URL/https://openai.com/index/introducing-swe-bench -verified/.
OpenAI。介绍 SWE-bench 验证,我们正在发布一个经过人工验证的 swebench 子集,更多信息,请访问 2024d。URL/ https://openai.com/index/introducing-swe-bench-verified/。

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023a.
B. Peng, J. Quesnelle, H. Fan, 和 E. Shippole. Yarn: 大型语言模型的高效上下文窗口扩展. arXiv 预印本 arXiv:2309.00071, 2023a.

H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, J. Hu, et al. FP8-LM: Training FP8 large language models. arXiv preprint arXiv:2310.18313, 2023b.
H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, J. Hu, 等. FP8-LM: 训练 FP8 大型语言模型. arXiv 预印本 arXiv:2310.18313, 2023b.

P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023a.
P. Qi, X. Wan, G. Huang, 和 M. Lin. 零气泡管道并行性. arXiv 预印本 arXiv:2401.10241, 2023a.

P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism, 2023b. URL https: //arxiv.org/abs/2401.10241.
P. Qi, X. Wan, G. Huang, 和 M. Lin. 零气泡管道并行性, 2023b. URL https: //arxiv.org/abs/2401.10241.
Qwen. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Qwen. Qwen 技术报告。arXiv 预印本 arXiv:2309.16609, 2023。

Qwen. Introducing Qwen1.5, 2024a. URL/https://qwenlm.github.io/blog/qwen1.5.
Qwen。介绍 Qwen1.5,2024a。网址/ https://qwenlm.github.io/blog/qwen1.5。

Qwen. Qwen2.5: A party of foundation models, 2024b. URLhttps://qwenlm.github.io/b log/qwen2.5.
Qwen. Qwen2.5:一组基础模型,2024b。URLhttps://qwenlm.github.io/b log/qwen2.5。

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-16. IEEE, 2020.
S. Rajbhandari, J. Rasley, O. Ruwase, 和 Y. He. Zero: 针对训练万亿参数模型的内存优化. 在 SC20: 高性能计算、网络、存储和分析国际会议, 页码 1-16. IEEE, 2020.

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, 和 S. R. Bowman. GPQA: 一项研究生水平的谷歌防范问答基准。arXiv 预印本 arXiv:2311.12022, 2023.

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, et al. Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537, 2023a.
B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, 等。深度学习的微缩数据格式。arXiv 预印本 arXiv:2310.10537, 2023a。

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, et al. Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537, 2023b.
B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, 等。深度学习的微缩数据格式。arXiv 预印本 arXiv:2310.10537, 2023b。

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, 和 Y. Choi. Winogrande: 一项大规模的对抗性 Winograd 语法挑战, 2019.

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, 和 D. Guo. Deepseekmath: 在开放语言模型中推动数学推理的极限. arXiv 预印本 arXiv:2402.03300, 2024.

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017. URL https: //openreview.net/forum?id=B1ckMDqlg.
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, 和 J. Dean. 极其庞大的神经网络:稀疏门控专家混合层. 在第五届国际学习表征会议,ICLR 2017. OpenReview.net, 2017. URL https: //openreview.net/forum?id=B1ckMDqlg.

F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/forum?i d=fR3wGCk-IXp.
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, 和 J. Wei. 语言模型是多语言的链式思维推理者。 在第十一届国际学习表征会议,ICLR 2023,卢旺达基加利,2023 年 5 月 1 日至 5 日。 OpenReview.net,2023。 URLhttps://openreview.net/forum?id=fR3wGCk-IXp。

Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, 和 S. Arikawa. 字节对编码:一种加速模式匹配的文本压缩方案。1999。

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, 和 Y. Liu. Roformer: 带有旋转位置嵌入的增强型变换器. Neurocomputing, 568:127063, 2024.

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019a.
K. Sun, D. Yu, D. Yu, 和 C. Cardie. 研究先前知识在挑战性中文机器阅读理解中的应用, 2019a.

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024.
M. Sun, X. Chen, J. Z. Kolter, 和 Z. Liu. 大型语言模型中的大量激活. arXiv 预印本 arXiv:2402.17762, 2024.

X. Sun, J. Choi, C.-Y. Chen, N. Wang, S. Venkataramani, V. V. Srinivasan, X. Cui, W. Zhang, and K. Gopalakrishnan. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019b.
X. Sun, J. Choi, C.-Y. Chen, N. Wang, S. Venkataramani, V. V. Srinivasan, X. Cui, W. Zhang, 和 K. Gopalakrishnan. 深度神经网络的混合 8 位浮点(HFP8)训练和推理. 神经信息处理系统进展, 32, 2019b.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, 等. 挑战大基准任务以及链式思维是否能解决它们。arXiv 预印本 arXiv:2210.09261, 2022.

V. Thakkar, P. Ramani, C. Cecka, A. Shivam, H. Lu, E. Yan, J. Kosaian, M. Hoemmen, H. Wu, A. Kerr, M. Nicely, D. Merrill, D. Blasig, F. Qiao, P. Majcher, P. Springer, M. Hohnerbach, J. Wang, and M. Gupta. CUTLASS, Jan. 2023. URLhttps://github.com/NVIDIA/cutlas S.
V. Thakkar, P. Ramani, C. Cecka, A. Shivam, H. Lu, E. Yan, J. Kosaian, M. Hoemmen, H. Wu, A. Kerr, M. Nicely, D. Merrill, D. Blasig, F. Qiao, P. Majcher, P. Springer, M. Hohnerbach, J. Wang, 和 M. Gupta. CUTLASS, 2023 年 1 月. URLhttps://github.com/NVIDIA/cutlas S.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, 等. LLaMA: 开放和高效的基础语言模型. arXiv 预印本 arXiv:2302.13971, 2023a.

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini,
R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv. 2307. 09288
R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, 和 T. Scialom. Llama 2: 开放基础和微调聊天模型. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv. 2307. 09288

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, 和 I. Polosukhin. 注意力是你所需要的一切. 神经信息处理系统进展, 30, 2017.

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. CoRR, abs/2408.15664, 2024a. URL/https://doi.org/10.48550/arX iv. 2408.15664
L. Wang, H. Gao, C. Zhao, X. Sun, 和 D. Dai. 无辅助损失的混合专家负载均衡策略. CoRR, abs/2408.15664, 2024a. URL/ https://doi.org/10.48550/arX iv. 2408.15664

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024b. URL https://doi.org/10.48550/arXiv.2406.01574.
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, 和 W. Chen. Mmlu-pro: 一个更强大和具有挑战性的多任务语言理解基准. CoRR, abs/2406.01574, 2024b. URL https://doi.org/10.48550/arXiv.2406.01574.

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023.
T. Wei, J. Luan, W. Liu, S. Dong, 和 B. Wang. Cmath: 你的语言模型能通过中国小学数学测试吗?,2023。

M. Wortsman, T. Dettmers, L. Zettlemoyer, A. Morcos, A. Farhadi, and L. Schmidt. Stable and low-precision training for large-scale vision-language models. Advances in Neural Information Processing Systems, 36:10271-10298, 2023.
M. Wortsman, T. Dettmers, L. Zettlemoyer, A. Morcos, A. Farhadi, 和 L. Schmidt. 大规模视觉-语言模型的稳定和低精度训练. 神经信息处理系统进展, 36:10271-10298, 2023.

H. Xi, C. Li, J. Chen, and J. Zhu. Training transformers with 4-bit integers. Advances in Neural Information Processing Systems, 36:49146-49168, 2023.
H. Xi, C. Li, J. Chen, 和 J. Zhu. 用 4 位整数训练变换器. 神经信息处理系统进展, 36:49146-49168, 2023.

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint, 2024.
C. S. Xia, Y. Deng, S. Dunn, 和 L. Zhang. 无代理: 揭示基于 llm 的软件工程代理. arXiv 预印本, 2024.

H. Xia, T. Ge, P. Wang, S. Chen, F. Wei, and Z. Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3909-3925. Association for Computational Linguistics, 2023. URLhttps://doi.org/10.18653/v1/ 2023.findings-emnlp.257.
H. Xia, T. Ge, P. Wang, S. Chen, F. Wei, 和 Z. Sui. 投机解码:利用投机执行加速 seq2seq 生成. 在计算语言学协会的发现:EMNLP 2023, 新加坡, 2023 年 12 月 6-10 日, 第 3909-3925 页. 计算语言学协会, 2023. URLhttps://doi.org/10.18653/v1/ 2023.findings-emnlp.257.

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087-38099. PMLR, 2023.
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, 和 S. Han. Smoothquant: 准确且高效的大型语言模型后训练量化. 在国际机器学习会议, 页码 38087-38099. PMLR, 2023.

L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762-4772. International Committee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, 和 Z. Lan. CLUE:一个中文语言理解评估基准。在 D. Scott, N. Bel, 和 C. Zong 编辑的《第 28 届国际计算语言学会议论文集》,COLING 2020,西班牙巴塞罗那(在线),2020 年 12 月 8-13 日,页 4762-4772。国际计算语言学委员会,2020。doi: 10.18653/V1/2020.COLING-MAIN.419。网址 https://doi.org/10.18653/v1/2020.coling-main.419

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791-4800. Association for Computational Linguistics, 2019. doi: 10.18653 / v 1 / p 19 1472 10.18653 / v 1 / p 19 1472 10.18653//v1//p19-147210.18653 / \mathrm{v} 1 / \mathrm{p} 19-1472. URL https://doi.org/10.18653/v1/p1 9-1472
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, 和 Y. Choi. HellaSwag: 机器真的能完成你的句子吗?在 A. Korhonen, D. R. Traum, 和 L. Màrquez 编辑的《第 57 届计算语言学协会会议论文集》,ACL 2019,意大利佛罗伦萨,2019 年 7 月 28 日至 8 月 2 日,第 1 卷:长篇论文,4791-4800 页。计算语言学协会,2019。doi: 10.18653 / v 1 / p 19 1472 10.18653 / v 1 / p 19 1472 10.18653//v1//p19-147210.18653 / \mathrm{v} 1 / \mathrm{p} 19-1472 。网址 https://doi.org/10.18653/v1/p1 9-1472

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, 和 N. Duan. AGIEval: 一个以人为中心的基准,用于评估基础模型。CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, 和 L. Hou. 大型语言模型的指令跟随评估. arXiv 预印本 arXiv:2311.07911, 2023.

Appendix  附录

A. Contributions and Acknowledgments
A. 贡献与致谢

Research & Engineering  研究与工程
Aixin Liu  刘爱心
Bing Xue  冰雪
Bingxuan Wang
Bochao Wu  博超 吴
Chengda Lu  程达陆
Chenggang Zhao  赵承刚
Chengqi Deng  邓承启
Chenyu Zhang*
Chong Ruan
Damai Dai  达麦·戴
Daya Guo
Dejian Yang
Deli Chen
Erhang Li
Fangyun Lin  方云林
Fucong Dai
Fuli Luo*
Guangbo Hao  光博浩
Guanting Chen  关廷晨
Guowei Li
H. Zhang
Han Bao*
Hanwei Xu
Haocheng Wang*
Haowei Zhang
Honghui Ding  丁洪辉
Huajian Xin*  华建新*
Huazuo Gao  华作高
Hui Qu  惠曲
Jianzhong Guo
Jiashi Li
Jiawei Wang*
Jingchang Chen
Jingyang Yuan  袁京阳
Junjie Qiu  邱俊杰
Junlong Li
Junxiao Song
Kai Dong  凯东
Kai Hu*  凯·胡*
Kaige Gao  高凯歌
Kang Guan
Kexin Huang  黄克欣
Kuai Yu  快鱼
Lean Wang
Lecong Zhang
Liang Zhao  梁赵
Litong Wang
Liyue Zhang
Mingchuan Zhang
Minghua Zhang
Minghui Tang
Panpan Huang  潘潘黄
Peiyi Wang
Qiancheng Wang
Qihao Zhu
Qinyu Chen
Qiushi Du
Ruiqi Ge
Ruisong Zhang
Ruizhe Pan
Runji Wang
Runxin Xu
Ruoyu Zhang
Shanghao Lu
Shangyan Zhou
Shanhuang Chen  单黄陈
Shengfeng Ye  盛丰叶
Shirong Ma
Shiyu Wang
Shuiping Yu
Shunfeng Zhou  周顺丰
Shuting Pan
Tao Yun  陶云
Tian Pei  田佩
Wangding Zeng  王丁增
Wanjia Zhao*
Wen Liu  文刘
Wenfeng Liang
Wenjun Gao  高文君
Wenqin Yu
Wentao Zhang
Xiao Bi  小比
Xiaodong Liu  刘晓东
Xiaohan Wang  小寒 王
Xiaokang Chen  小康陈
Xiaokang Zhang  张小康
Xiaotao Nie
Xin Cheng  辛城
Xin Liu
Xin Xie  辛谢
Xingchao Liu
Xingkai Yu
Xinyu Yang
Xinyuan Li
Xuecheng Su  薛成苏
Xuheng Lin
Y.K. Li
Y.Q. Wang
Y.X. Wei
Yang Zhang  杨张
Yanhong Xu
Yao Li  姚李
Yao Zhao  姚赵
Yaofeng Sun
Yaohui Wang  姚辉 王
Yi Yu  易宇
Yichao Zhang
Yifan Shi
Yiliang Xiong
Ying He  英赫
Yishi Piao
Yisong Wang
Yixuan Tan
Yiyang Ma*
Yiyuan Liu
Yongqiang Guo  郭永强
Yu Wu  余吴
Yuan Ou  袁欧
Yuduan Wang  余端王
Yue Gong  岳公
Yuheng Zou
Yujia He
Yunfan Xiong  熊云帆
Yuxiang Luo  罗宇翔
Yuxiang You
Yuxuan Liu
Yuyang Zhou
Z.F. Wu
Z.Z. Ren
Zehui Ren  任泽辉
Zhangli Sha  张丽沙
Zhe Fu
Zhean Xu
Zhenda Xie
Zhengyan Zhang  郑彦张
Zhewen Hao
Zhibin Gou
Zhicheng Ma
Zhigang Yan
Zhihong Shao
Zhiyu Wu
Zhuoshu Li
Zihui Gu
Zijia Zhu
Zijun Liu*  刘子俊*
Zilin Li
Ziwei Xie  谢子维
Ziyang Song
Ziyi Gao
Zizheng Pan
Data Annotation  数据标注
Bei Feng  北风
Hui Li
J.L. Cai
Jiaqi Ni  贾琦·倪
Lei Xu  雷旭
Meng Li  孟李
Ning Tian  宁天
R.J. Chen
R.L. Jin
Ruyi Chen
S.S. Li
Shuang Zhou
Tianyu Sun  天宇孙
X.Q. Li
Xiangyue Jin
Xiaojin Shen
Xiaosha Chen  小沙陈
Xiaowen Sun  小文孙
Xiaoxiang Wang  王小湘
Xinnan Song
Xinyi Zhou  周欣怡
Y.X. Zhu
Yanhong Xu  徐彦宏
Yanping Huang  黄燕平
Yaohui Li  姚辉 李
Yi Zheng  易正
Yuchen Zhu
Yunxian Ma  云贤马
Zhen Huang
Zhipeng Xu
Zhongyu Zhang
Business & Compliance  商业与合规
Dongjie Ji  董杰基
Jian Liang W.L. Xiao
Jin Chen  金晨 Wei An  魏安
Leyi Xia Xianzu Wang  王先祖
Miaojun Wang Xinxia Shan
Mingming Li Ying Tang
Peng Zhang  彭张 Yukun Zha
Shaoqing Wu Yuting Yan  闫玉婷
Shengfeng Ye Zhen Zhang
T. Wang
Jian Liang W.L. Xiao Jin Chen Wei An Leyi Xia Xianzu Wang Miaojun Wang Xinxia Shan Mingming Li Ying Tang Peng Zhang Yukun Zha Shaoqing Wu Yuting Yan Shengfeng Ye Zhen Zhang T. Wang | Jian Liang | W.L. Xiao | | :--- | :--- | | Jin Chen | Wei An | | Leyi Xia | Xianzu Wang | | Miaojun Wang | Xinxia Shan | | Mingming Li | Ying Tang | | Peng Zhang | Yukun Zha | | Shaoqing Wu | Yuting Yan | | Shengfeng Ye | Zhen Zhang | | T. Wang | |
Within each role, authors are listed alphabetically by the first name. Names marked with * denote individuals who have departed from our team.
在每个角色中,作者按名字的字母顺序列出。带有 * 标记的名字表示已离开我们团队的个人。

B. Ablation Studies for Low-Precision Training
B. 低精度训练的消融研究

Figure 10 | Loss curves comparison between BF16 and FP8 training. Results are smoothed by Exponential Moving Average (EMA) with a coefficient of 0.9.
图 10 | BF16 和 FP8 训练的损失曲线比较。结果通过指数移动平均(EMA)平滑,系数为 0.9。

B.1. FP8 v.s. BF16 Training
B.1. FP8 与 BF16 训练

We validate our FP8 mixed precision framework with a comparison to BF16 training on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising approximately 16 B total parameters on 1.33 T tokens. At the large scale, we train a baseline MoE model comprising approximately 230 B total parameters on around 0.9 T tokens. We show the training curves in Figure 10 and demonstrate that the relative error remains below 0.25 % 0.25 % 0.25%0.25 \% with our high-precision accumulation and fine-grained quantization strategies.
我们通过与在两个基线模型上进行 BF16 训练的比较来验证我们的 FP8 混合精度框架,涵盖不同规模。在小规模上,我们训练了一个基线 MoE 模型,包含大约 16B 的总参数,使用 1.33T 的 tokens。在大规模上,我们训练了一个基线 MoE 模型,包含大约 230B 的总参数,使用大约 0.9T 的 tokens。我们在图 10 中展示了训练曲线,并证明我们的高精度累积和细粒度量化策略使得相对误差保持在 0.25 % 0.25 % 0.25%0.25 \% 以下。

B.2. Discussion About Block-Wise Quantization
B.2. 关于块级量化的讨论

Although our tile-wise fine-grained quantization effectively mitigates the error introduced by feature outliers, it requires different groupings for activation quantization, i.e., 1 × 128 1 × 128 1xx1281 \times 128 in forward pass and 128 × 1 128 × 1 128 xx1128 \times 1 for backward pass. A similar process is also required for the activation gradient. A straightforward strategy is to apply block-wise quantization per 128 × 128 128 × 128 128 xx128128 \times 128 elements like the way we quantize the model weights. In this way, only transposition is required for backward. Therefore, we conduct an experiment where all tensors associated with Dgrad are quantized on a block-wise basis. The results reveal that the Dgrad operation which computes the activation gradients and back-propagates to shallow layers in a chain-like manner, is highly sensitive to precision. Specifically, block-wise quantization of activation gradients leads to
尽管我们的基于瓦片的细粒度量化有效地减轻了特征异常值引入的误差,但它需要对激活量化进行不同的分组,即在前向传播中为 1 × 128 1 × 128 1xx1281 \times 128 ,在反向传播中为 128 × 1 128 × 1 128 xx1128 \times 1 。激活梯度也需要类似的过程。一种简单的策略是对每 128 × 128 128 × 128 128 xx128128 \times 128 个元素应用块状量化,就像我们量化模型权重的方式一样。通过这种方式,反向传播只需要转置。因此,我们进行了一项实验,其中与 Dgrad 相关的所有张量都以块状方式进行量化。结果表明,计算激活梯度并以链式方式反向传播到浅层的 Dgrad 操作对精度非常敏感。具体来说,激活梯度的块状量化导致

model divergence on an MoE model comprising approximately 16B total parameters, trained for around 300B tokens. We hypothesize that this sensitivity arises because activation gradients are highly imbalanced among tokens, resulting in token-correlated outliers (Xi et al., 2023). These outliers cannot be effectively managed by a block-wise quantization approach.
在一个包含大约 16B 总参数的 MoE 模型上,模型发散,训练了大约 300B 个标记。我们假设这种敏感性产生的原因是激活梯度在标记之间高度不平衡,导致与标记相关的异常值(Xi et al., 2023)。这些异常值无法通过块级量化方法有效管理。

C. Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-LossFree Models
C. 16B 辅助损失基础模型和无辅助损失模型的专家专业化模式

We record the expert load of the 16B auxiliary-loss-based baseline and the auxiliary-loss-free model on the Pile test set. The auxiliary-loss-free model tends to have greater expert specialization across all layers, as demonstrated in Figure 10.
我们记录了在 Pile 测试集上 16B 辅助损失基础线和无辅助损失模型的专家负载。无辅助损失模型在所有层中往往具有更大的专家专业化,如图 10 所示。

Aux-Loss-Based Layer 1

Aux-Loss-Based Layer 4  Aux-Loss-Based 层 4

Aux-Loss-Free Layer 4  Aux-Loss-Free 层 4
Aux-Loss-Based Layer 5  Aux-Loss-Based 层 5
Aux-Loss-Free Layer 5  Aux-Loss-Free 层 5
Aux-Loss-Based Layer 6  Aux-Loss-Based 层 6
Aux-Loss-Free Layer 6  Aux-Loss-Free 层 6

(a) Layers 1-7  (a) 层 1-7
Aux-Loss-Based Layer 13  Aux-Loss-Based 层 13
Aux-Loss-Free Layer 14  Aux-Loss-Free 层 14
Aux-Loss-Free Layer 15  Aux-Loss-Free 层 15
Aux-Loss-Free Layer 16  Aux-Loss-Free 层 16
Aux-Loss-Free Layer 17  Aux-Loss-Free 层 17
Aux-Loss-Based Layer 18  Aux-Loss-Based 层 18
Wikipedia (en)  维基百科 (en)
Github
DM Mathematics  DM 数学
Aux-Loss-Free Layer 18  Aux-Loss-Free 层 18

© Layers 13-19  © 层 13-19

Aux-Loss-Free Layer 19  Aux-Loss-Free 层 19

Aux-Loss-Free Layer 20  Aux-Loss-Free 层 20

Aux-Loss-Based Layer 22  Aux-Loss-Based 层 22

Aux-Loss-Free Layer 22  Aux-Loss-Free 层 22

Aux-Loss-Free Layer 23  Aux-Loss-Free 层 23
Aux-Loss-Based Layer 24  Aux-Loss-Based 层 24
Wikipedia (en)  维基百科 (en)
Github
DM Mathematics

(d) Layers 19-25  (d) 层 19-25

(e) Layers 25-27  (e) 层 25-27
Figure 10 | Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load.
图 10 | 在 Pile 测试集的三个领域中,辅助无损模型和基于辅助损失模型的专家负载。辅助无损模型显示出比基于辅助损失模型更大的专家专业化模式。相对专家负载表示实际专家负载与理论平衡专家负载之间的比率。