We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at/https://github.com/deepseek-ai/DeepSeek-V3. 我们介绍 DeepSeek-V3,这是一个强大的 Mixture-of-Experts (MoE)语言模型,具有 671B 的总参数,其中每个 token 激活 37B。为了实现高效推理和经济的训练,DeepSeek-V3 采用了 Multi-head Latent Attention (MLA)和 DeepSeekMoE 架构,这些架构在 DeepSeek-V2 中得到了充分验证。此外,DeepSeek-V3 开创了一种无辅助损失的负载平衡策略,并设定了多 token 预测训练目标,以实现更强的性能。我们在 148 万亿个多样化和高质量的 token 上对 DeepSeek-V3 进行了预训练,随后进行了监督微调和强化学习阶段,以充分发挥其能力。全面评估显示,DeepSeek-V3 的表现优于其他开源模型,并且其性能可与领先的闭源模型相媲美。尽管表现出色,DeepSeek-V3 的完整训练仅需 2.788M H800 GPU 小时。此外,其训练过程非常稳定。在整个训练过程中,我们没有经历任何不可恢复的损失峰值,也没有进行任何回滚。模型检查点可在/ https://github.com/deepseek-ai/DeepSeek-V3 获取。
Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts. 图 1 | DeepSeek-V3 及其对应模型的基准性能。
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024a|b|c; Guo et al., 2024), LLaMA series (AI@Meta, 2024a b; Touvron et al., 2023a b), Qwen series (Qwen, 2023, 2024a b), and Mistral series (Jiang et al. 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token. 近年来,大型语言模型(LLMs)正在经历快速的迭代和演变(Anthropic, 2024;Google, 2024;OpenAI, 2024a),逐渐缩小与人工通用智能(AGI)之间的差距。除了闭源模型,开源模型,包括 DeepSeek 系列(DeepSeek-AI, 2024a|b|c;Guo 等, 2024)、LLaMA 系列(AI@Meta, 2024a b;Touvron 等, 2023a b)、Qwen 系列(Qwen, 2023, 2024a b)和 Mistral 系列(Jiang 等, 2023;Mistral, 2024),也在取得显著进展,努力缩小与其闭源同行之间的差距。为了进一步推动开源模型能力的边界,我们扩大了模型规模,并推出 DeepSeek-V3,这是一个大型的混合专家(MoE)模型,具有 671B 参数,其中每个 token 激活 37B。
With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeekV2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks. 从前瞻性的角度出发,我们始终努力追求强大的模型性能和经济的成本。因此,在架构方面,DeepSeek-V3 仍然采用 Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) 以实现高效推理,并采用 DeepSeekMoE (Dai et al., 2024) 以实现具有成本效益的训练。这两种架构已在 DeepSeekV2 (DeepSeek-AI, 2024c) 中得到了验证,证明它们能够在实现高效训练和推理的同时保持强大的模型性能。除了基本架构外,我们还实施了两种额外策略,以进一步增强模型能力。首先,DeepSeek-V3 开创了一种无辅助损失策略 (Wang et al., 2024a) 以实现负载均衡,旨在最小化因努力促进负载均衡而对模型性能产生的不利影响。其次,DeepSeek-V3 采用了多标记预测训练目标,我们观察到这增强了在评估基准上的整体性能。
In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al., 2019, Narang et al., 2017, Peng et al., 2023b), its evolution being closely tied to advancements in hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency. 为了实现高效训练,我们支持 FP8 混合精度训练,并对训练框架实施全面优化。低精度训练已成为高效训练的一个有前景的解决方案(Dettmers et al., 2022; Kalamkar et al., 2019, Narang et al., 2017, Peng et al., 2023b),其演变与硬件能力的进步密切相关(Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a)。在这项工作中,我们引入了 FP8 混合精度训练框架,并首次验证其在极大规模模型上的有效性。通过对 FP8 计算和存储的支持,我们实现了加速训练和减少 GPU 内存使用。至于训练框架,我们设计了 DualPipe 算法以实现高效的管道并行性,该算法具有更少的管道气泡,并通过计算-通信重叠隐藏了大部分训练过程中的通信。 这种重叠确保了,随着模型的进一步扩展,只要我们保持恒定的计算与通信比率,我们仍然可以在节点之间使用细粒度专家,同时实现近乎零的全到全通信开销。此外,我们还开发了高效的跨节点全到全通信内核,以充分利用 InfiniBand (IB)和 NVLink 带宽。此外,我们还精心优化了内存占用,使得在不使用昂贵的张量并行的情况下训练 DeepSeek-V3 成为可能。结合这些努力,我们实现了高训练效率。
During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32 K , and in the second stage, it is further extended to 128 K . Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models, and meanwhile carefully maintain the balance between model accuracy 在预训练期间,我们在 14.8T 高质量和多样化的标记上训练 DeepSeek-V3。预训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值,也不需要回滚。接下来,我们对 DeepSeek-V3 进行两阶段的上下文长度扩展。在第一阶段,最大上下文长度扩展到 32 K,在第二阶段,进一步扩展到 128 K。随后,我们进行后训练,包括对 DeepSeek-V3 基础模型的监督微调(SFT)和强化学习(RL),以使其与人类偏好对齐,并进一步释放其潜力。在后训练阶段,我们从 DeepSeekR1 系列模型中提炼推理能力,同时仔细保持模型准确性之间的平衡。
Training Costs 训练成本
Pre-Training 预训练
Context Extension 上下文扩展
Post-Training 后训练
Total 总计
in H800 GPU Hours 在 H800 GPU 小时
2664 K
119 K
5 K
2788 K
in USD 以美元计
$5.328M\$ 5.328 \mathrm{M}
$0.238M\$ 0.238 \mathrm{M}
$0.01M\$ 0.01 \mathrm{M}
$5.576M\$ 5.576 \mathrm{M}
Training Costs Pre-Training Context Extension Post-Training Total
in H800 GPU Hours 2664 K 119 K 5 K 2788 K
in USD $5.328M $0.238M $0.01M $5.576M| Training Costs | Pre-Training | Context Extension | Post-Training | Total |
| :--- | :---: | :---: | :---: | :---: |
| in H800 GPU Hours | 2664 K | 119 K | 5 K | 2788 K |
| in USD | $\$ 5.328 \mathrm{M}$ | $\$ 0.238 \mathrm{M}$ | $\$ 0.01 \mathrm{M}$ | $\$ 5.576 \mathrm{M}$ |
Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $2\$ 2 per GPU hour. 表 1 | 假设 H800 的租赁价格为 $2\$ 2 每个 GPU 小时,DeepSeek-V3 的训练成本。
and generation length. 和生成长度。
We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks. 我们在一系列综合基准上评估了 DeepSeek-V3。尽管其训练成本经济,但全面评估显示,DeepSeek-V3-Base 已成为目前可用的最强开源基础模型,特别是在代码和数学方面。其聊天版本也优于其他开源模型,并在一系列标准和开放式基准上达到了与领先的闭源模型(包括 GPT-4o 和 Claude-3.5-Sonnet)相当的性能。
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180 K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664 K GPU hours. Combined with 119 K GPU hours for the context length extension and 5 K GPU hours for post-training, DeepSeek-V3 costs only 2.788 M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2\$ 2 per GPU hour, our total training costs amount to only $5.576M\$ 5.576 \mathrm{M}. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data. 最后,我们再次强调 DeepSeek-V3 的经济训练成本,如表 1 所示,这得益于我们对算法、框架和硬件的优化共同设计。在预训练阶段,训练 DeepSeek-V3 每万亿个标记仅需 180 K H800 GPU 小时,即在我们 2048 H800 GPU 的集群上仅需 3.7 天。因此,我们的预训练阶段在不到两个月的时间内完成,耗时 2664 K GPU 小时。结合上下文长度扩展的 119 K GPU 小时和后训练的 5 K GPU 小时,DeepSeek-V3 的完整训练仅需 2.788 M GPU 小时。假设 H800 GPU 的租赁价格为 $2\$ 2 每 GPU 小时,我们的总训练成本仅为 $5.576M\$ 5.576 \mathrm{M} 。请注意,上述成本仅包括 DeepSeek-V3 的官方训练,不包括与先前研究和架构、算法或数据的消融实验相关的成本。
Our main contribution includes: 我们的主要贡献包括:
Architecture: Innovative Load Balancing Strategy and Training Objective 架构:创新的负载均衡策略和训练目标
On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. 在 DeepSeek-V2 高效架构的基础上,我们开创了一种无辅助损失的负载平衡策略,最小化因促进负载平衡而导致的性能下降。
We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration. 我们研究了多标记预测(MTP)目标,并证明它对模型性能有益。它还可以用于推测解码以加速推理。
Pre-Training: Towards Ultimate Training Efficiency 预训练:朝着终极训练效率迈进
We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model. 我们设计了一个 FP8 混合精度训练框架,并首次验证了 FP8 训练在极大规模模型上的可行性和有效性。
Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computationcommunication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead. 通过算法、框架和硬件的共同设计,我们克服了跨节点 MoE 训练中的通信瓶颈,实现了近乎完全的计算与通信重叠。这显著提高了我们的训练效率并降低了训练成本,使我们能够在没有额外开销的情况下进一步扩大模型规模。
At an economical cost of only 2.664 M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours. 以仅需 2.664 M H800 GPU 小时的经济成本,我们在 14.8T 标记上完成了 DeepSeek-V3 的预训练,生成了目前最强的开源基础模型。预训练后的后续训练阶段仅需 0.1M GPU 小时。
Post-Training: Knowledge Distillation from DeepSeek-R1 后训练:来自 DeepSeek-R1 的知识蒸馏
We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the 我们引入了一种创新的方法,将推理能力从长链思维(CoT)模型中提炼出来,特别是从 DeepSeek R1 系列模型之一,转化为标准LLMs,尤其是 DeepSeek-V3。我们的流程优雅地结合了
verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3. 将 R1 的验证和反思模式引入 DeepSeek-V3,并显著提高其推理性能。同时,我们还控制 DeepSeek-V3 的输出风格和长度。
Summary of Core Evaluation Results 核心评估结果摘要
Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge. 知识:(1)在 MMLU、MMLU-Pro 和 GPQA 等教育基准测试中,DeepSeek-V3 超越了所有其他开源模型,在 MMLU 上获得 88.5 分,在 MMLU-Pro 上获得 75.9 分,在 GPQA 上获得 59.1 分。其性能与领先的闭源模型如 GPT-4o 和 Claude-Sonnet-3.5 相当,缩小了该领域开源模型与闭源模型之间的差距。(2)在事实性基准测试中,DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 的开源模型中表现优越。虽然在英语事实知识(SimpleQA)方面落后于 GPT-4o 和 Claude-Sonnet-3.5,但在中文事实知识(中文 SimpleQA)方面超越了这些模型,突显了其在中文事实知识方面的优势。
Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as LiveCodeBench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks. 代码、数学和推理:(1)DeepSeek-V3 在所有非长链思维的开源和闭源模型中,在与数学相关的基准测试中实现了最先进的性能。值得注意的是,它在特定基准测试(如 MATH-500)上甚至超越了 o1-preview,展示了其强大的数学推理能力。(2)在与编码相关的任务中,DeepSeek-V3 成为编码竞赛基准测试(如 LiveCodeBench)的最佳表现模型,巩固了其在该领域的领先地位。在与工程相关的任务中,尽管 DeepSeek-V3 的表现略低于 Claude-Sonnet-3.5,但仍然显著超过所有其他模型,展示了其在多样化技术基准测试中的竞争力。
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, longcontext extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6). 在本文的其余部分,我们首先详细介绍我们的 DeepSeek-V3 模型架构(第 2 节)。随后,我们介绍我们的基础设施,包括我们的计算集群、训练框架、对 FP8 训练的支持、推理部署策略以及我们对未来硬件设计的建议。接下来,我们描述我们的预训练过程,包括训练数据的构建、超参数设置、长上下文扩展技术、相关评估以及一些讨论(第 4 节)。之后,我们讨论我们在后训练方面的努力,包括监督微调(SFT)、强化学习(RL)、相应的评估和讨论(第 5 节)。最后,我们总结这项工作,讨论 DeepSeek-V3 的现有限制,并提出未来研究的潜在方向(第 6 节)。
2. Architecture 2. 架构
We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeekV2 (DeepSeek-AI, 2024c). 我们首先介绍 DeepSeek-V3 的基本架构,其特点是采用 Multi-head Latent Attention (MLA)(DeepSeek-AI, 2024c)以实现高效推理,以及 DeepSeekMoE(Dai et al., 2024)以实现经济训练。然后,我们提出了一种 Multi-Token Prediction (MTP)训练目标,我们观察到它能够提升评估基准上的整体性能。对于其他未明确提及的细节,DeepSeek-V3 遵循 DeepSeekV2(DeepSeek-AI, 2024c)的设置。
2.1. Basic Architecture 2.1. 基本架构
The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing DeepSeek-V3 的基本架构仍然在 Transformer(Vaswani 等,2017)框架内。为了实现高效推理和经济训练,DeepSeek-V3 还采用了 MLA 和 DeepSeekMoE,这些在 DeepSeek-V2 中得到了充分验证。与 DeepSeek-V2 相比,一个例外是我们额外引入了无辅助损失的负载平衡。
Figure 2 | Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training. 图 2 | DeepSeek-V3 基本架构的示意图。继 DeepSeek-V2 之后,我们采用 MLA 和 DeepSeekMoE 进行高效推理和经济训练。
strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and DeepSeekMoE in this section. 策略(Wang et al., 2024a)用于 DeepSeekMoE,以减轻因确保负载平衡而导致的性能下降。图 2 展示了 DeepSeek-V3 的基本架构,我们将在本节中简要回顾 MLA 和 DeepSeekMoE 的细节。
2.1.1. Multi-Head Latent Attention 2.1.1. 多头潜在注意力
For attention, DeepSeek-V3 adopts the MLA architecture. Let dd denote the embedding dimension, n_(h)n_{h} denote the number of attention heads, d_(h)d_{h} denote the dimension per head, and h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} denote the attention input for the tt-th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference: 对于注意力,DeepSeek-V3 采用 MLA 架构。设 dd 表示嵌入维度, n_(h)n_{h} 表示注意力头的数量, d_(h)d_{h} 表示每个头的维度, h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} 表示给定注意力层中第 tt 个 token 的注意力输入。MLA 的核心是对注意力键和值进行低秩联合压缩,以减少推理过程中的键值 (KV) 缓存:
where c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) indicates the KVK V compression dimension; W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} denotes the down-projection matrix; W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} are the up-projection matrices for keys and values, respectively; W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE(*)\operatorname{RoPE}(\cdot) denotes the operation that applies RoPE matrices; and [:'*][\because \cdot] denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., c_(t)^(KV)\mathbf{c}_{t}^{K V} and k_(t)^(R)\mathbf{k}_{t}^{R} ) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017). 其中 c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} 是用于键和值的压缩潜在向量; d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) 表示 KVK V 压缩维度; W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} 表示下投影矩阵; W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} 分别是键和值的上投影矩阵; W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} 是用于生成携带旋转位置嵌入(RoPE)的解耦键的矩阵(Su et al., 2024); RoPE(*)\operatorname{RoPE}(\cdot) 表示应用 RoPE 矩阵的操作; [:'*][\because \cdot] 表示连接。请注意,对于 MLA,仅在生成过程中需要缓存蓝框向量(即 c_(t)^(KV)\mathbf{c}_{t}^{K V} 和 k_(t)^(R)\mathbf{k}_{t}^{R} ),这显著减少了 KV 缓存,同时保持与标准多头注意力(MHA)相当的性能(Vaswani et al., 2017)。
For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training: 对于注意力查询,我们还执行低秩压缩,这可以在训练期间减少激活内存:
2.1.2. DeepSeekMoE with Auxiliary-Loss-Free Load Balancing 2.1.2. DeepSeekMoE 与无辅助损失负载均衡
Basic Architecture of DeepSeekMoE. For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let u_(t)\mathbf{u}_{t} denote the FFN input of the tt-th token, we compute the FFN output h_(t)^(')\mathbf{h}_{t}^{\prime} as follows: DeepSeekMoE 的基本架构。对于前馈网络(FFNs),DeepSeek-V3 采用了 DeepSeekMoE 架构(Dai et al., 2024)。与传统的 MoE 架构如 GShard(Lepikhin et al., 2021)相比,DeepSeekMoE 使用了更细粒度的专家,并将一些专家隔离为共享专家。设 u_(t)\mathbf{u}_{t} 表示第 tt 个标记的 FFN 输入,我们计算 FFN 输出 h_(t)^(')\mathbf{h}_{t}^{\prime} 如下:
where N_(s)N_{s} and N_(r)N_{r} denote the numbers of shared experts and routed experts, respectively; FFN_(i)^((s))(*)\mathrm{FFN}_{i}^{(s)}(\cdot) and FFN_(i)^((r))(*)\mathrm{FFN}_{i}^{(r)}(\cdot) denote the ii-th shared expert and the ii-th routed expert, respectively; K_(r)K_{r} denotes the number of activated routed experts; g_(i,t)g_{i, t} is the gating value for the ii-th expert; s_(i,t)s_{i, t} is the token-to-expert affinity; e_(i)\mathbf{e}_{i} is the centroid vector of the ii-th routed expert; and Topk(*,K)\operatorname{Topk}(\cdot, K) denotes the set comprising KK highest scores among the affinity scores calculated for the tt-th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values. 其中 N_(s)N_{s} 和 N_(r)N_{r} 分别表示共享专家和路由专家的数量; FFN_(i)^((s))(*)\mathrm{FFN}_{i}^{(s)}(\cdot) 和 FFN_(i)^((r))(*)\mathrm{FFN}_{i}^{(r)}(\cdot) 分别表示第 ii 个共享专家和第 ii 个路由专家; K_(r)K_{r} 表示激活的路由专家数量; g_(i,t)g_{i, t} 是第 ii 个专家的门控值; s_(i,t)s_{i, t} 是令牌与专家的亲和力; e_(i)\mathbf{e}_{i} 是第 ii 个路由专家的质心向量; Topk(*,K)\operatorname{Topk}(\cdot, K) 表示包含第 KK 高分的亲和力分数的集合,这些分数是针对第 tt 个令牌和所有路由专家计算的。与 DeepSeek-V2 略有不同,DeepSeek-V3 使用 sigmoid 函数来计算亲和力分数,并在所有选定的亲和力分数之间应用归一化,以生成门控值。
Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term b_(i)b_{i} for each expert and add it to the corresponding affinity scores s_(i,t)s_{i, t} to determine the top-K routing: 无辅助损失的负载均衡。对于 MoE 模型,不平衡的专家负载会导致路由崩溃(Shazeer et al., 2017),并在专家并行的场景中降低计算效率。传统解决方案通常依赖于辅助损失(Fedus et al., 2021;Lepikhin et al., 2021)来避免不平衡负载。然而,过大的辅助损失会损害模型性能(Wang et al., 2024a)。为了在负载平衡和模型性能之间实现更好的权衡,我们开创了一种无辅助损失的负载均衡策略(Wang et al., 2024a)以确保负载平衡。具体来说,我们为每个专家引入一个偏置项 b_(i)b_{i} ,并将其添加到相应的亲和力分数 s_(i,t)s_{i, t} 中,以确定前 K 个路由:
Note that the bias term is only used for routing. The gating value, which will be multiplied with the FFN output, is still derived from the original affinity score s_(i,t)s_{i, t}