001
\correspondingauthorCore contributors
001 \通讯作者核心贡献者
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-Coder-V2:打破闭源模型在代码智能领域的壁垒
Abstract
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.
In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
我们推出了 DeepSeek-Coder-V2,这是一个开源的混合专家(Mixture-of-Experts, MoE)代码语言模型,其在代码特定任务上的表现可与 GPT4-Turbo 相媲美。具体而言,DeepSeek-Coder-V2 从 DeepSeek-V2 的一个中间检查点出发,额外预训练了 6 万亿个令牌。通过这种持续的预训练,DeepSeek-Coder-V2 大幅提升了 DeepSeek-V2 的编程和数学推理能力,同时在通用语言任务上保持了相当的性能。与 DeepSeek-Coder-33B 相比,DeepSeek-Coder-V2 在代码相关任务的多个方面以及推理和通用能力上均取得了显著进步。此外,DeepSeek-Coder-V2 将其支持的编程语言从 86 种扩展到了 338 种,并将上下文长度从 16K 延长至 128K。在标准基准测试中,DeepSeek-Coder-V2 在编码和数学基准测试中超越了 GPT4-Turbo、Claude 3 Opus 以及Gemini 1.5 Pro 等闭源模型。
UTF8gbsn
1 Introduction
1 引言
The open-source community has made significant strides in advancing code intelligence through the development of open-source code models such as StarCoder (Li et al., 2023b; Lozhkov et al., 2024), CodeLlama (Roziere et al., 2023), DeepSeek-Coder (Guo et al., 2024), and Codestral (MistralAI, 2024). These models have steadily approached the performance levels of closed-source counterparts, contributing to the progress of code intelligence. However, there remains a discernible gap when comparing them to state-of-the-art closed-source models like GPT4-Turbo (OpenAI, 2023), Claude 3 Opus (Anthropic, 2024), and Gemini 1.5 Pro (Reid et al., 2024). To bridge this gap and further propel the development of open-source code models, we introduce the DeepSeek-Coder-V2 series. These models are built upon the foundation of DeepSeek-V2 (DeepSeek-AI, 2024) and are further pre-trained with an additional corpus with 6 trillion tokens.
开源社区通过开发如 StarCoder(Li et al., 2023b; Lozhkov et al., 2024)、CodeLlama(Roziere et al., 2023)、DeepSeek-Coder(Guo et al., 2024)和 Codestral(MistralAI, 2024)等开源代码模型,在推进代码智能方面取得了显著进展。这些模型稳步接近闭源同类产品的性能水平,促进了代码智能的进步。然而,与 GPT4-Turbo(OpenAI, 2023)、Claude 3 Opus(Anthropic, 2024)和Gemini 1.5 Pro(Reid et al., 2024)等最先进的闭源模型相比,仍存在明显的差距。为了缩小这一差距并进一步推动开源代码模型的发展,我们推出了 DeepSeek-Coder-V2 系列。 这些模型建立在 DeepSeek-V2 的基础之上(DeepSeek-AI, 2024),并进一步使用包含 6 万亿个标记的额外语料库进行了预训练。
In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus. The source code consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl, using the same pipeline as DeepSeekMath (Shao et al., 2024). This corpus expands from 86 to 338 programming languages compared to the code corpus used to train DeepSeek-Coder. To demonstrate the effectiveness of the new code corpus, we conduct ablation studies with the 1B parameter model and observe improvements of 6.7% and 9.4% in accuracy across both HumanEval (from 30.5% to 37.2%) and MBPP (from 44.6% to 54.0%) benchmarks (Chen et al., 2021; Austin et al., 2021a), respectively. For the math corpus, we collect 221B math-related tokens sourced from CommonCrawl using the same pipeline, which approximately doubles the size of the 120B DeepSeekMath corpus (Shao et al., 2024), while for the natural language corpus, we directly sample from the training corpus in DeepSeek-V2. In total, DeepSeek-Coder-V2 has been exposed to 10.2T training tokens, where 4.2 trillion tokens originate from the DeepSeek V2 dataset, while the remaining 6 trillion tokens come from the DeepSeek-Coder-V2 dataset.
在预训练阶段,DeepSeek-Coder-V2 的数据集由 60%的源代码、10%的数学语料库和 30%的自然语言语料库构成。其中,源代码包含从 GitHub 和 CommonCrawl 获取的 1,170B 与代码相关的令牌,采用与 DeepSeekMath 相同的处理流程(Shao et al., 2024)。与用于训练 DeepSeek-Coder 的代码语料库相比,该语料库涵盖的编程语言从 86 种扩展到了 338 种。为了展示新代码语料库的效果,我们通过 1B 参数模型进行消融研究,发现 HumanEval(从 30.5%提升至 37.2%)和 MBPP(从 44.6%提升至 54.0%)基准测试的准确率分别提高了 6.7%和 9.4%(Chen et al., 2021; Austin et al., 2021a)。对于数学语料库,我们使用相同流程从 CommonCrawl 收集了 221B 与数学相关的令牌,其规模大约是 DeepSeekMath 120B 语料库的两倍(Shao et al., 2024);而自然语言语料库则直接从 DeepSeek-V2 的训练语料库中抽样。 总计,DeepSeek-Coder-V2 已接触 10.2 万亿训练标记,其中 4.2 万亿标记来自 DeepSeek V2 数据集,而剩余的 6 万亿标记则源自 DeepSeek-Coder-V2 数据集。
To accommodate longer code inputs and enhance applicability across various programming scenarios, we extend the context length from 16K to 128K tokens, allowing our models to handle more complex and extensive coding tasks. After continuous pre-training DeepSeek-V2 on this multi-source corpora, we find that DeepSeek-Coder-V2 significantly enhances the model’s capabilities in coding and mathematical reasoning while maintaining comparable general language performance.
为了适应更长的代码输入并提升在多种编程场景下的适用性,我们将上下文长度从 16K 扩展到 128K 个词元,使得我们的模型能够处理更为复杂和庞大的编程任务。在对该多源语料库进行持续预训练后,我们发现 DeepSeek-Coder-V2 显著提升了模型在编码和数学推理方面的能力,同时保持了可比拟的通用语言表现。
In the alignment phase, we first construct an instruction training dataset that includes code and math data from DeepSeek-Coder (Guo et al., 2024) and DeepSeek-Math (Shao et al., 2024), as well as general instruction data from DeepSeek-V2 (DeepSeek-AI, 2024). This dataset is used to fine-tune the base model. Then, in the reinforcement learning phase, we employ Group Relative Policy Optimization (GRPO) algorithm to align its behavior with human preferences. Preference data is collected in the coding domain using compiler feedback and test cases, and a reward model is developed to guide the training of the policy model. This approach ensures that the model’s responses are optimized for correctness and human preference in coding tasks. To enable the model to support code completion after alignment, we also utilize Fill-In-Middle approach (Guo et al., 2024) during the fine-tuning of the base model with 16B parameters.
在对齐阶段,我们首先构建了一个指令训练数据集,该数据集包含来自 DeepSeek-Coder (Guo et al., 2024)和 DeepSeek-Math (Shao et al., 2024)的代码和数学数据,以及来自 DeepSeek-V2 (DeepSeek-AI, 2024)的一般指令数据。该数据集用于微调基础模型。接着,在强化学习阶段,我们采用群体相对策略优化(GRPO)算法,使其行为与人类偏好对齐。通过编译器反馈和测试用例在编码领域收集偏好数据,并开发一个奖励模型来指导策略模型的训练。这种方法确保模型在编码任务中的响应针对正确性和人类偏好进行了优化。为了使模型在对齐后支持代码补全功能,我们在使用 16B 参数微调基础模型时,还采用了中间填充方法(Guo et al., 2024)。
1.1 Contributions
1.1 贡献概述
In summary, our main contributions are:
总结而言,我们的主要贡献如下:
-
•
We introduce DeepSeek-Coder-V2 with 16B and 236B parameters based on the DeepSeekMoE framework, which has activation parameters of only 2.4B and 21B, efficiently supporting diverse computational and application needs. Additionally, DeepSeek-Coder-V2 supports 338 programming languages and a maximum context length of 128K tokens.
• 我们推出了基于 DeepSeekMoE 框架的 DeepSeek-Coder-V2,拥有 16B 和 236B 参数,而激活参数仅为 2.4B 和 21B,有效满足多样化的计算与应用需求。此外,DeepSeek-Coder-V2 支持 338 种编程语言,并具备最高 128K 令牌的上下文长度。 -
•
We make the first attempt to develop an open-source hundred-billion-parameter code model to advance the field of code intelligence. Experimental results indicate that DeepSeek-Coder-V2 236B outperforms state-of-the-art closed-source models, such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro, in both coding and mathematics tasks.
• 我们首次尝试开发一个开源的千亿参数代码模型,以推动代码智能领域的发展。实验结果显示,DeepSeek-Coder-V2 236B 在编码和数学任务上均超越了当前最先进的闭源模型,如 GPT4-Turbo、Claude 3 Opus 以及Gemini 1.5 Pro。 -
•
DeepSeek-Coder-V2 models are released publicly under a permissive license, allowing for both research and unrestricted commercial use.
• DeepSeek-Coder-V2 模型在宽松许可下公开发布,允许进行研究和无限制的商业使用。
1.2 Summary of Evaluations and Metrics
1.2 评估与指标概述
-
•
Code: Regarding code generation benchmark evaluation, DeepSeek-Coder-V2 demonstrates remarkable superiority over all open source models while exhibiting performance on par with the leading closed-source models, such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. Notably, we achieve a 90.2% score on HumanEval (Chen et al., 2021), a 76.2% score on MBPP (Austin et al., 2021a) (establishing a new state-of-the-art result with EvalPlus evaluation pipeline), and a 43.4% score on LiveCodeBench (Jain et al., 2024) (questions from Dec. 2023 to June. 2024). Additionally, DeepSeek-Coder-V2 is the first open-source model that surpasses a score of 10% on SWEBench (Jimenez et al., 2023).
• 代码:在代码生成基准评估方面,DeepSeek-Coder-V2 展现出对所有开源模型的显著优势,同时与 GPT4-Turbo、Claude 3 Opus 以及Gemini 1.5 Pro 等领先的闭源模型性能相当。特别地,我们在 HumanEval(Chen et al., 2021)上获得了 90.2%的分数,在 MBPP(Austin et al., 2021a)上获得了 76.2%的分数(通过 EvalPlus 评估管道创造了新的最先进结果),以及在 LiveCodeBench(Jain et al., 2024)上获得了 43.4%的分数(问题涵盖 2023 年 12 月至 2024 年 6 月)。此外,DeepSeek-Coder-V2 是首个在 SWEBench(Jimenez et al., 2023)上突破 10%分数的开源模型。 -
•
Math: DeepSeek-Coder-V2 exhibits strong mathematical reasoning abilities, rivaling top closed-source models such as GPT-4o, Gemini 1.5 Pro, and Claude 3 Opus on both elementary benchmarks like GSM8K (Cobbe et al., 2021) and advanced competition-level benchmarks including MATH (Hendrycks et al., 2021), AIME (MAA, 2024), and Math Odyssey (Netmind.AI, 2024). Notably, DeepSeek-Coder-V2 attains an accuracy of 75.7% on the MATH benchmark, nearly matching the state-of-the-art accuracy of 76.6% achieved by GPT-4o. Furthermore, it surpasses the performance of these closed-source models in the AIME 2024 competition.
• 数学:DeepSeek-Coder-V2 展现出强大的数学推理能力,与顶尖的闭源模型如 GPT-4o、Gemini 1.5 Pro 以及Claude 3 Opus 在基础基准测试如 GSM8K(Cobbe et al., 2021)和高级竞赛级别基准测试包括 MATH(Hendrycks et al., 2021)、AIME(MAA, 2024)及 Math Odyssey(Netmind.AI, 2024)上不相上下。值得注意的是,DeepSeek-Coder-V2 在 MATH 基准测试中达到 75.7%的准确率,几乎与 GPT-4o 创下的 76.6%的最新准确率持平。此外,在 2024 年的 AIME 竞赛中,它超越了这些闭源模型的表现。 -
•
Natural Language: DeepSeek-Coder-V2 maintains comparable general language performance to DeepSeek-V2. For example, DeepSeek-Coder-V2 achieves 79.2% on MMLU with OpenAI simple-eval pipeline. Regarding subjective evaluation with GPT-4 as a judger, DeepSeek-Coder-V2 achieves 65.0 on arena-hard (Li et al., 2024), 8.77 on MT-bench (Zheng et al., 2023) and 7.84 on alignbench (Liu et al., 2023c). These scores are significantly better than other code-specific models, even comparable with general open source models.
• 自然语言能力:DeepSeek-Coder-V2 在通用语言表现上与 DeepSeek-V2 相当。例如,通过 OpenAI 的 simple-eval 管道,DeepSeek-Coder-V2 在 MMLU 上达到 79.2%。在以 GPT-4 为评判的主观评估中,DeepSeek-Coder-V2 在 arena-hard(李等,2024)上获得 65.0 分,在 MT-bench(郑等,2023)上得 8.77 分,在 alignbench(刘等,2023c)上得 7.84 分。这些成绩显著优于其他专注于代码的模型,甚至可与通用开源模型相媲美。
2 Data Collection
2 数据收集
The pre-training data for DeepSeek-Coder-V2 primarily consists of 60% source code, 10% math corpus, and 30% natural language corpus. Since the natural language corpus is directly sampled from the training dataset of DeepSeek-V2, this section focuses on the collection, cleaning, and filtering processes of the code and math data. Meanwhile, we further validate the quality of this data through comparative analysis experiments.
DeepSeek-Coder-V2 的预训练数据主要由 60%的源代码、10%的数学语料库和 30%的自然语言语料库组成。由于自然语言语料库直接采样自 DeepSeek-V2 的训练数据集,本节重点介绍代码和数学数据的收集、清洗和筛选过程。同时,我们通过对比分析实验进一步验证了这些数据的质量。
We collect public repositories created before November 2023 on GitHub. We first apply the same filtering rules and near-deduplication as those used in the DeepSeek-Coder (Guo et al., 2024) to filter out lower-quality and duplicated source code. To make the paper self-contained, we briefly describe the filtering rules. Firstly, we filter out files with an average line length exceeding 100 characters or a maximum line length surpassing 1000 characters. Additionally, we remove files with fewer than 25% alphabetic characters. Except for the XSLT programming language, we further filter out files where the string "<?xml version=" appears in the first 100 characters. For HTML files, we consider the ratio of visible text to HTML code. We retain files where the visible text constitutes at least 20% of the code and is no less than 100 characters. For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files. By applying these filtering rules and near-deduplication, we obtain 821B code encompassing 338 programming languages and 185B code-related text, such as markdown and issues. The list of supported programming languages can be found in the Appendix A. We use the same tokenizer as DeepSeekV2, detailed in (DeepSeek-AI, 2024).
我们收集了 2023 年 11 月之前在 GitHub 上创建的公共仓库。首先,我们采用了与 DeepSeek-Coder(Guo et al., 2024)相同的过滤规则和近似去重方法,以过滤掉质量较低和重复的源代码。为了使论文自成一体,我们简要描述这些过滤规则。首先,我们过滤掉平均行长度超过 100 个字符或最大行长度超过 1000 个字符的文件。此外,我们移除字母字符占比少于 25%的文件。除了 XSLT 编程语言外,我们还进一步过滤掉文件开头 100 个字符中出现"<?xml version="字符串的文件。对于 HTML 文件,我们考虑可见文本与 HTML 代码的比例,保留可见文本至少占代码 20%且不少于 100 个字符的文件。对于通常包含更多数据的 JSON 和 YAML 文件,我们仅保留字符数在 50 至 5000 个字符范围内的文件,这有效去除了大部分数据密集型文件。 通过应用这些过滤规则和近似去重技术,我们获得了包含 338 种编程语言的 821B 代码以及 185B 与代码相关的文本,如 Markdown 和问题描述。支持的编程语言列表可在附录A中找到。我们采用了与 DeepSeekV2 相同的分词器,详情参见(DeepSeek-AI, 2024)。
To collect code-related and math-related web texts from Common Crawl, we follow the same pipeline as DeepSeekMath (Shao et al., 2024). Specifically, we select coding forums such as StackOverflow111https://stackoverflow.com, library sites such as PyTorch documentation222https://pytorch.org/docs, and mathematics website such as StackExchange333https://math.stackexchange.com
as our initial seed corpus. Using this seed corpus, we train a fastText model (Joulin et al., 2016) to recall more coding-related and math-related web pages. Since tokenization for languages like Chinese cannot be done through spaces, we use the Byte Pair Encoding (BPE) tokenizer from DeepSeek-V2, which significantly improves the recall accuracy of fastText. For each domain, we calculate the percentage of web pages collected in the first iteration. Domains with over 10% of web pages collected are classified as code-related or math-related. We then annotate the URLs associated with code-related or math-related content within these identified domains. Uncollected web pages linked to these URLs are added to the seed corpus. After three iterations of data collection, we gather 70 billion code-related tokens and 221B math-related tokens from web pages.
To further collect high-quality source code from GitHub, we also apply the same pipeline on GitHub with two iterations of data collection and collect 94B source code. The initial seed corpus is constructed by manually collecting high-quality source code, such as those containing detailed descriptions. Finally, the new code corpus consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl.
要从 Common Crawl 收集与代码和数学相关的网页文本,我们遵循与 DeepSeekMath 相同的流程(Shao 等,2024)。具体来说,我们选择如 StackOverflow 的编程论坛、如 PyTorch 文档的库站点222https://pytorch.org/docs,以及如 StackExchange 的数学网站333https://math.stackexchange.com作为初始种子语料库。利用这一种子语料库,我们训练了一个 fastText 模型(Joulin 等,2016),以召回更多与编程和数学相关的网页。由于中文等语言无法通过空格进行分词,我们采用了 DeepSeek-V2 中的 Byte Pair Encoding (BPE)分词器,这显著提高了 fastText 的召回准确率。对于每个领域,我们计算了在首次迭代中收集的网页百分比。收集到的网页超过 10%的领域被归类为与代码或数学相关。 随后,我们对这些识别出的域名中与代码或数学相关内容的网址进行标注。未被收集的与这些网址链接的网页被添加到种子语料库中。经过三轮数据收集迭代后,我们从网页中累计获取了 700 亿个代码相关标记和 2210 亿个数学相关标记。为进一步从 GitHub 收集高质量源代码,我们也在 GitHub 上应用相同的数据收集流程,进行了两轮迭代,并收集了 940 亿行源代码。初始种子语料库是通过人工收集高质量源代码构建的,例如那些包含详细描述的代码。最终,新的代码语料库包含了从 GitHub 和 CommonCrawl 来源的共计 1.17 万亿个代码相关标记。
To demonstrate the effectiveness of the new code corpus, we conducted ablation studies (see Table 1) using a 1B parameter model, comparing it with the corpus used to train DeepSeek-Coder. Pre-training the 1B model on the new code corpus using 1T tokens resulted in improvements of 5.5% and 4.4% in accuracy on the HumanEval (from 30.5% to 36.0%) and MBPP (from 44.6% to 49.0%) benchmarks, respectively. Further training the 1B model with 2T tokens led to additional improvements, with HumanEval and MBPP scores rising to 37.2% and 54.0%, respectively. Therefore, the new code corpus is superior to the code corpus used to train DeepSeek-Coder.
为了展示新代码语料库的有效性,我们进行了消融研究(见表1),使用一个 10 亿参数的模型,将其与用于训练 DeepSeek-Coder 的语料库进行比较。在新的代码语料库上预训练 10 亿参数模型使用 1 万亿个令牌,使得 HumanEval(从 30.5%提高到 36.0%)和 MBPP(从 44.6%提高到 49.0%)基准测试的准确率分别提升了 5.5%和 4.4%。进一步使用 2 万亿个令牌训练 10 亿参数模型,导致 HumanEval 和 MBPP 得分分别上升至 37.2%和 54.0%。因此,新的代码语料库优于用于训练 DeepSeek-Coder 的代码语料库。
Model | Tokens | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg | MBPP |
---|---|---|---|---|---|---|---|---|---|---|---|
DeepSeek-Coder-1B 深索编码器 1B | 1T | 30.5% | 28.0% | 31.7% | 23.0% | 30.8% | 31.7% | 9.5% | 28.6% | 26.7% | 44.6% |
DeepSeek-Coder-V2-1B 深度探索编程者 V2-1B | 1T | 36.0% | 34.8% | 31.7% | 27.3% | 37.7% | 34.2% | 6.3% | 38.5% | 31.2% | 49.0% |
DeepSeek-Coder-V2-1B 深度探索编程者 V2-1B | 2T | 37.2% | 39.1% | 32.3% | 31.7% | 34.6% | 36.7% | 12.0% | 32.9% | 32.0% | 54.0% |
表 1:DeepSeek-Coder 与 DeepSeek-Coder-V2 在 1B 基础模型上的性能对比。
3 Training Policy
3 培训政策
3.1 Training Strategy
3.1 训练策略
We use two training objectives for DeepSeek-Coder-v2 16B: Next-Token-Prediction and Fill-In-Middle (FIM) (Li et al., 2023b; Bavarian et al., 2022; Guo et al., 2024). For DeepSeek-Coder-v2 236B, we only utilize the Next-Token-Prediction objective.
Here we give a brief introduction of the FIM training policy. We adopt the FIM training approach for the development of DeepSeek-Coder-v2-16B, leveraging the PSM (Prefix, Suffix, Middle) mode. This method structures the content reconstruction in the sequence: Prefix, Suffix, and Middle, as illustrated below:
我们为 DeepSeek-Coder-v2 16B 设定了两个训练目标:下一词预测(Next-Token-Prediction)和填空中间(Fill-In-Middle, FIM)(Li 等人,2023b;Bavarian 等人,2022;Guo 等人,2024)。对于 DeepSeek-Coder-v2 236B,我们仅采用下一词预测目标。在此,简要介绍 FIM 训练策略。我们采用 FIM 训练方法来开发 DeepSeek-Coder-v2-16B,利用 PSM(前缀、后缀、中间)模式。该方法按以下顺序构建内容重构:前缀、后缀和中间,具体如下所示:
This structure is applied at the document level as part of the pre-packing process. The FIM is utilized at a rate of 0.5, consistent with the PSM framework, to enhance the training efficacy and model performance.
该结构在文档级别作为预打包过程的一部分应用。FIM 以 0.5 的比率使用,与 PSM 框架保持一致,以提升训练效果和模型性能。
3.2 Model Architecture
3.2 模型架构
Our architecture aligns with that of DeepSeekV2 (DeepSeek-AI, 2024). The hyperparameters settings, 16B and 236B, correspond to those used in DeepSeek-V2-Lite and DeepSeek-V2, respectively. Notably, we encountered instability during training and spikes in gradient values, which we attributed to the exponential normalization technique. To address this, we reverted to the conventional normalization method.
我们的架构与 DeepSeekV2 (DeepSeek-AI, 2024)保持一致。超参数设置 16B 和 236B 分别对应于 DeepSeek-V2-Lite 和 DeepSeek-V2 所采用的参数。值得注意的是,我们在训练过程中遇到了不稳定性和梯度值的尖峰,我们认为这归因于指数归一化技术。为解决这一问题,我们转而采用了传统的归一化方法。
3.3 Training Hyper-Parameters
3.3 训练超参数
Consistent with the DeepSeek V2 methodology (DeepSeek-AI, 2024), we utilize the AdamW optimizer (Loshchilov and Hutter, 2019), configured with , , and a weight decay of 0.1. Batch sizes and learning rates are adjusted according to DeepSeek-V2 specifications. For learning rate scheduling, we employ a cosine decay strategy, starting with 2000 warm-up steps and gradually reducing the learning rate to 10% of its initial value.
与 DeepSeek V2 方法论一致(DeepSeek-AI, 2024),我们采用 AdamW 优化器(Loshchilov 和 Hutter, 2019),配置为 , ,以及 0.1 的权重衰减。批次大小和学习率根据 DeepSeek-V2 规范进行调整。对于学习率调度,我们采用余弦衰减策略,起始为 2000 步的预热阶段,并逐渐将学习率降至初始值的 10%。
Both DeepSeek-Coder-V2 and DeepSeek-Coder-V2-Lite are trained using the same methodology. To maintain robust natural language understanding capabilities in DeepSeek-Coder-V2, we continue the pre-training process from an intermediate checkpoint of DeepSeek-V2. The intermediate checkpoint was initially trained on 4.2T tokens. Consequently, DeepSeek-Coder-V2 has been exposed to a total of 10.2T high-quality tokens during the pre-training phase.
DeepSeek-Coder-V2 与 DeepSeek-Coder-V2-Lite 均采用相同的训练方法。为保持 DeepSeek-Coder-V2 强大的自然语言理解能力,我们从中途检查点继续对 DeepSeek-V2 进行预训练。该中途检查点最初在 4.2T 个词元上进行了训练。因此,DeepSeek-Coder-V2 在预训练阶段总共接触了 10.2T 个高质量词元。
Model | DeepSeek-Coder-V2-Lite 深搜编程助手 V2 轻量版 | DeepSeek-Coder-V2 深度探索编码器 V2 |
---|---|---|
# Total Parameters (#TP) # 总参数数目(#TP) | 16B | 236B |
# Active Parameters (#AP) # 活跃参数 (#AP) |
2.4B | 21B |
Pre-training Tokens 预训练标记 | 4.2T+6T | 4.2T+6T |
LR Scheduler | Cosine | Cosine |
FIM | Enable | Disable |
表 2:DeepSeek-Coder-V2 的训练设置。
3.4 Long Context Extension
3.4 长上下文扩展
Following DeepSeek-V2, we extend the context length of DeepSeek-Coder-V2 to 128K using Yarn (Peng et al., 2023). The hyper-parameters of YARN are the same as DeepSeek-V2: the scale to 40, to 1, to 32.
We further continue training the model using two stages to enhance its capability for handling long contexts. In the first stage, we utilize a sequence length of 32K and a batch size of 1152 for 1000 steps. In the second stage, we train the model for an additional 1000 steps, employing a sequence length of 128K and a batch size of 288 sequences. It should be noted here we upsample long context data ratio during long context extension. As shown in Figure 2, the results on the “Needle In A Haystack” (NIAH) tests indicate that DeepSeek-Coder-V2 performs well across all context window lengths up to 128K.
继 DeepSeek-V2 之后,我们利用 Yarn(Peng et al., 2023)将 DeepSeek-Coder-V2 的上下文长度扩展至 128K。YARN 的超参数与 DeepSeek-V2 相同:比例 设为 40, 设为 1, 设为 32。我们进一步通过两个阶段的持续训练来提升模型处理长上下文的能力。在第一阶段,我们采用 32K 的序列长度和 1152 的批次大小进行 1000 步训练。在第二阶段,我们使用 128K 的序列长度和 288 序列的批次大小,再训练 1000 步。需要注意的是,在长上下文扩展过程中,我们增加了长上下文数据的比例。如图2所示,在“大海捞针”(NIAH)测试中,DeepSeek-Coder-V2 在所有上下文窗口长度达到 128K 的情况下表现良好。
3.5 Alignment
3.5 对齐
3.5.1 Supervised Fine-Tuning
3.5.1 有监督微调
To build DeepSeek-Coder-V2 Chat, we construct the instruction training dataset mixed with code and math data. We first collect 20k code-related instruction data and 30k math related data from DeepSeek-Coder and DeepSeek-Math. To maintain the general ability, we also sample several data from the instruction data of DeepSeek-V2. Finally, we use a instruction dataset of 300M tokens. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate . We also use a batch size of 1M tokens and 1B tokens in total.
为了构建 DeepSeek-Coder-V2 Chat,我们构建了混合代码与数学数据的指令训练集。首先,我们从 DeepSeek-Coder 和 DeepSeek-Math 中分别收集了 20k 与代码相关的指令数据和 30k 与数学相关的数据。为保持通用能力,我们还从 DeepSeek-V2 的指令数据中抽取了部分样本。最终,我们使用了包含 3 亿个标记的指令数据集。在训练过程中,我们采用带有 100 个预热步骤的余弦调度,并设定初始学习率为 。同时,我们设置了 100 万标记的批次大小和总共 10 亿标记的训练量。
3.5.2 Reinforcement Learning
3.5.2 强化学习
We further employ Reinforcement Learning (RL) techniques to fully simulate the capabilities of DeepSeek-Coder-V2, which is proven to be quite effective.
我们进一步采用强化学习(RL)技术,全面模拟 DeepSeek-Coder-V2 的能力,实践证明其效果相当显著。
Prompts
Considerable effort was spent collecting prompts related to code and math from various sources, and each code prompt comes with corresponding test cases.
After filtering the prompts, there are approximately 40k data in total.
在收集与代码和数学相关的提示信息方面投入了大量精力,这些信息来自多个渠道,每个代码提示都附带相应的测试用例。经过筛选,最终约有 4 万条数据。
Reward Modeling 奖励模型构建
Reward models play crucial roles in the RL training.
In terms of mathematical preference data, we obtain them using the ground-truth labels.
In terms of code preference data, although the code compiler itself can already provide 0-1 feedback (whether the code pass all test cases or not), some code prompts may have a limited number of test cases, and do not provide full coverage, and hence directly using 0-1 feedback from the compiler may be noisy and sub-optimal.
Therefore, we still decide to train a reward model on the data provided by the compiler, and use the reward model to provide signal during RL training, which is more robust and has better generalization ability, in comparison with raw compiler signal.
As illustrated in Figure 3, in our in-house test sets (Leetcode and Leetcode-zh), using a reward model to provide RL training signal clearly outperforms using raw compiler signal.
Hence, we use reward model signal rather than compiler signal in all subsequent experiments.
奖励模型在强化学习训练中扮演着至关重要的角色。在数学偏好数据方面,我们通过使用真实标签来获取它们。对于代码偏好数据,尽管代码编译器本身已经能够提供 0-1 反馈(代码是否通过所有测试用例),但某些代码提示可能测试用例数量有限,覆盖不全,因此直接使用编译器的 0-1 反馈可能存在噪声且效果不佳。因此,我们仍决定在编译器提供的数据上训练一个奖励模型,并在强化学习训练中使用该模型提供的信号,相较于原始编译器信号,这种方法更为稳健且具有更好的泛化能力。如图3所示,在我们的内部测试集(Leetcode 及其中文版)中,使用奖励模型提供的强化学习训练信号明显优于使用原始编译器信号。因此,在后续的所有实验中,我们均采用奖励模型信号而非编译器信号。
Reinforcement Learning Algorithm
强化学习算法
We employ Group Relative Policy Optimization (GRPO) Shao et al. (2024) as our RL algorithm, which is the same as what DeepSeek-V2 uses.
Notably, GRPO is proven to be quite effective and has less cost compared with PPO, since there is no need to maintain an additional critic model.
我们采用 Group Relative Policy Optimization(GRPO)Shao 等人(2024)作为我们的强化学习算法,这与 DeepSeek-V2 所使用的算法相同。值得注意的是,GRPO 已被证明相当有效且成本较低,相比 PPO,因为它无需维护一个额外的评价模型。
4 Experimental Results
4 实验结果
In this section, we evaluate DeepSeek-Coder-V2 on three types of tasks, including coding, mathematics, and general natural language. We compare DeepSeek-Coder-V2 with the previous state-of-the-art large language models.
在本节中,我们评估了 DeepSeek-Coder-V2 在三类任务上的表现,涵盖编程、数学及通用自然语言处理。我们将 DeepSeek-Coder-V2 与先前的顶尖大型语言模型进行了对比。
-
•
CodeLlama (Roziere et al., 2023) consists of a series of code language models based on Llama2 (Touvron et al., 2023), and continue pre-training on datasets ranging from 500 to 1000 billion code tokens. These models are available in four sizes: 7B, 13B, 34B, and 70B.
• CodeLlama(Roziere 等人,2023 年)由一系列基于 Llama2(Touvron 等人,2023 年)的代码语言模型组成,并在包含 500 至 1000 亿个代码令牌的数据集上进行了持续预训练。这些模型共有四种规模:7B、13B、34B 和 70B。 -
•
StarCoder (Lozhkov et al., 2024) is a publicly accessible model with 15 billion parameters. It is specifically trained on a meticulously curated subset of the Stack dataset (Kocetkov et al., 2022), covering 86 programming languages.
• StarCoder(Lozhkov 等人,2024 年)是一款拥有 150 亿参数的公开访问模型。它专门针对经精心筛选的 Stack 数据集子集(Kocetkov 等人,2022 年)进行训练,涵盖 86 种编程语言。 -
•
StarCoder2 (Lozhkov et al., 2024) consists of 3B, 7B, and 15B parameters models trained on 3.3 to 4.3 trillion tokens of the Stack2 dataset (Lozhkov et al., 2024), spanning 619 programming languages.
• StarCoder2(Lozhkov 等,2024)包含 3B、7B 和 15B 参数模型,这些模型在 Stack2 数据集(Lozhkov 等,2024)上进行了训练,涵盖了从 3.3 万亿到 4.3 万亿个令牌,涉及 619 种编程语言。 -
•
DeepSeek-Coder (Guo et al., 2024) comprises a series of code language models, ranging from 1 billion to 33 billion parameters. Each model is trained from scratch on 2 trillion tokens, with a composition of 87% code and 13% natural language in both English and Chinese. These models are pre-trained on a project-level code corpus using a window size of 16K and an additional fill-in-the-blank task, enabling support for project-level code completion and infilling.
• DeepSeek-Coder(Guo 等,2024)由一系列代码语言模型组成,参数规模从 10 亿到 330 亿不等。每个模型均从零开始,基于 2 万亿个令牌进行训练,其中 87%为代码,13%为英汉双语的自然语言。这些模型采用 16K 的窗口大小,并结合填空任务,在项目级代码语料库上进行预训练,从而支持项目级代码的补全与填充。 -
•
Codestral (MistralAI, 2024) is a 22B parameter model developed by Mistral. It is trained on a diverse dataset of over 80 programming languages, including popular ones such as Python, Java, and JavaScript, as well as more specialized languages like Swift and Fortran.
• 代码斯特拉(MistralAI,2024 年)是由 Mistral 开发的 220 亿参数模型。它在一个包含超过 80 种编程语言的多样化数据集上进行训练,这些语言包括 Python、Java 和 JavaScript 等流行语言,以及 Swift 和 Fortran 等更为专业的语言。 -
•
General language models that we compare include Llama3 70B (Meta, 2024), GPT-4 (OpenAI, 2023), Claude 3 Opus (Anthropic, 2024), and Gemini 1.5 Pro (Reid et al., 2024). While they are not specifically trained on large code corpora, they achieve state-of-the-art performance in coding.
• 我们比较的通用语言模型包括 Llama3 70B(Meta,2024)、GPT-4(OpenAI,2023)、Claude 3 Opus(Anthropic,2024)以及Gemini 1.5 Pro(Reid 等人,2024)。尽管它们并非专门针对大型代码语料库进行训练,但在编程领域仍达到了顶尖水平。
4.1 Code Generation
4.1 代码生成
HumanEval and MBPP Benchmarks.
HumanEval 和 MBPP 基准测试。
The HumanEval (Chen et al., 2021) 444We use the template ”Please complete the python function below. The final complete version of your function must be returned within a code block. Here is the unfinished function:\n ```python\n{problem_description}\n\n” to build the instruction prompt. and MBPP (Austin et al., 2021b) benchmarks are commonly utilized for assessing the performance of code-generating Large Language Models (LLMs). HumanEval comprises 164 Python tasks that are verified through test cases to evaluate the performance of Code LLMs in a zero-shot scenario. For MBPP, we use the MBPP-Plus version (Liu et al., 2023a) to evaluate the models. To test the multilingual abilities of models, we extended the HumanEval benchmark problems into seven additional languages: C++, Java, PHP, TypeScript, C#, Bash, JavaScript, Swift, R, Julia, D, Rust and Racket. For both benchmarks, we employed a greedy search strategy and recreated the baseline results using identical scripts and environments to ensure a fair comparison.
HumanEval (Chen et al., 2021)444我们使用模板“请完成下面的 Python 函数。最终完整的函数版本必须在代码块内返回。以下是未完成的函数:
```python
{problem_description}
”来构建指令提示。和 MBPP (Austin et al., 2021b)基准测试常用于评估代码生成大型语言模型(LLMs)的性能。HumanEval 包含 164 个通过测试案例验证的 Python 任务,用于在零样本场景下评估代码LLMs的性能。对于 MBPP,我们使用 MBPP-Plus 版本(Liu et al., 2023a)来评估模型。为了测试模型的多语言能力,我们将 HumanEval 基准问题扩展到七种额外语言:C++、Java、PHP、TypeScript、C#、Bash、JavaScript、Swift、R、Julia、D、Rust 和 Racket。 对于这两个基准测试,我们采用了贪心搜索策略,并使用相同的脚本和环境重新生成了基线结果,以确保公平比较。
Table 3 provides an extensive overview of the performance metrics for various models across multiple programming languages on the HumanEval and Benchmarks. The DeepSeek-Coder-V2-Instruct demonstrates exceptional performance, securing the second-highest average score of 75.3%. This performance is notable as it breaks the dominance typically seen from closed-source models, standing out as a leading open-source contender. It is surpassed only by GPT-4o, which leads with an average score of 76.4%. DeepSeek-Coder-V2-Instruct shows top-tier results across a variety of languages, including the highest scores in Java and PHP, and strong performances in Python, C++, C#, TypeScript, and JavaScript, underscoring its robustness and versatility in handling diverse coding challenges.
表 3 提供了在 HumanEval 和 基准测试中,多种模型在多个编程语言上性能指标的全面概览。DeepSeek-Coder-V2-Instruct 展现了卓越的性能,以 75.3%的平均分位居第二,这一成绩尤为引人注目,因为它打破了通常由闭源模型主导的局面,成为领先的开放源代码竞争者。其表现仅次 GPT-4o,后者以 76.4%的平均分领先。DeepSeek-Coder-V2-Instruct 在多种语言中均显示出顶尖成果,包括在 Java 和 PHP 中取得最高分,并在 Python、C++、C#、TypeScript 和 JavaScript 中表现出色,充分证明了其在应对多样编程挑战时的稳健性和多功能性。
Furthermore, the DeepSeek-Coder-V2-Lite-Instruct also performs impressively, surpassing the larger 33B model. With a considerable margin in average performance (65.6% vs. 61.9%), it highlights the effectiveness of the 16B model in delivering competitive results despite its smaller size. This underscores the model’s efficiency and the advancements in model architecture and training methodologies that allow it to outperform larger counterparts.
此外,DeepSeek-Coder-V2-Lite-Instruct 同样表现出色,超越了规模更大的 33B 模型。在平均性能上,它以显著优势领先(65.6%对 61.9%),凸显了 16B 模型尽管体积较小,却能提供具有竞争力的成果。这突显了该模型的效率以及模型架构和训练方法的进步,使其能够超越规模更大的同类模型。
Competitive Programming. 竞争性编程。
To further validate the model’s capability in real-world competitive programming problems, we utilize the LiveCodeBench (Jain et al., 2024) and USACO benchmark (Shi et al., 2024) to estimate the effectiveness of DeepSeek-Coder-V2. LiveCodeBench is a meticulous and contamination-free assessment of Large Language Models (LLMs) for code generation, systematically gathering novel challenges over time from three prominent competitive programming platforms: LeetCode, AtCoder, and CodeForces. Since the cut-off of the training data is before November 2023, we use the subset (1201-0601) of Livecodebench. USACO benchmark contains 307 problems from the USA Computing
Olympiad, along with high-quality unit tests, reference code, and official
analyses for each problem.
为了进一步验证模型在实际竞赛编程问题中的能力,我们利用 LiveCodeBench (Jain et al., 2024)和 USACO 基准(Shi et al., 2024)来评估 DeepSeek-Coder-V2 的有效性。LiveCodeBench 是一个精心设计且无污染的大型语言模型(LLMs)代码生成评估,系统地从三个著名的竞赛编程平台——LeetCode、AtCoder 和 CodeForces——随时间收集新颖挑战。由于训练数据截止于 2023 年 11 月之前,我们采用 Livecodebench 的子集(1201-0601)。USACO 基准包含来自美国计算奥林匹克的 307 个问题,以及每个问题的高质量单元测试、参考代码和官方分析。
Model | #TP | #AP | LiveCodeBench 实时编码基准 | USACO | |||
Easy (82) | Medium (87) | Hard (57) | Overall (226) 总体而言(226) | ||||
Closed-Source Models 闭源模型 | |||||||
Gemini-1.5-Pro Gemini-1.5-专业版 | - | - | 74.9% | 16.8% | 1.8% | 34.1% | 4.9% |
Claude-3-Opus Claude-3-作品集 | - | - | 77.2% | 16.7% | 0.7% | 34.6% | 7.8% |
GPT-4-1106 | - | - | 78.4% | 20.2% | 3.5% | 37.1% | 11.1% |
GPT-4-Turbo-0409 | - | - | 84.1% | 35.4% | 6.1% | 45.7% | 12.3% |
GPT-4o-0513 | - | - | 87.4% | 27.5% | 4.9% | 43.4% | 18.8% |
Open-Source Models 开源模型 | |||||||
Codestral | 22B | 22B | 66.5% | 17.7% | 0.2% | 31.0% | 4.6% |
DS-Coder-instruct | 33B | 33B | 51.6% | 9.7% | 0.4% | 22.5% | 4.2% |
Llama3-Instruct | 70B | 70B | 62.4% | 14.4% | 2.1% | 28.7% | 3.3% |
DS-Coder-V2-Lite-Instruct DS-Coder-V2-Lite-指令 |
16B | 2.4B | 58.5% | 8.0% | 0.0% | 24.3% | 6.5% |
DS-Coder-V2-Instruct DS-Coder-V2-指令 | 236B | 21B | 84.1% | 29.9% | 5.3% | 43.4% | 12.1% |
表 4:在 LiveCodeBench(LCB)和 USACO 基准测试中的表现。
Table 4 showcases the performance of various language models on the two benchmarks. Notably, DeepSeek-Coder-V2-Instruct delivers a standout performance, tying for the highest score among large models at 43.4%, on par with GPT-4o. This exceptional result places it second overall, just behind GPT-4-Turbo-0409, which leads with an overall performance of 45.7%. DeepSeek-Coder-V2-Instruct’s impressive ability to handle complex coding challenges firmly establishes it as a top contender, closely trailing the leading GPT-4-Turbo variant.
表4展示了不同语言模型在两项基准测试中的性能。值得注意的是,DeepSeek-Coder-V2-Instruct 表现突出,在大模型中与 GPT-4o 并列最高分,达到 43.4%。这一优异成绩使其整体排名第二,仅次于以 45.7%总成绩领先的 GPT-4-Turbo-0409。DeepSeek-Coder-V2-Instruct 在处理复杂编程挑战方面的卓越能力,使其稳居顶尖竞争者之列,紧随领先的 GPT-4-Turbo 变体之后。
4.2 Code Completion
4.2 代码补全
4.2.1 Repository-Level Code Completion Evaluation
4.2.1 代码仓库级代码补全评估
We use RepoBench (Liu et al., 2023b) to evaluate the capabilities of currently available open-source code models with sizes below 35B in repository-level code completion tasks. This dataset is constructed from a diverse set of real-world, open-sourced, permissively licensed repositories in two popular programming languages: Python and Java. Notably, the latest version (v1.1) of RepoBench sources its data from GitHub repositories created between October 6th and December 31st, 2023, while our pre-training data includes code created before November 2023. To ensure this dataset was not present in our pre-training data and avoid data leakage, we only use data from December 2023.
我们使用 RepoBench (Liu et al., 2023b) 来评估当前可用的小于 35B 规模的开放源代码模型在仓库级代码补全任务中的能力。此数据集构建自两种流行编程语言(Python 和 Java)中多样化的真实世界、开源、宽松许可的仓库。值得注意的是,RepoBench 最新版本(v1.1)的数据来源于 2023 年 10 月 6 日至 12 月 31 日创建的 GitHub 仓库,而我们的预训练数据包含 2023 年 11 月之前创建的代码。为确保该数据集未出现在我们的预训练数据中并避免数据泄露,我们仅采用 2023 年 12 月的数据。
Our evaluation includes five context length levels—2k, 4k, 8k, 12k, and 16k tokens—across three settings: cross-file-first, cross-file-random, and in-file. We use greedy search for all models under evaluation. The models were constrained to generate a maximum of 64 new tokens per prompt, and the first non-empty and non-comment line of the output was selected as the prediction. The maximum token length for prompts was set to 15,800 by truncating excess cross-file context. We report the average exact match for the different context length levels.
我们的评估涵盖了五个上下文长度级别——2k、4k、8k、12k 和 16k 个令牌——在三种设置下进行:跨文件优先、跨文件随机和文件内。我们采用贪心搜索对所有评估模型进行处理。模型被限制每次提示最多生成 64 个新令牌,并选择输出的第一条非空且非注释的行作为预测结果。提示的最大令牌长度设定为 15,800,通过截断多余的跨文件上下文来实现。我们报告了不同上下文长度级别的平均完全匹配率。
Model | #TP | #AP | Python | Java | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2k | 4k | 8k | 12k | 16k | Avg | 2k | 4k | 8k | 12k | 16k | Avg | |||
StarCoder2-Base 星编码器 2-基础版 | 15B | 15B | 35.7% | 36.7% | 34.6% | 27.4% | 25.1% | 32.1% | 46.2% | 45.0% | 39.8% | 30.5% | 30.7% | 38.7% |
CodeLlama-Base | 7B | 7B | 32.0% | 34.4% | 35.3% | 33.3% | 32.2% | 33.5% | 43.1% | 42.1% | 40.4% | 37.0% | 40.3% | 40.6% |
CodeLlama-Base | 13B | 13B | 33.0% | 36.5% | 37.0% | 34.6% | 35.0% | 35.2% | 43.5% | 44.8% | 40.7% | 38.6% | 41.1% | 41.8% |
CodeLlama-Base | 34B | 34B | 35.3% | 37.5% | 39.5% | 34.9% | 35.6% | 36.6% | 45.9% | 45.4% | 42.5% | 41.0% | 41.2% | 43.3% |
DS-Coder-Base | 6.7B | 6.7B | 36.1% | 37.5% | 38.2% | 34.0% | 35.0% | 36.2% | 46.8% | 46.4% | 42.9% | 38.8% | 40.8% | 43.3% |
DS-Coder-Base | 33B | 33B | 39.7% | 40.1% | 40.0% | 36.9% | 38.5% | 39.1% | 47.9% | 47.7% | 43.3% | 40.9% | 43.6% | 44.8% |
Codestral | 22B | 22B | 42.1% | 44.3% | 46.6% | 46.6% | 51.5% | 46.1% | 48.3% | 47.8% | 46.0% | 42.2% | 43.9% | 45.7% |
DS-Coder-V2-Lite-Base DS-编码器-V2-轻量基础版 | 16B | 2.4B | 38.3% | 38.6% | 40.6% | 38.3% | 38.7% | 38.9% | 48.8% | 45.7% | 42.4% | 38.1% | 41.1% | 43.3% |
表 5: RepoBench v1.1 十二月子集上不同模型的性能表现。
As shown in Table 5, the results indicate that the DeepSeek-Coder-V2-Lite-Base model, despite having only 2.4 billion active parameters, achieves code completion capabilities in Python comparable to the DeepSeek-Coder-Base 33B model and in Java comparable to the DeepSeek-Coder-Base 7B model. Compared to CodeStral, the DeepSeek-Coder-V2-Lite-Base model has only one-tenth of the active parameters of CodeStral, resulting in lower performance in code completion tasks. However, we believe that the smaller number of active parameters in DeepSeek-Coder-V2 makes it faster for code completion scenarios.
如表5所示,结果表明,尽管 DeepSeek-Coder-V2-Lite-Base 模型仅拥有 24 亿活跃参数,但在 Python 中的代码补全能力可与 DeepSeek-Coder-Base 33B 模型相媲美,在 Java 中则与 DeepSeek-Coder-Base 7B 模型相当。与 CodeStral 相比,DeepSeek-Coder-V2-Lite-Base 模型的活跃参数仅为 CodeStral 的十分之一,因此在代码补全任务中的表现有所下降。然而,我们认为 DeepSeek-Coder-V2 的活跃参数较少,使其在代码补全场景中速度更快。
4.2.2 Fill-in-the-Middle Code Completion
4.2.2 填空式代码补全
DeepSeek-Coder-V2-Lite is trained with a unique approach that includes a 0.5 Fill-In-the-Middle (FIM) rate during their pre-training phase. This method allows the model to adeptly complete code by filling in blanks using the surrounding context, which includes both the preceding and following code segments. This ability is particularly advantageous for code completion tools. Several open-source models, such as SantaCoder (Allal et al., 2023), StarCoder (Li et al., 2023b), and CodeLlama (Roziere et al., 2023), also leverage similar capabilities and have established high standards in the domain of code generation and completion.
DeepSeek-Coder-V2-Lite 采用了一种独特的训练方法,其中包括预训练阶段 0.5 的填空率(Fill-In-the-Middle, FIM)。这种方法使得模型能够利用上下文信息,即前后代码段,熟练地通过填补空白来完成代码。这种能力对于代码补全工具尤为有利。多个开源模型,如 SantaCoder(Allal 等人,2023)、StarCoder(Li 等人,2023b)和 CodeLlama(Roziere 等人,2023),同样运用了类似的能力,并在代码生成和补全领域树立了高标准。
To evaluate the performance of DeepSeek-Coder-V2 models, we conducted a comparative analysis against leading models. The assessment was based on the Single-Line Infilling benchmarks, covering three different programming languages as described by Allal et al. (2023). The main metric for this evaluation was the line exact match accuracy555We use the first generated line rather than the whole generated chunk, thus the result is slightly different with DeepSeek-Coder..
为了评估 DeepSeek-Coder-V2 模型的性能,我们进行了与领先模型的对比分析。评估依据为单行填充基准测试,涵盖了三种不同的编程语言,如Allal 等人(2023)所述。此次评估的主要指标是行精确匹配准确率555我们采用的是生成的首行而非整个生成块,因此结果与 DeepSeek-Coder 略有不同。。
Model | #TP | #AP | python | java | javascript | Mean |
StarCoder666StartCoder-2 has some problems with FIM, thus we still use StartCoder. | 16B | 16B | 71.5% | 82.3% | 83.0% | 80.2% |
CodeLlama-Base | 7B | 7B | 58.6% | 70.6% | 70.7% | 68.0% |
CodeLlama-Base | 13B | 13B | 60.7% | 74.3% | 78.5% | 73.1% |
DS-Coder-Base | 1B | 1B | 74.1% | 85.1% | 82.9% | 81.8% |
DS-Coder-Base | 7B | 7B | 79.8% | 89.6% | 86.3% | 86.1% |
DS-Coder-Base | 33B | 33B | 80.5% | 88.4% | 86.6% | 86.4% |
Codestral | 22B | 22B | 77.2% | 83.2% | 85.9% | 83.0% |
DS-Coder-V2-Lite-Base DS-编码器-V2-轻量基础版 | 16B | 2.4B | 80.0% | 89.1% | 87.2% | 86.4% |
表 6:不同方法在 FIM 任务上的表现。
The table presents the performance of various coding models on FIM (Fill-in-the-Middle) tasks across three programming languages: Python, Java, and JavaScript, with a Mean score indicating overall effectiveness. Among the compared models, DeepSeek-Coder-V2-Lite-Base, with a configuration of 2.4B active parameters, achieves outstanding results. It scores 80.0% in Python, 89.1% in Java, and 87.2% in JavaScript, leading to a top Mean score of 86.4%. This demonstrates the superior effectiveness of DeepSeek-Coder-V2-Lite-Base, particularly in handling FIM tasks across different programming languages, achieving comparable performance with other bigger models in the evaluation.
表格展示了多种编码模型在跨三种编程语言(Python、Java 和 JavaScript)的 FIM(填空中间)任务上的表现,其中平均分表示整体有效性。在比较的模型中,配置为 2.4 亿活跃参数的 DeepSeek-Coder-V2-Lite-Base 取得了卓越成绩。它在 Python 中得分 80.0%,在 Java 中得分 89.1%,在 JavaScript 中得分 87.2%,从而获得最高的平均分 86.4%。这表明 DeepSeek-Coder-V2-Lite-Base 在处理不同编程语言的 FIM 任务方面具有卓越的有效性,其性能在评估中与其他更大型的模型不相上下。
4.3 Code Fixing
4.3 代码修复
To evaluate the bug-fixing capabilities of the model, we used the Defects4J 777https://github.com/rjust/defects4j, SWE-bench (Jimenez et al., 2023), and Aider 888https://github.com/paul-gauthier/aider datasets for testing. Defects4J is a widely used dataset in the field of software engineering, specifically designed for the purpose of evaluating and testing program repair techniques. It consists of a collection of real-world software bugs from various open-source projects, including but not limited to Apache Commons, JFreeChart, and Closure Compiler. Each bug in the dataset is accompanied by test suites that can be used to validate the effectiveness of program repair tools. Since the original bugs in Defec4J may need modify several files in the repository resulting in a long context, we collect 238 bugs that only need to modify one method from this benchmark.
为了评估模型的错误修复能力,我们使用了 Defects4J777https://github.com/rjust/defects4j、SWE-bench(Jimenez 等人,2023)和 Aider888https://github.com/paul-gauthier/aider数据集进行测试。Defects4J 是软件工程领域广泛使用的数据集,专门设计用于评估和测试程序修复技术。它包含来自多个开源项目的真实软件错误,例如 Apache Commons、JFreeChart 和 Closure Compiler 等。数据集中的每个错误都附带了测试套件,可用于验证程序修复工具的有效性。由于 Defec4J 中的原始错误可能需要修改仓库中的多个文件,导致上下文较长,我们从该基准中收集了仅需修改一个方法的 238 个错误。
SWE-bench is a comprehensive benchmark designed to evaluate the performance of large language models in addressing real-world software issues sourced from GitHub. The benchmark presents a codebase alongside a specific issue, challenging the language model to generate a patch that effectively resolves the described problem. This rigorous evaluation framework ensures that the language model’s ability to understand and fix real-world software issues is thoroughly tested, providing a clear measure of its practical utility and effectiveness in software development tasks.
SWE-bench 是一个全面的基准测试,旨在评估大型语言模型在解决来自 GitHub 的实际软件问题方面的性能。该基准测试展示一个代码库及其特定问题,要求语言模型生成一个能有效解决所述问题的补丁。这一严格的评估框架确保了语言模型理解和修复实际软件问题的能力得到彻底检验,从而清晰地衡量其在软件开发任务中的实用性和有效性。
Aider’s code editing benchmark evaluates the LLM’s ability to modify Python source files, completing 133 distinct coding tasks. This benchmark not only tests the LLM’s coding skills but also checks its consistency in producing code edits according to the specifications in the prompt. For DeepSeek-Coder-V2 models, we use whole format to evaluate.
Aider 的代码编辑基准测试评估了LLM修改 Python 源文件的能力,完成了 133 项不同的编码任务。此基准不仅测试了LLM的编码技能,还检验了其在根据提示中的规范生成代码编辑时的一致性。对于 DeepSeek-Coder-V2 模型,我们采用整体格式进行评估。
Model | #TP | #AP | Defects4J | SWE-Bench | Aider |
---|---|---|---|---|---|
Closed-Source Models 闭源模型 | |||||
Gemini-1.5-Pro Gemini-1.5-专业版 | - | - | 18.6% | 19.3% | 57.1% |
Claude-3-Opus Claude-3-作品集 | - | - | 25.5% | 11.7% | 68.4% |
GPT-4-1106 | - | - | 22.8% | 22.7% | 65.4% |
GPT-4-Turbo-0409 | - | - | 24.3% | 18.3% | 63.9% |
GPT-4o-0513 | - | - | 26.1% | 26.7% | 72.9% |
Open-Source Models 开源模型 | |||||
Codestral | 22B | 22B | 17.8% | 2.7% | 51.1% |
DS-Coder-Instruct DS-Coder-指令 | 33B | 33B | 11.3% | 0.0% | 54.5% |
Llama3-Instruct | 70B | 70B | 16.2% | - | 49.2% |
DS-Coder-V2-Lite-Instruct DS-Coder-V2-Lite-指令 |
16B | 2.4B | 9.2% | 0.0% | 44.4% |
DS-Coder-V2-Instruct DS-Coder-V2-指令 | 236B | 21B | 21.0% | 12.7% | 73.7% |
表 7:不同模型在修复基准上的性能表现。由于 Llama3-Instruct 仅支持 8K 上下文长度,我们未对其在 SWE-Bench 上进行评估。
Table 7 outlines the performances of different language models on software repair benchmarks, including Defects4J, SWE-Bench, and Aider. Among open-source models, DeepSeek-Coder-Instruct emerges as a standout, achieving the best performance within the open source models. It scores 21% in Defects4J and 12.7% in SWE-Bench, closely approaching the results of leading closed-source models and demonstrating significant capability in handling longer code sequences. Notably, DeepSeek-Coder-V2-Instruct achieves the highest score of 73.7% in Aider, surpassing all other models listed, including closed-source counterparts. This superior performance highlights its efficiency and robustness in automated code repair tasks, positioning DeepSeek-Coder-V2-Instruct as the top open-source model and a formidable competitor to closed-source alternatives in the field.
表 7 概述了不同语言模型在软件修复基准测试中的表现,包括 Defects4J、SWE-Bench 和 Aider。在开源模型中,DeepSeek-Coder-Instruct 脱颖而出,在开源模型中取得了最佳性能。它在 Defects4J 中得分为 21%,在 SWE-Bench 中为 12.7%,紧追领先闭源模型的成绩,展现出处理较长代码序列的显著能力。值得注意的是,DeepSeek-Coder-V2-Instruct 在 Aider 中取得了最高的 73.7% 分数,超越了所有其他列出的模型,包括闭源模型。这一卓越表现凸显了其在自动化代码修复任务中的高效性和鲁棒性,使 DeepSeek-Coder-V2-Instruct 成为顶尖的开源模型,并在该领域成为闭源替代方案的强劲竞争者。
4.4 Code Understanding and Reasoning
4.4 代码理解和推理
To assess the code reasoning capabilities of our models, we utilize the CRUXEval benchmark. This benchmark comprises 800 Python functions paired with corresponding input-output examples. It is divided into two distinct tasks: CRUXEval-I, which requires the large language model (LLM) to predict the output based on the given input, and CRUXEval-O, where the model must predict the input from the known output. This structure challenges the model’s ability to understand and reason through Python code in both forward and reverse directions.
为了评估我们模型的代码推理能力,我们采用了 CRUXEval 基准测试。该基准包含 800 个 Python 函数及其相应的输入输出示例。它分为两个不同的任务:CRUXEval-I 要求大型语言模型(LLM)根据给定的输入预测输出,而 CRUXEval-O 则要求模型从已知的输出预测输入。这种结构挑战了模型在正向和反向两个方向上理解和推理 Python 代码的能力。
Model | #TP | #AP | CruxEval-I-COT | CruxEval-O-COT 十字评估-O-COT |
---|---|---|---|---|
Closed-Source Models 闭源模型 | ||||
Gemini-1.5-Pro Gemini-1.5-专业版 | - | - | 67.0% | 77.5% |
Claude-3-Opus Claude-3-作品集 | - | - | 73.4% | 82.0% |
GPT-4-1106 | - | - | 75.5% | 77.1% |
GPT-4-Turbo-0409 | - | - | 75.7% | 82.0% |
GPT-4o-0513 | - | - | 77.4% | 88.7% |
Open-Source Models 开源模型 | ||||
Codestral | 22B | 22B | 48.0% | 60.6% |
DS-Coder-Instruct DS-Coder-指令 | 33B | 33B | 47.3% | 50.6% |
Llama3-Instruct | 70B | 70B | 61.1% | 64.3% |
DS-Coder-V2-Lite-Instruct DS-Coder-V2-Lite-指令 |
16B | 2.4B | 53.0% | 52.9% |
DS-Coder-V2-Instruct DS-Coder-V2-指令 | 236B | 21B | 70.0% | 75.1% |
表 8: 不同模型在 CruxEval 基准上的性能表现。
Table 8 presents the performance of various language models on the CruxEval benchmark, which assesses models on two metrics: CruxEval-I-COT and CruxEval-O-COT. Among the open-source models, DeepSeek-Coder-V2-Instruct stands out significantly. It scores 70.0% on CruxEval-I-COT and 75.1% on CruxEval-O-COT, showcasing its superior capability within the open-source domain. However, when compared to larger closed-source models, there is a performance gap. This performance gap may largely be attributed to the fact that DeepSeek-Coder-V2-Instruct operates with only 21 billion activation parameters, which is considerably fewer than those in larger, more advanced closed-source models like GPT-4o. This limitation in model complexity could restrict its learning and problem-solving capacities.
表 8 展示了各语言模型在 CruxEval 基准测试中的表现,该测试从 CruxEval-I-COT 和 CruxEval-O-COT 两个维度评估模型。在开源模型中,DeepSeek-Coder-V2-Instruct 表现尤为突出,其在 CruxEval-I-COT 上得分为 70.0%,在 CruxEval-O-COT 上为 75.1%,显示出其在开源领域的卓越能力。然而,与规模更大的闭源模型相比,存在一定性能差距。这一差距很大程度上归因于 DeepSeek-Coder-V2-Instruct 仅拥有 210 亿激活参数,远少于 GPT-4 等更先进闭源模型中的参数数量。模型复杂度的这种限制可能制约了其学习和问题解决能力。
4.5 Mathematical Reasoning
4.5 数学推理
To assess the mathematical reasoning capabilities of DeepSeekCoder-V2, we utilized the popular grade-school benchmark GSM8K (Cobbe et al., 2021), along with advanced competition-level benchmarks including MATH (Hendrycks et al., 2021), the American Invitational Mathematics Examination (AIME) 2024 (MAA, 2024), and Math Odyssey (Netmind.AI, 2024)999The performance of DeepSeek-Coder-V2 on the four mathematical benchmarks was obtained with zero-shot chain-of-thought prompting;
each test question was concatenated with the instruction: ”nPlease reason step by step, and put your final answer within boxed{}.”.
为了评估 DeepSeekCoder-V2 的数学推理能力,我们采用了广受欢迎的小学基准 GSM8K(Cobbe et al., 2021),以及包括 MATH(Hendrycks et al., 2021)、2024 年美国邀请数学考试(AIME)(MAA, 2024)和数学奥德赛(Netmind.AI, 2024)999在内的先进竞赛级基准。DeepSeekCoder-V2 在这四个数学基准上的表现是通过零样本思维链提示获得的;每个测试题目都附有以下指令:“n 请逐步推理,并将最终答案置于boxed{}中。”。
Model | #TP | #AP | GSM8K | MATH | AIME 2024 | Math Odyssey |
Closed-Source Models 闭源模型 | ||||||
Gemini 1.5 Pro | - | - | 90.8% | 67.7% | 2/30 | 45.0% |
Claude-3-Opus Claude-3-作品集 | - | - | 95.0% | 60.1% | 2/30 | 40.6% |
GPT-4-1106 | - | - | 91.4% | 64.3% | 1/30 | 49.1% |
GPT-4-Turbo-0409 | - | - | 93.7% | 73.4% | 3/30 | 46.8% |
GPT-4o-0513 | - | - | 95.8% | 76.6% | 2/30 | 53.2% |
Open-Source Models 开源模型 | ||||||
Llama3-Instruct | 70B | 70B | 93.0% | 50.4% | 1/30 | 27.9% |
DS-Coder-V2-Lite-Instruct DS-Coder-V2-Lite-指令 |
16B | 2.4B | 86.4% | 61.8% | 0/30 | 44.4% |
DS-Coder-V2-Instruct DS-Coder-V2-指令 | 236B | 21B | 94.9% | 75.7% | 4/30 | 53.7% |
表 9: 不同模型在数学推理上的表现。DeepSeek-Coder-V2-Instruct 能在 2024 年 AIME 上以 maj@64 达到 5/30 的成绩。
The results, presented in Table 9, were obtained using greedy decoding without the aid of tools or voting techniques, unless otherwise specified.
DeepSeek-Coder-V2 achieved an accuracy of 75.7% on the MATH benchmark and 53.7% on Math Odyssey, comparable to the state-of-the-art GPT-4o.
Additionally, DeepSeek-Coder-V2 solves more problems from AIME 2024 than the other models, demonstrating its strong mathematical reasoning capabilities.
结果如表9所示,采用贪婪解码方法获得,除非另有说明,未借助任何工具或投票技术。DeepSeek-Coder-V2 在 MATH 基准测试中达到 75.7%的准确率,在 Math Odyssey 上为 53.7%,与顶尖的 GPT-4o 相当。此外,DeepSeek-Coder-V2 解决了更多来自 AIME 2024 的问题,展现了其强大的数学推理能力。
4.6 General Natural Language
4.6 通用自然语言
As DeepSeek-Coder-V2 is built upon DeepSeek-V2, it inherits the strong natural language capability, even surpassing DeepSeek-V2 on reasoning-related benchmarks. We compare DeepSeek-Coder-V2 Instruct with DeepSeek-V2 Chat on standard benchmarks, which covers both English and Chinese benchmarks, including BigBench Hard (BBH) (Suzgun et al., 2022), MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), TriviaQA (Joshi et al., 2017), NaturalQuestions (Kwiatkowski et al., 2019), AGIEval (Zhong et al., 2023), CLUEWSC (Xu et al., 2020), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023a).
Besides, we also evaluate the open-ended generation ability of models, including Arena-Hard (Li et al., 2024), AlpacaEval2.0 (Dubois et al., 2024), MT-Bench (Zheng et al., 2023), and Alignbench (Liu et al., 2023c).
The evaluation pipeline and metrics are the same as in DeepSeek-V2, where the MMLU are evaluated using OpenAI simple-eval package https://github.com/openai/simple-evals.
由于 DeepSeek-Coder-V2 基于 DeepSeek-V2 构建,它继承了强大的自然语言处理能力,甚至在推理相关的基准测试中超越了 DeepSeek-V2。我们在涵盖英汉双语的标准基准上比较了 DeepSeek-Coder-V2 Instruct 与 DeepSeek-V2 Chat 的表现,这些基准包括 BigBench Hard(BBH)(Suzgun et al., 2022)、MMLU (Hendrycks et al., 2020)、ARC (Clark et al., 2018)、TriviaQA (Joshi et al., 2017)、NaturalQuestions (Kwiatkowski et al., 2019)、AGIEval (Zhong et al., 2023)、CLUEWSC (Xu et al., 2020)、C-Eval (Huang et al., 2023)以及 CMMLU (Li et al., 2023a)。此外,我们还评估了模型的开放式生成能力,包括 Arena-Hard (Li et al., 2024)、AlpacaEval2.0 (Dubois et al., 2024),MT-Bench (Zheng 等,2023),以及 Alignbench (Liu 等,2023c)。评估流程和指标与 DeepSeek-V2 相同,其中 MMLU 的评估采用 OpenAI 的 simple-eval 包https://github.com/openai/simple-evals。
Benchmark (Metric) 基准(度量标准) | # Shots | DeepSeek-V2-Lite 深度探索-V2-精简版 | DeepSeek-Coder-V2-Lite 深搜编程助手 V2 轻量版 | DeepSeek-V2 | DeepSeek-Coder-V2 深度探索编码器 V2 | |
Chat | Instruct | Chat | Instruct | |||
# Active Params # 活动参数 | - | 2.4B | 2.4B | 21B | 21B | |
# Total Params # 总参数量 | - | 16B | 16B | 236B | 236B | |
# Training Tokens # 训练令牌数 | - | 5.7T | 10.2T | 8.1T | 10.2T | |
English | BBH (EM) | 3-shot | 48.1 | 61.2 | 79.7 | 83.9 |
MMLU (Acc.) | 5-shot | 55.7 | 60.1 | 78.1 | 79.2 | |
ARC-Easy (Acc.) ARC-Easy(准确率) | 25-shot | 86.1 | 88.9 | 98.1 | 97.4 | |
ARC-Challenge (Acc.) ARC 挑战集(准确率) | 25-shot | 73.4 | 77.4 | 92.3 | 92.8 | |
TriviaQA (EM) TriviaQA(完全匹配) | 5-shot | 65.2 | 59.5 | 86.7 | 82.3 | |
NaturalQuestions (EM) 自然问题(EM) | 5-shot | 35.5 | 30.8 | 53.4 | 47.5 | |
AGIEval (Acc.) AGIEval(准确性) | 0-shot | 42.8 | 28.7 | 61.4 | 60.0 | |
Chinese | CLUEWSC (EM) | 5-shot | 80.0 | 76.5 | 89.9 | 85.9 |
C-Eval (Acc.) C-Eval(准确率) | 5-shot | 60.1 | 61.6 | 78.0 | 79.4 | |
CMMLU (Acc.) | 5-shot | 62.5 | 62.7 | 81.6 | 80.9 | |
Open-ended | Arena-Hard | - | 11.40 | 38.10 | 41.60 | 65.00 |
AlpacaEval 2.0 | - | 16.85 | 17.74 | 38.90 | 36.92 | |
MT-Bench | - | 7.37 | 7.81 | 8.97 | 8.77 | |
Alignbench | - | 6.02 | 6.83 | 7.91 | 7.84 |
表 10: DeepSeek-Coder-V2 Instruct 与 DeepSeek-V2 Chat 的对比。
When comparing the performance of 16B models, it is evident that DeepSeek-Coder-V2-Lite-Instruct outperforms DeepSeek-V2-Lite-Chat in benchmarks like BBH and Arena-Hard. These benchmarks place a high demand on the model’s reasoning ability, which DeepSeek-Coder-V2-Lite-Instruct excels at. However, DeepSeek-Coder-V2-Lite Instruct falls behind in knowledge-intensive benchmarks like TriviaQA, primarily due to the relatively smaller amount of web data used during pre-training.
在比较 16B 模型的性能时,可以明显看出 DeepSeek-Coder-V2-Lite-Instruct 在 BBH 和 Arena-Hard 等基准测试中优于 DeepSeek-V2-Lite-Chat。这些基准测试对模型的推理能力提出了很高的要求,而 DeepSeek-Coder-V2-Lite-Instruct 在这方面表现出色。然而,在知识密集型的基准测试如 TriviaQA 中,DeepSeek-Coder-V2-Lite Instruct 的表现则稍逊一筹,这主要是因为预训练期间使用的网络数据量相对较小。
Moving on to 236B models, DeepSeek-Coder-V2 Instruct exhibits greater strength in reasoning benchmarks, particularly in Arena-Hard, which comprises a substantial proportion of code, math, and reasoning questions. On the other hand, DeepSeek-V2 Chat demonstrates slightly better results in benchmarks such as MT-bench (Zheng et al., 2023), AlpacaEval 2.0 (Dubois et al., 2024), and AlignBench (Liu et al., 2023c). This advantage can be attributed to the general-purpose alignment stage of DeepSeek-V2 Chat.
转向 236B 型号,DeepSeek-Coder-V2 Instruct 在推理基准测试中展现出更强的实力,尤其是在 Arena-Hard 中,该部分包含了大量的代码、数学和推理题目。另一方面,DeepSeek-V2 Chat 在 MT-bench (Zheng et al., 2023)、AlpacaEval 2.0 (Dubois et al., 2024)和 AlignBench (Liu et al., 2023c)等基准测试中表现略优。这一优势可归因于 DeepSeek-V2 Chat 的通用对齐阶段。
5 Conclusion
5 结论
In this paper, we introduce DeepSeek-Coder-V2 to further advance the field of code intelligence, which is continually pre-trained from DeepSeek-V2 with 6 trillion tokens sourced from a high-quality and multi-source corpus. Through this continued pre-training, we find that DeepSeek-Coder-V2 significantly enhances the model’s capabilities in coding and mathematical reasoning while maintaining comparable general language performance to DeepSeek-V2. Compared to DeepSeek-Coder, DeepSeek-Coder-V2 supports a significantly larger number of programming languages, increasing from 86 to 338, and extends the maximum context length from 16K to 128K tokens. Experimental results demonstrate that DeepSeek-Coder-V2 achieves performance comparable to state-of-the-art closed-source models such as GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro in code and math-specific tasks.
本文中,我们介绍了 DeepSeek-Coder-V2,旨在进一步推动代码智能领域的发展,该模型从拥有 6 万亿标记的高质量多源语料库中持续预训练自 DeepSeek-V2。通过这种持续预训练,我们发现 DeepSeek-Coder-V2 在编码和数学推理能力上显著提升,同时保持了与 DeepSeek-V2 相媲美的通用语言处理性能。相较于 DeepSeek-Coder,DeepSeek-Coder-V2 支持的编程语言数量大幅增加,从 86 种增至 338 种,并将最大上下文长度从 16K 扩展到了 128K 标记。实验结果表明,DeepSeek-Coder-V2 在代码和数学特定任务上的表现可与 GPT-4 Turbo、Claude 3 Opus 及Gemini 1.5 Pro 等顶尖闭源模型相媲美。
Although DeepSeek-Coder-V2 achieves impressive performance on standard benchmarks, we find that there is still a significant gap in instruction-following capabilities compared to current state-of-the-art models like GPT-4 Turbo. This gap leads to poor performance in complex scenarios and tasks such as those in SWEbench. Therefore, we believe that a code model needs not only strong coding abilities but also exceptional instruction-following capabilities to handle real-world complex programming scenarios. In the future, we will focus more on improving the model’s instruction-following capabilities to better handle real-world complex programming scenarios and enhance the productivity of the development process.
尽管 DeepSeek-Coder-V2 在标准基准测试中表现出色,但我们发现,与 GPT-4 Turbo 等当前最先进的模型相比,其在遵循指令的能力上仍存在显著差距。这一差距导致在 SWEbench 等复杂场景和任务中表现不佳。因此,我们认为,一个优秀的代码模型不仅需要强大的编码能力,还需要出色的指令遵循能力,以应对现实世界中的复杂编程场景。未来,我们将着重提升模型的指令遵循能力,以更好地处理现实世界中的复杂编程场景,并提高开发过程的生产效率。
References
-
Allal et al. (2023) 阿拉尔等人 (2023)
L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al.
Santacoder: don’t reach for the stars!
arXiv preprint arXiv:2301.03988, 2023.
L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, 等。 Santacoder:勿好高骛远! arXiv 预印本 arXiv:2301.03988,2023 年。 - Anthropic (2024) A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
- Austin et al. (2021a) J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021a.
- Austin et al. (2021b) J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021b.
- Bavarian et al. (2022) M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Chen et al. (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Clark et al. (2018) P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
- Cobbe et al. (2021) K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- DeepSeek-AI (2024) DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Dubois et al. (2024) Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- Guo et al. (2024) D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Hendrycks et al. (2020) D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Hendrycks et al. (2021) D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Huang et al. (2023) Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- Jain et al. (2024) N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024.
- Jimenez et al. (2023) C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- Joshi et al. (2017) M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
- Joulin et al. (2016) A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
- Kocetkov et al. (2022) D. Kocetkov, R. Li, L. Jia, C. Mou, Y. Jernite, M. Mitchell, C. M. Ferrandis, S. Hughes, T. Wolf, D. Bahdanau, et al. The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research, 2022.
- Kwiatkowski et al. (2019) T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
- Li et al. (2023a) H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023a.
- Li et al. (2023b) R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023b.
- Li et al. (2024) T. Li, W.-L. Chiang, E. Frick, L. Dunlap, B. Zhu, J. E. Gonzalez, and I. Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URL https://lmsys.org/blog/2024-04-19-arena-hard/.
- Liu et al. (2023a) J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
- Liu et al. (2023b) T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations, 2023b.
- Liu et al. (2023c) X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023c. 10.48550/ARXIV.2311.18743. URL https://doi.org/10.48550/arXiv.2311.18743.
- Loshchilov and Hutter (2019) I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019.
- Lozhkov et al. (2024) A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- MAA (2024) MAA. American invitational mathematics examination - aime. American Invitational Mathematics Examination - AIME 2024, 2024. URL https://maa.org/math-competitions/american-invitational-mathematics-examination-aime.
- Meta (2024) Meta. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, April 2024.
- MistralAI (2024) MistralAI. Codestral. https://mistral.ai/news/codestral/, 2024. Accessed: 2024-05-29.
- Netmind.AI (2024) Netmind.AI. Odyssey-math. https://github.com/protagolabs/odyssey-math/tree/main, 2024. Accessed: April 22, 2024.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Peng et al. (2023) B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Reid et al. (2024) M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Roziere et al. (2023) B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Shao et al. (2024) Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Shi et al. (2024) Q. Shi, M. Tang, K. Narasimhan, and S. Yao. Can language models solve olympiad programming? arXiv preprint arXiv:2404.10952, 2024.
- Suzgun et al. (2022) M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Touvron et al. (2023) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Xu et al. (2020) L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics, 2020. 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419.
- Zheng et al. (2023) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Zhong et al. (2023) W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
Appendix A Supported Programming Languages
ABAP, ActionScript, Ada, Agda, AGS Script, Alloy, AmbientTalk, AMD GPU, AMPL, ANSYS Parametric Design Language, ANTLR, Apache Configuration, APL, AppleScript, Arc, Arduino, ASP, AspectJ, Assembly, Asymptote, Augeas, AutoHotkey, AutoIt, AWK, BC, Berry, BitBake, BlitzBasic, BlitzMax, Bluespec, BNF, Boo, Boogie, Brainfuck, BrightScript, Bro, BST, C, C#, C2HS Haskell, CADL, CapDL, Ceylon, Chapel, ChucK, Cirru, Click, Clojure, CMake, COBOL, COBOLFree, CoffeeScript, ColdFusion CFC, Common Lisp, C++, Crystal, Csound, Csound Score, CSS, CUDA, Cypher, Cython, Darcs Patch, Dart, DASM16, Debian Control File, DeviceTree, Diff, DM, Docker, Dockerfile, Dylan, EBNF, eC, Eiffel, Elixir, Elm, ELPi, Emacs Lisp, EmberScript, Erlang, Execline, F#, Factor, Fancy, Fantom, Felix, Fennel, Fish, Flux, Fortran, Fortran Fixed Form, FoxPro, FreeFem, FreeMarker, F*, Futhark, G-Code, GAP, GAS, GDScript, Genshi, Gentoo Ebuild, Gentoo Eclass, Gettext Catalog, GLSL, Glyph, Gnuplot, Go, Gosu, Grace, Gradle, Grammatical Framework, GraphQL, Graphviz DOT, Groff, Groovy, Groovy Server Pages, GSQL, Handlebars, Haskell, Haxe, HCL, HLSL, HTML, HTML Django, HTML ERB, HTML PHP, HTTP, Hy, Idris, IGOR Pro, Inform 6 Template, Inno Setup, Io, Isabelle, J, Jade, JAGS, Jasmin, Java, Java Server Pages, JavaScript, JavaScript MozPreproc, JCL, JFlex, JSON, JSONiq, JSX, Julia, Jupyter Notebook, K, Kconfig, Koka, Kotlin, KRL, Lean, Less, Lex, LFE, Lighttpd Configuration File, LilyPond, Limbo, Linker Script, Liquid, Literate Agda, Literate CoffeeScript, LLVM, Logtalk, LSL, Lua, M4, Makefile, Mako, Mason, MATLAB, Maxima, Meson, Metal, MiniScript, Mirah, Mizar, Modelica, Modula-2, Monkey, MooCode, MoonScript, Mosel, MQL, MUF, MuPAD, NASM, NCL, NetLinx, Nginx Configuration File, Nimrod, Ninja, Nit, Nix, NSIS, Nu, NuSMV, Objdump, Objective-C, Objective-C++, OCaml, Octave, Odin, OMG Interface Definition Language, ooc, Opa, OpenCL, OpenEdge ABL, OpenSCAD, Ox, Oz, Papyrus, Parrot Internal Representation, Pascal, PAWN, PEG, Perl, Perl 6, PHP, Pike, PkgConfig, POD, Pony, POV-Ray, PowerShell, Praat, Processing, Propeller Spin, Protocol Buffer, Pug, Puppet, PureBasic, PureScript, Python, Q, QML, QVTO, R, Racket, Ragel in Ruby Host, RAML, RConsole, Rd, REALbasic, ReasonML, Red, RenderScript, Ren’Py, REXX, RHTML, Ride, Robot Framework, Rouge, Ruby, Rust, S, Sage, SARL, SAS, Sass, Scala, Scheme, Scilab, SCSS, Self, Shell, ShExC, Sieve, Silver, Singularity, Slim, Smali, Smarty, Smithy, SMT, Solidity, SourcePawn, SPARQL, SQF, SQL, Squirrel, Stan, Standard ML, Stata, Stylus, SuperCollider, Swift, SWIG, SystemVerilog, Tcl, Tcsh, Tea, Terminfo, TeX, Thrift, Transact-SQL, Treetop, Turing, Twig, TypeScript, TypoScript, Unity3D Asset, Uno, UnrealScript, UrWeb, USD, Vala, VBScript, VCL, Velocity, Verilog, VHDL, VimL, Visual Basic, Vue, WebAssembly, Web IDL, Whiley, X10, XBase, XC, XML, XML Lasso, XQuery, XS, XSLT, Xtend, Xtlang, YANG, Zeek, Zephir, Zig, Zimpl