TACO: Topics in Algorithmic COde generation dataset
TACO:算法代码生成数据集中的主题
Abstract 摘要
We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the topics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO includes competition-level programming questions that are more challenging, to enhance or evaluate problem understanding and reasoning abilities in real-world programming scenarios. There are 25433 and 1000 coding problems in training and test set, as well as up to 1.55 million diverse solution answers. Moreover, each TACO problem includes several fine-grained labels such as task topics, algorithms, programming skills, and difficulty levels, providing a more precise reference for the training and evaluation of code generation models. The dataset and evaluation scripts are available on Hugging Face Hub111https://huggingface.co/datasets/BAAI/TACO and Github222https://github.com/FlagOpen/TACO.
我们介绍了 TACO,这是一个开源的大规模代码生成数据集,重点关注算法主题,旨在为代码生成模型领域提供更具挑战性的训练数据集和评估基准。TACO 包含更具挑战性的竞赛级编程问题,以增强或评估在真实编程场景中的问题理解和推理能力。训练集和测试集中分别有 25433 和 1000 个编码问题,以及多达 155 万种多样化的解决方案。此外,每个 TACO 问题都包括任务主题、算法、编程技能和难度级别等多个细粒度标签,为代码生成模型的训练和评估提供了更精确的参考。数据集和评估脚本可在Hugging Face Hub111https://huggingface.co/datasets/BAAI/TACO和Github222https://github.com/FlagOpen/TACO上获取。
1 Introduction
1 介绍
Code capability is one of the core competencies of foundational models, crucial for enhancing key skills such as inference and planning in these models. In recent years, large-scale language models (LLMs) based on the Transformer architecture Vaswani et al. (2017); Brown et al. (2020) have made significant advancements in applications involving generating code from a natural language description, establishing themselves as the foundational backbone for a myriad of code-related downstream tasks Chen et al. (2021); Nijkamp et al. (2022); Li et al. (2023); Rozière et al. (2023). By treating the natural language to code as a sequential transformation task, these models offer solutions to basic programming problems Austin et al. (2021). Nevertheless, they encounter substantial difficulties when addressing complex and novel programming challenges Hendrycks et al. (2021); Li et al. (2022); Liu et al. (2023). Recent evaluations indicate that the GPT-4 model performs remarkably poorly in programming challenges when compared to other human-level benchmarks, achieving only a 7% success rate in Leetcode (hard) tasks OpenAI (2023).
编码能力是基础模型的核心竞争力之一,对于增强这些模型中的推理和规划等关键技能至关重要。近年来,基于 Transformer 架构的大规模语言模型(LLMs)Vaswani 等人 (2017); Brown 等人 (2020)在从自然语言描述生成代码的应用中取得了显著进展,成为众多与代码相关的下游任务的基础支柱Chen 等人 (2021); Nijkamp 等人 (2022); Li 等人 (2023); Rozière 等人 (2023)。通过将自然语言到代码视为一个顺序转换任务,这些模型为基本编程问题提供了解决方案Austin 等人 (2021)。然而,在解决复杂和新颖的编程挑战时,它们遇到了相当大的困难Hendrycks 等人 (2021); Li 等人 (2022); Liu 等人 (2023)。最近的评估表明,与其他人类水平的基准相比,GPT-4 模型在编程挑战中表现极差,在 Leetcode(困难)任务中仅取得了 7%的成功率OpenAI (2023)。
With the rapid development of large language models and code generation models, mainstream code evaluation benchmarks have begun to reveal their limitations, struggling to comprehensively reflect the models’ performance and potential in real-world scenarios.
随着大型语言模型和代码生成模型的快速发展,主流代码评估基准开始显露出其局限性,难以全面反映模型在现实场景中的表现和潜力。
-
•
Low Difficulty of Evaluation Tasks: The current mainstream benchmarks like HumanEval/HumanEval-X, MBPP/MBXP primarily focus on basic programming problems, requiring the model to complete a function or a Text2Code task, rather than solving a real-world problem. Moreover, the performance scores on these benchmarks are already high, with HumanEval’s state-of-the-art (SOTA) model achieving 94.4 (pass@1) and MBPP’s SOTA model at 81.1 (Acc). Consequently, the reference value of these evaluation results is gradually decreasing.
• 评估任务难度低:当前主流的基准测试如 HumanEval/HumanEval-X、MBPP/MBXP 主要关注基础编程问题,要求模型完成一个函数或 Text2Code 任务,而不是解决实际问题。此外,这些基准测试的性能得分已经很高,HumanEval 的最先进模型(SOTA)达到 94.4(pass@1),MBPP 的 SOTA 模型达到 81.1(Acc)。因此,这些评估结果的参考价值正在逐渐降低。 -
•
Test Set Quality Issues, Doubts on Validity: Code execution and test cases are key to verifying the correctness of code. However, benchmarks like APPS and CodeContest still face issues with their test sets, such as the absence of manual answers and non-de-duplicated problem and answer sets. Additionally, a DeepMind paper titled ’Competition-Level Code Generation with AlphaCode’ pointed out that due to insufficient test cases, code evaluation datasets might encounter the issue of False Positives, where all test cases pass, but a manual review reveals incorrect code implementation, indicating the model only just passed the given few test cases.
• 测试集质量问题,对有效性的质疑:代码执行和测试用例是验证代码正确性的关键。然而,像 APPS 和 CodeContest 这样的基准测试仍然面临测试集问题,例如缺乏人工答案和未去重的问题和答案集。此外,DeepMind 的一篇论文《Competition-Level Code Generation with AlphaCode》指出,由于测试用例不足,代码评估数据集可能会遇到假阳性问题,即所有测试用例通过,但人工审查发现代码实现不正确,表明模型仅仅通过了给定的少数测试用例。 -
•
Lack of Fine-Grained Indicators: Current code evaluation datasets lack more fine-grained indicators for assessing coding ability, such as difficulty dimensions, algorithm dimensions (sorting, dynamic programming, etc.), failing to provide targeted guidance for improving model code generation capabilities.
• 缺乏细粒度指标:当前的代码评估数据集缺乏更细粒度的指标来评估编码能力,例如难度维度、算法维度(排序、动态规划等),未能为提高模型代码生成能力提供有针对性的指导。
Therefore, we need a more challenging, higher quality, and more finely-grained code generation evaluation scheme to assess models’ code generation capabilities and to provide guidance for enhancing these capabilities.
因此,我们需要一个更具挑战性、更高质量、更细粒度的代码生成评估方案,以评估模型的代码生成能力,并为提升这些能力提供指导。
Programming tasks usually exhibit a pronounced heterogeneity; they possess disparate requirements, necessitate various problem-solving strategies, and employ diverse algorithms and data structures. In this paper, we propose a novel perspective by modeling program synthesis as a combination of multiple tasks Caruana (1997), treating different programming challenges as instance-level tasks. Therefore, the foundation is the detailed annotation of code data with respect to algorithm topics. Existing code datasets either lack comprehensive annotations or provide them in an excessively generalized form. However, algorithms are like the basic alphabet of programming, influencing every coding decision and laying the groundwork for higher-level abstractions. We suggest that enhanced algorithmic annotations have the potential to improve progress in the domain of code understanding and generation, thereby offering benefits to the community. To this end, we have curated and publicly released a large code dataset named TACO (Topics in Algorithm for Code), which is collected from a variety of sources. TACO consists of 26,443 programming tasks, spanning a wide range of subjects including mathematics, data structures, and graph theory. Notably, TACO offers a rich correlation between programming tasks and their respective algorithms, and also exhibits richness in annotations related to algorithms, including aspects like time complexity. Drawing upon the insights offered by Laaksonen (2017), we have meticulously revised and systematically categorized raw tags from the original datasets into 36 primary topics in algorithm, including Complete Search and Dynamic Programming. Additionally, we have supplemented these categories with annotations denoting foundational programming techniques, herein referred to as skills. This aims to assist models in mastering the essential competencies required for solving wide range of programming challenges. Given its comprehensive scope and diverse set of tasks, TACO serves as a valuable benchmark for evaluating the model’s capability to address a wide array of programming challenges across various algorithmic topics.
编程任务通常表现出明显的异质性;它们具有不同的需求,需要各种问题解决策略,并使用多样的算法和数据结构。在本文中,我们通过将程序合成为多个任务的组合来提出一种新颖的视角Caruana (1997),将不同的编程挑战视为实例级任务。因此,基础是对代码数据进行详细的算法主题注释。现有的代码数据集要么缺乏全面的注释,要么提供的注释过于笼统。然而,算法就像编程的基本字母,影响着每一个编码决策,并为更高层次的抽象奠定基础。我们建议,增强的算法注释有可能改善代码理解和生成领域的进展,从而为社区带来益处。为此,我们策划并公开发布了一个名为TACO的大型代码数据集(Topics in Algorithm for Code),该数据集从多种来源收集而来。TACO包含 26,443 个编程任务,涵盖了包括数学、数据结构和图论在内的广泛主题。值得注意的是,TACO在编程任务及其相应算法之间提供了丰富的关联,并且在算法相关注释方面也表现出丰富性,包括时间复杂度等方面。借鉴Laaksonen (2017)提供的见解,我们仔细修订并系统地将原始数据集中的标签分类为 36 个主要算法主题,包括完全搜索和动态规划。 此外,我们还为这些类别补充了标注,标注了基础编程技术,这里称为技能。这旨在帮助模型掌握解决各种编程挑战所需的基本能力。由于其全面的范围和多样的任务集,TACO成为评估模型在各种算法主题上解决广泛编程挑战能力的宝贵基准。
2 Overview and Characteristics
2 概述和特征
TACO dataset comprises problems sourced from publicly accessible datasets and a curated set of manually verified open-source problems. While C++ predominates in programming competitions, our data collection efforts centered on Python 3 solutions. This decision is underpinned by Python’s more restrictive syntactic rules relative to C++, leading to cleaner and more discernible code structures. The TACO dataset encompasses not only algorithmic competencies but also multi-faceted metadata including time and space constraints, timestamps, among others, representing its most salient point of differentiation from extant datasets.
TACO 数据集包含来自公开可访问数据集的问题和一组经过手动验证的开源问题。虽然 C++ 在编程竞赛中占主导地位,但我们的数据收集工作集中在 Python 3 解决方案上。这一决定基于 Python 相对于 C++ 更严格的语法规则,从而导致更清晰和更易辨识的代码结构。TACO 数据集不仅涵盖算法能力,还包括多方面的元数据,包括时间和空间限制、时间戳等,这代表了其与现有数据集的最显著区别。
Following an exhaustive examination, we identified multiple shortcomings in existing open-source code generation datasets. These limitations include a restricted range of problems and an absence of metadata that elucidates the algorithmic skills required to solve the tasks. Within the realm of programming challenges, problems are often addressed by employing a confined yet essential repertoire of strategies, encompassing algorithms like dynamic programming, exhaustive search, and sorting techniques. Generally speaking, solutions to these challenges are an amalgamation of established methodologies and innovative insights.
经过详尽的检查,我们发现现有开源代码生成数据集存在多种缺陷。这些限制包括问题范围有限以及缺乏阐明解决任务所需算法技能的元数据。在编程挑战领域,问题通常通过采用有限但必要的策略组合来解决,包括动态规划、穷举搜索和排序技术等算法。一般来说,这些挑战的解决方案是既定方法和创新见解的结合。
To facilitate the more effective incorporation of algorithmic competencies into code generation tasks, we have developed the TACO dataset. This dataset offers the following distinct contributions and advantages in comparison to existing open-source code generation datasets:
为了更有效地将算法能力融入代码生成任务中,我们开发了 TACO 数据集。与现有的开源代码生成数据集相比,该数据集提供了以下独特的贡献和优势:

图 1:经典算法问题的一个示例(例如,问题、解决方案和标签)
-
•
More Extensive: The TACO dataset comprises 26,443 problems (The training set has 25,443 problems and the test set has 1,000 problems) and 1,539,152 verified Python 3 code solutions. Initially, we amassed 2,045,502 code examples, but this number was reduced following a rigorous code deduplication process. On average, each problem features 58.21 correct code solutions.(Each problem in the training set has 51.6 test cases, and the test set has 202.3.)
• 更广泛:TACO 数据集包含 26,443 个问题(训练集有 25,443 个问题,测试集有 1,000 个问题)和 1,539,152 个经过验证的 Python 3 代码解决方案。最初,我们收集了 2,045,502 个代码示例,但经过严格的代码去重过程后,这一数字有所减少。平均而言,每个问题有 58.21 个正确的代码解决方案。(训练集中的每个问题有 51.6 个测试用例,测试集有 202.3 个。) -
•
Algorithmic Categorization: The TACO dataset emphasizes the algorithmic labeling of problems encountered in programming challenges. Across all problems, we thoroughly summarized the algorithmic labels of the 968 categories, ultimately consolidating them into 36 distinct algorithmic categories.
• 算法分类:TACO 数据集强调编程挑战中遇到问题的算法标签。在所有问题中,我们全面总结了 968 个类别的算法标签,最终将其整合为 36 个不同的算法类别。 -
•
Data Quality Verification: We have performed a manual correctness assessment on a subset of data collected from the TACO dataset. To ensure the accuracy of test cases and solutions within the dataset, we have carried out comprehensive verification using unit tests.
• 数据质量验证:我们对从 TACO 数据集中收集的一部分数据进行了手动正确性评估。为了确保数据集中测试用例和解决方案的准确性,我们使用单元测试进行了全面验证。 -
•
Performance Benchmarks: The benchmarks we propose primarily serve to assess the code-generation efficacy of the nl2code model. Within the test set, we offer an exhaustive coverage of disparate algorithms and problems, aiming to gauge the model’s performance across a spectrum of algorithmic skills pivotal to problem-solving.
• 性能基准:我们提出的基准主要用于评估 nl2code 模型的代码生成效率。在测试集中,我们提供了不同算法和问题的全面覆盖,旨在评估模型在解决问题所需的各种算法技能上的表现。
3 Dataset Construction
3 数据集构建
The development of the TACO dataset is meticulously divided into two distinct phases: data collection and data processing. In the data collection phase, we focused on collecting raw data primarily from a diverse range of programming competition platforms. This phase was further enriched by integrating additional open-source datasets, thereby expanding the dataset’s scope and variety. Following the collection, the data processing phase began, involving a series of intricate steps. Initially, this phase entailed code decommenting and deduplication to ensure data quality and uniqueness. Subsequently, we augmented the test samples to enhance the robustness of the dataset. In this phase, we also engaged in the classification and generalization of tasks and skill sets, using insights gained from a variety of programming contest websites and open source code generation datasets. This comprehensive approach allowed us to establish a well-defined and detailed format schema for the TACO data set.
TACO 数据集的开发被细致地分为两个不同的阶段:数据收集和数据处理。在数据收集阶段,我们主要专注于从各种编程竞赛平台收集原始数据。通过整合额外的开源数据集,这一阶段的范围和多样性得到了进一步扩展。数据收集完成后,数据处理阶段开始,涉及一系列复杂的步骤。最初,这一阶段包括代码去注释和去重,以确保数据质量和唯一性。随后,我们增加了测试样本以增强数据集的稳健性。在此阶段,我们还参与了任务和技能集的分类和泛化,利用从各种编程竞赛网站和开源代码生成数据集中获得的见解。这种全面的方法使我们能够为 TACO 数据集建立一个明确且详细的格式架构。
3.1 Data collection phase
3.1 数据收集阶段
During the assembly of the TACO dataset, we conducted an exhaustive review of various programming contest platforms as well as existing code generation datasets. Due to legal restrictions and the intricate nature of platform-specific API interfaces, our primary data sources were platforms such as CodeChef, CodeForces, HackerRank, and GeeksforGeeks. We also integrated existing datasets, including APPS, CodeContest, and Description2code. To construct and optimize the TACO dataset, we employed two primary methodologies: First, we utilized web scraping techniques to extract problem information from programming competition websites; second, we augmented existing open-source datasets with algorithmic skill labels and other relevant tags.
在组装 TACO 数据集的过程中,我们对各种编程竞赛平台以及现有的代码生成数据集进行了详尽的审查。由于法律限制和平台特定 API 接口的复杂性,我们的主要数据来源是 CodeChef、CodeForces、HackerRank 和 GeeksforGeeks 等平台。我们还整合了现有的数据集,包括 APPS、CodeContest 和 Description2code。为了构建和优化 TACO 数据集,我们采用了两种主要方法:首先,我们利用网络爬虫技术从编程竞赛网站提取问题信息;其次,我们为现有的开源数据集增加了算法技能标签和其他相关标签。
To guarantee the dataset’s integrity and the comprehensiveness of its tagging metadata, we devised bespoke HTML parsers tailored to each of the four major programming contest platforms: CodeChef, CodeForces, HackerRank, and GeeksforGeeks. These parsers are engineered to accommodate the diverse syntactic layouts present in the problem texts of their source websites. Throughout the parser development phase, we confirmed the accuracy of each specialized parser via extensive manual sampling. On exceptional occasions, we also employed OCR tool APIs (e.g., SimplePix) to facilitate the conversion of svg-formatted information into LaTeX format. For a detailed discussion of the validation procedures and data scraping techniques used for these platforms, refer to the Appendix A. Additionally, with regard to the algorithmic labeling designed to augment the APPS and CodeContest datasets, our HTML parser similarly crawls algorithmic skill labels from sources such as LeetCode, Kattis, CodeWars, and CodeForces.
为了保证数据集的完整性和标签元数据的全面性,我们设计了专门的 HTML 解析器,针对四个主要的编程竞赛平台:CodeChef、CodeForces、HackerRank 和 GeeksforGeeks。这些解析器被设计为能够适应其源网站问题文本中存在的多种语法布局。在解析器开发阶段,我们通过广泛的人工抽样确认了每个专用解析器的准确性。在特殊情况下,我们还使用 OCR 工具 API(例如 SimplePix)将 svg 格式的信息转换为 LaTeX 格式。有关这些平台的数据验证程序和数据抓取技术的详细讨论,请参阅附录A。此外,关于旨在增强 APPS 和 CodeContest 数据集的算法标签,我们的 HTML 解析器同样从 LeetCode、Kattis、CodeWars 和 CodeForces 等来源抓取算法技能标签。

图 2:TACO 数据集组成分析
Upon completion of the assembly of the TACO dataset, we performed a rigorous analysis to assess the degree of overlap between TACO, APPS, and CodeContest in terms of the quantity of the problem. Initially, we executed problem correspondences predicated on the URLs corresponding to each problem. Subsequently, we expanded our correspondence framework to include plaintext analyses, which entailed the removal of whitespace and newline characters. Our calculations revealed that there are 3,392 duplicate problems shared between APPS and CodeContest. The degree of overlap in the problem between TACO and CodeContest is 0.4617, while the overlap coefficient with APPS is 0.3778. As depicted in Figure 2, 29.5% of the problems constituting the TACO dataset are sourced from our recent web crawling endeavors; among the remaining problems, 24.3% are exclusive to the APPS dataset, 32.7% are unique to CodeContest, and the residual 13.4% are featured in both APPS and CodeContest.
在完成 TACO 数据集的组装后,我们进行了严格的分析,以评估 TACO、APPS 和 CodeContest 在问题数量方面的重叠程度。最初,我们基于每个问题对应的 URL 执行了问题对应分析。随后,我们扩展了对应框架,包括纯文本分析,这涉及去除空格和换行符。我们的计算显示,APPS 和 CodeContest 之间有 3,392 个重复问题。TACO 和 CodeContest 问题的重叠程度为 0.4617,而与 APPS 的重叠系数为 0.3778。如图 2 所示,构成 TACO 数据集的 29.5% 的问题来自我们最近的网络爬虫工作;在剩余的问题中,24.3% 是 APPS 数据集独有的,32.7% 是 CodeContest 独有的,剩余的 13.4% 出现在 APPS 和 CodeContest 中。
As detailed in Table 1, the TACO dataset includes data from multiple platform programming contests, listing the diverse data sources corresponding to each website. For websites with multiple sources of data, we have executed data merging and deduplication procedures. It is noteworthy that certain platforms, such as Aizu, AtCoder, and HackerEarth, already feature data integrated from CodeContests and Description2code, and include the requisite algorithmic tags; therefore, additional crawling efforts were deemed unnecessary for these sites.
如表 1 所述,TACO 数据集包括来自多个平台编程竞赛的数据,列出了对应每个网站的多样化数据来源。对于具有多个数据来源的网站,我们执行了数据合并和去重程序。值得注意的是,某些平台,如 Aizu、AtCoder 和 HackerEarth,已经集成了来自 CodeContests 和 Description2code 的数据,并包含必要的算法标签;因此,这些网站无需额外的爬虫工作。
表 1: TACO 中的网站及数据来源访问
Site 网站 | URL | Source 来源 | Web Crawler Details 网络爬虫详情 |
Aizu | https://judge.u-aizu.ac.jp | CodeContests | None 无 |
AtCoder | https://atcoder.jp | APPS CodeContests | None 无 |
CodeChef | https://www.codechef.com | CodeContests CodeChef | Problem&Solution 问题与解决方案 Algo tags 算法标签 Timestamp 时间戳 Time, memory limit 时间、内存限制 |
Codeforces | https://codeforces.com | APPS CodeContests Codeforces | Problem&Solution 问题与解决方案 Algo tags 算法标签 Timestamp 时间戳 Time, memory limit 时间、内存限制 |
CodeWars | https://www.codewars.com | APPS | Algo tags 算法标签 |
GeeksforGeeks | https://www.geeksforgeeks.org | GeeksforGeeks | Problem&Solution 问题与解决方案 Algo tags 算法标签 Time, space complexity 时间、空间复杂度 |
HackerEarth | https://www.hackerearth.com | Description2code | None 无 |
HackerRank | https://www.hackerrank.com | HackerRank | Problem&Solution 问题与解决方案 Algo tags 算法标签 |
Kattis | https://open.kattis.com | APPS | Algo tags 算法标签 |
LeetCode | https://leetcode.com | APPS | Algo tags 算法标签 |
3.2 Data processing phase
3.2 数据处理阶段
While we exerted considerable effort during the data collection phase to ensure the high fidelity of acquired problem information and Python code to their respective source websites, it is crucial to acknowledge that the majority of the code found on programming competition platforms was manually authored and submitted by participants. This inherently introduces the possibility of encountering annotated or duplicated code. To mitigate this challenge, we implemented a robust series of correctness processing and functional processing on the dataset.
虽然我们在数据收集阶段付出了相当大的努力,以确保获取的问题信息和 Python 代码与其各自来源网站的高度一致,但必须承认,编程竞赛平台上的大多数代码都是由参与者手动编写和提交的。这本质上引入了可能遇到带注释或重复代码的可能性。为了解决这一挑战,我们对数据集实施了一系列稳健的正确性处理和功能处理。
3.2.1 Correctness processing
3.2.1 正确性处理
-
•
Unit Test Validation: While the code harvested from programming contest platforms was ostensibly marked as “correct” by those sites, our examination revealed instances where some code either was incorrect or failed to meet predefined criteria, such as time or space complexity, time or space limitations, or adherence to coding standards. To guarantee that the code within our dataset met a robust standard of correctness, we executed each solution against a comprehensive set of predefined unit tests that assessed these specific criteria, subsequently omitting any that failed to pass this rigorous evaluation.
• 单元测试验证:虽然从编程竞赛平台收集的代码在表面上被这些网站标记为“正确”,但我们的检查发现有些代码要么不正确,要么未能满足预定义的标准,如时间或空间复杂度、时间或空间限制,或遵循编码标准。为了确保我们数据集中的代码达到稳健的正确性标准,我们针对一套全面的预定义单元测试执行每个解决方案,这些测试评估了这些特定标准,随后删除任何未能通过严格评估的代码。 -
•
Conversion from Python 2 to Python 3: Upon incorporating a set of problems from HackerEarth, sourced from the Description2code dataset, we observed that the provided solutions were implemented in Python 2, which was incongruent with our focus on Python 3 code. To rectify this discrepancy, we employed Python’s 2to3 library for automated code translation. Solutions that could not be successfully converted to Python 3 were excluded, retaining only those that could be transitioned to the Python 3 format as part of the final dataset.
• 从 Python 2 到 Python 3 的转换:在从 Description2code 数据集中引入一组来自 HackerEarth 的问题后,我们发现提供的解决方案是用 Python 2 实现的,这与我们专注于 Python 3 代码的目标不符。为了解决这一差异,我们使用 Python 的 2to3 库进行自动代码翻译。无法成功转换为 Python 3 的解决方案被排除,仅保留那些可以转换为 Python 3 格式的解决方案作为最终数据集的一部分。
3.2.2 Functional processing
3.2.2 功能处理
-
•
Code De-duplication: Given that our data collection draws from solutions submitted by multiple users and integrates various open-source datasets, the propensity for encountering similar or duplicate code within individual problems is elevated. To mitigate this issue, we employed a targeted deduplication strategy. Specifically, we utilized the MinHash algorithm in conjunction with the Jaccard similarity index to achieve near-deduplication. A Jaccard similarity threshold was established at 0.85, and MinHash-assisted clusters of duplicate code were generated. Subsequently, these clusters were condensed into unique code files based on the Jaccard similarity metric.
• 代码去重:由于我们的数据收集来自多个用户提交的解决方案,并整合了各种开源数据集,因此在单个问题中遇到相似或重复代码的可能性较高。为了解决这个问题,我们采用了有针对性的去重策略。具体来说,我们结合使用了 MinHash 算法和 Jaccard 相似度指数来实现近似去重。我们设定了一个 0.85 的 Jaccard 相似度阈值,并生成了由 MinHash 辅助的重复代码集群。随后,这些集群根据 Jaccard 相似度指标被压缩为唯一的代码文件。 -
•
Code De-commenting: In the code submissions retrieved from programming contest platforms, such as CodeChef, we observed a significant prevalence of comments. These comments, while ostensibly redundant, not only inflate the code’s length but also introduce potential noise into subsequent analyses. To address this issue, we employed the Abstract Syntax Tree (AST) parsing methodology to remove these comments. Specifically, the collected code is initially converted to AST format, whereupon comment nodes are identified and eradicated. Following this, the AST structure is reverted to its original code format. It is noteworthy that the AST parsing technique also affords us the capability to execute additional code formatting tasks, such as the elimination of superfluous whitespace.
• 代码去注释:在从编程竞赛平台(如 CodeChef)检索的代码提交中,我们观察到注释的显著普遍存在。这些注释虽然表面上是多余的,不仅增加了代码的长度,还可能在后续分析中引入噪音。为了解决这个问题,我们采用了抽象语法树(AST)解析方法来删除这些注释。具体来说,收集的代码首先被转换为 AST 格式,然后识别并删除注释节点。之后,AST 结构被还原为其原始代码格式。值得注意的是,AST 解析技术还使我们能够执行其他代码格式化任务,例如消除多余的空白。 -
•
Supplementation of Test Cases Owing to the unavailability of hidden unit tests on programming websites for crawling, most of the problems within the APPS dataset are restricted to one or two sets of unit tests, as provided in the problem description. This constraint not only undermines the precise evaluation of code correctness but also risks the generation of incorrect code that may pass these limited unit tests. According to AlphaCode, the APPS dataset exhibited a False Positive (FP) Rate of up to 60% when assessed with an average of 20 unit tests, thereby signaling a lack of stringent evaluation criteria. To rectify this limitation, we augmented the code samples with additional unit tests, elevating the average number of unit tests to over 200. AlphaCode’s research indicates that the FP rate can be substantially reduced to approximately 4% when the average number of unit tests exceeds 200. Specifically, we used OpenAI’s GPT-4 API to generate the input components of unit test pairs. Subsequently, 30 verified correct code samples were executed on these inputs to generate corresponding outputs, which were then scrutinized for consistency. Input-output pairs that produced consistent outputs were considered valid unit tests. This iterative process was carried out multiple times to ensure that the total number of unit tests reached a minimum threshold of 200.
由于无法在编程网站上抓取隐藏的单元测试,APPS 数据集中的大多数问题仅限于问题描述中提供的一到两组单元测试。这一限制不仅削弱了对代码正确性的精确评估,还可能导致生成错误的代码通过这些有限的单元测试。根据 AlphaCode 的研究,APPS 数据集在使用平均 20 个单元测试进行评估时,假阳性(FP)率高达 60%,这表明缺乏严格的评估标准。为了解决这一限制,我们为代码样本增加了额外的单元测试,将平均单元测试数量提高到 200 多个。AlphaCode 的研究表明,当平均单元测试数量超过 200 时,FP 率可以大幅降低至约 4%。具体来说,我们使用 OpenAI 的 GPT-4 API 生成单元测试对的输入组件。随后,30 个经过验证的正确代码样本在这些输入上执行以生成相应的输出,然后对其一致性进行审查。产生一致输出的输入输出对被视为有效的单元测试。这个迭代过程进行了多次,以确保单元测试的总数达到至少 200 的最低阈值。
To meet our research objectives, we employed the aforementioned methodology and executed meticulous data processing on the multiple components of the TACO dataset, as delineated in Table 2. Specifically, we undertook a rigorous data-cleaning process for the newly acquired problems from CodeChef, CodeForces, HackerRank, and GeeksforGeeks. Likewise, we conducted an exhaustive cleaning procedure on problems obtained from open-source datasets—namely APPS, CodeContest, and Description2code—with the objective of eliminating any invalid information from the solutions, thereby enhancing the accuracy and reliability of the TACO dataset.
为了实现我们的研究目标,我们采用了上述方法,并对 TACO 数据集的多个组件进行了细致的数据处理,如表2所述。具体来说,我们对从 CodeChef、CodeForces、HackerRank 和 GeeksforGeeks 新获取的问题进行了严格的数据清理。同样,我们对从开源数据集(即 APPS、CodeContest 和 Description2code)获得的问题进行了详尽的清理程序,旨在消除解决方案中的任何无效信息,从而提高 TACO 数据集的准确性和可靠性。
表 2:TACO 数据集各组件的数据处理详情
Components 组件 |
|
APPS | CodeContest | Description2code | ||
Unit Test Validation 单元测试验证 | Yes 是 | No 否 | No 否 | No 否 | ||
|
No 否 | No 否 | No 否 | Yes 是 | ||
Code De-duplication 代码去重 | Yes 是 | Yes 是 | Yes 是 | Yes 是 | ||
Code De-commenting 代码去注释 | Yes 是 | Yes 是 | Yes 是 | Yes 是 |
表 3:TACO 与其他数据集的比较分析
Dataset 数据集 | APPS | CodeContests | TACO | ||
Problems 问题 | 10000 | 13610 | 26443 | ||
Solutions 解决方案 | 232421 | 1502532 | 1539152 | ||
Algorithmic labeling or not 算法标注与否 |
No 否 | No 否 | Yes 是 | ||
|
1235/5000(24.7%) | 43/165(26.1%) | 0/1000(0%) | ||
Proportion of valid problems 有效问题的比例 |
87.65% | 61% | 78% |
3.2.3 Skill algorithm categorization
3.2.3 技能算法分类
In the Source Programming Problems dataset, we retained 968 category original labels spanning various domains. These labels were manually annotated by domain experts and offer the most granular level of informational categorization. Specifically, these labels may encapsulate either the subject matter of the problem (e.g., mathematics, geometry, graphs, or strings) or feasible solution strategies (e.g., dynamic programming, brute-force approaches, or binary search). It is noteworthy that a single problem may have multiple such labels simultaneously. For a comprehensive view of the original label distribution across all problems in the dataset, refer to Figure 3. As evidenced by this figure, the distribution of these foundational labels is highly irregular and diverse. Consequently, it is crucial to systematically categorize and generalize these algorithmic labels.
在源编程问题数据集中,我们保留了 968 个类别的原始标签,涵盖了各个领域。这些标签由领域专家手动标注,提供了信息分类的最细粒度水平。具体来说,这些标签可能包含问题的主题(例如,数学、几何、图形或字符串)或可行的解决策略(例如,动态规划、暴力方法或二分搜索)。值得注意的是,一个问题可能同时具有多个这样的标签。要全面了解数据集中所有问题的原始标签分布,请参见图3。正如该图所示,这些基础标签的分布极不均匀且多样化。因此,系统地分类和概括这些算法标签至关重要。

图 3:数据集中原始来源标签的统计分布
A usability analysis was undertaken to elucidate both the quantity and distribution of raw labels. Our findings suggest that an excessive and intricate array of algorithmic labels could potentially impede the model’s training and inference processes. Consequently, we conducted a thorough generalization of the initial set of 968 category algorithmic labels. Guided by three key considerations, we ultimately reclassified all labels into 36 discrete categories.
我们进行了可用性分析,以阐明原始标签的数量和分布。我们的研究表明,过多且复杂的算法标签可能会阻碍模型的训练和推理过程。因此,我们对最初的 968 个类别算法标签进行了彻底的概括。在三个关键考虑因素的指导下,我们最终将所有标签重新分类为 36 个独立的类别。
-
•
Coverage: The chosen tags must offer comprehensive encapsulation of both programming techniques and algorithmic design methods pertinent to programming challenges.
• 覆盖范围:所选标签必须全面涵盖与编程挑战相关的编程技术和算法设计方法。 -
•
Balance Between Specificity and Generality: Overly specific labels could impede a comprehensive grasp of programming problems, whereas excessively general labels might fall short in furnishing detailed insights into the problem. Thus, judicious abstraction and generalization of the original tags are imperative.
• 特异性与通用性之间的平衡:过于具体的标签可能会妨碍对编程问题的全面理解,而过于笼统的标签可能无法提供问题的详细见解。因此,审慎地抽象和概括原始标签是必要的。 -
•
Label Quality: Following meticulous organization and generalization, the resulting set of labels should accurately and profoundly capture the quintessential attributes of programming problems and their corresponding solutions.
• 标签质量:经过细致的组织和概括,所得的标签集应准确且深刻地捕捉编程问题及其相应解决方案的典型属性。

图 4:通过对数据集进行概括获得的 36 个算法标签的统计分布
As illustrated in Figure 4, we have consolidated the fine-grained labels into 36 coarse-grained algorithmic labels. To enhance the adaptability of the TACO dataset across various customized application scenarios, we keep the problem’s original label, algorithm label, and skill label in the dataset’s metadata file. Users may consult this partial content to reclassify or amalgamate labels in accordance with their specific requirements. Moreover, we use the eight Basic Techniques enumerated in the Competitive Programmer’s Handbook as a benchmark for statistically analyzing the distribution of these foundational skills. At the same time, the 8 skill tabs are included in the 36 algorithm tabs. Quantitative information can be found in Figure 5.
如图 4 所示,我们已将细粒度标签整合为 36 个粗粒度的算法标签。为了增强 TACO 数据集在各种定制应用场景中的适应性,我们在数据集的元数据文件中保留了问题的原始标签、算法标签和技能标签。用户可以根据其具体需求查阅此部分内容以重新分类或合并标签。此外,我们使用《竞争程序员手册》中列举的八种基本技术作为基准,统计分析这些基础技能的分布情况。同时,这 8 个技能标签包含在 36 个算法标签中。定量信息见图 5。

图 5:通过对数据集进行概括获得的 8 个技能标签的统计分布
4 Comparisons with Code Datasets
4 与代码数据集的比较
The principal code generation datasets commonly used in programming competitions include APPS and CodeContest. In the development of the TACO dataset, we utilized these two existing datasets as reference standards and implemented a series of enhancements and extensions to mitigate their limitations. Specifically, our cardinal contribution was the integration of systematic algorithmic annotations for each problem within the dataset. Concurrently, we observed that a notable proportion of problems within the APPS and CodeContest test sets lacked human-submitted Python solutions, a deficiency that hampers effective validation of model inference.
主要用于编程竞赛的代码生成数据集包括 APPS 和 CodeContest。在开发 TACO 数据集的过程中,我们将这两个现有数据集作为参考标准,并实施了一系列增强和扩展以减轻其局限性。具体而言,我们的主要贡献是为数据集中的每个问题整合了系统的算法注释。同时,我们观察到 APPS 和 CodeContest 测试集中有相当一部分问题缺乏人类提交的 Python 解决方案,这一缺陷阻碍了模型推理的有效验证。
Regarding model validation, APPS principally employs GPT-2 and GPT-Neo as benchmark models, thereby neglecting an evaluation of the fine-tuning capabilities of contemporary code models, specifically in the context of programming challenges. In contrast, our TACO dataset comprises a suite of benchmarks generated through the fine-tuning of state-of-the-art models; notably, the most extensive among these employs the StarCoder model, which boasts 15 billion parameters (refer to Section 5.2 for further details).
关于模型验证,APPS 主要采用 GPT-2 和 GPT-Neo 作为基准模型,因此忽略了对现代代码模型微调能力的评估,特别是在编程挑战的背景下。相比之下,我们的 TACO 数据集包含了一套通过微调最先进模型生成的基准;其中最庞大的模型使用了 StarCoder 模型,该模型拥有 150 亿个参数(详情请参见第5.2节)。
We have counted the number of problems, the number of solutions, the presence or absence of algorithmic labels, and the number of problems without python solutions in the test set for the three datasets TACO, APPS, and CodeContests, and these differences are enumerated in detail in Table 3. In addition, we provide more details about each dataset as follows:
我们统计了三个数据集 TACO、APPS 和 CodeContests 中的问题数量、解决方案数量、算法标签的有无以及测试集中没有 Python 解决方案的问题数量,这些差异在表3中详细列出。此外,我们提供了关于每个数据集的更多详细信息如下:
APPS: The APPS dataset primarily originates from programming competition platforms, including CodeForces, LeetCode, and CodeChef, among others. Both the training and test sets of this dataset comprise 5,000 problems. Notably, existing academic literature, particularly the study conducted by Alphacode, highlights that the APPS dataset employs only samples drawn from problem descriptions to serve as the HIDDEN TEST CASE, leading to an elevated False Positive Rate.
APPS: APPS 数据集主要来源于编程竞赛平台,包括 CodeForces、LeetCode 和 CodeChef 等。该数据集的训练集和测试集均包含 5,000 个问题。值得注意的是,现有的学术文献,特别是 Alphacode 进行的研究指出,APPS 数据集仅使用从问题描述中抽取的样本作为隐藏测试用例,导致假阳性率升高。
CodeContests: The CodeContests dataset encompasses an extensive array of popular programming languages, such as C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript. The dataset is partitioned into training, validation, and test sets, containing 13,328, 117, and 165 problems, respectively. In contrast to datasets that exclusively feature correct code, CodeContests incorporates a variety of erroneous code types.
CodeContests: CodeContests 数据集涵盖了多种流行的编程语言,如 C++、C#、Go、Java、JavaScript、Lua、PHP、Python、Ruby、Rust、Scala 和 TypeScript。该数据集分为训练集、验证集和测试集,分别包含 13,328、117 和 165 个问题。与仅包含正确代码的数据集不同,CodeContests 包含多种错误代码类型。
TACO: The TACO benchmark suite encompasses a wide array of algorithms programming challenges. Its primary objective is to rigorously assess a model’s proficiency in mastering various algorithmic competencies integral to problem-solving. For each problem within the TACO dataset, we offer comprehensive algorithmic labels. In contrast to APPS and CodeContests, TACO supports not only the Standard Input Format (as is found in APPS) but also introduces a Call-Based Format. Furthermore, code examples are accompanied by detailed annotations, which include original labels, algorithmic tags, skill categorization, as well as time and space constraints, timestamps, and so forth.
TACO: TACO 基准套件涵盖了广泛的算法编程挑战。其主要目标是严格评估模型在掌握各种算法能力以解决问题方面的熟练程度。对于 TACO 数据集中的每个问题,我们提供了全面的算法标签。与 APPS 和 CodeContests 不同,TACO 不仅支持标准输入格式(如 APPS 中所见),还引入了基于调用的格式。此外,代码示例附有详细的注释,包括原始标签、算法标签、技能分类、时间和空间限制、时间戳等。
5 Code Generation Evaluation
5 代码生成评估
表 4: TACO 训练和测试数据集中难度和编程技能的分布
Category 类别 | Type 类型 | Train 训练 | Test 测试 |
Difficulty 难度 | Easy 简单 | 8904 | 200 |
Medium 中等 | 3244 | 200 | |
Medium_Hard 中等_困难 | 2745 | 200 | |
Hard 困难 | 3162 | 200 | |
Very_Hard 非常困难 | 2374 | 200 | |
Unknown 未知 | 5014 | 0 | |
Skill 技能 | Range Queries 区间查询 | 220 | 73 |
Amortized Analysis 均摊分析 | 554 | 70 | |
Bit Manipulation 位操作 | 814 | 90 | |
Complete Search 完全搜索 | 1886 | 154 | |
Sorting 排序 | 2234 | 197 | |
Dynamic Programming 动态规划 | 2585 | 179 | |
Greedy Algorithms 贪心算法 | 2649 | 209 | |
Data Structures 数据结构 | 4695 | 241 | |
Unknown 未知 | 14907 | 337 |
5.1 Experimental Setup
5.1 实验设置
The fine-grained labels of the TACO dataset include four dimensions: task theme, algorithm tag, programming skill, and difficulty level. Considering the data volume in the test set and application scenarios, the evaluation of code generation uses two dimensions: programming skill and difficulty level. The distribution of the questions in the TACO dataset across different difficulties and programming skills is shown in Table 4.
TACO 数据集的细粒度标签包括四个维度:任务主题、算法标签、编程技能和难度级别。考虑到测试集中的数据量和应用场景,代码生成的评估使用两个维度:编程技能和难度级别。TACO 数据集中问题在不同难度和编程技能上的分布如表 4 所示。
Similar to the evaluations of the generation of standard code, for each programming problem, we allow the model to generate 200 attempts, using pass@k (k=1, 10, 100) as the evaluation metric. Therefore, during the model evaluation process, it is necessary to set the hyperparameters of the generation model. Through extensive generation experiments, we found that the model is highly sensitive to the difficulty of the problem. For more challenging problems, the model requires a more diverse range of generated results to potentially produce code that passes the tests. Consequently in the testing process, we use a setting of 0.95. The temperature settings are differentiated based on problem difficulty levels: we assign a value of 0.2 for easy problems, 0.6 for medium and medium hard problems, and 0.8 for hard and very hard problems.
类似于标准代码生成的评估,对于每个编程问题,我们允许模型生成 200 次尝试,使用 pass@k (k=1, 10, 100) 作为评估指标。因此,在模型评估过程中,有必要设置生成模型的超参数。通过广泛的生成实验,我们发现模型对问题的难度非常敏感。对于更具挑战性的问题,模型需要更广泛的生成结果来可能产生通过测试的代码。因此在测试过程中,我们使用 0.95 的 设置。温度设置根据问题难度级别进行区分:简单问题设定为 0.2,中等和中等偏难问题设定为 0.6,困难和非常困难问题设定为 0.8。
For general code models, the models are first fine-tuned on the whold training set. Subsequently, on the specified test set, code generation is performed with to as random seeds. The generated codes undergo a common truncation scheme as post-processing. The processed code is executed as a program on all test cases, and a program that passes all cases is considered correct code generation.
对于通用代码模型,首先在整个训练集上对模型进行微调。随后,在指定的测试集上,使用 到 作为随机种子进行代码生成。生成的代码经过通用截断方案作为后处理。处理后的代码在所有测试用例上作为程序执行,通过所有用例的程序被视为正确的代码生成。
Since GPT-4 is not a purely code generation model, and considering cost factors, in our evaluation of GPT-4, we use the default settings of top_p=0.95 and temperature=0.7. We append ’Please write a python program.’ to the prompt after the question, generate only once, and extract the Python code block from the markdown as the generated program to calculate pass@1.
由于 GPT-4 不是一个纯粹的代码生成模型,并且考虑到成本因素,在我们对 GPT-4 的评估中,我们使用默认设置 top_p=0.95 和 temperature=0.7。我们在问题后附加“请编写一个 Python 程序。”到提示中,只生成一次,并从 markdown 中提取 Python 代码块作为生成的程序来计算 pass@1。
5.2 Results
5.2 结果
We selected several classic coding models for evaluation, including codellama-7b, codegen25-mono, starcoder-1b and starcoder-15.5b, and we also conducted tests on GPT-4 for comparison. Evaluations were carried out on 200 questions at five different difficulty levels, and the results are presented in Table 5:
我们选择了几个经典的编码模型进行评估,包括 codellama-7b、codegen25-mono、starcoder-1b 和 starcoder-15.5b,我们还对 GPT-4 进行了测试以作比较。评估在五个不同难度级别的 200 个问题上进行,结果如表5所示:
表 5: 不同难度级别的 TACO 编码问题的代码生成性能
Model Name 模型名称 | Level 等级 | Temperature 温度 | Pass@1 | Pass@10 | Pass@100 |
GPT-4 | easy 简单 | 0.7 | 31.50 | - | - |
medium 中等 | 0.7 | 19.00 | - | - | |
medium_hard 中等偏难 | 0.7 | 13.00 | - | - | |
hard 困难 | 0.7 | 4.50 | - | - | |
very_hard 非常困难 | 0.7 | 2.00 | - | - | |
codellama-7b-python | easy 简单 | 0.2 | 9.32 | 15.15 | 19.99 |
medium 中等 | 0.6 | 2.38 | 8.28 | 17.51 | |
medium_hard 中等偏难 | 0.6 | 0.60 | 2.93 | 7.53 | |
hard 困难 | 0.8 | 0.31 | 1.93 | 5.51 | |
very_hard 非常困难 | 0.8 | 0.18 | 1.05 | 2.25 | |
codegen25-mono | easy 简单 | 0.2 | 9.06 | 16.12 | 22.52 |
medium 中等 | 0.6 | 2.42 | 7.83 | 17.23 | |
medium_hard 中等偏难 | 0.6 | 0.74 | 3.58 | 8.01 | |
hard 困难 | 0.8 | 0.57 | 2.66 | 5.31 | |
very_hard 非常困难 | 0.8 | 0.22 | 1.00 | 1.47 | |
starcoder-1b | easy 简单 | 0.2 | 8.95 | 12.01 | 15.51 |
medium 中等 | 0.6 | 2.22 | 5.32 | 11.16 | |
medium_hard 中等偏难 | 0.6 | 0.60 | 2.74 | 6.11 | |
hard 困难 | 0.8 | 0.20 | 1.29 | 3.44 | |
very_hard 非常困难 | 0.8 | 0.04 | 0.32 | 1.13 | |
starcoder-15.5b | easy 简单 | 0.2 | 11.60 | 18.44 | 23.92 |
medium 中等 | 0.6 | 3.23 | 8.96 | 19.32 | |
medium_hard 中等偏难 | 0.6 | 1.43 | 5.43 | 9.41 | |
hard 困难 | 0.8 | 0.58 | 2.79 | 6.39 | |
very_hard 非常困难 | 0.8 | 0.18 | 0.85 | 1.75 |
The TACO test set poses a significant challenge, with GPT-4 scoring only 31.5 in the pass@1 category for the easy level. Apart from GPT-4, the pass@1 scores of other coding models across all five difficulty levels are generally below 10, and their pass@100 scores are even lower than the pass@1 score of GPT-4.
TACO 测试集具有很大的挑战性,GPT-4 在简单级别的 pass@1 类别中仅得分 31.5。除了 GPT-4 之外,其他编码模型在所有五个难度级别的 pass@1 得分通常低于 10,而它们的 pass@100 得分甚至低于 GPT-4 的 pass@1 得分。
To further explore the effects on different programming skills, as well as the role of the TACO training set, we conducted experiments on the starcoder-1b model. We used the starcoder-1b model to perform LoRA fine-tuning on the entire training set, resulting in the baseline LoRA model. Additionally, we separately conducted LoRA fine-tuning on subsets of the training set corresponding to different programming skills, resulting in 8 distinct skilled LoRA models. We evaluated both the baseline LoRA and skilled LoRA models, and the evaluation results are presented in the Table 6:
为了进一步探索对不同编程技能的影响,以及 TACO 训练集的作用,我们在 starcoder-1b 模型上进行了实验。我们使用 starcoder-1b 模型对整个训练集进行了 LoRA 微调,得到了基线 LoRA 模型。此外,我们分别对训练集中对应不同编程技能的子集进行了 LoRA 微调,得到了 8 个不同技能的 LoRA 模型。我们对基线 LoRA 和技能 LoRA 模型进行了评估,评估结果如表6所示:
表 6:不同技能类型的评估结果
SKILL TYPES 技能类型 | GPT-4 | baseline LoRA 基线 LoRA | skilled LoRAs 技能 LoRA |
Amortized Analysis 摊销分析 | 11.43 | 0 | 0 |
Bit Manipulation 位操作 | 10.00 | 1.1 | 1.11 |
Complete Search 完全搜索 | 14.29 | 0.32 | 0.32 |
Data Structures 数据结构 | 13.28 | 1.4 | 1.55 |
Dynamic Programming 动态规划 | 8.94 | 0 | 0 |
Greedy Algorithms 贪心算法 | 9.09 | 0 | 0.36 |
Range Queries 区间查询 | 5.48 | 0 | 0 |
Sorting 排序 | 10.66 | 0.7 | 1.5 |
Firstly, GPT-4 demonstrates comparable abilities across various programming skills, with pass@1 scores ranging between 5 and 10, which to some extent indicates the rationality of the skill categorization in the TACO dataset. Secondly, utilizing the TACO training set, which has fine-grained labels, can specifically enhance the performance of code generation models. For example, in Data Structures, Greedy Algorithms, and Sorting, starcoder-1b shows a clear advantage in performance when fine-tuned on specific skills using the TACO training set, compared to being fine-tuned on the entire training set.
首先,GPT-4 在各种编程技能上表现出相当的能力,pass@1 得分在 5 到 10 之间,这在一定程度上表明了 TACO 数据集中技能分类的合理性。其次,利用具有细粒度标签的 TACO 训练集,可以专门提升代码生成模型的性能。例如,在数据结构、贪心算法和排序中,starcoder-1b 在使用 TACO 训练集对特定技能进行微调时,表现出明显的性能优势,相较于在整个训练集上进行微调。
6 Limitations
6 限制因素
During the construction of the TACO dataset, some programming problems were collected from the APPS and CodeContest datasets and were subsequently reclassified. Therefore, after training code models using the TACO dataset, it is not suitable to use APPS and CodeContest as benchmarks to evaluate the code generation capabilities of the models. For assessing programming challenge capabilities, it is recommended to directly use the test set of the TACO dataset.
在构建 TACO 数据集的过程中,一些编程问题是从 APPS 和 CodeContest 数据集中收集并重新分类的。因此,在使用 TACO 数据集训练代码模型后,不适合使用 APPS 和 CodeContest 作为基准来评估模型的代码生成能力。对于评估编程挑战能力,建议直接使用 TACO 数据集的测试集。
7 Other Application Scenarios
7 其他应用场景
This subsection provides an exhaustive investigation into the versatile utility of the TACO dataset across diverse application domains. The TACO dataset incorporates labels associated with algorithmic skills and is designed to assimilate these skills coherently into code generation frameworks. Specifically, we examine the dataset’s prospective contributions to several sectors, including code comprehension, educational pedagogy, algorithmic code recommendation, and applications rooted in Language Learning Models (LLMs).
本小节深入探讨了 TACO 数据集在不同应用领域中的多功能实用性。TACO 数据集包含与算法技能相关的标签,旨在将这些技能有机地融入代码生成框架中。具体而言,我们考察了该数据集对多个领域的潜在贡献,包括代码理解、教育教学、算法代码推荐以及基于语言学习模型的应用(LLMs)。
-
•
Code understanding: The TACO dataset is optimally designed for the training of models that are specialized in code interpretation. Within this dataset, each programming challenge is paired with an associated algorithmic competency label. Through the established correlation between the programming challenges problems and their algorithmic labels, the model is empowered to achieve a more nuanced understanding and interpretation of the source code’s semantics. As an illustration, the model can autonomously generate annotations that encapsulate pertinent algorithmic metadata within the source code.
• 代码理解:TACO 数据集经过优化设计,适用于专注于代码解释的模型训练。在该数据集中,每个编程挑战都配有相关的算法能力标签。通过编程挑战问题与其算法标签之间的既定关联,模型能够实现对源代码语义的更细致理解和解释。例如,模型可以自主生成注释,将相关的算法元数据封装在源代码中。 -
•
Education and learning: In the educational domain, the TACO dataset holds promise for a diverse range of applications, including the creation of programming textbooks, curriculum development, and the assembly of problem repositories for practice. Learners can enhance their programming proficiency by tackling the intricate programming challenges contained in the dataset, each annotated with algorithmic skill labels. These labels facilitate a more profound understanding of the relationship between specific problems and their affiliated algorithms. Moreover, capitalizing on the metadata provided through these labels, educators can streamline the assessment process for student assignments and projects in programming, ensuring adherence to recommended algorithmic methodologies and established best practices.
• 教育与学习:在教育领域,TACO 数据集在多种应用中展现出潜力,包括编程教材的编写、课程开发以及练习题库的组建。学习者可以通过解决数据集中包含的复杂编程挑战来提高编程能力,每个挑战都附有算法技能标签。这些标签有助于更深入地理解特定问题与其相关算法之间的关系。此外,利用这些标签提供的元数据,教育者可以简化学生编程作业和项目的评估过程,确保遵循推荐的算法方法和既定的最佳实践。 -
•
Code Algorithm Recommendations: The TACO dataset offers promising prospects for contributing to the establishment of an algorithmic recommendation system for coding, which is devised to expedite the process of algorithm selection for software developers. Using the dataset’s array of programming challenges and their associated skill labels, developers can efficiently identify algorithmic exemplars that align with the objectives of their ongoing projects, thereby resolving specific coding conundrums. Furthermore, code editing platforms can exploit the metadata embedded in the TACO dataset to furnish developers with more precise suggestions for code auto-completion, encompassing a broad spectrum of algorithmic and data structural options.
• 代码算法推荐:TACO 数据集在建立代码算法推荐系统方面具有很大的潜力,该系统旨在加速软件开发人员的算法选择过程。利用数据集中的编程挑战及其相关的技能标签,开发人员可以高效地识别与其当前项目目标相符的算法示例,从而解决特定的编码难题。此外,代码编辑平台可以利用 TACO 数据集中的元数据,为开发人员提供更精确的代码自动补全建议,涵盖广泛的算法和数据结构选项。 -
•
Potential for application on LLM: The TACO dataset holds considerable promise for applications grounded in Large Language Models (LLMs). This dataset serves multiple functions: it can expedite the software development process through the automated generation of task-specific code samples, and it enhances the capabilities of LLMs in synthesizing code annotations and documentation to bolster both code comprehensibility and maintainability. Additionally, LLMs that leverage the TACO dataset are capable of addressing queries pertinent to programming and algorithmic topics, providing developers and learners with comprehensive and accurate responses. Collectively, these applications contribute to enhancements in programming efficiency, code quality, and the acquisition of programming knowledge, thereby stimulating ongoing advancements in the fields of programming and algorithms.
• 在 LLM 上的应用潜力:TACO 数据集在基于大型语言模型(LLMs)的应用中具有相当大的潜力。该数据集具有多种功能:它可以通过自动生成特定任务的代码示例来加速软件开发过程,并增强 LLMs 在合成代码注释和文档方面的能力,以提高代码的可理解性和可维护性。此外,利用 TACO 数据集的 LLMs 能够解决与编程和算法相关的问题,为开发人员和学习者提供全面而准确的回答。这些应用共同促进了编程效率、代码质量和编程知识的获取,从而推动编程和算法领域的持续进步。
References 参考文献
-
Austin et al. (2021)
Austin 等人 (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, 等人。 使用大型语言模型进行程序合成。 arXiv 预印本 arXiv:2108.07732, 2021 年。 -
Brown et al. (2020)
Brown 等人 (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
汤姆·布朗、本杰明·曼、尼克·赖德、梅拉妮·苏比亚、贾里德·D·卡普兰、普拉富拉·达里瓦尔、阿尔温德·尼拉坎坦、普拉纳夫·夏姆、吉里什·萨斯特里、阿曼达·阿斯克尔等。语言模型是少样本学习者。神经信息处理系统进展, 33:1877–1901, 2020。 -
Caruana (1997)
Rich Caruana.
Multitask learning.
Machine learning, 28:41–75, 1997.
里奇·卡鲁阿纳。多任务学习。机器学习, 28:41–75, 1997。 -
Chen et al. (2021)
Chen 等 (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
马克·陈, 杰里·特沃雷克, 许熙宇, 袁启明, 亨里克·庞德·德·奥利维拉·平托, 贾里德·卡普兰, 哈里·爱德华兹, 尤里·布尔达, 尼古拉斯·约瑟夫, 格雷格·布罗克曼, 等人。评估在代码上训练的大型语言模型。arXiv 预印本 arXiv:2107.03374, 2021 年。 -
Hendrycks et al. (2021)
Hendrycks 等 (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song 等。通过应用程序测量编码挑战能力。arXiv 预印本 arXiv:2105.09938, 2021。 -
Laaksonen (2017)
Antti Laaksonen.
Competitive programmer’s handbook.
Preprint, 5, 2017.
安蒂·拉克松宁。竞赛程序员手册。预印本, 5, 2017。 -
Li et al. (2022)
Li 等 (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, 等 人。 具有竞赛水平的代码生成与 alphacode。 科学, 378(6624):1092–1097, 2022 年。 -
Li et al. (2023)
Li 等 (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, 等 人。 Starcoder: 愿源代码与你同在! arXiv 预印本 arXiv:2305.06161, 2023 年。 -
Liu et al. (2023)
Liu 等 (2023) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
刘佳伟、春秋·史蒂文·夏、王宇尧、张灵明。你的代码由 ChatGPT 生成的真的正确吗?对代码生成的大型语言模型的严格评估。arXiv 预印本 arXiv:2305.01210, 2023。 -
Nijkamp et al. (2022)
Nijkamp 等 (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
Erik Nijkamp、Bo Pang、Hiroaki Hayashi、Lifu Tu、Huan Wang、Yingbo Zhou、Silvio Savarese 和 Caiming Xiong。 Codegen:一个用于代码的开放大型语言模型,支持多轮程序合成。 arXiv 预印本 arXiv:2203.13474, 2022 年。 -
OpenAI (2023)
OpenAI.
Gpt-4 technical report.
ArXiv, abs/2303.08774, 2023.
OpenAI。GPT-4 技术报告。ArXiv, abs/2303.08774, 2023。 -
Rozière et al. (2023)
Rozière 等 (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, I. Evtimov, Joanna Bitton, Manish P Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D’efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. 2023.
巴蒂斯特·罗齐埃、乔纳斯·格林、法比安·格洛克尔、斯滕·索特拉、伊泰·加特、肖庆·艾伦·谭、约西·阿迪、刘静宇、塔尔·雷梅兹、杰里米·拉平、阿尔乔姆·科热夫尼科夫、I.·埃夫蒂莫夫、乔安娜·比顿、马尼什·P·巴特、克里斯蒂安·坎顿·费雷尔、亚伦·格拉特菲奥里、熊文翰、亚历山大·德福塞、杰德·科佩特、法伊萨尔·阿扎尔、雨果·图夫隆、路易斯·马丁、尼古拉斯·乌苏尼耶、托马斯·西亚洛姆和加布里埃尔·西纳耶夫。代码 Llama:开放基础模型用于代码。2023。 -
Vaswani et al. (2017)
Vaswani 等 (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 注意力是你所需要的一切。 神经信息处理系统进展, 30, 2017.
Appendix A More Implementation Details
附录 A更多实现细节
The principal challenge encountered during the construction of the TACO dataset involved the acquisition of high-quality open-source information. This information predominantly comprised problem descriptions, algorithmic skill labels, Python 3-compliant solutions, and associated test cases from various programming contest platforms. This section elaborates on the particular challenges and informational nuances we encountered during data acquisition across each of these competitive programming platforms.
构建 TACO 数据集过程中遇到的主要挑战是获取高质量的开源信息。这些信息主要包括问题描述、算法技能标签、符合 Python 3 的解决方案以及来自各种编程竞赛平台的相关测试用例。本节详细说明了我们在各个竞争性编程平台上获取数据时遇到的特定挑战和信息细节。
-
•
CodeChef: During the process of web crawling for problem information from this platform, we discovered that identical content could be represented in multiple HTML syntax formats. To standardize the parsing across these disparate formats, we conducted an extensive series of manual reviews and subsequently developed specialized HTML parsers for each format. These parsers are equipped to manage an array of text formatting elements, such as line breaks and text styles, in addition to converting the HTML source code for mathematical formulas into LaTeX syntax. In the end, we successfully crawled all the problems from the website up to April 2023.
• CodeChef:在从该平台抓取问题信息的过程中,我们发现相同的内容可以用多种 HTML 语法格式表示。为了在这些不同格式中实现解析标准化,我们进行了广泛的人工审查,并随后为每种格式开发了专门的 HTML 解析器。这些解析器能够处理一系列文本格式元素,如换行和文本样式,并将数学公式的 HTML 源代码转换为 LaTeX 语法。最终,我们成功抓取了截至 2023 年 4 月的所有网站问题。 -
•
CodeForces: Given that the APPS and CodeContests datasets already encompass a broad spectrum of problems from this platform, we employed a time-split crawling strategy to maximize efficiency. For problems dating prior to January 2020, we restricted our crawl to the URL, tag, and difficulty level information. This subset of information served two primary purposes: matching with extant problem data and enriching the algorithmic labeling. Conversely, for problems released between January 2020 and April 2023, we conducted a comprehensive data crawl that included problems, solutions, input-output samples, timestamps, time limits, and memory limits. This newly acquired data was then fully integrated into our TACO dataset.
• CodeForces:鉴于 APPS 和 CodeContests 数据集已经涵盖了该平台的广泛问题,我们采用了时间分割抓取策略以最大化效率。对于 2020 年 1 月之前的问题,我们将抓取限制在 URL、标签和难度级别信息。这部分信息主要用于与现有问题数据匹配和丰富算法标签。相反,对于 2020 年 1 月至 2023 年 4 月之间发布的问题,我们进行了全面的数据抓取,包括问题、解决方案、输入输出样例、时间戳、时间限制和内存限制。然后,这些新获取的数据被完全整合到我们的 TACO 数据集中。 -
•
HackerRank: On the HackerRank platform, we faced multifaceted challenges that included crafting a platform-specific HTML parser and contending with a substantial volume of images in SVG format. These SVG images often depict textual characters and mathematical formulas, which complicates the direct extraction of textual data. To surmount this issue, we employed a two-stages processing approach: initially, SVG vector images were transformed into PNG images featuring black text against a white backdrop; subsequently, the SimpleTex API was harnessed to transcribe the image content into LaTeX syntax format. During our preliminary review of the TACO dataset, we identified approximately 30,000 SVG images within four categories—namely, “Algorithm”, “Data Structures”, “Mathematics”, and “Python”—that necessitated conversion to LaTeX format on the HackerRank platform. After weighing various factors such as cost and accuracy across different OCR tools (e.g., Mathpix, Pix2Pix, SimpleTex), we selected SimpleTex’s API for OCR deployment. It is pertinent to note that errors may occur in SimpleTex’s transcriptions, particularly when the SVG images on HackerRank represent overly intricate mathematical formulas. To maintain the dataset’s integrity and rigor, we performed manual verifications, revisions, and corrections for the SVG segments in each problem within the HackerRank dataset. We crawled issues from this site that four we needed categories for as of April 2023.
在 HackerRank 平台上,我们面临多方面的挑战,包括制作一个平台专用的 HTML 解析器以及处理大量 SVG 格式的图像。这些 SVG 图像通常描绘文本字符和数学公式,这使得直接提取文本数据变得复杂。为了解决这个问题,我们采用了两阶段的处理方法:首先,将 SVG 矢量图像转换为黑色文本在白色背景上的 PNG 图像;随后,利用 SimpleTex API 将图像内容转录为 LaTeX 语法格式。在对 TACO 数据集的初步审查中,我们发现大约有 30,000 个 SVG 图像分布在四个类别中,即“算法”、“数据结构”、“数学”和“Python”,需要在 HackerRank 平台上转换为 LaTeX 格式。在权衡了不同 OCR 工具(例如,Mathpix、Pix2Pix、SimpleTex)的成本和准确性等因素后,我们选择了 SimpleTex 的 API 进行 OCR 部署。需要注意的是,SimpleTex 的转录可能会出现错误,特别是当 HackerRank 上的 SVG 图像代表过于复杂的数学公式时。为了保持数据集的完整性和严谨性,我们对 HackerRank 数据集中每个问题的 SVG 部分进行了人工验证、修订和校正。我们从该网站抓取了截至 2023 年 4 月我们需要的四个类别的问题。 -
•
Geeksforgeeks: GeeksforGeeks serves as a prominent platform for programming education, characterized by meticulously categorized problem content and a well-organized overall structure. However, the website employs various anti-crawling mechanisms, such as request frequency limitations, CAPTCHA validations, and IP address blocks, which presented significant impediments to our data extraction endeavors. Moreover, we encountered frequent parsing errors due to inconsistent formatting and typographical mistakes in the website’s problem text. These irregularities hampered our ability to accurately parse the metadata associated with time and space complexity, as stored in the metadata.json file, as well as the input-output samples contained in the input_output.json file. To navigate these obstacles, we developed a suite of versatile regular expression parsing templates. These templates are engineered to extract maximal problem-related information while maintaining a high degree of accuracy. For data that proved resistant to automated parsing, we conducted manual review and supplementation to ensure maximal fidelity to the original content on the website. We finally managed to crawl the site for 2680 problems through May 2023.
• Geeksforgeeks: GeeksforGeeks 是一个著名的编程教育平台,其特点是精心分类的问题内容和组织良好的整体结构。然而,该网站采用了各种反爬机制,如请求频率限制、验证码验证和 IP 地址封锁,这对我们的数据提取工作造成了重大障碍。此外,由于网站问题文本格式不一致和排版错误,我们经常遇到解析错误。这些不规则性妨碍了我们准确解析与时间和空间复杂度相关的元数据(存储在 metadata.json 文件中)以及输入输出样本(包含在 input_output.json 文件中)的能力。为了解决这些障碍,我们开发了一套多功能的正则表达式解析模板。这些模板旨在提取最大量的问题相关信息,同时保持高度的准确性。对于那些难以自动解析的数据,我们进行了人工审查和补充,以确保最大程度地忠实于网站的原始内容。最终,我们成功在 2023 年 5 月之前抓取了该网站的 2680 个问题。 -
•
CodeWars, Kattis, LeetCode: The problems presented on these three educational programming platforms are substantially covered in two pre-existing datasets: APPS and CodeContests. However, these datasets fall short in providing specific algorithmic labels (‘tags”), a critical requirement for our research objectives. Consequently, we executed targeted data crawling on each of these websites, with a primary focus on acquiring the algorithmic tags and URL information related to the challenges. We subsequently integrated this newly-acquired information into the existing open-source datasets. This augmentation allows for a systematic algorithmic categorization of these challenges and thereby constitutes a pivotal component of the TACO dataset.
• CodeWars, Kattis, LeetCode: 这三个教育编程平台上提供的问题在两个现有数据集中得到了充分覆盖:APPS 和 CodeContests。然而,这些数据集在提供特定算法标签(“标签”)方面有所不足,而这是我们研究目标的关键要求。因此,我们对每个网站进行了有针对性的数据抓取,主要关注获取与挑战相关的算法标签和 URL 信息。随后,我们将这些新获取的信息整合到现有的开源数据集中。这种增强允许对这些挑战进行系统的算法分类,从而构成了 TACO 数据集的关键组成部分。
While analyzing various programming websites, we observed that certain problems employed images to convey key information, such as graphs or tree structures in a pictorial format. Such information is challenging to accurately represent using LaTeX syntax and could result in incomplete or misinterpreted data when directly inputted into the model. To mitigate this issue, we introduced a new metadata attribute, picture_num, to the metadata.json file during the data crawling process for the websites CodeChef, CodeForces, HackerRank, and GeeksforGeeks. We tailored the HTML parser for each site to capture this attribute, which records the number of images present in the problem. The objective of incorporating this parameter is to preemptively address potential gaps in information attributable to image-based content during subsequent application phases.
在分析各种编程网站时,我们观察到某些问题使用图像来传达关键信息,例如以图形格式表示的图表或树结构。此类信息很难用 LaTeX 语法准确表示,直接输入模型时可能导致数据不完整或被误解。为了解决这个问题,我们在数据抓取过程中为 CodeChef、CodeForces、HackerRank 和 GeeksforGeeks 网站的 metadata.json 文件引入了一个新的元数据属性,picture_num。我们为每个网站定制了 HTML 解析器以捕获此属性,该属性记录问题中存在的图像数量。引入此参数的目的是在后续应用阶段预先解决由于基于图像的内容导致的信息缺失问题。