OpenWebMath: An Open Dataset of
High-Quality Mathematical Web Text
OpenWebMath:高质量数学网络文本的开放数据集
Abstract 摘要
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known publicly released web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
越来越多的证据表明,在高质量、精心设计的标记(如代码或数学)上进行预训练在提高大型语言模型的推理能力方面起着重要作用。例如,Minerva 是一个在来自 arXiv 和网络的数十亿数学文档标记上微调的 PaLM 模型,在需要定量推理的问题上报告了显著的性能提升。然而,由于所有已知的公开发布的网络数据集都采用了不能忠实保留数学符号的预处理方法,因此研究界无法获得在大规模定量网络文档上训练的好处。我们介绍了 OpenWebMath,这是一个受这些工作启发的开放数据集,包含来自 Common Crawl 的 147 亿数学网页标记。我们详细描述了从 HTML 文档中提取文本和 LaTeX 内容以及去除模板内容的方法,以及我们的质量过滤和去重方法。此外,我们通过在 OpenWebMath 上训练 14 亿参数的语言模型进行小规模实验,展示了在 14.我们的数据集中的 7B 个标记超过了在超过 20 倍通用语言数据上训练的模型的性能。我们希望在 Hugging Face Hub 上公开发布的数据集将有助于推动大型语言模型推理能力的进步。
1 Introduction 1 引言
Advances in large language models have opened up new opportunities in numerous fields, providing a transformative shift in our approach to a wide range of complex problems (Brown et al., 2020; Raffel et al., 2020). Among these problems, mathematical reasoning has drawn the attention of several researchers in recent years, becoming both a common benchmark to judge the performance of large language models and inspiring new approaches to improve their reasoning capabilities in the hope that they will one day be able to solve complex mathematical problems. One of the biggest advancements in mathematical reasoning in recent years has been the Minerva model (Lewkowycz et al., 2022), which achieved state-of-the-art results on quantitative reasoning benchmarks such as MATH (Hendrycks et al., 2021). Minerva was trained by finetuning PaLM (Chowdhery et al., 2022) on a curated dataset consisting of billions of tokens of high quality technical content sourced from both scientific papers and the web.
大型语言模型的进步在众多领域开辟了新的机遇,为我们处理各种复杂问题的方法带来了变革性的转变(Brown 等,2020;Raffel 等,2020)。在这些问题中,数学推理近年来引起了许多研究人员的关注,成为评判大型语言模型性能的常见基准,并激发了新的方法来提高其推理能力,希望它们有一天能够解决复杂的数学问题。近年来数学推理领域最大的进展之一是 Minerva 模型(Lewkowycz 等,2022),该模型在 MATH(Hendrycks 等,2021)等定量推理基准上取得了最先进的成果。Minerva 通过在 PaLM(Chowdhery 等,2022)上进行微调训练,使用了一个精心策划的数据集,该数据集包含数十亿个高质量技术内容的标记,这些内容来源于科学论文和网络。
Minerva and the datasets used for its training were not released publicly and the current capabilities of open-source models (e.g., Touvron et al. (2023b; c; a); Geng & Liu (2023); Biderman et al. (2023)) in quantitative reasoning lags behind. We believe that there are important research directions that can only be enabled through open-access to such models and datasets, such as work on memorization and generalization, reinforcement learning, the development of new reasoning benchmarks, and advancement in the reasoning capabilities of language models.
Minerva 及其用于训练的数据集未公开发布,当前开源模型(例如 Touvron 等人(2023b; c; a);Geng & Liu(2023);Biderman 等人(2023))在定量推理方面的能力仍然落后。我们认为,只有通过对这些模型和数据集的开放访问,才能实现一些重要的研究方向,例如记忆和泛化的研究、强化学习、新推理基准的开发以及语言模型推理能力的进步。
In our work, we produce an open alternative to the Math Web Pages dataset used to train Minerva (Lewkowycz et al., 2022). We extract documents from Common Crawl111https://commoncrawl.org/, applying our pipeline to extract text while preserving mathematical content in the form of LaTeX equations. We then filter the documents, ensuring that only high-quality English mathematical documents are kept. Finally, we deduplicate the dataset, resulting in 14.7B tokens of high-quality mathematical content suitable for both pretraining and finetuning large language models. The key contributions of this work are as follows:
在我们的工作中,我们制作了一个用于训练 Minerva(Lewkowycz 等,2022)的 Math Web Pages 数据集的开放替代品。我们从 Common Crawl1 中提取文档,应用我们的管道提取文本,同时以 LaTeX 方程的形式保留数学内容。然后我们过滤文档,确保只保留高质量的英文数学文档。最后,我们对数据集进行去重,得到 14.7B 个高质量数学内容的标记,适用于大型语言模型的预训练和微调。本文的主要贡献如下:
-
•
We publically release OpenWebMath, a dataset of 14.7B tokens of high-quality mathematical web text. Our dataset can be found at https://huggingface.co/datasets/open-web-math/open-web-math on the Hugging Face Hub.
我们公开发布了 OpenWebMath,这是一个包含 14.7B 高质量数学网页文本的数据集。我们的数据集可以在 Hugging Face Hub 上的 https://huggingface.co/datasets/open-web-math/open-web-math 找到。 -
•
We extensively document our pipeline, sharing our findings with the NLP community. We open-source the code needed to reproduce our results.
我们详细记录了我们的流程,并与 NLP 社区分享我们的发现。我们开源了重现我们结果所需的代码。 -
•
We analyze the quality of OpenWebMath. First, we analyze the contents of our dataset, providing statistics on the types of webpages, subjects, and top domains. Then, we train several language models on our dataset to show that per-token, it is more effective than existing mathematical pretraining datasets, and is most effective when combined with other datasets.
我们分析了 OpenWebMath 的质量。首先,我们分析了数据集的内容,提供了关于网页类型、主题和顶级域名的统计数据。然后,我们在数据集上训练了几种语言模型,表明每个标记的效果比现有的数学预训练数据集更有效,并且在与其他数据集结合时效果最佳。
2 Related Work 2 相关工作
2.1 Mathematics datasets and benchmarks
2.1 数学数据集和基准测试
Mathematics datasets 数学数据集
Over the past couple of years, several datasets of mathematics have been introduced. AMPS, a dataset of informal mathematics, was introduced alongside the MATH dataset (Hendrycks et al., 2021). AMPS includes more than 100,000 Khan Academy problems with step-by-step solutions in LaTeX and over 5 million problems generated using Mathematica scripts. In total, AMPS contains 23GB of problems and solutions. Another notable example is NaturalProofs (Welleck et al., 2021), which encompasses 32,000 theorem statements and proofs, 14,000 definitions, and 2,000 other types of pages (e.g. axioms, corollaries) derived from ProofWiki, the Stacks project and data from mathematics textbooks. Proof-Pile (Azerbayev et al., 2023) is a dataset of mathematical text that contains more than 14.5GB of informal mathematics texts obtained from arXiv, Stack Exchange, ProofWiki, Wikipedia, openly licensed books, and the MATH dataset. There are also many proprietary datasets for mathematics. WebMath is a large-scale dataset mentioned by OpenAI researchers (Polu & Sutskever, 2020) that contains a 35B token mix of content from Github, arXiv, and Math StackExchange, adding up to 35GB of informal mathematics. MathMix is another OpenAI dataset used to finetune GPT-4 (Lightman et al., 2023) that contains 1B high quality mathematical tokens containing both natural and synthetic data. The proprietary web dataset used to train Minerva, called Math Web Pages (Lewkowycz et al., 2022), was compiled by collecting 17.5B tokens from web pages that contain LaTeX code.
在过去的几年里,已经引入了几个数学数据集。AMPS,一个非正式数学的数据集,与 MATH 数据集(Hendrycks 等,2021)一起引入。AMPS 包括超过 100,000 个 Khan Academy 问题,并附有 LaTeX 格式的逐步解决方案,以及使用 Mathematica 脚本生成的超过 500 万个问题。总的来说,AMPS 包含 23GB 的问题和解决方案。另一个值得注意的例子是 NaturalProofs(Welleck 等,2021),它包含 32,000 个定理陈述和证明,14,000 个定义,以及 2,000 种其他类型的页面(例如公理、推论),这些页面来自 ProofWiki、Stacks 项目和数学教科书的数据。Proof-Pile(Azerbayev 等,2023)是一个数学文本数据集,包含超过 14.5GB 的非正式数学文本,这些文本来自 arXiv、Stack Exchange、ProofWiki、Wikipedia、公开许可的书籍和 MATH 数据集。还有许多专有的数学数据集。WebMath 是 OpenAI 研究人员提到的一个大规模数据集(Polu & Sutskever,2020),包含来自 Github、arXiv 和 Math StackExchange 的 35B 个标记内容,总计 35GB 的非正式数学内容。 MathMix 是另一个用于微调 GPT-4 的 OpenAI 数据集(Lightman 等,2023),包含 10 亿个高质量的数学标记,包含自然数据和合成数据。用于训练 Minerva 的专有网络数据集称为 Math Web Pages(Lewkowycz 等,2022),通过从包含 LaTeX 代码的网页中收集 175 亿个标记编译而成。
Mathematics benchmarks 数学基准
Several popular benchmarks have been used by researchers to assess the capabilities of language models on both formal and informal mathematics. The MATH dataset (Hendrycks et al., 2021) is comprised of 12,500 challenging competition problems in informal language. Each problem is also accompanied by a step-by-step informal proof. Answers are delimited by the \boxed environment, allowing for easier answer verification. GSM8k (Cobbe et al., 2021) is another popular multi-step informal mathematics reasoning benchmark. It contains 8,500 grade school math problems that are intended to be solvable by a bright middle school student. Lewkowycz et al. (2022) also introduce a benchmark based on OpenCourseWare. OCWCourses includes a set of 272 automatically-verifiable solutions at the undergraduate level, covering chemistry, information theory, differential equations, special relativity, and more. Lewkowycz et al. (2022) also evaluate on a subset of MMLU (Hendrycks et al., 2020) called MMLU-STEM, which focuses on science, technology, engineering, and mathematics.
研究人员使用了几种流行的基准来评估语言模型在正式和非正式数学方面的能力。MATH 数据集(Hendrycks 等,2021)包含了 12,500 个用非正式语言编写的具有挑战性的竞赛问题。每个问题还附有逐步的非正式证明。答案用 \boxed 环境分隔,便于答案验证。GSM8k(Cobbe 等,2021)是另一个流行的多步骤非正式数学推理基准。它包含 8,500 个小学数学问题,旨在让聪明的中学生能够解决。Lewkowycz 等(2022)还引入了一个基于 OpenCourseWare 的基准。OCWCourses 包含一组 272 个自动验证的本科水平的解决方案,涵盖化学、信息论、微分方程、狭义相对论等。Lewkowycz 等(2022)还在 MMLU(Hendrycks 等,2020)的一个子集 MMLU-STEM 上进行了评估,该子集侧重于科学、技术、工程和数学。
2.2 Web Data Processing Pipelines
2.2 网络数据处理管道
The pretraining of large language models requires large, diverse datasets. Data scraped from the web is one of the primary sources for such data. However, sources such as Common Crawl, which contains over 200 billion web pages, are known to have significant amounts of low-quality and duplicate content, requiring extensive filtering and deduplication to be suitable for training. Prior works such as C4 (Raffel et al., 2020), RefinedWeb (Penedo et al., 2023), CCNet (Wenzek et al., 2019), The Pile (Gao et al., 2020), and GPT-3 (Brown et al., 2020) introduce various pipelines for extracting quality data from Common Crawl for the purposes of language model training. These pipelines typically consist of three primary steps: text extraction, filtering, and deduplication.
大型语言模型的预训练需要大量多样化的数据集。从网络上抓取的数据是此类数据的主要来源之一。然而,像 Common Crawl 这样的来源,包含超过 2000 亿个网页,已知有大量低质量和重复内容,需要进行广泛的过滤和去重才能适合训练。先前的工作如 C4(Raffel 等,2020)、RefinedWeb(Penedo 等,2023)、CCNet(Wenzek 等,2019)、The Pile(Gao 等,2020)和 GPT-3(Brown 等,2020)引入了各种管道,从 Common Crawl 中提取高质量数据用于语言模型训练。这些管道通常包括三个主要步骤:文本提取、过滤和去重。
Text extraction 文本提取
Extracting plain text from HTML files is a critical step in the creation of Common Crawl-based datasets. The easiest way to extract text from Common Crawl documents is to use the WET corresponding to each webpage, which contains pre-extracted plain text of the webpage. CCNet and C4 both use Common Crawl’s WET files. However, the text extracted in WET files may contain too much boilerplate or miss out on important content such as LaTeX equations. It is also possible to extract text directly from the raw HTML found in Common Crawl WARC files. The Pile uses an open source library called jusText (Endrédy & Novák, 2013) to extract text from HTML while RefinedWeb uses a library called Trafilatura (Barbaresi, 2021). These text extraction approaches differ in terms of extraction speed, customization, and their precision and recall for removing boilerplate content.
从 HTML 文件中提取纯文本是创建基于 Common Crawl 数据集的关键步骤。最简单的方法是使用每个网页对应的 WET 文件,这些文件包含预先提取的网页纯文本。CCNet 和 C4 都使用 Common Crawl 的 WET 文件。然而,WET 文件中提取的文本可能包含过多的模板内容,或者遗漏重要内容,如 LaTeX 方程式。也可以直接从 Common Crawl WARC 文件中的原始 HTML 中提取文本。The Pile 使用一个名为 jusText(Endrédy & Novák, 2013)的开源库从 HTML 中提取文本,而 RefinedWeb 使用一个名为 Trafilatura(Barbaresi, 2021)的库。这些文本提取方法在提取速度、定制化以及去除模板内容的精度和召回率方面有所不同。
Filtering 过滤
The first layer of filtering often involves language identification (Wenzek et al., 2019). Language filtering is used because certain other parts of the pipeline only work for specific languages, and is often done with simple linear classifiers such as from fastText (Joulin et al., 2016). Quality filtering can be done with a combination of perplexity, classifier, and rule-based methods. CCNet uses a 5-gram Kneser-Ney language model implemented in the KenLM library (Heafield, 2011) trained on the target domain. The documents in the dataset are then sorted and filtered by their perplexity under this model. Other datasets such as the one used to train GPT-3 (Brown et al., 2020) use a classifier-based approach. This involves training a classifier on known-high-quality documents, such as those from Wikipedia, as positive examples and unfiltered documents from Common Crawl as negative examples. The classifier scores are used to filter low-quality documents from the dataset. Finally, rule-based approaches such as those used in C4 (Raffel et al., 2020) and MassiveWeb (Rae et al., 2021) involve removing pages with certain characters, too many or too few characters, too high a proportion of symbols, or those with an abnormal average word length. OpenMathWeb uses a mixture of these three approaches.
过滤的第一层通常涉及语言识别(Wenzek 等,2019)。使用语言过滤是因为管道的某些其他部分仅适用于特定语言,通常使用简单的线性分类器,例如来自 fastText 的分类器(Joulin 等,2016)。质量过滤可以通过困惑度、分类器和基于规则的方法相结合来完成。CCNet 使用在目标领域上训练的 KenLM 库(Heafield,2011)中实现的 5-gram Kneser-Ney 语言模型。然后根据该模型下的困惑度对数据集中的文档进行排序和过滤。其他数据集,如用于训练 GPT-3 的数据集(Brown 等,2020),使用基于分类器的方法。这涉及在已知高质量文档(如来自维基百科的文档)上训练分类器作为正例,并将来自 Common Crawl 的未过滤文档作为负例。分类器得分用于从数据集中过滤低质量文档。最后,基于规则的方法,如 C4(Raffel 等,2020)和 MassiveWeb(Rae 等)中使用的方法。2021 年) 涉及删除包含某些字符、字符过多或过少、符号比例过高或平均词长异常的页面。OpenMathWeb 结合了这三种方法。
Deduplication 数据去重
Given the periodic nature of Common Crawl snapshots and a general redundancy in web-sourced text, deduplication is an important processing step. Document-level near-deduplication (e.g., in (Brown et al., 2020; Penedo et al., 2023)) often employs MinHashLSH, an efficient algorithm for estimating the Jaccard similarity of documents. CCNet (Wenzek et al., 2019) uses paragraph-level deduplication, which can help to remove common boilerplate content found in WET text-extractions.
鉴于 Common Crawl 快照的周期性和网络来源文本的一般冗余,去重是一个重要的处理步骤。文档级近似去重(例如,在(Brown et al., 2020; Penedo et al., 2023)中)通常采用 MinHashLSH,这是一种用于估计文档 Jaccard 相似度的高效算法。CCNet(Wenzek et al., 2019)使用段落级去重,这有助于去除在 WET 文本提取中常见的模板内容。
3 Building OpenWebMath 构建 OpenWebMath
3.1 Objectives 3.1 目标
Our aim with OpenWebMath is to build a dataset of as many mathematical documents sourced from the web as possible while preserving the formatting of mathematical content such as LaTeX equations as in Lewkowycz et al. (2022). For the purposes of this work, we define a mathematical document as a document containing either core mathematical contents such as theorems, definitions, proofs, questions and answers, formal mathematics, or interdisciplinary documents featuring mathematical formulas within fields like physics, chemistry, biology, economics, and finance. We source our documents from Common Crawl, which is a large open-access crawl of the web containing petabytes of raw HTML files. Due to the high variance in the quality of documents from Common Crawl, we additionally use several methods for filtering and boilerplate reduction. Throughout the creation of OpenWebMath, we iteratively refined these methods to ensure that we do not remove too many relevant documents, optimizing for high recall whenever possible. Since we expect that OpenWebMath will be used primarily as an additional source of pretraining data for large language models, we prefer having a small percentage of non-mathematical but high quality documents in the dataset rather than removing them and potentially losing relevant mathematical content. Finally, due to the limited number of mathematical data available on the web, we use significantly more manual inspection and tuning of our processing pipeline than other web-based datasets. We document our processing choices and pipeline in the section that follows.
我们创建 OpenWebMath 的目标是构建一个尽可能多地从网络上获取数学文档的数据集,同时保留数学内容的格式,例如 Lewkowycz 等人(2022)中的 LaTeX 方程。为了本研究的目的,我们将数学文档定义为包含核心数学内容(如定理、定义、证明、问题和答案、形式数学)或在物理、化学、生物学、经济学和金融等领域中包含数学公式的跨学科文档。我们从 Common Crawl 获取文档,这是一个包含数拍字节原始 HTML 文件的大型开放访问网络爬取。由于 Common Crawl 中文档质量的高度差异,我们还使用了几种方法进行过滤和模板减少。在创建 OpenWebMath 的过程中,我们反复改进这些方法,以确保不会删除太多相关文档,并在可能的情况下优化高召回率。 由于我们预计 OpenWebMath 将主要用作大型语言模型的预训练数据的额外来源,我们更倾向于在数据集中保留一小部分非数学但高质量的文档,而不是删除它们并可能丢失相关的数学内容。最后,由于网上可用的数学数据数量有限,我们对处理流程进行了比其他基于网络的数据集更多的人工检查和调整。我们在接下来的部分中记录了我们的处理选择和流程。
3.2 High-level overview of the pipeline
3.2 管道的高级概述
As shown in Figure 1, the processing pipeline for OpenWebMath falls into five stages. First, we apply a prefilter to all HTML documents in Common Crawl to quickly judge whether they have mathematical content, skipping those that do not before doing the extensive processing needed to extract text and equations and remove boilerplate. Second, we extract the text, including mathematical content, from the HTML documents. Third, we apply language identification filters, perplexity-based quality filtering, and a mathematical content classifier filter. Fourth, we deduplicate the dataset using SimHash (Manku et al., 2007). Finally, we manually inspect the documents gathered in the previous steps and view documents from the most popular domains by document-count and character-count, removing domains that are not high quality. We describe each of these steps in detail in the following sections.
如图 1 所示,OpenWebMath 的处理流程分为五个阶段。首先,我们对 Common Crawl 中的所有 HTML 文档应用预过滤器,以快速判断它们是否包含数学内容,跳过那些不包含数学内容的文档,从而避免在提取文本和方程式以及去除模板所需的广泛处理之前进行不必要的处理。其次,我们从 HTML 文档中提取包括数学内容在内的文本。第三,我们应用语言识别过滤器、基于困惑度的质量过滤器和数学内容分类器过滤器。第四,我们使用 SimHash(Manku 等,2007)对数据集进行去重。最后,我们手动检查在前几个步骤中收集的文档,并查看按文档数量和字符数量计算的最受欢迎域名的文档,移除质量不高的域名。我们将在以下部分详细描述这些步骤。
3.3 Prefiltering 3.3 预过滤
Since there are over 200B HTML documents in Common Crawl, applying our processing over each document would require a significant amount of compute. To improve the efficiency of the pipeline, we first apply a stack of pre-filters optimized for high recall to reduce the number of documents that need to be processed. Our first filters check for common mathematical strings as in Lewkowycz et al. (2022), such as the presence of tex classes, <math> tags, and the word “mathjax”. See Table 8 for a full list of terms. If none of these terms are present, we search for the presence of the top 100 most-popular LaTeX symbols in the text. This is done by first filtering for documents containing a backslash command using a simple regular expression and then searching specifically for these LaTeX symbols in the plain text from the HTML document. If none of these symbols are found, we run the plain text through our MathScore classifier (see section 3.5.1) and keep documents that exceed a confidence threshold of 0.8. By tuning these filters and using hierarchical layers of progressively more accurate but more expensive filters, we were able to reduce the compute needed to process the dataset by several times while retaining a high recall of relevant documents.
由于 Common Crawl 中有超过 2000 亿个 HTML 文档,对每个文档进行处理将需要大量的计算资源。为了提高管道的效率,我们首先应用了一组优化了高召回率的预过滤器,以减少需要处理的文档数量。我们的第一个过滤器检查常见的数学字符串,如 Lewkowycz 等人(2022)所述,例如 tex 类的存在、<math>标签和“mathjax”一词。完整的术语列表见表 8。如果这些术语都不存在,我们会搜索文本中最受欢迎的 100 个 LaTeX 符号的存在。这是通过首先使用简单的正则表达式过滤包含反斜杠命令的文档,然后在 HTML 文档的纯文本中专门搜索这些 LaTeX 符号来完成的。如果没有找到这些符号,我们会将纯文本通过我们的 MathScore 分类器(见第 3.5.1 节),并保留超过 0.8 置信度阈值的文档。 通过调整这些过滤器并使用逐层更精确但更昂贵的过滤器,我们能够在保留相关文档高召回率的同时,将处理数据集所需的计算量减少数倍。</math>
3.4 Text extraction 3.4 文本提取
In contrast with prior works that extract text from Common Crawl such as C4 (Collins et al., 2023), The Pile (Gao et al., 2020), and RefinedWeb (Penedo et al., 2023), we chose to make a mostly custom pipeline for extracting the main content from HTML documents. This is because we found that while other tools get decent performance on average over many documents on the internet, they do not work optimally on many of the most common sources of mathematical content on the web. We instead opted to build on top of Resiliparse (Bevendorff et al., 2018; 2021), a fast and efficient library built in Cython that includes performant tools for parsing HTML pages, processing their DOMs, and extracting the main content. As shown in Table 5 in the appendix, Resiliparse is significantly more efficient than alternative libraries such as jusText. Another notable part of our text extraction pipeline is that we randomize the parameters of the extraction to add diversity to the dataset. This includes randomizing whether we use a plain text or Markdown format for the documents and randomizing the amount of boilerplate terms required to trigger a line being removed.
与之前从 Common Crawl 中提取文本的工作(如 C4(Collins 等,2023)、The Pile(Gao 等,2020)和 RefinedWeb(Penedo 等,2023))相比,我们选择了一个主要定制的管道来从 HTML 文档中提取主要内容。这是因为我们发现,尽管其他工具在互联网上的许多文档上平均表现良好,但它们在许多最常见的数学内容来源上并不能达到最佳效果。我们选择基于 Resiliparse(Bevendorff 等,2018;2021)进行构建,这是一个用 Cython 编写的快速高效的库,包含解析 HTML 页面、处理其 DOM 和提取主要内容的高性能工具。如附录中的表 5 所示,Resiliparse 比 jusText 等替代库显著更高效。我们的文本提取管道的另一个显著部分是我们随机化提取参数,以增加数据集的多样性。这包括随机化我们是使用纯文本还是 Markdown 格式的文档,以及随机化触发删除行所需的样板术语数量。
Training Dataset 训练数据集 | GSM8k | MATH | ||||||
Prealgebra 初级代数 | Algebra 代数 | Intermediate 中级 Algebra 代数学 | Counting & 计数与 Probability 概率 | Number 数字 Theory 理论 | Precalculus 预备微积分 | Geometry 几何学 | ||
The Pile (14.7B tokens) 堆积(14.7B 令牌) | 2.2032 | 1.9127 | 1.9751 | 1.8420 | 1.8193 | 1.9227 | 1.6847 | 1.9499 |
ProofPile (14.7B tokens) ProofPile(147 亿个标记) | 2.2350 | 1.7370 | 1.7214 | 1.5739 | 1.6462 | 1.7291 | 1.4838 | 1.7229 |
OpenWebMath (14.7B tokens) OpenWebMath(14.7B 令牌) |
1.9075 | 1.6285 | 1.6503 | 1.5949 | 1.6002 | 1.6894 | 1.4542 | 1.5748 |
Mixture (14.7B tokens) 混合物(147 亿标记) | 1.8968 | 1.6055 | 1.6190 | 1.5301 | 1.5719 | 1.6607 | 1.4119 | 1.5599 |
The Pile (300B tokens; Pythia 1.4B) The Pile(300B 代币;Pythia 1.4B) |
1.9430 | 1.7117 | 1.7560 | 1.6358 | 1.6359 | 1.7460 | 1.5191 | 1.7252 |
表 1:我们在各种数据集上训练了 1.4B 参数模型,处理了 14.7B 个标记,并测量了它们在不同数学基准上的困惑度。OpenWebMath 和 ProofPile Azerbayev 等人(2023)与 OpenWebMath 的 50/50 混合表现良好——优于在 The Pile(Gao 等人,2020)300B 个标记上训练的 Pythia 1.4B(Biderman 等人,2023)。
Training Dataset 训练数据集 | MATH Algebra-Easy 数学 代数-简单 | MATH Algebra-Easy 数学 代数-简单 maj@16 | LILA multiarith LILA 多算术 |
The Pile (14.7B tokens) 堆积(14.7B 令牌) | 2.81% | 3.93% | 9.77% |
ProofPile (14.7B tokens) ProofPile(147 亿个标记) | 2.81% | 3.93% | 8.04% |
OpenWebMath (14.7B tokens) OpenWebMath(14.7B 令牌) |
5.62% | 9.55% | 16.67% |
Mixture (14.7B tokens) 混合物(147 亿标记) | 5.06% | 10.11% | 13.22% |
The Pile (300B tokens; Pythia 1.4B) The Pile(300B 代币;Pythia 1.4B) |
3.93% | 5.62% | 21.80% |
表 2:不同数学基准的准确性。
Our text extraction pipeline consists of four stages: LaTeX extraction, text extraction, DOM processing, and line processing.
我们的文本提取流程包括四个阶段:LaTeX 提取、文本提取、DOM 处理和行处理。
LaTeX Extraction LaTeX 提取
Lewkowycz et al. (2022) employ a relatively simple LaTeX extraction pipeline that extracts equations from <script type="math/latex">, <script type="math/asciimath">, and <math> blocks with <annotation encoding="application/x-tex"> blocks within them and replaces these tags with the extracted equations. When we applied these filters to documents from Common Crawl, we noticed an extremely low number of these tags compared to what was reported. We suspect that this is due to a difference between the HTML files available within Google (Lewkowycz et al., 2022) and those available on Common Crawl. The majority of the LaTeX on the internet is written using MathJax, where developers write equations delimited by dollar signs or other delimiters in their HTML pages and then the included javascript code replaces these equations with properly rendered LaTeX equations within the above script tags when the page is loaded. HTML documents on Common Crawl do not include the changes to the HTML that result from running javascript, requiring that we instead extract the LaTeX equations by finding delimiters ourselves. This is a significant challenge since we need to detect whether the page contains the required MathJax javascript code, which delimiters were chosen by the user to denote equations, and then match and extract the equations from the text on the page. See Appendix B for a more detailed discussion.
Lewkowycz 等人(2022)采用了一个相对简单的 LaTeX 提取管道,从
In order to extract MathJax, we first determine whether the page is importing the MathJax javascript code by searching for the word MathJax on the page. If it is not found, we additionally search for common LaTeX symbols, and if they are found, we treat the page as though it is running MathJax. We use regular expressions to search for code that calls the configuration function for MathJax to extract the delimiters used for equations. We add these delimiters to an extensive list of default delimiters and treat any content between these delimiters as LaTeX equations.
为了提取 MathJax,我们首先通过在页面上搜索 MathJax 这个词来确定页面是否导入了 MathJax 的 JavaScript 代码。如果没有找到,我们会另外搜索常见的 LaTeX 符号,如果找到了这些符号,我们就将该页面视为正在运行 MathJax。我们使用正则表达式搜索调用 MathJax 配置函数的代码,以提取用于方程的定界符。我们将这些定界符添加到一个广泛的默认定界符列表中,并将这些定界符之间的任何内容视为 LaTeX 方程。
In addition to extracting equations from MathJax, we found several more ways that LaTeX is encoded on the internet. These methods were discovered by filtering small portions of Common Crawl for documents that contain \frac, one of the most popular LaTeX commands, and making sure that our processing code supports all the different ways that math could be encoded. We found that LaTeX on the internet is encoded in the following ways:
除了从 MathJax 中提取方程式外,我们还发现了几种在互联网上编码 LaTeX 的方法。这些方法是通过筛选 Common Crawl 中包含\frac(最流行的 LaTeX 命令之一)的文档的小部分发现的,并确保我们的处理代码支持所有可能的数学编码方式。我们发现互联网上的 LaTeX 编码方式如下:
-
1.
equation and align environments.
方程和对齐环境。 -
2.
The alttext of elements with special classes like tex.
具有特殊类(如 tex)的元素的替代文本。 -
3.
Images from domains like latex.codecogs.com often include equations encoded in the URL.
来自 latex.codecogs.com 等域的图像通常包含在 URL 中编码的方程式。 -
4.
Special wordpress plugins.
特殊的 WordPress 插件。 -
5.
<math> tags with <annotation encoding="application/x-tex"> blocks within them.
<math> 标签内包含 块。</math> -
6.
<math> tags with MathML content. We use a style sheet to convert these equations into LaTeX.
<math> 标签包含 MathML 内容。我们使用样式表将这些方程转换为 LaTeX。</math> -
7.
MathJax equations encoded in the text of the page.
页面文本中编码的 MathJax 方程。
The relative frequencies of the different ways math is encoded can be found in Table 6 in the appendix.
不同数学编码方式的相对频率可以在附录的表 6 中找到。
DOM Processing DOM 处理
After extracting the LaTeX equations from the HTML, we do several processing steps on the DOM-tree of the HTML document. This includes removing invisible elements based on their styles, removing buttons and link clusters, annotating code, tables, and headers, and removing known problematic elements based on class or ID.
从 HTML 中提取 LaTeX 方程后,我们对 HTML 文档的 DOM 树进行几个处理步骤。这包括根据样式删除不可见元素,删除按钮和链接集群,注释代码、表格和标题,并根据类或 ID 删除已知的有问题元素。
Text Extraction 文本提取
We use the extract_plain_text(main_content=True) method in Resiliparse (Bevendorff et al., 2018) to extract the main content text from the DOM following several preprocessing steps to get around common issues with their specific implementation that cause it to be overly sensitive when removing boilerplate.
我们在 Resiliparse(Bevendorff 等,2018)中使用 extract_plain_text(main_content=True)方法,从 DOM 中提取主要内容文本,经过几个预处理步骤,以解决其特定实现中的常见问题,这些问题在去除模板时会导致其过于敏感。
Line Processing 行处理
After extracting the plain text on the page using Resiliparse, we apply our own processing to remove boilerplate lines based on an iteratively-refined set of common boilerplate phrases, remove empty headers, and escape dollar signs that are not part of LaTeX equations.
在使用 Resiliparse 提取页面上的纯文本后,我们应用自己的处理方法,基于一个迭代优化的常见样板短语集去除样板行,删除空标题,并对不属于 LaTeX 方程的美元符号进行转义。
3.5 Filtering 3.5 过滤
We apply filtering with the goal of removing non-English documents (since our filters pipeline is optimized for English), removing documents that are not mathematical, and removing low-quality documents that would be harmful to train a language model on. We apply the following filters in order:
我们应用过滤的目的是去除非英语文档(因为我们的过滤管道是为英语优化的),去除非数学文档,以及去除低质量的文档,这些文档会对训练语言模型有害。我们按以下顺序应用过滤器:
-
1.
We use a FastText language identification model (Joulin et al., 2016) to remove documents that are not in English.
我们使用 FastText 语言识别模型(Joulin 等,2016)来删除非英语的文档。 -
2.
We use our MathScore classifier (see section 3.5.1) to get a probability that the document is mathematical. If our previous extraction step found LaTeX equations, we keep documents with a probability of over 0.17. If no LaTeX equations were found, we keep documents with a probability of over 0.8.
我们使用我们的 MathScore 分类器(见第 3.5.1 节)来获取文档是数学文档的概率。如果我们之前的提取步骤找到了 LaTeX 方程,我们保留概率超过 0.17 的文档。如果没有找到 LaTeX 方程,我们保留概率超过 0.8 的文档。 -
3.
We use a KenLM language model (Heafield, 2011) trained on ProofPile (Azerbayev et al., 2023) to get a perplexity score for each document. We remove documents with a perplexity score of more than 15,000.
我们使用在 ProofPile(Azerbayev 等,2023)上训练的 KenLM 语言模型(Heafield,2011)来获取每个文档的困惑度分数。我们删除困惑度分数超过 15,000 的文档。
3.5.1 Math Score 3.5.1 数学成绩
During our filtering process, we train a model to predict the probability a document is mathematical, which we call MathScore. We first gather a dataset of hundreds of thousands documents extracted from our pipeline from an early stage of the project, and label them depending on whether they contain one of the top-100 most common LaTeX commands. We then remove any LaTeX code from the documents and train a classifier to predict whether the documents contain one of these common LaTeX commands. The training process for MathScore is depicted in Figure 4. Since we remove all LaTeX code from the features fed into the model, the model needs to learn the words and phrases most commonly associated with LaTeX content. We use FastText (Joulin et al., 2016) to train this model, and find based on manual inspection that content with a score of under 0.2 is very unlikely to contain useful mathematical content.
在我们的过滤过程中,我们训练了一个模型来预测文档是数学文档的概率,我们称之为 MathScore。我们首先收集了从项目早期阶段的管道中提取的数十万份文档数据集,并根据它们是否包含前 100 个最常见的 LaTeX 命令之一进行标记。然后,我们从文档中删除任何 LaTeX 代码,并训练一个分类器来预测文档是否包含这些常见的 LaTeX 命令之一。MathScore 的训练过程如图 4 所示。由于我们从输入模型的特征中删除了所有 LaTeX 代码,模型需要学习与 LaTeX 内容最常相关的词语和短语。我们使用 FastText(Joulin 等,2016)来训练这个模型,并通过人工检查发现,得分低于 0.2 的内容很不可能包含有用的数学内容。
3.6 Deduplication 3.6 数据去重
Due to the large amount of duplicate documents in Common Crawl, we apply a deduplication step to remove near-duplicate documents. We use the SimHash implementation from text-dedup (Mou et al., 2023) to deduplicate the dataset using a threshold of 0.7. We find that this threshold is high enough to remove most duplicate documents even if they have slight differences in their texts.
由于 Common Crawl 中存在大量重复文档,我们应用了去重步骤以移除近似重复的文档。我们使用 text-dedup(Mou 等,2023)中的 SimHash 实现,通过 0.7 的阈值对数据集进行去重。我们发现,即使文档在文本上有细微差异,这一阈值也足够高,可以移除大多数重复文档。
3.7 Manual Inspection 3.7 手动检查
Finally, we manually inspect the top domains by document count, the top domains by character count, and the longest documents in the dataset to ensure that the documents are high quality. We remove domains that are not high quality or clearly not mathematical by adding domains to a blacklist and adding domain filters such as removing user profile pages, abstract-hosting websites as in Lewkowycz et al. (2022), and removing search result pages.
最后,我们手动检查按文档数量排序的顶级域名、按字符数量排序的顶级域名以及数据集中最长的文档,以确保文档的高质量。我们通过将域名添加到黑名单并添加域名过滤器(如删除用户个人资料页面、摘要托管网站,如 Lewkowycz 等人(2022)所述)以及删除搜索结果页面,来移除不高质量或明显不是数学的域名。
4 Dataset Analysis 数据集分析
Token count 词元计数
At 14.7B tokens, OpenWebMath is just below the size of Minerva’s Math Web Pages (17.5B tokens) Lewkowycz et al. (2022) and significantly larger than the web part of any other dataset. OpenWebMath has around the same number of LLaMA tokens as ProofPile (14.2B) (Azerbayev et al., 2023), but we note that there is very little overlap between between the two datasets. As a result, OpenWebMath brings a large number of new mathematical tokens that were previously unavailable to the open-source community. Due to differences in data curation strategies, it is hard to compare these datasets other than by training models on them. Since not much is known about how to properly filter a dataset, we opted to keep as much relevant content as possible. However, future work could explore filtering OpenWebMath more aggressively to further improve its quality.
在 14.7B 个标记中,OpenWebMath 的规模略小于 Minerva 的数学网页(17.5B 个标记)(Lewkowycz 等,2022),但显著大于任何其他数据集的网页部分。OpenWebMath 的 LLaMA 标记数量与 ProofPile(14.2B)(Azerbayev 等,2023)大致相同,但我们注意到这两个数据集之间几乎没有重叠。因此,OpenWebMath 带来了大量以前开源社区无法获得的新数学标记。由于数据整理策略的差异,除了通过在这些数据集上训练模型外,很难对它们进行比较。由于关于如何正确过滤数据集的知识不多,我们选择保留尽可能多的相关内容。然而,未来的工作可以探索更积极地过滤 OpenWebMath,以进一步提高其质量。
Data Composition 数据组成
We measured the distribution of domains in OpenWebMath both by document and by character count. Table 4 and Table 4 show the top twenty most common domains by document and character count respectively. The most common sources of data tend to be discussion forums, blog posts, and scientific papers. We find that the distribution of characters in the dataset is distributed over 131,206 domains, with 46% of the characters appearing in the top 100 domains.
我们通过文档和字符计数测量了 OpenWebMath 中领域的分布。表 4 和表 4 分别显示了按文档和字符计数的前二十个最常见领域。数据的最常见来源往往是讨论论坛、博客文章和科学论文。我们发现数据集中字符的分布覆盖了 131,206 个领域,其中 46%的字符出现在前 100 个领域中。
Domain 领域 | # Documents # 文档 | % Documents % 文件 |
stackexchange.com | 1,136,407 | 17.99% |
physicsforums.com 物理论坛.com | 300,044 | 4.75% |
mathhelpforum.com | 170,721 | 2.70% |
socratic.org | 133,983 | 2.12% |
mathoverflow.net | 120,755 | 1.91% |
gradesaver.com | 96,100 | 1.52% |
zbmath.org | 91,939 | 1.46% |
wordpress.com | 87,876 | 1.39% |
github.io | 81,125 | 1.28% |
brilliant.org | 68,573 | 1.09% |
gamedev.net | 50,560 | 0.80% |
openstudy.com | 49,041 | 0.78% |
gmatclub.com | 48,812 | 0.77% |
blogspot.com | 48,036 | 0.76% |
wikipedia.org 维基百科 | 46,606 | 0.74% |
ac.uk | 41,342 | 0.65% |
nature.com | 37,403 | 0.59% |
aimsciences.org | 36,368 | 0.58% |
libretexts.org | 32,216 | 0.51% |
readthedocs.io | 31,455 | 0.50% |
表 3:按文档数量划分的最常见领域。
Domain 领域 | # Characters # 字符 | % Characters 源文本:% 字符 翻译文本: |
stackexchange.com | 4,655,132,784 | 9.55% |
nature.com | 1,529,935,838 | 3.14% |
wordpress.com | 1,294,166,938 | 2.66% |
physicsforums.com 物理论坛.com | 1,160,137,919 | 2.38% |
github.io | 725,689,722 | 1.49% |
zbmath.org | 620,019,503 | 1.27% |
wikipedia.org 维基百科 | 618,024,754 | 1.27% |
groundai.com | 545,214,990 | 1.12% |
blogspot.com | 520,392,333 | 1.07% |
mathoverflow.net | 499,102,560 | 1.02% |
gmatclub.com | 442,611,169 | 0.91% |
gamedev.net | 426,478,461 | 0.88% |
ac.uk | 402,111,665 | 0.83% |
aimsciences.org | 344,716,386 | 0.71% |
mathhelpforum.com 数学帮助论坛.com | 319,215,756 | 0.65% |
deepai.org | 313,512,520 | 0.64% |
libretexts.org | 282,014,149 | 0.58% |
readthedocs.io | 269,816,413 | 0.55% |
tib.eu | 199,714,017 | 0.41% |
mit.edu | 198,487,362 | 0.41% |
表 4:按字符数计算的最常见域名。
In order to get a sense of the types of documents found in the dataset, we analyzed 100,000 randomly sampled documents. First, we created embeddings of this data using all-MiniLM-L12-v2 (Wang et al., 2020) in SentenceTransformers (Reimers & Gurevych, 2019). Then, we clustered these embeddings using -Means with . Finally, we took the five closest documents to each cluster center and asked gpt-3.5-turbo (https://platform.openai.com/docs/api-reference) to classify each cluster as Math, Physics, Statistics, Chemistry, Economics, Computer Science, or Other. We then aggregated these statistics, using the size of each cluster to get an estimate of the final number of documents in each category. We note several potential issues with this methodology, including inaccuracies stemming from using an LLM for classification, and the potential that not every document within a cluster belongs to the predicted category. Figure 2 shows the results of this analysis. The majority of the documents in the dataset are directly related to mathematics, while the rest are spread out throughout physics, computer science, statistics, chemistry, and economics, with 12% of documents not falling neatly into any of these categories.
为了了解数据集中包含的文档类型,我们分析了 100,000 个随机抽样的文档。首先,我们使用 SentenceTransformers(Reimers & Gurevych, 2019)中的 all-MiniLM-L12-v2(Wang et al., 2020)对这些数据进行了嵌入。然后,我们使用 -Means 和 对这些嵌入进行了聚类。最后,我们选取了每个聚类中心最接近的五个文档,并要求 gpt-3.5-turbo(https://platform.openai.com/docs/api-reference)将每个聚类分类为数学、物理、统计、化学、经济学、计算机科学或其他。然后,我们汇总了这些统计数据,使用每个聚类的大小来估算每个类别中的最终文档数量。我们注意到这种方法存在几个潜在问题,包括使用LLM进行分类可能导致的不准确性,以及并非每个聚类中的所有文档都属于预测类别的可能性。图 2 显示了此分析的结果。 数据集中大多数文档直接与数学相关,而其余文档则分布在物理学、计算机科学、统计学、化学和经济学领域,其中 12%的文档不完全属于这些类别中的任何一个。
We also used GPT to analyze the types of websites found in OpenWebMath. To do this, we took a sample of 200 documents and asked gpt-3.5-turbo to classify each as a Forum, Paper, Blog, Reference, Educational, Reference, or other. We also gave the document URL as a feature, since we found GPT is often able to judge the topic from the URL alone. We validated our analysis by asking GPT to do this classification on the top 100 domain names and got similar results. Figure 2 shows the results. The highest proportion of documents are forum pages, where users ask and answer questions related to mathematical subjects. There is also a large proportion of educational and reference content.
我们还使用 GPT 分析了 OpenWebMath 中发现的网站类型。为此,我们抽取了 200 个文档的样本,并要求 gpt-3.5-turbo 将每个文档分类为论坛、论文、博客、参考资料、教育、参考资料或其他。我们还将文档 URL 作为一个特征,因为我们发现 GPT 通常能够仅从 URL 判断主题。我们通过要求 GPT 对前 100 个域名进行分类来验证我们的分析,并得到了类似的结果。图 2 显示了结果。文档中比例最高的是论坛页面,用户在这些页面上提出和回答与数学相关的问题。还有大量的教育和参考内容。
Downstream Performance 下游性能
We ran experiments to find out how our dataset compares to other language modeling datasets. We compare models trained on OpenWebMath for a single epoch (14.7B tokens) with models trained for the same number of tokens on The Pile (Gao et al., 2020), a general langauge modeling dataset, and ProofPile (Azerbayev et al., 2023), a dataset of both formal and informal mathematics. We also train a 50/50 mixture of ProofPile and OpenWebMath to evaluate the performance of OpenWebMath when included in a mixture of other datasets, as would be common in practice.
我们进行了实验,以了解我们的数据集与其他语言建模数据集的比较。我们将仅在 OpenWebMath 上训练一个周期(14.7B 标记)的模型与在 The Pile(Gao 等,2020),一个通用语言建模数据集,以及 ProofPile(Azerbayev 等,2023),一个包含正式和非正式数学的数据集上训练相同数量标记的模型进行比较。我们还训练了 ProofPile 和 OpenWebMath 的 50/50 混合模型,以评估 OpenWebMath 在与其他数据集混合时的表现,这在实际应用中是常见的。
We train randomly initialized models with the same architecture as Pythia 1.4B (Biderman et al., 2023). We use a batch size of 1M tokens and the same hyperparameters as Pythia otherwise. These models are evaluated on a collection of mathematics benchmarks which show signal on models of this size. This includes the subset of level-1 algebra questions from MATH, LILA-multiarith to test coding ability, and GSM8k and MATH perplexities, which scale more smoothly than accuracies. We also compare to Pythia 1.4B (Biderman et al., 2023), which was trained on 300B tokens of The Pile (Gao et al., 2020) with the same architecture.
我们训练了与 Pythia 1.4B(Biderman 等,2023)具有相同架构的随机初始化模型。我们使用 1M 个标记的批量大小,并使用与 Pythia 相同的超参数。这些模型在一系列数学基准上进行了评估,这些基准在此规模的模型上显示出信号。这包括 MATH 中的一级代数问题子集,LILA-multiarith 用于测试编码能力,以及 GSM8k 和 MATH 的困惑度,这些困惑度比准确度更平滑地扩展。我们还与 Pythia 1.4B(Biderman 等,2023)进行了比较,后者在具有相同架构的 The Pile(Gao 等,2020)的 300B 个标记上进行了训练。
Table 1 shows the results for our perplexity evaluations. There is a clear performance lead for models trained with OpenWebMath and the mixture seems to perform best. Despite Pythia being trained on over 20x the number of tokens, the performance of our models on the perplexity benchmarks far exceeds its performance, showing the potential of domain-specific models for mathematics. Similarly, Table 2 shows the performance of the models on MATH-Algebra-Easy and LILA-multiarith (Mishra et al., 2022). OpenWebMath models outperform models that were not trained on it by a significant margin.
表 1 显示了我们的困惑度评估结果。使用 OpenWebMath 训练的模型明显表现领先,并且混合模型似乎表现最佳。尽管 Pythia 在训练时使用的标记数量超过 20 倍,但我们的模型在困惑度基准测试中的表现远远超过了它的表现,显示了数学领域特定模型的潜力。同样,表 2 显示了模型在 MATH-Algebra-Easy 和 LILA-multiarith(Mishra 等,2022)上的表现。OpenWebMath 模型的表现显著优于未使用其训练的模型。
5 Conclusion 5 结论
In this paper, we describe OpenWebMath, an open dataset of 14.7B high quality mathematical documents from the web. We extensively document our pipeline, including several novel methodologies for extracting LaTeX formulas, reducing boilerplate, and filtering the dataset. OpenWebMath consists of high quality Q&A forum posts, educational documents, blogs, and more spread across mathematics, physics, computer science, and other technical domains. We also train several models on OpenWebMath and other language modeling datasets to compare the downstream performance achievable by training on our dataset. Notably, we find that models trained on OpenWebMath outperform models trained on 20x more general-domain tokens in mathematics. We hope that OpenWebMath can lead to the creation of language models with improved mathematical reasoning capabilities.
在本文中,我们描述了 OpenWebMath,这是一个包含 14.7B 高质量数学文档的开放数据集。我们详细记录了我们的处理流程,包括几种新颖的方法来提取 LaTeX 公式、减少样板文本和过滤数据集。OpenWebMath 由高质量的问答论坛帖子、教育文档、博客等组成,涵盖了数学、物理、计算机科学和其他技术领域。我们还在 OpenWebMath 和其他语言建模数据集上训练了几个模型,以比较在我们的数据集上训练所能实现的下游性能。值得注意的是,我们发现,在 OpenWebMath 上训练的模型在数学方面优于在 20 倍更多的通用领域标记上训练的模型。我们希望 OpenWebMath 能够促使语言模型在数学推理能力方面的改进。
Acknowledgements 致谢
JB is supported by NSERC Grant [2020-06904], CIFAR AI Chairs program, Google Research Scholar Program, and Amazon Research Award. KP is supported by an NSERC PGS-D award. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, Fujitsu Limited, and companies sponsoring the Vector Institute for Artificial Intelligence (www.vectorinstitute.ai/partners). Computing resources for model training were provided by EleutherAI and Brigham Young University. We thank Finn Paster for the graphic design for the logo. We additionally thank Ziming Chen, Yuhuai Wu, Stella Biderman, Aviya Skowron, Hailey Schoelkopf, and Sean Welleck for their helpful comments.
JB 得到了 NSERC 资助 [2020-06904]、CIFAR AI Chairs 计划、Google Research Scholar 计划和 Amazon Research Award 的支持。KP 得到了 NSERC PGS-D 奖的支持。准备这项研究所使用的资源部分由安大略省、加拿大政府通过 CIFAR、富士通有限公司以及赞助 Vector Institute for Artificial Intelligence(www.vectorinstitute.ai/partners)的公司提供。模型训练的计算资源由 EleutherAI 和杨百翰大学提供。我们感谢 Finn Paster 为标志设计的图形设计。我们还感谢陈子明、吴宇怀、Stella Biderman、Aviya Skowron、Hailey Schoelkopf 和 Sean Welleck 的宝贵意见。
References
- Andonian et al. (2023) Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Benjamin Thérien, Phil Wang, and Samuel Weinbach. GPT-NeoX: Large scale autoregressive language modeling in PyTorch. GitHub Repo, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.
- Azerbayev et al. (2023) Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023.
- Barbaresi (2021) Adrien Barbaresi. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics, 2021. URL https://aclanthology.org/2021.acl-demo.15.
- Bevendorff et al. (2018) Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Leif Azzopardi, Allan Hanbury, Gabriella Pasi, and Benjamin Piwowarski (eds.), Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Berlin Heidelberg New York, March 2018. Springer.
- Bevendorff et al. (2021) Janek Bevendorff, Martin Potthast, and Benno Stein. FastWARC: Optimizing Large-Scale Web Archive Analytics. In Andreas Wagner, Christian Guetl, Michael Granitzer, and Stefan Voigt (eds.), 3rd International Symposium on Open Search Technology (OSSYM 2021). International Open Search Symposium, October 2021.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Collins et al. (2023) Katherine M Collins, Albert Q Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B Tenenbaum, William Hart, et al. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694, 2023.
- Endrédy & Novák (2013) István Endrédy and Attila Novák. More effective boilerplate removal-the goldminer algorithm. Polibits, 48:79–83, 12 2013. doi: 10.17562/PB-48-10.
- Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Gebru et al. (2021) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. Datasheets for datasets, 2021.
- Geng & Liu (2023) Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Heafield (2011) Kenneth Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pp. 187–197, 2011.
- Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021. URL https://arxiv.org/abs/2103.03874.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
- Lewkowycz et al. (2022) Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. CoRR, abs/2305.20050, 2023. doi: 10.48550/arXiv.2305.20050. URL https://doi.org/10.48550/arXiv.2305.20050.
- Manku et al. (2007) Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 141–150, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595936547. doi: 10.1145/1242572.1242592. URL https://doi.org/10.1145/1242572.1242592.
- Mishra et al. (2022) Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, et al. Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022.
- Mou et al. (2023) Chenghao Mou, Chris Ha, Kenneth Enevoldsen, and Peiyuan Liu. Chenghaomou/text-dedup: Reference snapshot, September 2023. URL https://doi.org/10.5281/zenodo.8364980.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/arXiv.2306.01116. URL https://doi.org/10.48550/arXiv.2306.01116.
- Polu & Sutskever (2020) Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
- Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew J. Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
- Touvron et al. (2023b) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023b.
- Touvron et al. (2023c) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023c. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
- Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
- Welleck et al. (2021) Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. Naturalproofs: Mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112, 2021.
- Wenzek et al. (2019) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
Appendix A Limitations and Future Work
附录 A 限制与未来工作
Despite the high quality of OpenWebMath, we note several limitations and avenues for future works. First, due to the high cost of extracting data from all shards on Common Crawl, we were only able to run our pipeline once. Therefore, many of our choices are without empirical justification and we provide no ablation study. We also note that the nature of this particular type of dataset means that there are many subjective choices to be made. For instance, what counts as a mathematical document? What is a high-quality document? How do we choose the threshold for near-deduplication? For each of these, we chose several values and manually inspected a few examples to choose. Due to the cost constraints, there are also practical challenges with balancing cost with accuracy when filtering and extracting text. For instance, our prefilter reduces the number of HTML documents processed to under 1% of the documents in Common Crawl, which may be too aggressive. We also note that OpenWebMath is an English-only dataset, which limits its applications for researchers and users who speak other languages. Finally, we note that OpenWebMath only contains the text from math on the web, not associated figures, which can be important for solving mathematical problems (OpenAI, 2023). Future work should focus on finding empirical answers to the questions of what constitutes good data, creating new, efficient filtering methodologies, and extracting images inline with math text.
尽管 OpenWebMath 质量很高,但我们注意到几个局限性和未来工作的方向。首先,由于从 Common Crawl 的所有分片中提取数据的成本很高,我们只能运行一次我们的管道。因此,我们的许多选择没有经验上的依据,也没有提供消融研究。我们还注意到,这种特定类型的数据集的性质意味着需要做出许多主观选择。例如,什么算是数学文档?什么是高质量文档?我们如何选择近似去重的阈值?对于每一个问题,我们选择了几个值并手动检查了一些示例来做出选择。由于成本限制,在过滤和提取文本时平衡成本与准确性也存在实际挑战。例如,我们的预过滤器将处理的 HTML 文档数量减少到 Common Crawl 文档的不到 1%,这可能过于激进。我们还注意到,OpenWebMath 是一个仅限英语的数据集,这限制了其对讲其他语言的研究人员和用户的应用。 最后,我们注意到 OpenWebMath 仅包含来自网络上的数学文本,而不包含相关的图形,这对于解决数学问题可能很重要(OpenAI,2023)。未来的工作应集中于寻找关于什么构成良好数据的实证答案,创建新的高效过滤方法,并提取与数学文本内联的图像。
Appendix B Text Extraction
附录 B 文本提取
Method 方法 | Runtime (s) | Source Code Link |
Resiliparse | 3.99 | https://github.com/chatnoir-eu/chatnoir-resiliparse |
HTML-Text | 10.75 | https://github.com/TeamHG-Memex/html-text |
Inscripts | 19.14 | https://github.com/weblyzard/inscriptis |
BoilerPy | 24.94 | https://github.com/jmriebold/BoilerPy3 |
jusText | 31.17 | https://github.com/miso-belica/jusText |
HTML2Text | 37.17 | https://github.com/Alir3z4/html2text/ |
BeautifulSoup | 38.42 | https://code.launchpad.net/beautifulsoup |
Trafilatura | 63.90 | https://github.com/adbar/trafilatura |
ExtractNet | 299.67 | https://github.com/currentslab/extractnet |
Choice of Base Text Extractor
When considering which HTML text-extraction library to use, we considered the efficiency, customization, and existing boilerplate reduction methods for each option. The most commonly used option, using WET files extracted by Common Crawl, was not an option since they do not deal with LaTeX correctly and offer no customization. Other options such as jusText (Endrédy & Novák, 2013), used in The Pile Gao et al. (2020), removed boilerplate too aggressively, leading to sections containing math to be discarded. Likewise, Trafilatura (Barbaresi, 2021), which was used in RefinedWeb (Penedo et al., 2023), had poor efficiency. We decided to go with Resiliparse (Bevendorff et al., 2018) due to its balanced boilerplate removal, fast runtime, and efficient Common Crawl parsing tools. Table 5 shows the full results for our comparison.
Math Format | Percentage of Documents |
Found at least one instance of math | 91.42% |
MathJax with delimiters (inline) | 50.27% |
MathJax with delimiters (display) | 23.37% |
Math found in images | 6.96% |
.math-container | 3.94% |
MathML code | 3.28% |
<annotation> withing <math> tags | 2.35% |
<mathjax> tags | 2.24% |
align environments | 1.72% |
equation environments | 1.18% |
within <script> tags | 1.01% |
alttext property of <math> tags | 0.24% |
LaTeX Extraction LaTeX 提取
LaTeX code comes in many forms throughout Common Crawl HTML files. We employed an iterative process to refine our extraction rules. First, we filtered shards of Common Crawl for documents that contain the string \frac. Then, we filtered those documents to find those which our extraction code found no extractable LaTeX. Then, we refined our code to include additional sources of math until we were confident that we had reasonable support for all formats of LaTeX in HTML documents. Table 6 shows the breakdown of different common types of LaTeX found in HTML documents.
LaTeX 代码在 Common Crawl HTML 文件中以多种形式出现。我们采用了迭代过程来完善我们的提取规则。首先,我们筛选了包含字符串\frac 的 Common Crawl 文档碎片。然后,我们筛选这些文档,找出我们的提取代码未找到可提取 LaTeX 的文档。接着,我们完善了代码,以包括更多的数学来源,直到我们确信对 HTML 文档中的所有 LaTeX 格式都有合理的支持。表 6 显示了 HTML 文档中常见的不同类型 LaTeX 的分类。
We note that most of the LaTeX in OpenWebMath and across the internet is encoded using MathJax, which presents a challenge. The majority of MathJax documents use dollar sign delimiters, but most dollar signs on the web do not delimit LaTeX equations. This leaves us with a few options:
-
•
Detect the use of the MathJax script in the HTML file. If the script is imported, treat dollar signs as LaTeX code.
-
•
Detect common LaTeX commands in between dollar signs. If they are present, treat dollar signs as LaTeX code.
-
•
Use the MathScore classifier to determine whether the page looks like it is talking about math. If so, treat dollar signs as LaTeX code.
The first option is not always accurate since the MathJax javascript code may be nested inside of another import or named differently depending on the website. The latter two options make up for many of these cases, but can fail to detect edge cases where math equations are present but the surrounding text does not indicate that the document is mathematical. We suspect Minerva (Lewkowycz et al., 2022) gets around this issue by using HTML documents where javascript code has already been executed, in which case MathJax is converted from delimited text to explicit HTML tags that are easy to detect.
Model Size | Layers 层 | Model Dim | Heads 头 | Learning Rate | Batch Size |
1.4 B | 24 | 2048 | 16 | 1M |
Math Keywords |
MathJax |
mathjax |
<math |
math-container |
katex.min.css |
latex.php |
codecogs |
tex.cgi |
class="tex" |
class=’tex’ |
Appendix C Interplay Between Extraction and Filtering
In prior works, we noticed many cases where suboptimal HTML text extractors were used and yet text quality remains high in the dataset. This is due to the interplay between extraction and filtering. Specifically, if a text extractor fails to extract the main text, gets the formatting wrong, or includes too much boilerplate in the extraction, then both the classification and perplexity filters can filter out such examples. This can lead to subtle biases in the dataset, where specific poorly-extracted websites are excluded entirely even though they do contain high quality content. In the case of making a mathematical dataset, failure to extract and deal with inline LaTeX code properly can hurt perplexity scores and lead to these documents being filtered out. We suggest practitioners tune their text extraction pipeline on a diverse set of documents before applying filtering to avoid this bias.
Appendix D Model Hyperparameters
We trained models on 14.7B tokens using the LLaMA (Touvron et al., 2023c) tokenizer and the architecture described in Pythia (Biderman et al., 2023). We train the model using the GPT-NeoX library (Andonian et al., 2023) on 8 A100 80GB GPUs. Exact hyperparameters can be found in Table 7.
我们使用 LLaMA(Touvron 等,2023c)分词器和 Pythia(Biderman 等,2023)中描述的架构,在 14.7B 个标记上训练了模型。我们使用 GPT-NeoX 库(Andonian 等,2023)在 8 个 A100 80GB GPU 上训练模型。具体的超参数可以在表 7 中找到。
Appendix E Datasheet
We provide a datasheet for OpenWebMath, following the framework in Gebru et al. (2021).
Motivation 动机 | |
For what purpose was the dataset created? | The dataset was created to enable the training of large language models on mathematical texts, in order to improve their mathematical reasoning capabilities. |
Who created the dataset and on behalf of which entity? | The dataset was created by the authors of this work. |
Who funded the creation of the dataset? | Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, Fujitsu Limited, and companies sponsoring the Vector Institute for Artificial Intelligence (www.vectorinstitute.ai/partners). Computing resources for model training were provided by EleutherAI and Brigham Young University. |
Any other comment? | None. |
Composition | |
What do the instances that comprise the dataset represent? | The instances are text documents extracted from mathematics-related webpages from Common Crawl. |
How many instances are there in total? | In total, OpenWebMath contains 6.3 million documents. |
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? | OpenWebMath doesn’t contain all instances of text extracted from mathematics-related webpages from Common Crawl, as our filters can miss a non-zero proportion of such webpages. However, we expect OpenWebMath to contain most of them. |
What data does each instance consist of? | Each instance consists of plain text and metadata including the source URL, the snapshot date, and other extraction parameters. |
Is there a label or target associated with each instance? | No. |
Is any information missing from individual instances? | No. |
Are relationships between individual instances made explicit? | No. |
Are there recommended data splits? | No. |
Are there any errors, sources of noise, or redundancies in the dataset? | Yes, a small portion of the documents from OpenWebMath are not related to mathematics, or contain bad quality content. |
Is the dataset self-contained, or does it link to or otherwise rely on external resources? | The dataset is entirely self-contained. |
Does the dataset contain data that might be considered confidential? | No. |
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? | The data is filtered for quality and we do not expect that this content will be offensive, but since our filters may be imperfect we make no guarantees. |
Collection 收藏 | |
How was the data associated with each instance acquired? 如何获取与每个实例相关的数据? |
The data was acquired by processing data from Common Crawl. |
What mechanisms or procedures were used to collect the data? | We refer to the CommonCrawl website (commoncrawl.org) for details on how they collect data. |
If the dataset is a sample from a larger set, what was the sampling strategy? | We use all data from Common Crawl that was available before May 2023. |
Who was involved in the data collection process and how were they compensated? | Keiran Paster and Marco Dos Santos collected the data and were compensated by their respective graduate programs. |
Over what timeframe was the data collected? | OpenWebMath uses shards of CommonCrawl gathered between 2013 and 2023. |
Were any ethical review processes conducted? | No. |
Preprocessing | |
Was any preprocessing/cleaning/labeling of the data done? | Yes. See section 3.5 for details. |
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? | Yes. |
Is the software that was used to preprocess/clean/label the data available? | Yes. See supplementary materials. |
Uses | |
Has the dataset been used for any tasks already? | Yes, the data was used to train 1.4B parameter language models in section 4 |
Is there a repository that links to any or all papers or systems that use the dataset? | No. |
What (other) tasks could the dataset be used for? | We primarily envision that OpenWebMath could be useful for language model pretraining, finetuning, and evaluation. |
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? | It is possible that the filtering stage of the project discarded valuable documents, such as those not written in English. This makes OpenWebMath suboptimal for creating mathematical models in other languages. |
Are there tasks for which the dataset should not be used? | Any tasks which may considered irresponsible or harmful. |
Distribution | |
Will the dataset be distributed to third parties outside of the entity on behalf of which the dataset was created? | Yes, the dataset will be available on the Hugging Face Hub for NLP practitioners. |
How will the dataset will be distributed? | We will distribute the dataset on the Hugging Face Hub |
When will the dataset be distributed? | The dataset will be available when the paper is made public. |
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? | The public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. |
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? | Not to our knowledge. |
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? | Not to our knowledge. |
Maintenance | |
Who will be supporting/hosting/maintaining the dataset? | The dataset will be hosted on the Hugging Face Hub. |
How can the owner/curator/manager of the dataset be contacted? | keirp@cs.toronto.edu |
Is there an erratum? | No. 不。 |
Will the dataset be updated? | No. |
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? | No. |