这是用户在 2024-3-22 17:21 为 https://huggingface.co/blog/cosmopedia 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Cosmopedia: how to create large-scale synthetic data for pre-training

Published March 20, 2024 发布于2024年3月20日
Update on GitHub

In this blog post, we outline the challenges and solutions involved in generating a synthetic dataset with billions of tokens to replicate Phi-1.5, leading to the creation of Cosmopedia. Synthetic data has become a central topic in Machine Learning. It refers to artificially generated data, for instance by large language models (LLMs), to mimic real-world data.

Traditionally, creating datasets for supervised fine-tuning and instruction-tuning required the costly and time-consuming process of hiring human annotators. This practice entailed significant resources, limiting the development of such datasets to a few key players in the field. However, the landscape has recently changed. We've seen hundreds of high-quality synthetic fine-tuning datasets developed, primarily using GPT-3.5 and GPT-4. The community has also supported this development with numerous publications that guide the process for various domains, and address the associated challenges [1][2][3][4][5].
传统上,为监督微调和指令微调创建数据集需要雇用人工标注员的昂贵和耗时的过程。这种做法需要大量资源,限制了这类数据集的开发仅限于该领域的少数主要参与者。然而,最近情况发生了变化。我们看到了数百个高质量的合成微调数据集的开发,主要使用GPT-3.5和GPT-4。该社区还通过众多出版物支持这一发展,为各个领域指导这一过程,并解决相关挑战[ 1][ 2][ 3][ 4][ 5]。

number of datasets with synthetic tag
Figure 1. Datasets on Hugging Face hub with the tag synthetic.
图1. 在Hugging Face hub上带有合成标签的数据集。

However, this is not another blog post on generating synthetic instruction-tuning datasets, a subject the community is already extensively exploring. We focus on scaling from a few thousand to millions of samples that can be used for pre-training LLMs from scratch. This presents a unique set of challenges.

Why Cosmopedia?  为什么是Cosmopedia?

Microsoft pushed this field with their series of Phi models [6][7][8], which were predominantly trained on synthetic data. They surpassed larger models that were trained much longer on web datasets. Phi-2 was downloaded over 617k times in the past month and is among the top 20 most-liked models on the Hugging Face hub.
微软通过他们的一系列 Phi 模型[6][7][8]推动了这个领域,这些模型主要是在合成数据上训练的。它们超越了在网络数据集上训练时间更长的更大型模型。Phi-2 在过去一个月内被下载了超过 617,000 次,是 Hugging Face 中最受欢迎的前 20 个模型之一。

While the technical reports of the Phi models, such as the “Textbooks Are All You Need” paper, shed light on the models’ remarkable performance and creation, they leave out substantial details regarding the curation of their synthetic training datasets. Furthermore, the datasets themselves are not released. This sparks debate among enthusiasts and skeptics alike. Some praise the models' capabilities, while critics argue they may simply be overfitting benchmarks; some of them even label the approach of pre-training models on synthetic data as « garbage in, garbage out». Yet, the idea of having full control over the data generation process and replicating the high-performance of Phi models is intriguing and worth exploring.
虽然 Phi 模型的技术报告,如"Textbooks Are All You Need"论文,阐明了这些模型的出色性能和创造,但它们省略了关于合成训练数据集策划的大量细节。此外,这些数据集本身也未发布。这引发了热衷者和怀疑者之间的争论。一些人赞扬模型的能力,而批评者则认为它们可能只是过度拟合基准;他们甚至将在合成数据上预训练模型的方法称为"垃圾进,垃圾出"。然而,完全控制数据生成过程并复制 Phi 模型的高性能的想法很引人入胜,值得探索。

This is the motivation for developing Cosmopedia, which aims to reproduce the training data used for Phi-1.5. In this post we share our initial findings and discuss some plans to improve on the current dataset. We delve into the methodology for creating the dataset, offering an in-depth look at the approach to prompt curation and the technical stack. Cosmopedia is fully open: we release the code for our end-to-end pipeline, the dataset, and a 1B model trained on it called cosmo-1b. This enables the community to reproduce the results and build upon them.
这就是开发 Cosmopedia 的动机,它旨在复制用于 Phi-1.5 的训练数据。在这篇文章中,我们分享了我们的初步发现,并讨论了改善当前数据集的一些计划。我们深入探讨了创建数据集的方法论,详细介绍了提示策划和技术栈的方法。Cosmopedia 完全开放:我们发布了我们的端到端管道代码、数据集和在此基础上训练的 1B 模型 cosmo-1b。这使得社区能够复制结果并在此基础上构建。

Behind the scenes of Cosmopedia’s creation

Besides the lack of information about the creation of the Phi datasets, another downside is that they use proprietary models to generate the data. To address these shortcomings, we introduce Cosmopedia, a dataset of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. It contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.

Heads up: If you are anticipating tales about deploying large-scale generation tasks across hundreds of H100 GPUs, in reality most of the time for Cosmopedia was spent on meticulous prompt engineering.
注意:如果您期待听到关于在数百个H100 GPU上部署大规模生成任务的故事,事实上,Cosmopedia大部分时间都花在了精心设计提示上。

Prompts curation  提示词策划

Generating synthetic data might seem straightforward, but maintaining diversity, which is crucial for optimal performance, becomes significantly challenging when scaling up. Therefore, it's essential to curate diverse prompts that cover a wide range of topics and minimize duplicate outputs, as we don’t want to spend compute on generating billions of textbooks only to discard most because they resemble each other closely. Before we launched the generation on hundreds of GPUs, we spent a lot of time iterating on the prompts with tools like HuggingChat. In this section, we'll go over the process of creating over 30 million prompts for Cosmopedia, spanning hundreds of topics and achieving less than 1% duplicate content.

Cosmopedia aims to generate a vast quantity of high-quality synthetic data with broad topic coverage. According to the Phi-1.5 technical report, the authors curated 20,000 topics to produce 20 billion tokens of synthetic textbooks while using samples from web datasets for diversity, stating:

We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity.
我们仔细选择了 20K 个主题来生成这个新的合成数据。在我们的生成提示中,我们使用来自网络数据集的样本来增加多样性。

Assuming an average file length of 1000 tokens, this suggests using approximately 20 million distinct prompts. However, the methodology behind combining topics and web samples for increased diversity remains unclear.

We combine two approaches to build Cosmopedia’s prompts: conditioning on curated sources and conditioning on web data. We refer to the source of the data we condition on as “seed data”.
我们结合两种方法来构建 Cosmopedia 的提示:以经过策划的来源为条件,以及以网络数据为条件。我们将我们以此为条件的数据来源称为"种子数据"。

piecharts of data sources
Figure 2. The distribution of data sources for building Cosmopedia prompts (left plot) and the distribution of sources inside the Curated sources category (right plot).
图2. 用于构建Cosmopedia提示的数据源分布(左图)以及Curated sources类别内部的源分布(右图)。

Curated Sources  精选来源

We use topics from reputable educational sources such as Stanford courses, Khan Academy, OpenStax, and WikiHow. These resources cover many valuable topics for an LLM to learn. For instance, we extracted the outlines of various Stanford courses and constructed prompts that request the model to generate textbooks for individual units within those courses. An example of such a prompt is illustrated in figure 3.
我们使用来自斯坦福课程、可汗学院、OpenStax 和 WikiHow 等知名教育资源的主题。这些资源涵盖了许多有价值的主题,供<b1001>学习。例如,我们提取了斯坦福课程的各种大纲,并构建了要求模型生成这些课程单元教科书的提示。图 3 中展示了这样一个提示示例。

Although this approach yields high-quality content, its main limitation is scalability. We are constrained by the number of resources and the topics available within each source. For example, we can extract only 16,000 unique units from OpenStax and 250,000 from Stanford. Considering our goal of generating 20 billion tokens, we need at least 20 million prompts!

Leverage diversity in audience and style

One strategy to increase the variety of generated samples is to leverage the diversity of audience and style: a single topic can be repurposed multiple times by altering the target audience (e.g., young children vs. college students) and the generation style (e.g., academic textbook vs. blog post). However, we discovered that simply modifying the prompt from "Write a detailed course unit for a textbook on 'Why Go To Space?' intended for college students" to "Write a detailed blog post on 'Why Go To Space?'" or "Write a textbook on 'Why Go To Space?' for young children" was insufficient to prevent a high rate of duplicate content. To mitigate this, we emphasized changes in audience and style, providing specific instructions on how the format and content should differ.

Figure 3 illustrates how we adapt a prompt based on the same topic for different audiences.

comparison of prompts
Figure 3. Prompts for generating the same textbook for young children vs for professionals and researchers vs for high school students.
图3. 为幼儿、专业人士和研究人员以及高中生生成相同教科书的提示。

By targeting four different audiences (young children, high school students, college students, researchers) and leveraging three generation styles (textbooks, blog posts, wikiHow articles), we can get up to 12 times the number of prompts. However, we might want to include other topics not covered in these resources, and the small volume of these sources still limits this approach and is very far from the 20+ million prompts we are targeting. That’s when web data comes in handy; what if we were to generate textbooks covering all the web topics? In the next section, we’ll explain how we selected topics and used web data to build millions of prompts.
通过针对四个不同的受众(幼儿、高中生、大学生、研究人员)并利用三种生成方式(教科书、博客文章、wikiHow 文章),我们可以获得高达 12 倍的提示数量。然而,我们可能需要包括这些资源中未涵盖的其他主题,而这些来源的小量仍然限制了这种方法,远远达不到我们的 20 多万个提示的目标。这时网络数据就派上用场了;如果我们生成涵盖所有网络主题的教科书会怎么样?在下一节中,我们将解释我们如何选择主题并使用网络数据来构建数百万个提示。

Web data  网络数据

Using web data to construct prompts proved to be the most scalable, contributing to over 80% of the prompts used in Cosmopedia. We clustered millions of web samples, using a dataset like RefinedWeb, into 145 clusters, and identified the topic of each cluster by providing extracts from 10 random samples and asking Mixtral to find their common topic. More details on this clustering are available in the Technical Stack section.

We inspected the clusters and excluded any deemed of low educational value. Examples of removed content include explicit adult material, celebrity gossip, and obituaries. The full list of the 112 topics retained and those removed can be found here.

We then built prompts by instructing the model to generate a textbook related to a web sample within the scope of the topic it belongs to based on the clustering. Figure 4 provides an example of a web-based prompt. To enhance diversity and account for any incompleteness in topic labeling, we condition the prompts on the topic only 50% of the time, and change the audience and generation styles, as explained in the previous section. We ultimately built 23 million prompts using this approach. Figure 5 shows the final distribution of seed data, generation formats, and audiences in Cosmopedia.

web prompt
Figure 4. Example of a web extract and the associated prompt.
图4. 网页摘录示例及相关提示。

Figure 5. The distribution of seed data, generation format and target audiences in Cosmopedia dataset.
图5. Cosmopedia数据集中种子数据、生成格式和目标受众的分布。

In addition to random web files, we used samples from AutoMathText, a carefully curated dataset of Mathematical texts with the goal of including more scientific content.
除了随机网页文件外,我们还使用了 AutoMathText 的样本,这是一个精心策划的数学文本数据集,目的是包含更多科学内容。

Instruction datasets and stories

In our initial assessments of models trained using the generated textbooks, we observed a lack of common sense and fundamental knowledge typical of grade school education. To address this, we created stories incorporating day-to-day knowledge and basic common sense using texts from the UltraChat and OpenHermes2.5 instruction-tuning datasets as seed data for the prompts. These datasets span a broad range of subjects. For instance, from UltraChat, we used the "Questions about the world" subset, which covers 30 meta-concepts about the world. For OpenHermes2.5, another diverse and high-quality instruction-tuning dataset, we omitted sources and categories unsuitable for storytelling, such as glaive-code-assist for programming and camelai for advanced chemistry. Figure 6 shows examples of prompts we used to generate these stories.

stories prompts
Figure 6. Prompts for generating stories from UltraChat and OpenHermes samples for young children vs a general audience vs reddit forums.
图6. 用于从UltraChat和OpenHermes样本为幼儿、普通观众和Reddit论坛生成故事的提示。

That's the end of our prompt engineering story for building 30+ million diverse prompts that provide content with very few duplicates. The figure below shows the clusters present in Cosmopedia, this distribution resembles the clusters in the web data.
这就是我们构建 30 多万个多样化提示的提示工程故事的结尾。下图显示了 Cosmopedia 中存在的聚类,这种分布与网络数据中的聚类相似。

Figure 7. The clusters of Cosmopedia, annotated using Mixtral.
图7. 使用Mixtral注释的Cosmopedia集群。

You can use the dataset viewer to investigate the dataset yourself:

dataset viewer
Figure 8. Cosmopedia's dataset viewer.
图8. Cosmopedia的数据集查看器。

Technical stack  技术栈

We release all the code used to build Cosmopedia in: https://github.com/huggingface/cosmopedia

In this section we'll highlight the technical stack used for text clustering, text generation at scale and for training cosmo-1b model.

Topics clustering  主题聚类

We used text-clustering repository to implement the topic clustering for the web data used in Cosmopedia prompts. The plot below illustrates the pipeline for finding and labeling the clusters. We additionally asked Mixtral to give the cluster an educational score out of 10 in the labeling step; this helped us in the topics inspection step. You can find a demo of the web clusters and their scores in this demo.
我们使用文本聚类存储库为 Cosmopedia 提示中使用的网络数据实现了主题聚类。下图说明了查找和标记聚类的管道。在标记步骤中,我们还要求 Mixtral 给聚类一个 10 分制的教育评分;这有助于我们在主题检查步骤中。您可以在此演示中找到网络聚类及其评分的演示。

Figure 9. The pipleline of text-clustering.
图9. 文本聚类的流程。

Textbooks generation at scale

We leverage the llm-swarm library to generate 25 billion tokens of synthetic content using Mixtral-8x7B-Instruct-v0.1. This is a scalable synthetic data generation tool using local LLMs or inference endpoints on the Hugging Face Hub. It supports TGI and vLLM inference libraries. We deployed Mixtral-8x7B locally on H100 GPUs from the Hugging Face Science cluster with TGI. The total compute time for generating Cosmopedia was over 10k GPU hours.
我们利用 llm-swarm 库使用 Mixtral-8x7B-Instruct-v0.1 生成了 250 亿个合成令牌。这是一个可扩展的合成数据生成工具,可以使用 Hugging Face Hub 上的本地 LLMs 或推理端点。它支持 TGI 和 vLLM 推理库。我们在 Hugging Face Science 集群的 H100 GPU 上本地部署了 Mixtral-8x7B,并使用了 TGI。生成 Cosmopedia 的总计算时间超过 10,000 GPU 小时。

Here's an example to run generations with Mixtral on 100k Cosmopedia prompts using 2 TGI instances on a Slurm cluster:
以下是在 Slurm 集群上使用 2 个 TGI 实例对 100k 个 Cosmopedia 提示进行生成的示例:

# clone the repo and follow installation requirements 
cd llm-swarm
python ./examples/textbooks/generate_synthetic_textbooks.py \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --instances 2 \
    --prompts_dataset "HuggingFaceTB/cosmopedia-100k" \
    --prompt_column prompt \
    --max_samples -1 \
    --checkpoint_path "./tests_data" \
    --repo_id "HuggingFaceTB/generations_cosmopedia_100k" \
    --checkpoint_interval 500

You can even track the generations with wandb to monitor the throughput and number of generated tokens.
您可以使用 wandb 跟踪生成的代数,以监控吞吐量和生成的标记数。

Figure 10. Wandb plots for an llm-swarm run.
图10. 一个llm-swarm运行的Wandb图。

Note: We used HuggingChat for the initial iterations on the prompts. Then, we generated a few hundred samples for each prompt using llm-swarm to spot unusual patterns. For instance, the model used very similar introductory phrases for textbooks and frequently began stories with the same phrases, like "Once upon a time" and "The sun hung low in the sky". Explicitly asking the model to avoid these introductory statements and to be creative fixed the issue; they were still used but less frequently.
注意:我们在提示的初始迭代中使用了HuggingChat。然后,我们使用 llm-swarm 为每个提示生成了几百个样本,以发现异常模式。例如,该模型对教科书使用了非常相似的引言短语,并且经常以相同的短语开始故事,如"从前有一次"和"太阳低悬在天空中"。明确要求该模型避免这些引言陈述并保持创意解决了这个问题;它们仍然被使用,但频率较低。

Benchmark decontamination

Given that we generate synthetic data, there is a possibility of benchmark contamination within the seed samples or the model's training data. To address this, we implement a decontamination pipeline to ensure our dataset is free of any samples from the test benchmarks.

Similar to Phi-1, we identify potentially contaminated samples using a 10-gram overlap. After retrieving the candidates, we employ difflib.SequenceMatcher to compare the dataset sample against the benchmark sample. If the ratio of len(matched_substrings) to len(benchmark_sample) exceeds 0.5, we discard the sample. This decontamination process is applied across all benchmarks evaluated with the Cosmo-1B model, including MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-Easy, and ARC-Challenge.
类似于Phi-1,我们使用10克重叠来识别可能受污染的样本。在检索候选项后,我们使用 difflib.SequenceMatcher 来比较数据集样本和基准样本。如果 len(matched_substrings)len(benchmark_sample) 的比率超过0.5,我们会丢弃该样本。这种去污染过程应用于使用Cosmo-1B模型评估的所有基准,包括MMLU、HellaSwag、PIQA、SIQA、Winogrande、OpenBookQA、ARC-Easy和ARC-Challenge。

We report the number of contaminated samples removed from each dataset split, as well as the number of unique benchmark samples that they correspond to (in brackets):

Dataset group 数据集组 ARC BoolQ 布尔问题 HellaSwag 地道的翻译如下: 地狱般的东西 PIQA
web data + stanford + openstax
网络数据 + 斯坦福 + 开放教材
49 (16) 386 (41) 6 (5) 5 (3)
auto_math_text + khanacademy
自动数学文本 + 可汗学院
17 (6) 34 (7) 1 (1) 0 (0)
stories 故事 53 (32) 27 (21) 3 (3) 6 (4)

We find less than 4 contaminated samples for MMLU, OpenBookQA and WinoGrande.

Training stack  训练栈

We trained a 1B LLM using Llama2 architecure on Cosmopedia to assess its quality: https://huggingface.co/HuggingFaceTB/cosmo-1b.
我们使用 Llama2 架构在 Cosmopedia 上训练了一个 10 亿参数的模型以评估其质量:https://huggingface.co/HuggingFaceTB/cosmo-1b。

We used datatrove library for data deduplication and tokenization, nanotron for model training, and lighteval for evaluation.

The model performs better than TinyLlama 1.1B on ARC-easy, ARC-challenge, OpenBookQA, and MMLU and is comparable to Qwen-1.5-1B on ARC-challenge and OpenBookQA. However, we notice some performance gaps compared to Phi-1.5, suggesting a better synthetic generation quality, which can be related to the LLM used for generation, topic coverage, or prompts.
该模型在 ARC-easy、ARC-challenge、OpenBookQA 和 MMLU 上的表现优于 TinyLlama 1.1B,并且在 ARC-challenge 和 OpenBookQA 上与 Qwen-1.5-1B 相当。然而,我们注意到与 Phi-1.5 相比存在一些性能差距,这可能与用于生成的 LLM、主题覆盖范围或提示有关,表明合成生成质量更好。

evaluation results
Figure 10. Evaluation results of Cosmo-1B.
图10. Cosmo-1B的评估结果。

Conclusion & next steps

In this blog post, we outlined our approach for creating Cosmopedia, a large synthetic dataset designed for pre-training models, with the goal of replicating the Phi datasets. We highlighted the significance of meticulously crafting prompts to cover a wide range of topics, ensuring the generation of diverse content. Additionally, we have shared and open-sourced our technical stack, which allows for scaling the generation process across hundreds of GPUs.

However, this is just the initial version of Cosmopedia, and we are actively working on enhancing the quality of the generated content. The accuracy and reliability of the generations largely depends on the model used in the generation. Specifically, Mixtral may sometimes hallucinate and produce incorrect information, for example when it comes to historical facts or mathematical reasoning within the AutoMathText and KhanAcademy subsets. One strategy to mitigate the issue of hallucinations is the use of retrieval augmented generation (RAG). This involves retrieving information related to the seed sample, for example from Wikipedia, and incorporating it into the context. Hallucination measurement methods could also help assess which topics or domains suffer the most from it [9]. It would also be interesting to compare Mixtral’s generations to other open models.
以下是简体中文翻译: 然而,这只是Cosmopedia的初始版本,我们正在积极努力提高生成内容的质量。生成的准确性和可靠性在很大程度上取决于所使用的模型。具体而言,Mixtral有时会产生幻觉并生成不正确的信息,例如在涉及历史事实或AutoMathText和KhanAcademy子集中的数学推理时。缓解幻觉问题的一种策略是使用检索增强生成(RAG)。这涉及检索与种子样本相关的信息(例如来自维基百科),并将其纳入上下文。幻觉测量方法也可能有助于评估哪些主题或领域受其影响最大[9]。比较Mixtral的生成结果与其他开放模型也很有趣。

The potential for synthetic data is immense, and we are eager to see what the community will build on top of Cosmopedia.

References  参考文献

[1] Ding et al. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. URL https://arxiv.org/abs/2305.14233
[1] 丁等人。通过扩展高质量的指导性对话来增强聊天语言模型。URL https://arxiv.org/abs/2305.14233

[2] Wei et al. Magicoder: Source Code Is All You Need. URL https://arxiv.org/abs/2312.02120
[2] 魏等人。Magicoder:源代码就是你所需要的。URL https://arxiv.org/abs/2312.02120

[3] Toshniwal et al. OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset. URL https://arxiv.org/abs/2402.10176
[3] Toshniwal等人。OpenMathInstruct-1:一个180万数学指令调整数据集。URL https://arxiv.org/abs/2402.10176

[4] Xu et al. WizardLM: Empowering Large Language Models to Follow Complex Instructions. URL https://arxiv.org/abs/2304.12244
[4] 徐等人。WizardLM:增强大型语言模型以遵循复杂指令。URL https://arxiv.org/abs/2304.12244

[5] Moritz Laurer. Synthetic data: save money, time and carbon with open source. URL https://huggingface.co/blog/synthetic-data-save-cost
[5] Moritz Laurer. 合成数据:使用开源节省金钱、时间和碳排放。URL https://huggingface.co/blog/synthetic-data-save-cost

[6] Gunasekar et al. Textbooks Are All You Need. URL https://arxiv.org/abs/2306.11644
[6] Gunasekar 等人。教科书就是你所需要的。URL https://arxiv.org/abs/2306.11644

[7] Li et al. Textbooks are all you need ii: phi-1.5 technical report. URL https://arxiv.org/abs/2309.05463
[7] 李等人。教科书就是你所需要的ii:phi-1.5技术报告。网址 https://arxiv.org/abs/2309.05463

[8] Phi-2 blog post. URL https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
[8] Phi-2博客文章。网址 https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

[9] Manakul, Potsawee and Liusie, Adian and Gales, Mark JF. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. URL https://arxiv.org/abs/2303.08896
[9] Manakul, Potsawee 和 Liusie, Adian 以及 Gales, Mark JF. Selfcheckgpt: 面向生成式大型语言模型的零资源黑盒幻觉检测. URL https://arxiv.org/abs/2303.08896