这是用户在 2024-7-16 15:26 为 https://arxiv.org/html/2407.06645v3?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0
许可:CC BY 4.0
arXiv:2407.06645v3 [cs.LG] 11 Jul 2024
arXiv:2407.06645v3 [cs.LG] 2024 年 7 月 11 日
\pdfcolInitStack

tcb@breakable \useunder\ul

Entropy Law: The Story Behind Data Compression and LLM Performance
熵定律:数据压缩和LLM性能背后的故事

Mingjia Yin, Chuhan Wusuperscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT, Yufei Wang, Hao Wangsuperscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT,
殷明佳 ,吴楚涵 superscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT ,王雨菲 ,王浩 superscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT

Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen
郭伟,王亚生,刘勇,唐瑞明,连德富,陈恩红

State Key Laboratory of Cognitive Intelligence & University of Science and Technology of China
认知智能国家重点实验室 & 中国科学技术大学

Noah’s Ark Lab, Huawei
诺亚方舟实验室,华为

{mingjia-yin, wanghao3, cheneh}@ustc.edu.cn
{wuchuhan1, tangruiming}@huawei.com
Corresponding authors. \dagger Equal contribution.
通讯作者。 \dagger 共同贡献。
Abstract 摘要

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an “entropy law” that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named ZIP for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
数据是大型语言模型的基石,但并非所有数据都对模型学习有用。精心选择的数据可以在更少的计算开销下更好地引出模型的能力。大多数方法集中在评估数据选择中单个样本的质量,而忽略了样本之间的组合效应。即使每个样本的质量都很完美,由于其内在的同质性或矛盾性,它们的组合在教学模型时可能是次优的。在本文中,我们旨在揭示模型性能与数据选择之间的潜在关系。受信息压缩性质的启发,我们揭示了一条“熵定律”,将模型性能与数据压缩比和第一轮训练损失联系起来,分别反映了数据集的信息冗余和数据集中编码的内在知识的掌握程度。通过理论推导和实证评估,我们发现模型性能与训练数据的压缩比负相关,这通常会产生较低的训练损失。 基于熵定律的发现,我们提出了一种名为 ZIP 的高效且通用的数据选择方法用于训练LLMs,其目的是优先选择压缩率低的数据子集。基于一个以贪婪方式选择多样数据的多阶段算法,我们可以获得一个具有满意多样性的数据子集。进行了大量实验以验证熵定律和 ZIP 在不同LLM骨干网和对齐阶段的优越性。我们还展示了熵定律的一个有趣应用,即在模型训练开始时检测潜在的性能风险。
111Code can be found in https://github.com/USTC-StarTeam/ZIP.
代码可以在 https://github.com/USTC-StarTeam/ZIP 找到。

1 Introduction 1 引言

In recent years, Large Language Models (LLMs) have gained significant attention from both academia and industry, applied in various domains, such as chatbots (Ouyang et al., 2022; Achiam et al., 2023), chemistry tools (M. Bran et al., 2024), and programming assistants (GitHub, 2020). The great success of LLMs depends on their general intelligence obtained from a vast amount of data collected from various sources (Albalak et al., 2024; Wang et al., 2023c). Through pretraining on trillions of tokens to master diverse knowledge and tuning on smaller instruction data to align models with human preference, LLMs can effectively utilize their knowledge to follow user instructions, do commonsense reasoning, and solve real-world problems (Zhao et al., 2023).
近年来,大型语言模型(LLMs)引起了学术界和工业界的广泛关注,被应用于各种领域,如聊天机器人(Ouyang 等,2022;Achiam 等,2023)、化学工具(M. Bran 等,2024)和编程助手(GitHub,2020)。LLMs的巨大成功依赖于从各种来源收集的大量数据中获得的通用智能(Albalak 等,2024;Wang 等,2023c)。通过在数万亿个标记上进行预训练以掌握多样化的知识,并在较小的指令数据上进行微调以使模型与人类偏好对齐,LLMs能够有效利用其知识来遵循用户指令、进行常识推理并解决现实世界的问题(Zhao 等,2023)。

However, not all data are useful for teaching LLMs, especially when computational resources are limited (Albalak et al., 2024). For example, we can better elicit the capability of LLMs by fine-tuning them on carefully curated samples rather than a large but noisy data collection (Ouyang et al., 2022; Chowdhery et al., 2023; Meta, 2020; Zhou et al., 2023). However, selecting the proper data for LLM training is quite complicated and abstruse, since the space of data preprocessing and combination is almost unlimited. Due to the huge computational overhead of LLM training, manual or empirical data selection based on trial-and-error feedback is rather cumbersome and even impractical. Therefore, automatic data selection methods are necessary for LLM development under limited computational budgets.
然而,并非所有数据都对教学LLMs有用,特别是在计算资源有限的情况下(Albalak 等,2024)。例如,我们可以通过在精心策划的样本上进行微调,而不是在大量但嘈杂的数据集合上,更好地引出LLMs的能力(Ouyang 等,2022;Chowdhery 等,2023;Meta,2020;Zhou 等,2023)。然而,为LLM训练选择合适的数据是相当复杂和深奥的,因为数据预处理和组合的空间几乎是无限的。由于LLM训练的巨大计算开销,基于试错反馈的手动或经验数据选择相当繁琐,甚至不切实际。因此,在有限的计算预算下,LLM开发需要自动数据选择方法。

Intuitively, high-quality samples are expected to have better efficiency in teaching LLMs. For example, the successful practice of LIMA (Zhou et al., 2023) shows the powerful effect of data quality on LLM performance that can surpass the amount of data. Therefore, existing methods usually focus on quality-oriented data selection, based either on heuristic rules (Raffel et al., 2020; Rae et al., 2021; Xie et al., 2023; Chowdhery et al., 2023; Li et al., 2023) or evaluation models (Wettig et al., 2024; Chen et al., 2023; Lu et al., 2023; Liu et al., 2023; Cui et al., 2023). Heuristic methods typically involve hand-crafted rules (e.g., sentence number (Raffel et al., 2020), word count (Rae et al., 2021), length (Shen, 2024)) to evaluate data across multiple dimensions. Model-based approaches, on the contrary, rely on well-established LLMs such as GPT-4 (Achiam et al., 2023) to provide quality assessments of training samples in different views, such as direct scoring (Chen et al., 2023), task tagging (Lu et al., 2023), and pairwise scoring (Liu et al., 2023). However, most of these approaches evaluate different data samples independently, which neglects the intricate combinatorial effects among samples. As illustrated in Figure 1, even if each sample is in perfect quality, their combinations may still be suboptimal due to their mutual information redundancy or inconsistency. Although the quality-based subset is composed of all three good samples, the knowledge they encode is actually redundant and conflicting. In contrast, another data subset composed of several relatively lower-quality but diverse samples may convey more information than the above subset in the teaching of LLMs. Therefore, quality-based data selection does not fully align with the goal of maximizing the knowledge mastery of LLMs.
直观上,高质量样本在教学LLMs中预期会有更好的效率。例如,LIMA(Zhou et al, 2023)的成功实践显示了数据质量对LLM性能的强大影响,甚至可以超过数据量。因此,现有方法通常专注于基于质量导向的数据选择,或者基于启发式规则(Raffel et al, 2020; Rae et al, 2021; Xie et al, 2023; Chowdhery et al, 2023; Li et al, 2023)或评估模型(Wettig et al, 2024; Chen et al, 2023; Lu et al, 2023; Liu et al, 2023; Cui et al, 2023)。启发式方法通常涉及手工制定的规则(例如,句子数量(Raffel et al, 2020),字数(Rae et al, 2021),长度(Shen, 2024))来评估多个维度的数据。基于模型的方法则依赖于成熟的LLMs,如 GPT-4(Achiam et al, 2023),从不同角度提供训练样本的质量评估,例如直接评分(Chen et al, 2023),任务标记(Lu et al, 2023),和成对评分(Liu et al, 2023)。然而,这些方法中的大多数都是独立评估不同的数据样本,忽略了样本之间复杂的组合效应。 如图 1 所示,即使每个样本的质量都很完美,由于它们的互信息冗余或不一致性,它们的组合仍可能是次优的。尽管基于质量的子集由所有三个好的样本组成,但它们编码的知识实际上是冗余和冲突的。相比之下,另一个由几个相对较低质量但多样化的样本组成的数据子集在LLMs的教学中可能比上述子集传达更多的信息。因此,基于质量的数据选择并不完全符合最大化LLMs知识掌握的目标。

Refer to caption
Figure 1: An illustrative example describing different data selection paradigms. Quality-based data selection relies on sample-level data quality measurements while overlooking combinatorial effects among samples. Information-amount-based selection aims to select samples maximizing the overall information amount.
图 1:描述不同数据选择范式的示例。基于质量的数据选择依赖于样本级数据质量测量,同时忽略样本之间的组合效应。基于信息量的选择旨在选择最大化整体信息量的样本。

In many recent studies, researchers have shown that the basic mechanism of auto-regressive language modeling in LLMs is information compression (Delétang et al., 2023; Huang et al., 2024). Thus, the knowledge condensed by LLMs actually depends on the effective information encoded by training data. This intuition opens another direction of data selection, i.e., based on the effective information amount of data. In this paper, we uncover the underlying relations between LLM performance and data homogeneity, which can be measured by various canonical lossless compression algorithms (e.g., DEFLATE in ZIP). Through both theoretical analysis and empirical experiments, we formulate the “entropy law”, which shows that the compression ratio of training data is a decisive factor affecting model performance, if the overall quality and consistency of selected samples remain unchanged. Motivated by the entropy law, we propose an effective and efficient data selection algorithm called ZIP to select heterogeneous data with low compression ratio, which aims to maximize the effective information amount of information for LLM learning. Specifically, we devise a multi-stage greedy strategy to find an approximate solution that guarantees a low compression ratio without exhausting all possible combinations, and it iterates continuously until we obtain a predetermined number of samples. In each iteration, ZIP performs preliminary filtering to choose a smaller pool of candidates, and then selects a few samples from the reduced pool that minimizes the compression ratio of the selected dataset through a cascaded manner. By learning LLMs on a collection of diverse samples that encode heterogeneous and complementary information, the capabilities of LLMs can be better elicited. Extensive experiments on different LLM backbones at different stages of LLM alignment demonstrate the superiority of ZIP over various quality-based baselines. We also present an interesting application of the entropy law that can detect potential performance risks at the beginning of model training, which can effectively reduce the computational overhead in LLM development.
在许多最近的研究中,研究人员表明,自回归语言建模的基本机制是信息压缩(Delétang 等,2023;Huang 等,2024)。因此,LLMs 所凝结的知识实际上取决于训练数据所编码的有效信息。这一直觉开启了数据选择的另一个方向,即基于数据的有效信息量。在本文中,我们揭示了 LLM 性能与数据同质性之间的潜在关系,这可以通过各种典型的无损压缩算法(例如 ZIP 中的 DEFLATE)来衡量。通过理论分析和实证实验,我们提出了“熵定律”,该定律表明,如果所选样本的整体质量和一致性保持不变,训练数据的压缩比是影响模型性能的决定性因素。受熵定律的启发,我们提出了一种名为 ZIP 的有效且高效的数据选择算法,以选择压缩比低的异质数据,旨在最大化 LLM 学习的信息有效信息量。 具体来说,我们设计了一种多阶段贪婪策略,以找到一个近似解,该解保证了低压缩比,而无需穷尽所有可能的组合,并且它会不断迭代,直到我们获得预定数量的样本。在每次迭代中,ZIP 进行初步筛选以选择一个较小的候选池,然后从缩小的候选池中选择一些样本,通过级联方式最小化所选数据集的压缩比。通过在编码异质和互补信息的多样化样本集合上学习LLMs,可以更好地引出LLMs的能力。在不同阶段的LLM对齐中,不同LLM骨干上的大量实验表明,ZIP 优于各种基于质量的基线。我们还展示了熵定律的一个有趣应用,它可以在模型训练开始时检测潜在的性能风险,从而有效减少LLM开发中的计算开销。

2 Related Works 2 相关工作

2.1 Large Modeling and Information Compression
2.1 大规模建模与信息压缩

The relationship between language modeling and data compression has long intrigued researchers (Shannon, 1948, 1951). Pandey (2024) has identified a data-dependant scaling law that takes data’s gzip compressibility into consideration. Besides, recent empirical studies have confirmed that language models can act as versatile data compressors (Delétang et al., 2023), and the intelligence of LLMs can be quantified by their capacity for text compression (Huang et al., 2024). Let a text corpus be generated from an underlying distribution ρ𝜌\rhoitalic_ρ. A lossless compression algorithm 𝒞𝒞\mathcal{C}caligraphic_C is then expected to encode a text sequence x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT into a bitstream 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) of minimal length, ensuring that x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT can be perfectly recovered from 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). The expected number of bits of an optimal 𝒞𝒞\mathcal{C}caligraphic_C is equal to 𝔼xρ[log2ρ(x)]=𝔼xρ[i=1nlog2ρ(xi|x1:i1)]subscript𝔼similar-to𝑥𝜌delimited-[]subscript2𝜌𝑥subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2𝜌conditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\log_{2}\rho(x)]=\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n% }\log_{2}\rho(x_{i}|x_{1:i-1})]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] (Shannon, 1948). The underlying distribution ρ𝜌\rhoitalic_ρ is usually unknown in reality, but it can be estimated by a language model ρmodelsubscript𝜌model\rho_{\text{model}}italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT. Then the expected number of bits of an optimal 𝒞𝒞\mathcal{C}caligraphic_C can be updated:
语言建模与数据压缩之间的关系长期以来一直吸引着研究人员(Shannon, 1948, 1951)。Pandey(2024)已经确定了一种依赖于数据的缩放定律,该定律考虑了数据的 gzip 可压缩性。此外,最近的实证研究证实,语言模型可以作为多功能的数据压缩器(Delétang et al, 2023),而LLMs的智能可以通过其文本压缩能力来量化(Huang et al, 2024)。假设一个文本语料库是从一个基础分布 ρ𝜌\rhoitalic_ρ 生成的。然后,一个无损压缩算法 𝒞𝒞\mathcal{C}caligraphic_C 应该将文本序列 x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT 编码成最小长度的比特流 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) ,确保可以从 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) 完美恢复 x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT 。最优 𝒞𝒞\mathcal{C}caligraphic_C 的预期比特数等于 𝔼xρ[log2ρ(x)]=𝔼xρ[i=1nlog2ρ(xi|x1:i1)]subscript𝔼similar-to𝑥𝜌delimited-[]subscript2𝜌𝑥subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2𝜌conditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\log_{2}\rho(x)]=\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n% }\log_{2}\rho(x_{i}|x_{1:i-1})]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] (Shannon, 1948)。基础分布 ρ𝜌\rhoitalic_ρ 在现实中通常是未知的,但可以通过语言模型 ρmodelsubscript𝜌model\rho_{\text{model}}italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT 来估计。然后,最优 𝒞𝒞\mathcal{C}caligraphic_C 的预期比特数可以更新为:

𝔼xρ[i=1nlog2ρmodel(xi|x1:i1)].subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2subscript𝜌modelconditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n}\log_{2}\rho_{\text{model}}(x_{i}|x_{1:i% -1})].blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] . (1)

Equation 1 is the cross-entropy loss employed in training LLMs, thereby establishing a coherent relationship between LLMs and information compression. This foundational insight paves the way for this work.
方程 1 是训练LLMs时使用的交叉熵损失,从而在LLMs和信息压缩之间建立了一种连贯的关系。这一基础性见解为本工作的开展铺平了道路。

2.2 Alignment of Large Language Models
2.2 大型语言模型的对齐

Large Language Models (LLMs) have recently gained significant attention from academia and industry. LLM alignment, which includes supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), has emerged as a crucial technique for adapting LLMs to end tasks using natural language instructions (Zhao et al., 2023; Wang et al., 2023c). Alignment is performed using instruction datasets consisting of multiple (Instruction, Output) pairs, which require LLMs to follow the instructions and generate corresponding outputs. Early explorations have focused on constructing or expanding instruction datasets through methods such as crowd-sourcing (Wang et al., 2022; Köpf et al., 2024), self-instruction (Taori et al., 2023; Peng et al., 2023; Wang et al., 2023b), or the combination of existing datasets (Wang et al., 2023a; Ivison et al., 2023). Fine-tuned LLMs on these datasets have demonstrated promising capabilities to adhere to instructions across various contexts and align with human expectations.
大型语言模型(LLMs)最近在学术界和工业界引起了广泛关注。LLM对齐,包括监督微调(SFT)和基于人类反馈的强化学习(RLHF),已成为使用自然语言指令将LLMs适应终端任务的关键技术(Zhao 等,2023;Wang 等,2023c)。对齐是使用包含多个(指令,输出)对的指令数据集进行的,这些数据集要求LLMs遵循指令并生成相应的输出。早期的探索集中在通过众包(Wang 等,2022;Köpf 等,2024)、自我指令(Taori 等,2023;Peng 等,2023;Wang 等,2023b)或现有数据集的组合(Wang 等,2023a;Ivison 等,2023)等方法构建或扩展指令数据集。在这些数据集上微调的LLMs展示了在各种情境下遵循指令并符合人类期望的有希望的能力。

2.3 Data selection for LLM alignment
2.3 用于LLM对齐的数据选择

A growing body of research has emphasized the importance of selecting appropriate data for LLM alignment, which can prevent potential quality issues and optimize computational resource allocation. As a prominent example, Lima (Zhou et al., 2023) has demonstrated superior performance by carefully crafting only 1,000 high-quality samples for SFT, highlighting the crucial importance of data quality. The current literature on selecting alignment data has focused on selecting samples according to individual sample quality, which can be categorized into heuristic methods (Shen, 2024) and model-based methods (Chen et al., 2023; Lu et al., 2023; Liu et al., 2023; Li et al., 2023, 2024; Du et al., 2023). Heuristic methods typically employ specific criteria, such as response length (Shen, 2024), to guide data selection. On the other hand, model-based methods adopt various strategies to leverage the capabilities of established language models for evaluating sample quality. For example, IFD (Li et al., 2023) measures the change in response loss when instructions are removed, and selects those with the most significant changes. Building upon IFD, SuperFiltering (Li et al., 2024) introduces a lightweight proxy model for a more efficient calculation of the IFD score. In addition, other model-based methods employ proprietary LLMs to assess data quality. In a pioneering work, AlpaGasus (Chen et al., 2023) uses ChatGPT directly to assign data quality scores to samples, while #InsTag (Lu et al., 2023) proposes assigning tags to each sample using ChatGPT and evaluates sample quality based on the number of tags. DEITA (Liu et al., 2023) uses ChatGPT-generated data to train two Llama-based scorers, assigning complexity and quality scores to each sample, and ultimately selecting samples with the highest hybrid scores. However, existing methods are mainly designed to pick data based on sample-wise quality measurements, which are usually weak in reflecting the overall dataset quality. In this paper, we focus on the relation between performance and dataset quality, which can be efficiently measured by data compression metrics.
越来越多的研究强调选择适当的数据进行LLM对齐的重要性,这可以防止潜在的质量问题并优化计算资源分配。作为一个突出的例子,Lima(Zhou 等,2023)通过精心制作仅 1,000 个高质量样本用于 SFT,展示了卓越的性能,突显了数据质量的关键重要性。目前关于选择对齐数据的文献主要集中在根据单个样本质量选择样本,这可以分为启发式方法(Shen,2024)和基于模型的方法(Chen 等,2023;Lu 等,2023;Liu 等,2023;Li 等,2023,2024;Du 等,2023)。启发式方法通常采用特定标准,如响应长度(Shen,2024),来指导数据选择。另一方面,基于模型的方法采用各种策略来利用已建立的语言模型的能力来评估样本质量。例如,IFD(Li 等,2023)通过测量删除指令时响应损失的变化,并选择那些变化最显著的样本。 在 IFD 的基础上,SuperFiltering(Li 等,2024)引入了一种轻量级代理模型,以更高效地计算 IFD 分数。此外,其他基于模型的方法采用专有的LLMs来评估数据质量。在一项开创性工作中,AlpaGasus(Chen 等,2023)直接使用 ChatGPT 为样本分配数据质量分数,而#InsTag(Lu 等,2023)提出使用 ChatGPT 为每个样本分配标签,并根据标签数量评估样本质量。DEITA(Liu 等,2023)使用 ChatGPT 生成的数据训练了两个基于 Llama 的评分器,为每个样本分配复杂度和质量分数,最终选择混合分数最高的样本。然而,现有方法主要设计用于基于样本质量测量来挑选数据,这通常难以反映整体数据集的质量。在本文中,我们关注性能与数据集质量之间的关系,这可以通过数据压缩指标高效地测量。

3 Entropy Law: Connecting Model Performance with Data Compression
熵定律:将模型性能与数据压缩联系起来

In this section, we provide some theoretical analysis of the relations between data compression and LLM performance. Intuitively, the correctness and diversity of the training data would affect the performance of the final model. Meanwhile, the performance of LLM may be suboptimal if the data have severe intrinsic conflicts or the model has poor mastery of the information encoded by the data, which can be indicated by the training loss. Based on these assumptions, we denote the performance of an LLM as Z𝑍Zitalic_Z, which is expected to be influenced by the following factors:
在本节中,我们提供了一些关于数据压缩与LLM性能之间关系的理论分析。直观上,训练数据的正确性和多样性会影响最终模型的性能。同时,如果数据存在严重的内在冲突或模型对数据编码的信息掌握不佳(这可以通过训练损失来表示),LLM的性能可能会不理想。基于这些假设,我们将LLM的性能表示为 Z𝑍Zitalic_Z ,其预期会受到以下因素的影响:

  • Data compression ratio R𝑅Ritalic_R: This metric can be derived by dividing the pre-compression data size by the post-compression size, which can be computed by various off-the-shelf compression algorithms. Intuitively, a dataset with a lower compression ratio indicates a higher information density.


    • 数据压缩比 R𝑅Ritalic_R :该指标可以通过将压缩前的数据大小除以压缩后的大小来得出,这可以通过各种现成的压缩算法计算。直观上,压缩比越低的数据集表明信息密度越高。
  • Training loss L𝐿Litalic_L: Indicates whether the data are hard for the model to memorize. Given the same base model, a high training loss is usually due to noisy or inconsistent information in the dataset. In practice, the average loss of a small number of training steps in the first training epoch is sufficient to produce an indicative L𝐿Litalic_L value so that the model does not overfit the data.


    • 训练损失 L𝐿Litalic_L :表示数据是否难以被模型记住。对于相同的基础模型,高训练损失通常是由于数据集中存在噪声或不一致的信息。在实践中,第一训练周期中少量训练步骤的平均损失足以产生一个指示性 L𝐿Litalic_L 值,以防止模型过拟合数据。
  • Data consistency C𝐶Citalic_C: The consistency of data is reflected by the entropy of the probability of the next token given the previous contexts. Higher data consistency usually yields a lower training loss. The performance of LLMs is usually suboptimal if the dataset has poor consistency.222 Assume we have two question-answer pairs (q1,a1)subscript𝑞1subscript𝑎1(q_{1},a_{1})( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (q2,a2)subscript𝑞2subscript𝑎2(q_{2},a_{2})( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). When an LLM updates on each QA pair, it learns the mutual information (MI) between the question and answer, i.e., I(q1;a1)𝐼subscript𝑞1subscript𝑎1I(q_{1};a_{1})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and I(q2;a2)𝐼subscript𝑞2subscript𝑎2I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). When it updates on both, it learns the joint MI I(q1q2;a1a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). If q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are not independent, we have I(q1q2;a1a2)<I(q1;a1)+I(q2;a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2𝐼subscript𝑞1subscript𝑎1𝐼subscript𝑞2subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})<I(q_{1};a_{1})+I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), whose detailed derivation can be found in Appendix A. This implies that the total knowledge learned by LLMs is narrowed if the answers to similar questions are highly inconsistent.
    假设我们有两个问答对 (q1,a1)subscript𝑞1subscript𝑎1(q_{1},a_{1})( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )(q2,a2)subscript𝑞2subscript𝑎2(q_{2},a_{2})( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 。当一个LLM在每个问答对上更新时,它学习到问题和答案之间的互信息(MI),即 I(q1;a1)𝐼subscript𝑞1subscript𝑎1I(q_{1};a_{1})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )I(q2;a2)𝐼subscript𝑞2subscript𝑎2I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 。当它在两个问答对上更新时,它学习到联合互信息 I(q1q2;a1a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 。如果 q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTq2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 不是独立的,我们有 I(q1q2;a1a2)<I(q1;a1)+I(q2;a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2𝐼subscript𝑞1subscript𝑎1𝐼subscript𝑞2subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})<I(q_{1};a_{1})+I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,其详细推导可以在附录 A 中找到。这意味着如果对类似问题的答案高度不一致,LLMs所学到的总知识将会减少。


    • 数据一致性 C𝐶Citalic_C :数据的一致性通过给定先前上下文的下一个标记的概率的熵来反映。更高的数据一致性通常会产生较低的训练损失。如果数据集的一致性较差,LLMs 的性能通常会不理想。 2
  • Average data quality Q𝑄Qitalic_Q: This reflects the average sample-level quality of the data, which can be measured through various objective and subjective aspects.


    • 平均数据质量 Q𝑄Qitalic_Q :这反映了数据的平均样本级质量,可以通过各种客观和主观方面来衡量。

Given a certain amount of training data, the model performance can be estimated by the above factors:
给定一定量的训练数据,模型性能可以通过上述因素来估计:

Zf(R,L,C,Q),proportional-to𝑍𝑓𝑅𝐿𝐶𝑄Z\propto f(R,L,C,Q),italic_Z ∝ italic_f ( italic_R , italic_L , italic_C , italic_Q ) , (2)

where f𝑓fitalic_f is a hidden function. Given a specific base model, the scale of L𝐿Litalic_L usually depends on R𝑅Ritalic_R and C𝐶Citalic_C, which can be formulated as:
其中 f𝑓fitalic_f 是一个隐藏函数。给定一个特定的基础模型, L𝐿Litalic_L 的规模通常取决于 R𝑅Ritalic_RC𝐶Citalic_C ,其公式为:

Lg(R,C).proportional-to𝐿𝑔𝑅𝐶L\propto g(R,C).italic_L ∝ italic_g ( italic_R , italic_C ) . (3)

L𝐿Litalic_L is expected to be monotonous on R𝑅Ritalic_R and C𝐶Citalic_C, since a dataset with higher homogeneity or better data consistency is easier for a model to learn. Thus, we can rewrite the above formula as follows:
L𝐿Litalic_L 预计在 R𝑅Ritalic_RC𝐶Citalic_C 上是单调的,因为具有更高同质性或更好数据一致性的数据集更容易被模型学习。因此,我们可以将上述公式重写如下:

Cg(R,L),proportional-to𝐶superscript𝑔𝑅𝐿C\propto g^{\prime}(R,L),italic_C ∝ italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , (4)

where gsuperscript𝑔g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an inverse function. By combining the three above equations, we have:
其中 gsuperscript𝑔g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 是一个反函数。通过结合上述三个方程,我们得到:

Zf(R,L,g(R,L),Q)h(R,L,Q),proportional-to𝑍𝑓𝑅𝐿superscript𝑔𝑅𝐿𝑄proportional-to𝑅𝐿𝑄Z\propto f(R,L,g^{\prime}(R,L),Q)\propto h(R,L,Q),italic_Z ∝ italic_f ( italic_R , italic_L , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , italic_Q ) ∝ italic_h ( italic_R , italic_L , italic_Q ) , (5)

where hhitalic_h is another hidden function. If a data selection method does not substantially change the average data quality Q𝑄Qitalic_Q, we can approximately regard the variable Q𝑄Qitalic_Q as a constant. Therefore, the final performance can be roughly formulated as follows:
其中 hhitalic_h 是另一个隐藏函数。如果数据选择方法不会显著改变平均数据质量 Q𝑄Qitalic_Q ,我们可以大致将变量 Q𝑄Qitalic_Q 视为常数。因此,最终性能可以大致表述如下:

Zh(R,L),proportional-to𝑍superscript𝑅𝐿Z\propto h^{\prime}(R,L),italic_Z ∝ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , (6)

which means that the model performance is correlated with the data compression ratio and training loss. We name this relationship as “Entropy Law”.
这意味着模型性能与数据压缩率和训练损失相关。我们将这种关系称为“熵定律”。

We can raise two deductions based on the entropy law:
根据熵定律,我们可以得出两个推论:

  • If we further regard the data consistency as a constant, the training loss is directly influenced by the compression ratio (Eq. 3). Thus, the model performance is controlled by the compression ratio: Z𝑍Zitalic_Z is usually worse if the data compression ratio R𝑅Ritalic_R is higher, which will be validated by our experiments.


    • 如果我们进一步将数据一致性视为常数,训练损失将直接受到压缩比的影响(方程 3)。因此,模型性能由压缩比控制:如果数据压缩比更高, Z𝑍Zitalic_Z 通常会更差,这将在我们的实验中得到验证。
  • Given the same compression ratio R𝑅Ritalic_R, a higher training loss means a lower data consistency. Thus, the effective knowledge learned by the model may be more limited. This can be used to predict the performance of LLM on different data with similar compression ratios and sample qualities. We will show later the application of this deduction in our practice.


    • 在相同的压缩比 R𝑅Ritalic_R 下,较高的训练损失意味着较低的数据一致性。因此,模型学习到的有效知识可能更有限。这可以用来预测 LLM 在具有相似压缩比和样本质量的不同数据上的表现。我们将在后面展示这一推论在我们实践中的应用。

Notably, entropy law reveals a coherent connection between downstream model performance and data compression ratio, setting it apart from the previously proposed data-dependent scaling law by Pandey (2024). Building upon the entropy law, we derive a data selection algorithm in Section 4 and demonstrate its application in practical large-scale LLM development in Section 5.3.
值得注意的是,熵定律揭示了下游模型性能与数据压缩比之间的一致联系,这使其有别于 Pandey(2024)先前提出的数据依赖缩放定律。基于熵定律,我们在第 4 节推导出一个数据选择算法,并在第 5.3 节展示其在实际大规模LLM开发中的应用。

4 ZIP: Lightweight Data Selection for LLM Alignment
4ZIP:用于LLM对齐的轻量级数据选择

Guided by the findings of the entropy law, we propose an effective and efficient method named ZIP to select data samples based on data compression ratios, which aims to maximize the amount of effective information given a limited training data budget. Although there exists a subset with the lowest compression ratio, it is impractical to find it due to the huge combination space of data samples. Thus, we propose an iterative multi-stage greedy algorithm to efficiently obtain an approximate solution with a relatively low compression ratio. In each iteration, we first use a global selection stage to choose a pool of candidate samples that have low compression ratios, which aims to find samples with high information density. We then employ a coarse-grained local selection stage incorporating a smaller set of samples with the lowest redundancy with already selected samples. Finally, we use a fine-grained local selection stage that minimizes the similarity between samples to add. The above process is conducted until we obtain a sufficient size of data. The workflow of our method is summarized in Algorithm 1, whose details are introduced as follows.
根据熵定律的研究结果,我们提出了一种名为 ZIP 的有效方法,通过数据压缩比来选择数据样本,旨在在有限的训练数据预算下最大化有效信息量。尽管存在压缩比最低的子集,但由于数据样本的巨大组合空间,找到它是不切实际的。因此,我们提出了一种迭代多阶段贪婪算法,以有效地获得压缩比相对较低的近似解。在每次迭代中,我们首先使用全局选择阶段来选择压缩比低的候选样本池,旨在找到信息密度高的样本。然后,我们采用粗粒度的局部选择阶段,结合一组与已选择样本冗余度最低的较小样本集。最后,我们使用细粒度的局部选择阶段,最小化待添加样本之间的相似性。上述过程一直进行,直到我们获得足够大小的数据。我们方法的工作流程总结在算法 1 中,其详细介绍如下。

4.1 Global Selection 4.1 全球选择

In general, we maintain an information redundancy state π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT that evaluates the “information gain” of each sample. Intuitively, data with high intra-sample information redundancy are unlikely to have good global diversity. For example, a sample with repeated patterns or echoed conversation turns usually has low education value in LLM training. Thus, we initialize this state by calculating the sample-level compression ratio for the entire dataset 𝒟𝒟\mathcal{D}caligraphic_D. In each iteration, We select K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples with the lowest scores in π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to form an initial candidate pool 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which provides a good set for subsequent local selection.
一般来说,我们保持一个信息冗余状态 π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ,用于评估每个样本的“信息增益”。直观上,具有高样本内信息冗余的数据不太可能具有良好的全局多样性。例如,具有重复模式或重复对话轮次的样本通常在LLM训练中具有较低的教育价值。因此,我们通过计算整个数据集的样本级压缩比来初始化此状态 𝒟𝒟\mathcal{D}caligraphic_D 。在每次迭代中,我们选择 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 个在 π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT 中得分最低的样本来形成一个初始候选池 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,这为后续的局部选择提供了一个良好的集合。

4.2 Local Coarse-grained Selection
4.2 局部粗粒度选择

Since the global selection does not well consider the mutual relations among samples, we further conduct local selection to pick diverse samples. To ensure good computational efficiency, we introduce a coarse-grained selection phase to narrow the candidate pool into a smaller one with K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples. We first compute the compression ratio of a merged set that adds each sample in 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to the selected set 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We use this score to update the information redundancy state π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to better indicate the current information gain of these samples. Based on the scores of the samples in 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we select K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples with the lowest scores. These samples form a small subset for final fine-grained selection, where each sample has good diversity with the selected dataset 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
由于全局选择未能很好地考虑样本之间的相互关系,我们进一步进行局部选择以挑选多样化的样本。为了确保良好的计算效率,我们引入了一个粗粒度选择阶段,将候选池缩小到一个包含 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 个样本的较小池。我们首先计算将 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中的每个样本添加到已选择集 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 中的合并集的压缩比。我们使用此分数更新信息冗余状态 π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ,以更好地指示这些样本的当前信息增益。基于 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中样本的分数,我们选择分数最低的 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 个样本。这些样本形成一个用于最终细粒度选择的小子集,其中每个样本与已选择数据集 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 具有良好的多样性。

Algorithm 1 Pseudo code of ZIP
算法 1 ZIP 的伪代码
1:The original dataset 𝒟𝒟\mathcal{D}caligraphic_D of size N𝑁Nitalic_N. A data compressor 𝒞𝒞\mathcal{C}caligraphic_C. The number of selected samples m𝑚mitalic_m. The number of compression ratios to be updated in the global selection K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The data pool size of local selection K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The data budget of local selection K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Compression ratio calculation function g(𝒞(D))=Bits(D)Bits(𝒞(D))𝑔𝒞𝐷Bits𝐷Bits𝒞𝐷g(\mathcal{C}(D))=\frac{\text{Bits}(D)}{\text{Bits}(\mathcal{C}(D))}italic_g ( caligraphic_C ( italic_D ) ) = divide start_ARG Bits ( italic_D ) end_ARG start_ARG Bits ( caligraphic_C ( italic_D ) ) end_ARG, which measures compression ratio by the number of bits.
1: 原始数据集 𝒟𝒟\mathcal{D}caligraphic_D 的大小为 N𝑁Nitalic_N 。数据压缩器 𝒞𝒞\mathcal{C}caligraphic_C 。选定样本的数量 m𝑚mitalic_m 。全局选择中要更新的压缩比数量 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 。局部选择的数据池大小 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 。局部选择的数据预算 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 。压缩比计算函数 g(𝒞(D))=Bits(D)Bits(𝒞(D))𝑔𝒞𝐷Bits𝐷Bits𝒞𝐷g(\mathcal{C}(D))=\frac{\text{Bits}(D)}{\text{Bits}(\mathcal{C}(D))}italic_g ( caligraphic_C ( italic_D ) ) = divide start_ARG Bits ( italic_D ) end_ARG start_ARG Bits ( caligraphic_C ( italic_D ) ) end_ARG ,通过比特数来衡量压缩比。
2:A data subset 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of size m𝑚mitalic_m with a relatively low compression ratio.
一个大小为 m𝑚mitalic_m 的数据子集 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,其压缩比相对较低。
3:Init 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as an empty set ΦΦ\Phiroman_Φ
3:将 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 初始化为空集 ΦΦ\Phiroman_Φ
4:Init π𝒟={g(d0),g(d1),,g(dN)}subscript𝜋𝒟𝑔subscript𝑑0𝑔subscript𝑑1𝑔subscript𝑑𝑁\pi_{\mathcal{D}}=\{g(d_{0}),g(d_{1}),\dots,g(d_{N})\}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = { italic_g ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_g ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_g ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } by calculating the compression ratio of each sample in 𝒟𝒟\mathcal{D}caligraphic_D
4:通过计算 𝒟𝒟\mathcal{D}caligraphic_D 中每个样本的压缩比来初始化 π𝒟={g(d0),g(d1),,g(dN)}subscript𝜋𝒟𝑔subscript𝑑0𝑔subscript𝑑1𝑔subscript𝑑𝑁\pi_{\mathcal{D}}=\{g(d_{0}),g(d_{1}),\dots,g(d_{N})\}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = { italic_g ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_g ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_g ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }
5:while |𝒟|<msuperscript𝒟𝑚|\mathcal{D}^{\prime}|<m| caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < italic_m do 5:当 |𝒟|<msuperscript𝒟𝑚|\mathcal{D}^{\prime}|<m| caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < italic_m
6:     // Stage 1: Global Selection
6: // 阶段 1:全局选择
7:     𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K(𝒟𝒟𝒟superscript𝒟\mathcal{D}\setminus\mathcal{D}^{\prime}caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, π𝒟𝒟subscript𝜋𝒟superscript𝒟\pi_{\mathcal{D}\setminus\mathcal{D}^{\prime}}italic_π start_POSTSUBSCRIPT caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) // Select K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples with the lowest scores
7: 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K( 𝒟𝒟𝒟superscript𝒟\mathcal{D}\setminus\mathcal{D}^{\prime}caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , π𝒟𝒟subscript𝜋𝒟superscript𝒟\pi_{\mathcal{D}\setminus\mathcal{D}^{\prime}}italic_π start_POSTSUBSCRIPT caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) // 选择得分最低的 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 个样本
8:     // Stage 2: Local Coarse-grained Selection
8: // 阶段 2:局部粗粒度选择
9:     πK1=g(𝒟{d})subscript𝜋subscript𝐾1𝑔superscript𝒟𝑑\pi_{K_{1}}=g(\mathcal{D}^{\prime}\cup\{d\})italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_g ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_d } ), where d𝒟K1𝑑subscript𝒟subscript𝐾1d\in\mathcal{D}_{K_{1}}italic_d ∈ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT //Compression ratios of the union of 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and d𝑑ditalic_d
9: πK1=g(𝒟{d})subscript𝜋subscript𝐾1𝑔superscript𝒟𝑑\pi_{K_{1}}=g(\mathcal{D}^{\prime}\cup\{d\})italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_g ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_d } ) ,其中 d𝒟K1𝑑subscript𝒟subscript𝐾1d\in\mathcal{D}_{K_{1}}italic_d ∈ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT // 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTd𝑑ditalic_d 的联合压缩比
10:     Update corresponding values in π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT with πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
10: 在 π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT 中更新相应的值为 πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
11:     𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K(𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) // Select K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples with the lowest compression ratios
11: 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K( 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) // 选择压缩率最低的 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 个样本
12:     // Stage 3: Local Fine-grained Selection
第 3 阶段:局部细粒度选择
13:     Init 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as an empty set ΦΦ\Phiroman_Φ
13: 将 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 初始化为一个空集 ΦΦ\Phiroman_Φ
14:     while |𝒟K3|<K3subscript𝒟subscript𝐾3subscript𝐾3|\mathcal{D}_{K_{3}}|<K_{3}| caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | < italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT do 14:当 |𝒟K3|<K3subscript𝒟subscript𝐾3subscript𝐾3|\mathcal{D}_{K_{3}}|<K_{3}| caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | < italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
15:         πK2={g(𝒟K3{d})|d𝒟K2}subscript𝜋subscript𝐾2conditional-set𝑔subscript𝒟subscript𝐾3𝑑𝑑subscript𝒟subscript𝐾2\pi_{K_{2}}=\{g(\mathcal{D}_{K_{3}}\cup\{d\})|d\in\mathcal{D}_{K_{2}}\}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_g ( caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { italic_d } ) | italic_d ∈ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
16:         d = argmindπK2subscriptargmin𝑑subscript𝜋subscript𝐾2\operatorname*{arg\,min}_{d}{\pi_{K_{2}}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 16:d = argmindπK2subscriptargmin𝑑subscript𝜋subscript𝐾2\operatorname*{arg\,min}_{d}{\pi_{K_{2}}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
17:         𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝒟K3{d}subscript𝒟subscript𝐾3𝑑\mathcal{D}_{K_{3}}\cup\{d\}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { italic_d }
18:         𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝒟K2{d}subscript𝒟subscript𝐾2𝑑\mathcal{D}_{K_{2}}\setminus\{d\}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ { italic_d }
19:     end while 19:结束循环
20:     // Update the selected dataset
20: // 更新选定的数据集
21:     𝒟=𝒟𝒟K3superscript𝒟superscript𝒟subscript𝒟subscript𝐾3\mathcal{D}^{\prime}=\mathcal{D}^{\prime}\cup\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
22:end while 22:结束循环

4.3 Local Fine-grained Selection
4.3 局部细粒度选择

Although the above stage ensures that the candidate pool has distinct information from the selected set, the information redundancy among the samples within this pool is not measured. Thus, we aim to pick further samples from this subset that are diverse from each other. Concretely, we initialize a local selected set 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and compute the compression ratio of the union of 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and each sample in 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We add the sample with the lowest compression ratio into 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and remove it from 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. By repeating this process, we obtain a small subset that contains samples not only different from the selected set 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT but also distinct from each other. We conduct the three stages above until the size of 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT reaches a predefined data budget m𝑚mitalic_m. In our method, the entire selection process is quite easy to implement since it is model-free, and can be accelerated using multiple threads. It can select data efficiently and effectively from a large candidate pool for high-quality LLM training.
尽管上述阶段确保了候选池中的信息与已选集合中的信息不同,但未测量该池内样本之间的信息冗余。因此,我们旨在从该子集中挑选彼此不同的进一步样本。具体来说,我们初始化一个本地选定集合 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,并计算 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中每个样本的并集的压缩比。我们将压缩比最低的样本添加到 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中,并将其从 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中移除。通过重复此过程,我们获得了一个小子集,其中包含不仅与选定集合 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 不同的样本,而且彼此不同的样本。我们进行上述三个阶段,直到 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 的大小达到预定义的数据预算 m𝑚mitalic_m 。在我们的方法中,整个选择过程非常容易实现,因为它是无模型的,并且可以使用多线程加速。它可以从大型候选池中高效且有效地选择高质量LLM训练的数据。

5 Experiments 5 实验

ZIP is content-agnostic and model-free, making it suitable for various stages of LLM alignment. We systematically evaluate the effectiveness of ZIP through experiments conducted in the SFT and RLHF stages, as described in Sections 5.1 and 5.2, respectively. Subsequently, Section 5.3 presents an in-depth analysis to empirically support the proposed entropy law, including a practical application guided by this law.
ZIP 是内容不可知且无模型的,使其适用于 LLM 对齐的各个阶段。我们通过在 SFT 和 RLHF 阶段进行的实验系统地评估了 ZIP 的有效性,分别在第 5.1 和 5.2 节中进行了描述。随后,第 5.3 节提供了深入的分析,以实证支持所提出的熵定律,包括由该定律指导的实际应用。

5.1 Data Selection for SFT
5.1 SFT 的数据选择

5.1.1 Setup 5.1.1 设置

Data Pool & Data Selection We follow DEITA (Liu et al., 2023) to establish a large-scale data pool comprising 300K high-quality samples obtained from WizardLM (Xu et al., 2023), ShareGPT (Chiang et al., 2023), and UltraChat (Ding et al., 2023). Subsequently, various data selection techniques are employed to extract a subset of this pool for LLM instruction tuning. Notably, previous studies controlled the data budget by limiting the number of instances, whereas we managed the total token count to ensure a fair allocation of the compute budget among all methods. To achieve this, we initially select 10,000 samples using ZIP and calculate the corresponding token count. Then, we apply other methods to continue data selection until the required token count is reached.
数据池与数据选择 我们遵循 DEITA(Liu 等,2023)建立了一个大规模数据池,该数据池由从 WizardLM(Xu 等,2023)、ShareGPT(Chiang 等,2023)和 UltraChat(Ding 等,2023)获得的 30 万高质量样本组成。随后,采用各种数据选择技术从该数据池中提取一个子集用于LLM指令微调。值得注意的是,以往的研究通过限制实例数量来控制数据预算,而我们则通过管理总 token 数来确保所有方法之间计算预算的公平分配。为此,我们首先使用 ZIP 选择 10,000 个样本并计算相应的 token 数。然后,我们应用其他方法继续进行数据选择,直到达到所需的 token 数。

Training & Evaluation We fine-tune Mistral-7B (Jiang et al., 2023) and LLama-3-8B (Meta, 2020) on the selected dataset. Other training details can be found in Appendix B. As for evaluation, we adopt MT-bench(Zheng et al., 2023) as our benchmark. Specifically, MT-bench is a challenging multi-turn question set with LLM judgements to evaluate model responses, which exhibits a high-level human preferences alignment.
训练与评估 我们在选定的数据集上微调了 Mistral-7B(Jiang 等,2023)和 LLama-3-8B(Meta,2020)。其他训练细节见附录 B。至于评估,我们采用 MT-bench(Zheng 等,2023)作为我们的基准。具体来说,MT-bench 是一个具有LLM个判断的多轮问题集,用于评估模型响应,表现出高度的人类偏好一致性。

Baselines We select the baseline from two aspects. The first group includes heuristic methods: (1) Random, which randomly selects instances from the data pool to verify the fundamental effectiveness of other methods; (2) Cluster, which adopts K-means clustering based on the sample representations and select cluster centroids; (3) Perplexity, which selects the samples with highest training loss. The second group of baselines includes model-based methods: (1) DEITA (Liu et al., 2023), which employs ChatGPT-generated data to train a Llama-based data complexity evaluator and a quality evaluator, and selects samples with the highest hybrid scores; (2) SuperFiltering (Li et al., 2024), which assesses each sample by calculating the change in response loss upon instruction removal and introduce a lightweight proxy model to calculate the score more efficiently.
基线 我们从两个方面选择基线。第一组包括启发式方法:(1) 随机,从数据池中随机选择实例以验证其他方法的基本有效性;(2) 聚类,基于样本表示采用 K-means 聚类并选择聚类中心;(3) 困惑度,选择训练损失最高的样本。第二组基线包括基于模型的方法:(1) DEITA (Liu et al, 2023),使用 ChatGPT 生成的数据训练基于 Llama 的数据复杂性评估器和质量评估器,并选择混合得分最高的样本;(2) SuperFiltering (Li et al, 2024),通过计算移除指令后的响应损失变化来评估每个样本,并引入一个轻量级代理模型以更高效地计算得分。

5.1.2 Results 5.1.2 结果

Table 1: Model performance comparison between different data selection baselines based on Mistral-7B and Llama-3-8B on the SFT stage. We also provide the computational cost, average token length of selected data, and the average data quality produced by Alpagasus Chen et al. (2023). The best results are bolded, and the second-best numbers are underlined. \dagger means CPU-only methods.
表 1:基于 Mistral-7B 和 Llama-3-8B 在 SFT 阶段的不同数据选择基线的模型性能比较。我们还提供了计算成本、所选数据的平均标记长度以及 Alpagasus Chen 等人(2023)产生的平均数据质量。最佳结果用粗体表示,次佳结果用下划线表示。 \dagger 表示仅使用 CPU 的方法。
Model MT-bench \uparrow Cost \downarrow 成本 \downarrow Avg.length 平均长度 Quality 质量
Mistral-7B-based models with SFT
基于 Mistral-7B 的 SFT 模型
Random \dagger 随机 \dagger 6.85 10s 10 秒 976 4.08
Cluster 集群 \ul6.91 6.91 15h 15 小时 970 4.05
Perplexity 困惑度 6.89 8h 8 小时 981 4.09
SuperFiltering 超级过滤 6.12 14h 14 小时 1579 4.10
DEITA 6.82 21h 21 小时 2048 4.03
ZIP \dagger 7.08 \ul4.5h 4.5 小时 543 4.00
Llama-3-8B-based models with SFT
基于 Llama-3-8B 的 SFT 模型
Random \dagger 随机 \dagger 7.16 10s 10 秒 892 4.08
Cluster 集群 \ul7.18 7.18 16h 16 小时 886 3.95
Perplexity 困惑度 7.09 9h 9 小时 895 3.96
SuperFiltering 超级过滤 6.59 14h 14 小时 1481 3.99
DEITA 7.11 21h 21 小时 2048 4.09
ZIP \dagger 7.28 \ul4.5h 4.5 小时 470 4.00

Main comparison We compare ZIP with various data selection methods based on Mistral-7B and Llama-3-8B, and the results are presented in Table 1. ZIP outperforms other data selection approaches on all backbones, which can be attributed to ZIP’s ability to model the complex combinatorial effects among samples. Furthermore, our observations indicate that model-based data selection methods often fail to produce satisfactory outcomes when a fixed token number is given. This is because the sample-level evaluations are not updated correspondingly after selecting some samples, leading to biased evaluations for the remaining samples. Additionally, some of these methods adopt strategies to enhance data diversity, such as DEITA, which controls the representation distances of selected samples. However, these strategies only provide a rough assessment of the combinatorial effects within the representation space, since semantic distances do not necessarily reflect information redundancy.
主要比较 我们基于 Mistral-7B 和 Llama-3-8B 比较了 ZIP 与各种数据选择方法,结果如表 1 所示。ZIP 在所有基础模型上都优于其他数据选择方法,这可以归因于 ZIP 能够建模样本之间复杂的组合效应。此外,我们的观察表明,当给定固定的标记数量时,基于模型的数据选择方法往往无法产生令人满意的结果。这是因为在选择一些样本后,样本级别的评估没有相应更新,导致对剩余样本的评估存在偏差。此外,这些方法中的一些采用了增强数据多样性的策略,例如 DEITA,它控制所选样本的表示距离。然而,这些策略仅在表示空间内对组合效应进行了粗略评估,因为语义距离不一定反映信息冗余。

Selection bias in sample length across different strategies We also provide the average length of tokenized samples in Table 1. The average token length of Random provides an estimation for the entire data pool, which is used to analyze other methods. From the tables, we can observe that Cluster and Perplexity exhibit similar selection preferences as Random. Additionally, Deita and SuperFiltering predominantly select lengthy data samples. This bias may stem from the LLMs’ inclination toward generating longer responses (Saito et al., 2023). However, given the limited budget of selected tokens, choosing excessively lengthy data will reduce the information density and degrade the capabilities of models trained on such data. In contrast, ZIP tends to select shorter samples. Furthermore, we plot the token length distribution of these methods, as depicted in Figure 2 and Figure 6. Consistent with the previous results, we can observe similar distributions for Random, Cluster, and Perplexity. The token length distributions of DEITA and SuperFiltering are severely skewed, deviating greatly from the original data distribution. In contrast to these model-based approaches, ZIP exhibits no bias toward selecting lengthy samples.
不同策略下样本长度的选择偏差 我们在表 1 中提供了分词样本的平均长度。随机方法的平均分词长度为整个数据池提供了一个估计值,用于分析其他方法。从表中可以看出,聚类和困惑度表现出与随机方法相似的选择偏好。此外,Deita 和 SuperFiltering 主要选择较长的数据样本。这种偏差可能源于LLMs倾向于生成较长的响应(Saito 等,2023)。然而,鉴于所选分词的预算有限,选择过长的数据会降低信息密度,并降低在此类数据上训练的模型的能力。相比之下,ZIP 倾向于选择较短的样本。此外,我们绘制了这些方法的分词长度分布,如图 2 和图 6 所示。与之前的结果一致,我们可以观察到随机、聚类和困惑度的分布相似。DEITA 和 SuperFiltering 的分词长度分布严重偏斜,与原始数据分布有很大偏差。 与这些基于模型的方法相比,ZIP 在选择较长样本时没有偏向。

Cost comparison of different strategies We provide a detailed cost analysis of each method in Table 1. Except for the Random method, ZIP required the least time to complete the data selection process, demonstrating greater efficiency than other methods. Notably, ZIP’s computations are entirely executed on CPUs, resulting in significant cost savings. Furthermore, ZIP is independent of proprietary LLMs used by DEITA or the proxy model employed by Cluster, Perplexity, and SuperFiltering. This model-free characteristic endows ZIP with notable efficiency and versatility.
不同策略的成本比较 我们在表 1 中提供了每种方法的详细成本分析。除了随机方法外,ZIP 完成数据选择过程所需的时间最少,显示出比其他方法更高的效率。值得注意的是,ZIP 的计算完全在 CPU 上执行,从而显著节省了成本。此外,ZIP 不依赖于 DEITA 使用的专有LLMs或 Cluster、Perplexity 和 SuperFiltering 采用的代理模型。这种无模型特性赋予了 ZIP 显著的效率和多功能性。

Refer to caption
(a) ZIP (a) 邮政编码
Refer to caption
(b) Random (b) 随机
Refer to caption
(c) Diversity (c) 多样性
Refer to caption
(d) Perplexity (d) 困惑度
Refer to caption
(e) DEITA (e) 数据分析与信息技术应用
Refer to caption
(f) SuperFiltering (f) 超级过滤
Figure 2: The distribution of average token number across datasets selected by different algorithms for Mistral-7B.
图 2:Mistral-7B 不同算法选择的数据集中平均标记数的分布。

Selected data quality of different strategies We have followed Alpagasus (Chen et al., 2023) to evaluate the quality of each data sample in the selected datasets by prompting ChatGPT, with the quality scores ranging from 0 to 5. The quality scores of multi-turn samples are the average scores of each turn. The results have been presented in Table 1. Surprisingly, the quality scores of selected datasets are highly similar, even with significant differences in selection mechanisms. This may suggest that the average quality distribution remains relatively uniform in the original data pool. Notably, even the SOTA model-based methods like DEITA (Liu et al., 2023) and SuperFiltering (Li et al., 2024) select data with similar quality scores, potentially contradicting their original conclusions. We posit that this discrepancy stems from the setting of the data budget, which is controlled by the number of samples in prior studies. Considering the selection bias discussed above, these methods tend to select lengthy samples, resulting in a significantly higher token count compared with baselines. For instance, under this setting, data selected by DEITA will possess 2.7 times the number of tokens compared to ZIP. However, we argue it is fairer to control the data budget by the token count since it guarantees a similar compute budget among all methods333In practical implementation, the training steps of all methods are almost equal by employing the packing technique detailed in Axolotl.
在实际应用中,通过采用 Axolotl 中详述的打包技术,所有方法的训练步骤几乎相同。
.
不同策略的数据质量选择 我们遵循 Alpagasus(Chen 等人,2023)的方法,通过提示 ChatGPT 来评估所选数据集中每个数据样本的质量,质量评分范围为 0 到 5 分。多轮样本的质量评分是每轮评分的平均值。结果如表 1 所示。令人惊讶的是,所选数据集的质量评分非常相似,即使选择机制存在显著差异。这可能表明,原始数据池中的平均质量分布相对均匀。值得注意的是,即使是基于 SOTA 模型的方法,如 DEITA(Liu 等人,2023)和 SuperFiltering(Li 等人,2024),所选数据的质量评分也相似,这可能与其原始结论相矛盾。我们认为,这种差异源于数据预算的设置,该预算由先前研究中的样本数量控制。考虑到上述选择偏差,这些方法倾向于选择较长的样本,导致与基线相比,令牌数量显著增加。例如,在这种设置下,DEITA 选择的数据将拥有 2。与 ZIP 相比,令牌数量增加了 7 倍。然而,我们认为通过令牌计数来控制数据预算更为公平,因为它保证了所有方法之间的计算预算相似。

5.2 Data Selection for RLHF
5.2 用于 RLHF 的数据选择

Table 2: Model performance comparison between different data selection baselines based on Llama-3-8B on the RLHF stage. We also provide the computational cost of each method.
表 2:基于 Llama-3-8B 在 RLHF 阶段的不同数据选择基线的模型性能比较。我们还提供了每种方法的计算成本。
Model MT-bench \uparrow Cost 成本 Avg.length 平均长度
Base 基础 7.18 NA NA
Random \dagger 随机 \dagger \ul7.33 7.33 5s 5 秒 464
Score 分数 7.30 NA 489
ZIP \dagger 7.42 1.1h 1.1 小时 357

5.2.1 Setup 5.2.1 设置

Data Pool & Data Selection The data pool used for preference alignment is a cleaned version of UltraFeedback (Cui et al., 2023; Bartolome et al., 2023), which consists of around 60k samples in the form of a "chosen-rejected" pair. Similarly to the SFT stage, we ensure each data selection method selects data with an approximately equal token count. Since a "chosen-rejected" data pair encompasses two data points, we select 5,000 data pairs with ZIP and then apply other methods to select data with the corresponding token budget.
数据池和数据选择 用于偏好对齐的数据池是经过清理的 UltraFeedback 版本(Cui 等,2023;Bartolome 等,2023),其中包含大约 60k 个样本,以“选择-拒绝”对的形式存在。与 SFT 阶段类似,我们确保每种数据选择方法选择的数据具有大致相等的标记数。由于“选择-拒绝”数据对包含两个数据点,我们首先使用 ZIP 选择 5,000 个数据对,然后应用其他方法选择具有相应标记预算的数据。

Refer to caption
(a) Entropy law w.r.t. compression ratio
(a) 熵定律关于压缩比
Refer to caption
(b) Entropy law w.r.t. training loss
(b) 关于训练损失的熵定律
Figure 3: Entropy law demonstration of Mistral-7B. The Entropy law curve is fitted with the results of different methods.
图 3:Mistral-7B 的熵定律演示。熵定律曲线与不同方法的结果相匹配。
Refer to caption
(a) Entropy law w.r.t. compression ratio
(a) 熵定律关于压缩比
Refer to caption
(b) Entropy law w.r.t. training loss
(b) 关于训练损失的熵定律
Figure 4: Entropy law curve of Llama-3-8B. The Entropy law curve is fitted with the results of different methods.
图 4:Llama-3-8B 的熵定律曲线。熵定律曲线与不同方法的结果相拟合。

Training & Evaluation Building upon the model previously fine-tuned with SFT, we further refine it using RLHF. In particular, we employ Kahneman-Tversky Optimization (KTO) (Ethayarajh et al., 2024) for preference alignment, a novel method that shows promising potential in aligning preferences. Additional training details can be found in Appendix B. For evaluation, we continue to utilize MT-bench (Zheng et al., 2023) as our benchmark to assess the capabilities of LLMs fine-tuned with data selected using diverse data selection strategies.
训练与评估 在之前使用 SFT 微调的模型基础上,我们进一步使用 RLHF 进行优化。特别是,我们采用 Kahneman-Tversky 优化(KTO)(Ethayarajh 等,2024)进行偏好对齐,这是一种在偏好对齐方面显示出有前景潜力的新方法。更多训练细节见附录 B。为了评估,我们继续使用 MT-bench(Zheng 等,2023)作为基准,评估使用多样化数据选择策略选择的数据微调的LLMs的能力。

Baselines We compare ZIP with the following baselines: (1) Random, which randomly samples some "chosen-rejected" pairs from the data pool. (2) Score, which selects the "chosen-rejected" pairs with the highest "chosen-scores". These scores are obtained through LLM evaluation of the response quality (Cui et al., 2023; Bartolome et al., 2023).
基线 我们将 ZIP 与以下基线进行比较:(1) 随机,从数据池中随机抽取一些“选择-拒绝”对。(2) 分数,选择“选择-拒绝”对中“选择分数”最高的。这些分数是通过LLM对响应质量的评估获得的(Cui 等, 2023; Bartolome 等, 2023)。

5.2.2 Main results 5.2.2 主要结果

Table 2 presents the results of different data selection strategies on the preference alignment stage of LLMs. Similar to the SFT stage, models aligned with data selected by ZIP can yield the best downstream performance, demonstrating the necessity for modeling the combinatorial effects. Besides, we find Score and Random are on par with each other, even though the selection process of Score is far more expensive than Random. This is unsurprising, as Score does not consider the combinatorial effects, which may limit the knowledge amount of the selected dataset.
表 2 展示了不同数据选择策略在LLMs的偏好对齐阶段的结果。与 SFT 阶段类似,由 ZIP 选择的数据对齐的模型可以产生最佳的下游性能,证明了建模组合效应的必要性。此外,我们发现 Score 和 Random 的表现相当,尽管 Score 的选择过程比 Random 昂贵得多。这并不令人惊讶,因为 Score 没有考虑组合效应,这可能限制了所选数据集的知识量。

5.3 Empirical Validation of Entropy Law
5.3 熵定律的实证验证

Refer to caption
Figure 5: Practical application of Entropy law in incremental training data update, where x1,x2,x3,x4,x5subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5x_{1},x_{2},x_{3},x_{4},x_{5}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are five data versions.
图 5:熵定律在增量训练数据更新中的实际应用,其中 x1,x2,x3,x4,x5subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5x_{1},x_{2},x_{3},x_{4},x_{5}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 是五个数据版本。

In this section, we aim to demonstrate the proposed entropy law. Specifically, we have plotted the model performance of Mistral-7B and Llama-3-8B concerning data compression ratio and training loss in Figure 3 and 4, respectively. Besides, we plot entropy-law curves by fitting the results. From the two figures, we can draw the following analysis:
在本节中,我们旨在展示所提出的熵定律。具体来说,我们在图 3 和图 4 中分别绘制了 Mistral-7B 和 Llama-3-8B 的模型性能与数据压缩比和训练损失的关系。此外,我们通过拟合结果绘制了熵定律曲线。从这两幅图中,我们可以得出以下分析:

Relationship between model performance, data compression ratio, and training loss In Figure 3(a) and Figure 4(a), LLMs trained on data with a lower compression ratio typically exhibit enhanced performance. Since the learning process of LLMs is highly relevant to information compression, we can regard LLMs as data compressors. Then the data with a lower compression ratio means a higher knowledge amount, which is more beneficial to the compressors. Besides, a lower compression ratio usually corresponds a higher training loss, as illustrated in Figures 3(b) and 4(b). This is because data resistant to compression carries more knowledge, posing a greater challenge for LLMs to absorb the encapsulated knowledge.
模型性能、数据压缩比和训练损失之间的关系 在图 3(a)和图 4(a)中,LLMs在较低压缩比的数据上训练通常表现出更好的性能。由于LLMs的学习过程与信息压缩高度相关,我们可以将LLMs视为数据压缩器。那么较低压缩比的数据意味着更高的知识量,这对压缩器更有利。此外,较低的压缩比通常对应较高的训练损失,如图 3(b)和图 4(b)所示。这是因为抗压缩的数据携带更多的知识,对LLMs吸收封装的知识构成了更大的挑战。

Model performance interpretation with entropy law Considering the three methods with comparable compression ratios and training loss, namely Random, Cluster, and Perplexity, the corresponding model performances are close. This phenomenon may seem counter-intuitive, given the distinct criteria used for data selection. However, it aligns with the predictions of our proposed entropy law: when the average data quality, training loss, and data compression ratio are similar, the model performance is expected to be comparable as well. Thus, the entropy law has the potential to serve as a criterion for predicting the model’s performance on data, thereby guiding the training of LLMs.
利用熵定律解释模型性能 考虑到三种具有可比压缩比和训练损失的方法,即随机、聚类和困惑度,对应的模型性能是接近的。鉴于用于数据选择的不同标准,这一现象可能看起来违反直觉。然而,这与我们提出的熵定律的预测是一致的:当平均数据质量、训练损失和数据压缩比相似时,模型性能也预计是可比的。因此,熵定律有可能作为预测模型在数据上性能的标准,从而指导LLMs的训练。

Practical application of entropy law Incremental version update of training data is a common setting in practical LLM development. Usually, the training data amount remains relatively stable, with only a minor portion undergoing modification. We have conducted incremental training data update experiments in real scenarios, with results depicted in Figure 5. Due to confidentiality, only the relative order of the results is provided. Guided by entropy law, assuming the data quality Q𝑄Qitalic_Q does not significantly decay after each incremental update, we can expect a model performance improvement with a decreased data compression ratio. This expectation is supported by the results from data versions x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in Figure 5 since their compression ratios are decreased after each incremental update. However, the data version x5subscript𝑥5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT exhibits an abnormal increase in the loss and data compression ratio, which serves as an early indicator of potential model performance degradation due to a decline in training data consistency. This prediction is further confirmed by subsequent post-training model performance evaluations, as illustrated in Figure 5. Thus, the entropy Law can be utilized as a guideline for LLM training to identify potential risks of experimental failure without training the model on the full dataset until convergence. This is particularly significant given the substantial costs associated with training an LLM.
熵定律的实际应用 训练数据的增量版本更新是实际开发中的常见设置。通常,训练数据量保持相对稳定,只有一小部分进行修改。我们在实际场景中进行了增量训练数据更新实验,结果如图 5 所示。由于保密原因,仅提供结果的相对顺序。在熵定律的指导下,假设数据质量在每次增量更新后不会显著下降,我们可以预期模型性能会随着数据压缩比的降低而提高。图 5 中数据版本 x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 的结果支持了这一预期,因为它们的压缩比在每次增量更新后都降低了。然而,数据版本 x5subscript𝑥5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 表现出损失和数据压缩比的异常增加,这作为训练数据一致性下降导致模型性能潜在退化的早期指标。后续的训练后模型性能评估进一步证实了这一预测,如图 5 所示。 因此,熵定律可以作为LLM训练的指导方针,用于在不将模型训练到整个数据集收敛的情况下识别实验失败的潜在风险。鉴于训练LLM所需的巨大成本,这一点尤为重要。

6 Conclusion 6 结论

In this paper, we delve deeply into the data selection problem from a data compression perspective. Inspired by the insight that language modeling is performing information compression, we propose an entropy law delineating the coherent relationship between model performance, data compression ratio, and training loss. Theoretically guided by the entropy law, we propose a new data selection algorithm, ZIP, to select data with the nearly lowest compression ratio, which is model-free and content-agnostic. rendering it significantly lightweight and versatile. Experimental results have demonstrated the effectiveness and efficiency of ZIP, based on various LLM backbones, during the SFT and RLHF stages. Further in-depth analysis provided empirical evidence of Entropy law, which could serve as a criterion for LLM performance prediction on specific data.
在本文中,我们从数据压缩的角度深入探讨了数据选择问题。受到语言建模执行信息压缩这一见解的启发,我们提出了一条熵定律,描述了模型性能、数据压缩比和训练损失之间的一致关系。在熵定律的理论指导下,我们提出了一种新的数据选择算法 ZIP,用于选择压缩比几乎最低的数据,该算法不依赖于模型且与内容无关,使其显著轻量且通用。实验结果表明,在 SFT 和 RLHF 阶段,基于各种LLM骨干网络,ZIP 的有效性和效率得到了验证。进一步的深入分析提供了熵定律的实证证据,这可以作为特定数据上LLM性能预测的标准。

References 参考文献

  • (1)
  • Achiam et al. (2023) 阿奇亚姆等人 (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
    乔什·阿奇亚姆,史蒂文·阿德勒,桑迪尼·阿加瓦尔,拉玛·艾哈迈德,伊尔格·阿卡亚,弗洛伦西亚·莱奥尼·阿莱曼,迪奥戈·阿尔梅达,扬科·阿尔滕施密特,山姆·奥特曼,夏马尔·阿纳德卡特,等人 2023。Gpt-4 技术报告。arXiv 预印本 arXiv:2303.08774 (2023)。
  • Albalak et al. (2024) Albalak 等人 (2024) Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. A Survey on Data Selection for Language Models. arXiv:2402.16827 [cs.CL]
    阿隆·阿尔巴拉克,亚奈·埃拉扎尔,桑·迈克尔·谢,谢恩·朗普雷,内森·兰伯特,王欣怡,尼克拉斯·穆尼霍夫,侯百如,潘亮明,郑海媛,科林·拉菲尔,常士钰,桥本龙纪,和王杨。2024 年。关于语言模型数据选择的调查。arXiv:2402.16827 [cs.CL]
  • Bartolome et al. (2023) Bartolome 等人 (2023) Alvaro Bartolome, Gabriel Martin, and Daniel Vila. 2023. Notus. https://github.com/argilla-io/notus.
    阿尔瓦罗·巴托洛梅,盖布里埃尔·马丁,丹尼尔·维拉。2023。Notus。https://github.com/argilla-io/notus。
  • Chen et al. (2023) 陈等人 (2023) Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
    陈立昌, 李世阳, 闫军, 王海, Kalpa Gunaratna, Vikas Yadav, 唐正, Vijay Srinivasan, 周天翼, 黄恒, 等 2023. Alpagasus: 用更少的数据训练更好的羊驼. arXiv 预印本 arXiv:2307.08701 (2023).
  • Chiang et al. (2023) 蒋等人 (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6.
    魏林·蒋, 卓翰·李, 子·林, 英·盛, 张浩·吴, 浩·张, 连敏·郑, 思源·庄, 永浩·庄, 约瑟夫·E·冈萨雷斯 等 2023. Vicuna: 一个以 90%* chatgpt 质量令 gpt-4 印象深刻的开源聊天机器人。见 https://vicuna.lmsys.org (访问日期 2023 年 4 月 14 日) 2, 3 (2023), 6.
  • Chowdhery et al. (2023) Chowdhery 等人 (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann 等. 2023. Palm: 使用路径扩展语言建模. 机器学习研究杂志 24, 240 (2023), 1–113.
  • Cui et al. (2023) 崔等人 (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377 (2023).
    崔甘渠,袁立凡,丁宁,姚冠明,朱伟,倪源,谢国桐,刘志远,孙茂松。2023。Ultrafeedback: 通过高质量反馈提升语言模型。arXiv 预印本 arXiv:2310.01377 (2023)。
  • Delétang et al. (2023) 德莱唐等人 (2023) Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2023. Language Modeling Is Compression. CoRR abs/2309.10668 (2023).
    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, 和 Joel Veness. 2023. 语言建模即压缩. CoRR abs/2309.10668 (2023).
  • Ding et al. (2023) 丁等人 (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. In EMNLP. Association for Computational Linguistics, 3029–3051.
    宁丁,陈宇林,徐博凯,秦宇佳,胡盛鼎,刘志远,孙茂松,周博文。2023。通过扩展高质量的指导性对话来增强聊天语言模型。在 EMNLP。计算语言学协会,3029-3051。
  • Du et al. (2023) 杜等人 (2023) Qianlong Du, Chengqing Zong, and Jiajun Zhang. 2023. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
    杜乾龙, 宗成庆, 张家俊. 2023. Mods: 面向模型的数据选择用于指令调优. arXiv 预印本 arXiv:2311.15653 (2023).
  • Ethayarajh et al. (2024) Ethayarajh 等人 (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 (2024).
    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, 和 Douwe Kiela. 2024. Kto: 模型对齐作为前景理论优化。arXiv 预印本 arXiv:2402.01306 (2024).
  • GitHub (2020) GitHub. 2020. GitHub Copilot. https://github.com/features/copilot/
  • Huang et al. (2024) 黄等人 (2024) Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. 2024. Compression Represents Intelligence Linearly. arXiv preprint arXiv:2404.09937 (2024).
    黄玉珍, 张静涵, 单子飞, 和何俊贤. 2024. 压缩线性地代表智能. arXiv 预印本 arXiv:2404.09937 (2024).
  • Ivison et al. (2023) 艾维森等人 (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2. CoRR abs/2311.10702 (2023).
    Hamish Ivison, 王一中, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, 和 Hannaneh Hajishirzi. 2023. 气候变化中的骆驼:使用 Tulu 2 增强 LM 适应性. CoRR abs/2311.10702 (2023).
  • Jiang et al. (2023) 江等人 (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. CoRR abs/2310.06825 (2023).
    蒋庆安, 亚历山大·萨布莱罗尔, 亚瑟·门施, 克里斯·班福德, 德文德拉·辛格·查普洛特, 迭戈·德·拉斯·卡萨斯, 弗洛里安·布雷桑德, 吉安娜·伦吉尔, 吉约姆·兰普尔, 露西尔·索尔尼耶, 莱利奥·雷纳德·拉沃德, 玛丽-安妮·拉肖, 皮埃尔·斯托克, 特文·勒斯卡奥, 蒂博·拉夫里尔, 托马斯·王, 蒂莫泰·拉克鲁瓦, 威廉·埃尔·赛义德. 2023. Mistral 7B. CoRR abs/2310.06825 (2023).
  • Köpf et al. (2024) Köpf 等人 (2024) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems 36 (2024).
    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi 等人. 2024. Openassistant 对话——民主化大型语言模型对齐。神经信息处理系统进展 36 (2024).
  • Li et al. (2024) 李等人 (2024) Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou. 2024. Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning. CoRR abs/2402.00530 (2024).
    李明,张勇,何帅,李志涛,赵宏宇,王建宗,程宁,周天一。2024。超级过滤:快速指令调优的弱到强数据过滤。CoRR abs/2402.00530 (2024)。
  • Li et al. (2023) 李等人 (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. CoRR abs/2308.12032 (2023).
    李明,张勇,李志涛,陈久海,陈立昌,程宁,王建宗,周天一,肖静。2023。从数量到质量:通过自我引导的数据选择提升LLM性能以进行指令微调。CoRR abs/2308.12032 (2023)。
  • Liu et al. (2023) 刘等人 (2023) Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. 2023. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. CoRR abs/2312.15685 (2023).
    刘伟,曾伟豪,何可卿,姜勇,何俊贤。2023。什么是对齐的好数据?指令调优中自动数据选择的综合研究。CoRR abs/2312.15685 (2023)。
  • Lu et al. (2023) 卢等人 (2023) Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
    陆克明,袁宏毅,袁正,林润基,林俊阳,谭传琪,周畅,周靖人。2023。# InsTag: 用于分析大语言模型监督微调的指令标注。在第十二届国际学习表征会议上。
  • M. Bran et al. (2024)
    M. Bran 等人 (2024)
    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. Augmenting large language models with chemistry tools. Nature Machine Intelligence (2024), 1–11.
    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, 和 Philippe Schwaller. 2024. 用化学工具增强大型语言模型. 自然机器智能 (2024), 1–11.
  • Meta (2020) 重试    错误原因 Meta. 2020. Llama3. https://ai.meta.com/blog/meta-llama-3/
  • Ouyang et al. (2022) 欧阳等人 (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
    欧阳龙, 吴杰弗里, 姜旭, 迪奥戈·阿尔梅达, 卡罗尔·温赖特, 帕梅拉·米什金, 张冲, 桑迪尼·阿加瓦尔, 卡塔琳娜·斯拉马, 亚历克斯·雷, 等 2022. 通过人类反馈训练语言模型以遵循指令. 神经信息处理系统进展 35 (2022), 27730–27744.
  • Pandey (2024) 潘迪 (2024) Rohan Pandey. 2024. gzip Predicts Data-dependent Scaling Laws. arXiv preprint arXiv:2405.16684 (2024).
    罗汉·潘迪. 2024. gzip 预测数据依赖的缩放定律. arXiv 预印本 arXiv:2405.16684 (2024).
  • Peng et al. (2023) 彭等人 (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction Tuning with GPT-4. CoRR abs/2304.03277 (2023).
    彭宝林, 李春元, 何鹏程, Michel Galley, 高建峰. 2023. 使用 GPT-4 进行指令微调. CoRR abs/2304.03277 (2023).
  • Rae et al. (2021) Rae 等人 (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).
    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young 等人. 2021. 语言模型的扩展:训练 Gopher 的方法、分析和见解. arXiv 预印本 arXiv:2112.11446 (2021).
  • Raffel et al. (2020) 拉费尔等人 (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, 和 Peter J Liu. 2020. 使用统一的文本到文本转换器探索迁移学习的极限. 机器学习研究杂志 21, 140 (2020), 1–67.
  • Saito et al. (2023) 斋藤等人 (2023) Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076 (2023).
    斋藤启太,和知昭文,渡冈光树,秋本洋平。2023。大型语言模型在偏好标注中的冗长偏见。arXiv 预印本 arXiv:2310.10076 (2023)。
  • Shannon (1948) 香农 (1948) Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3 (1948), 379–423.
    克劳德·E·香农. 1948. 通信的数学理论. 贝尔系统技术杂志. 27, 3 (1948), 379–423.
  • Shannon (1951) 香农 (1951) Claude E Shannon. 1951. Prediction and entropy of printed English. Bell system technical journal 30, 1 (1951), 50–64.
    克劳德·E·香农. 1951. 印刷英语的预测和熵. 贝尔系统技术期刊 30, 1 (1951), 50–64.
  • Shen (2024) 沈 (2024) Ming Shen. 2024. Rethinking Data Selection for Supervised Fine-Tuning. CoRR abs/2402.06094 (2024).
    沈明. 2024. 重新思考监督微调的数据选择. CoRR abs/2402.06094 (2024).
  • Taori et al. (2023) 陶里等人 (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, 和 Tatsunori B Hashimoto. 2023. 斯坦福羊驼:一个遵循指令的美洲驼模型。
  • Wang et al. (2023a) 王等人 (2023a) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023a. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. In NeurIPS.
    王一中, 哈米什·艾维森, 普拉迪普·达西吉, 杰克·赫塞尔, 图沙尔·科特, 凯亚提·昌杜, 大卫·瓦登, 凯尔西·麦克米兰, 诺亚·A·史密斯, 伊兹·贝尔塔吉, 和汉娜内·哈吉希尔齐. 2023a. 骆驼能走多远?探索开放资源上的指令调优状态. 发表在 NeurIPS.
  • Wang et al. (2023b) 重试    错误原因 Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In ACL (1). Association for Computational Linguistics, 13484–13508. 重试    错误原因
  • Wang et al. (2022) 重试    错误原因 Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705 (2022). 重试    错误原因
  • Wang et al. (2023c) 王等 (2023c) Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023c. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700 (2023).
    王子戈, 钟万军, 王玉飞, 朱琪, 米飞, 王宝军, 尚立峰, 姜鑫, 刘群. 2023c. 大型语言模型的数据管理:综述. arXiv 预印本 arXiv:2312.01700 (2023).
  • Wettig et al. (2024) Wettig 等人 (2024) Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for Training Language Models. arXiv preprint arXiv:2402.09739 (2024).
    亚历山大·韦蒂格,阿特米克·古普塔,索米亚·马利克,丹琪·陈。2024。QuRating:选择高质量数据用于训练语言模型。arXiv 预印本 arXiv:2402.09739 (2024)。
  • Xie et al. (2023) 谢等人 (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
    桑迈克尔·谢、希巴尼·桑图卡、滕宇·马和珀西·S·梁。2023 年。通过重要性重采样进行语言模型的数据选择。神经信息处理系统进展 36(2023),34201–34227。
  • Xu et al. (2023) 徐等人 (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. WizardLM: Empowering Large Language Models to Follow Complex Instructions. CoRR abs/2304.12244 (2023).
    徐灿, 孙庆峰, 郑凯, 耿修博, 赵璞, 冯家展, 陶崇阳, 姜大新. 2023. WizardLM: 赋能大型语言模型遵循复杂指令. CoRR abs/2304.12244 (2023).
  • Zhao et al. (2023) 赵等人 (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR abs/2303.18223 (2023).
    赵鑫, 周昆, 李俊逸, 唐天一, 王晓磊, 侯玉鹏, 闵颖倩, 张北辰, 张俊杰, 董子灿, 杜一凡, 杨晨, 陈宇硕, 陈志鹏, 姜金浩, 任瑞阳, 李一凡, 唐心宇, 刘子康, 刘沛宇, 聂建云, 温继荣. 2023. 大型语言模型综述. CoRR abs/2303.18223 (2023).
  • Zheng et al. (2023) 郑等人 (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS.
    郑连敏, 蒋维林, 盛颖, 庄思远, 吴章昊, 庄永浩, 林子, 李卓翰, 李大成, Eric P. Xing, 张浩, Joseph E. Gonzalez, 和 Ion Stoica. 2023. 使用 MT-Bench 和 Chatbot Arena 评估LLM-as-a-Judge。在 NeurIPS.
  • Zhou et al. (2023) 周等人 (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. In NeurIPS.
    周春婷, 刘鹏飞, 徐普新, 斯里尼瓦桑·艾耶尔, 孙娇, 毛宇宁, 马雪哲, 阿维亚·埃弗拉特, 余平, 余莉莉, 张苏珊, 加尔吉·高希, 迈克·刘易斯, 卢克·泽特尔莫耶尔, 和奥默·莱维. 2023. LIMA: 少即是多的对齐方法. 发表在 NeurIPS.

Appendix A Derivations of joint mutual information of two QA pairs
附录 A 两个 QA 对的联合互信息推导

I(q1q2;a1a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2\displaystyle I(q_{1}q_{2};a_{1}a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =H(a1a2)H(a1a2|q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑎2subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}a_{2}|q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (7)
=H(a1a2)H(a1a2|q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑎2subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}a_{2}|q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=H(a1a2)H(a1|q1q2)H(a2|a1q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1subscript𝑞2𝐻conditionalsubscript𝑎2subscript𝑎1subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}|q_{1}q_{2})-H(a_{2}|a_{1}q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=H(a1a2)H(a1|q1)H(a2|q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1𝐻conditionalsubscript𝑎2subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}|q_{1})-H(a_{2}|q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
H(a1)+H(a2)H(a1|q1)H(a2|q2)absent𝐻subscript𝑎1𝐻subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1𝐻conditionalsubscript𝑎2subscript𝑞2\displaystyle\leq H(a_{1})+H(a_{2})-H(a_{1}|q_{1})-H(a_{2}|q_{2})≤ italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=I(q1;a1)+I(q2;a2).absent𝐼subscript𝑞1subscript𝑎1𝐼subscript𝑞2subscript𝑎2\displaystyle=I(q_{1};a_{1})+I(q_{2};a_{2}).= italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

The equality is achieved when a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is independent on a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (similarly, q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT needs to be independent on q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).
a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 独立于 a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 时,实现了平等(同样地, q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 需要独立于 q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )。

Appendix B Training Details
附录 B 训练细节

Platform All experiments were finished on a platform with 64 Intel Xeon Gold 6326 CPU cores @ 2.90GHz, two main-stream high-performance GPUs, and 500GB memories. The training code is based on a popular open-source framework Axolotl444https://github.com/OpenAccess-AI-Collective/axolotl.
平台 所有实验均在一个平台上完成,该平台配备了 64 个 Intel Xeon Gold 6326 CPU 核心 @ 2.90GHz、两块主流高性能 GPU 和 500GB 内存。训练代码基于一个流行的开源框架 Axolotl 4

Data preprocessing To format the multi-turn conversation data, we adopt the Vicuna-style template for Mistral-7B and the Llama-3 template for Llama-3-8B. Samples longer than the maximum input sequence length will be truncated. Besides, the data will be packed to speed up training for SFT.
数据预处理 为了格式化多轮对话数据,我们采用了 Mistral-7B 的 Vicuna 风格模板和 Llama-3-8B 的 Llama-3 模板。超过最大输入序列长度的样本将被截断。此外,数据将被打包以加速 SFT 的训练。

Hyper-parameters For ZIP, the selection numbers K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 10000, 200, and 100, respectively. As for SFT, we share these hyper-parameters for all backbones: training batch size is 128, training epochs is 4, input sequence length is 2048, and the warm-up ratio is 0.1. We adopt different learning rates for each backbone: the learning rate of Mistral-7B is set to 4e-6, and the learning rate of Llama-3-8B is set to 1e-5. As for RLHF, the learning rate for KTO is set to 1e-6, and the batch size is set to 128.
超参数 对于 ZIP,选择编号 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTK2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTK3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 分别设置为 10000、200 和 100。至于 SFT,我们为所有骨干网络共享这些超参数:训练批量大小为 128,训练周期为 4,输入序列长度为 2048,预热比率为 0.1。我们为每个骨干网络采用不同的学习率:Mistral-7B 的学习率设置为 4e-6,Llama-3-8B 的学习率设置为 1e-5。至于 RLHF,KTO 的学习率设置为 1e-6,批量大小设置为 128。

Appendix C Token length distribution of more backbones
附录 C 更多骨干网络的 Token 长度分布

The token length distribution of data selected for Llama-3-8B is depicted in Figure 6, similar to the ones of Mistral-7B.
Llama-3-8B 选择的数据的标记长度分布如图 6 所示,类似于 Mistral-7B 的数据。

Refer to caption
(a) ZIP (a) 邮政编码
Refer to caption
(b) Random (b) 随机
Refer to caption
(c) Diversity (c) 多样性
Refer to caption
(d) Perplexity (d) 困惑度
Refer to caption
(e) DEITA (e) 数据分析与信息技术应用
Refer to caption
(f) SuperFiltering (f) 超级过滤
Figure 6: The distribution of average token number across datasets selected by different algorithms for Llama-3-8B.
图 6:Llama-3-8B 不同算法选择的数据集中平均标记数量的分布。

Appendix D Hyper-parameter sensitivity
附录 D 超参数敏感性

Refer to caption
(a) Model performance w.r.t. K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
(a) 模型性能相对于 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Refer to caption
(b) Model performance w.r.t. K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(b) 模型性能关于 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Refer to caption
(c) Model performance w.r.t. K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
(c) 模型性能相对于 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Figure 7: Model performance w.r.t. different hyper-parameters.
图 7:模型在不同超参数下的性能表现。

ZIP involves three hyper-parameters K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for improved efficiency. We aim to investigate the impact of these hyper-parameters on the model performance, with results depicted in Figure 7.
ZIP 包含三个超参数 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTK2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTK3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 以提高效率。我们旨在研究这些超参数对模型性能的影响,结果如图 7 所示。

Perceived sample number in global selection K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT decides the number of samples to be updated in the global selection stage. We set K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT among range [200, 1000, 10000, 20000], and the results are presented in Figure 7(a). In the figure, the model performance exhibits an increasing trend when K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases. When a smaller K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is specified, ZIP is only exposed to a limited set of samples. This can lead ZIP to degenerate into a variant that consistently selects samples based on individual compression ratios, neglecting the modeling of combinatorial effects. Furthermore, the compression ratio associated with the currently selected dataset typically increases with each update, whereas the compression ratios of other samples remain unchanged. Consequently, a large K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT may result in the compression ratio of the un-updated samples being underestimated, leading to inferior samples’ selection. As a result, a model performance degradation can be found when K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to 20,000.
全局选择中的感知样本数量 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 决定了全局选择阶段要更新的样本数量。我们将 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 设置在[200, 1000, 10000, 20000]范围内,结果如图 7(a)所示。在图中,当 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 增加时,模型性能呈现出上升趋势。当指定较小的 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 时,ZIP 仅暴露于有限的样本集。这可能导致 ZIP 退化为一种变体,该变体始终基于个体压缩比选择样本,忽略了组合效应的建模。此外,与当前选择的数据集相关的压缩比通常会随着每次更新而增加,而其他样本的压缩比保持不变。因此,较大的 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 可能导致未更新样本的压缩比被低估,从而选择了较差的样本。因此,当 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 设置为 20000 时,可以发现模型性能下降。

Data pool size of local selection K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT decides the number of samples selected from the previous K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples. We set K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT among range [100, 200, 500, 1000], and the results are presented in Figure 7(b). The model performance increases with an increased K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which aligns with intuition since the algorithm can consider the combinatorial effects of more samples. But when K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT exceeds a threshold, the model performance reaches a saturated phase, which indicates similar local selection results even with increased local data budget.
本地选择的数据池大小 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 决定了从前 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 个样本中选择的样本数量。我们将 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 设置在[100, 200, 500, 1000]范围内,结果如图 7(b)所示。随着 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 的增加,模型性能提高,这与直觉一致,因为算法可以考虑更多样本的组合效应。但当 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 超过某个阈值时,模型性能达到饱和阶段,这表明即使增加本地数据预算,局部选择结果也相似。

Data budget of local selection K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT decides the number of samples selected from the previous K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples. We set K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT among range [50, 100, 150, 200], and the results are presented in Figure 7(b). The results exhibit a similar trend as the results of K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, yet the underlying causes are inverse. A large K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will make ZIP degenerate into a trivial variant that consistently selects samples based on individual compression ratios. On the other hand, a small K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will lead to more frequent compression ratio updates, which can also lead to underestimated compression ratios of some inferior samples.
本地选择的数据预算 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 决定了从之前的 K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 个样本中选择的样本数量。我们将 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 设置在[50, 100, 150, 200]范围内,结果如图 7(b)所示。结果显示出与 K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 相似的趋势,但其根本原因是相反的。较大的 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 会使 ZIP 退化为一个简单的变体,始终根据个体压缩比选择样本。另一方面,较小的 K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 会导致更频繁的压缩比更新,这也可能导致一些劣质样本的压缩比被低估。