Observational Scaling Laws and
the Predictability of Language Model Performance
观测缩放定律与语言模型性能的可预测性

Yangjun Ruan^1,2,3 阮阳君 ^1,2,3
yjruan@cs.toronto.edu
Chris J. Maddison^2,3 克里斯·J·马迪森
cmaddis@cs.toronto.edu
Tatsunori Hashimoto¹
桥本龙纪 ¹
thashim@stanford.edu
¹Stanford University ²University of Toronto ³Vector Institute
¹ 斯坦福大学 ² 多伦多大学 ³ 向量研究所

Abstract 摘要

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from $\sim$ 80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
理解语言模型性能如何随规模变化对于基准测试和算法开发至关重要。缩放定律是一种建立这种理解的方法，但在许多不同规模上训练模型的要求限制了它们的使用。我们提出了一种替代的观察方法，该方法绕过模型训练，而是从 80 个公开可用的模型中构建缩放定律。从多个模型家族构建单一缩放定律具有挑战性，因为它们的训练计算效率和能力存在很大差异。然而，我们表明，这些差异与一个简单的、广义的缩放定律一致，其中语言模型性能是低维能力空间的函数，而模型家族仅在将训练计算转换为能力的效率上有所不同。使用这种方法，我们展示了复杂尺度现象的惊人可预测性：我们展示了几种涌现现象遵循平滑的 S 形行为，并且可以从小模型中预测；我们展示了像 GPT-4 这样的模型的代理性能可以从更简单的非代理基准中精确预测；我们还展示了如何预测后训练干预措施（如思维链和自我一致性）的影响，随着语言模型能力的不断提高。

1 Introduction 1 引言

Language model (LM) scaling plays a central role in discussions of model capabilities and affects everything from the tasks they can perform to the effectiveness of post-training techniques such as Chain-of-Thought [91]. Due to this importance, understanding and predicting LM behaviors across scales, benchmarks, and algorithmic interventions is a major question for many researchers and engineers. Machine learning researchers may wish to understand whether their proposed algorithmic interventions remain effective in the face of future model scaling, while engineers and benchmark builders may wish to understand whether complex capabilities such as agentic abilities will scale predictably in the same way as existing LM benchmarks.
语言模型（LM）扩展在讨论模型能力时起着核心作用，并影响从它们可以执行的任务到训练后技术（如链式思维[91]）的有效性。由于这种重要性，理解和预测 LM 在不同规模、基准和算法干预下的行为是许多研究人员和工程师面临的一个重大问题。机器学习研究人员可能希望了解他们提出的算法干预在未来模型扩展面前是否仍然有效，而工程师和基准构建者可能希望了解复杂能力（如代理能力）是否会像现有的 LM 基准一样可预测地扩展。

Scaling laws [34, 42, 7, 35, 61] have been powerful tools for understanding the scaling trend of LMs, which have shown that LMs follow a precise power-law relationship between compute measures (such as training FLOPs) and downstream capabilities ranging from perplexity [42, 35] to benchmark performance [32, 33]. This power-law relationship has been used in a variety of ways – including hyperparameter and architecture selection [42, 35, 10] as well as model capability forecasting [24, 62, 63]. Unfortunately, scaling analyses remain uncommon in many benchmarking and post-training studies, as most researchers do not have the compute resources to build scaling laws from scratch, and open models are trained at too few scales (3-5) for reliable scaling predictions.
缩放定律[34, 42, 7, 35, 61]一直是理解语言模型（LMs）缩放趋势的有力工具，显示了 LMs 在计算度量（如训练 FLOPs）和下游能力（从困惑度[42, 35]到基准性能[32, 33]）之间遵循精确的幂律关系。这种幂律关系已被用于多种方式——包括超参数和架构选择[42, 35, 10]以及模型能力预测[24, 62, 63]。不幸的是，缩放分析在许多基准测试和训练后研究中仍然不常见，因为大多数研究人员没有计算资源从头构建缩放定律，而开源模型的训练规模（3-5）太少，无法进行可靠的缩放预测。

Although the high costs of compute scaling laws are unavoidable when optimizing pre-training hyperparameters (e.g., Hoffmann et al. [35]), this is not true of all scaling analyses. In this work, we show that many other types of scaling analyses, such as understanding complex model capabilities (e.g., agentic or “emergent” behaviors) and post-training interventions, can be done using a lower-cost, higher-resolution, and broader-coverage alternative to the standard approach of training (or using) a single family of LMs across compute scales.
尽管在优化预训练超参数时计算扩展定律的高成本是不可避免的（例如，Hoffmann 等人[35]），但并非所有扩展分析都是如此。在这项工作中，我们展示了许多其他类型的扩展分析，例如理解复杂模型能力（例如，代理行为或“突现”行为）和训练后干预，可以使用一种低成本、高分辨率和广覆盖的替代方法来完成，而不是采用在计算规模上训练（或使用）单一系列语言模型的标准方法。

Refer to caption — Figure 1: Observational scaling laws generalize existing compute scaling laws which directly relate training compute to downstream capabilities (dashed line) by hypothesizing the existence of a low-rank space of LM capabilities that have a log-linear relationship with compute (center), and can be extracted directly from standardized LM benchmarks (left). This enables us to get low-cost, high-resolution scaling predictions of LMs’ complex downstream capabilities from their observable standard benchmark metrics using nearly 80 publicly accessible LMs (left to right).
图 1：观测缩放定律推广了现有的计算缩放定律，这些定律通过假设存在一个与计算呈对数线性关系的低秩空间（中间），直接将训练计算与下游能力（虚线）联系起来，并且可以直接从标准化的语言模型基准中提取（左侧）。这使我们能够从可观察的标准基准指标中，使用近 80 个公开可访问的语言模型（从左到右），以低成本、高分辨率预测语言模型复杂的下游能力。

The starting point of our work is the observation that there now exist hundreds of open models spanning a large range of scales and capabilities. While we cannot directly use these models for compute scaling laws (as the training compute efficiency varies widely across model families), we might hope that there exists a more general scaling law that holds across model families. In particular, we hypothesize that the downstream performance of an LM is a function of a low-dimensional space of capabilities (e.g., natural language understanding, reasoning, and code generation), and that model families vary only in the efficiency by which they convert training compute to these capabilities. If such a relationship held, it would imply that there is a log-linear relationship from low-dimensional capabilities to downstream capabilities across model families (which would allow us to build scaling laws that leverage all existing models), as well as a log-linear relationship between training compute and capabilities within each model family (as in standard compute scaling) (Fig. 1).
我们工作的起点是观察到现在存在数百个跨越大范围规模和能力的开放模型。虽然我们不能直接使用这些模型来计算扩展定律（因为训练计算效率在不同模型家族之间差异很大），但我们可能希望存在一个更普遍的扩展定律适用于所有模型家族。特别是，我们假设语言模型的下游性能是低维能力空间（例如，自然语言理解、推理和代码生成）的函数，并且模型家族之间的差异仅在于它们将训练计算转换为这些能力的效率。如果这种关系成立，这将意味着在模型家族之间，从低维能力到下游能力存在对数线性关系（这将允许我们构建利用所有现有模型的扩展定律），以及在每个模型家族内部，训练计算与能力之间存在对数线性关系（如标准计算扩展中所示）（图 1）。

Through an analysis of existing standardized LM benchmarks (e.g., Open LLM Leaderboard [9]), we find a few such capability measures that have scaling law relationships with compute within model families ( $R^{2}>0.9$ ) (Fig. 3), and with downstream metrics across model families. We call such scaling relationships observational scaling laws as they enable the predictions of complex downstream capabilities from simple observable quantities that we expect to scale with compute (like standardized benchmark performance).
通过对现有标准化 LM 基准（例如，Open LLM Leaderboard [9]）的分析，我们发现了一些在模型家族内与计算量存在缩放定律关系的能力测量（ $R^{2}>0.9$ ）（图 3），以及跨模型家族的下游指标。我们称这种缩放关系为观察性缩放定律，因为它们能够通过我们期望与计算量成比例的简单可观察量（如标准化基准性能）预测复杂的下游能力。

The ability to build scaling laws across a large number of existing LMs from their standard benchmark metrics has significant advantages in cost, resolution, and coverage: Observational scaling incurs no training cost, while leveraging models spanning a much larger compute range than any single model family. It also significantly increases the resolution of scaling laws by virtue of using more models, which is useful for studying nearly discontinuous phenomena like “emergent” capabilities. Finally, observational scaling can combine model families from heterogeneous sources with very different scaling properties (e.g., LLaMA [83] vs StarCoder [46]) which allows us to study how different scaling strategies impact downstream performance and algorithmic interventions.
通过现有大规模语言模型的标准基准指标建立缩放定律的能力在成本、分辨率和覆盖范围方面具有显著优势：观察性缩放不需要训练成本，同时利用跨越比任何单一模型家族更大计算范围的模型。通过使用更多的模型，它还显著提高了缩放定律的分辨率，这对于研究几乎不连续的现象（如“新兴”能力）非常有用。最后，观察性缩放可以结合来自异质来源、具有非常不同缩放特性的模型家族（例如，LLaMA [83] 与 StarCoder [46]），这使我们能够研究不同的缩放策略如何影响下游性能和算法干预。

Finally, we show that using observational scaling laws is low-cost and straightforward, as there are a few model families that are sufficiently representative to replicate many of our core findings (Sec. 5). By using these representative families, we find that future works can easily make scaling predictions on benchmarks and post-training interventions by evaluating only 10-20 models.
最后，我们表明，使用观测缩放定律成本低且简单，因为有一些模型家族足够具有代表性，可以复制我们的许多核心发现（第 5 节）。通过使用这些具有代表性的家族，我们发现未来的工作可以通过评估仅 10-20 个模型，轻松地对基准和训练后干预进行缩放预测。

We demonstrate the utility of observational scaling laws in three different settings that are challenging for compute scaling laws but are accurately predicted using observational scaling laws. While our results are based on systematic holdout validation with currently available models, we preregister our fitted scaling laws and commit to updating their prediction accuracy on future models (Sec. 4).
我们在三个对计算缩放定律具有挑战性的不同环境中展示了观测缩放定律的实用性，但使用观测缩放定律可以准确预测。虽然我们的结果基于当前可用模型的系统性保留验证，但我们预先注册了拟合的缩放定律，并承诺在未来模型上更新其预测准确性（第 4 节）。

Emergent capabilities (Sec. 4.1) There has been an active debate about whether LMs have “emergent” capabilities that discontinuously appear at certain compute thresholds and whether these capabilities can be predicted using small models [90, 74, 56, 36]. The high resolution of observational scaling laws shows that some of these phenomena follow a smooth sigmoid, and can be predicted accurately using small, sub Llama-2 7B models.
新兴能力（第 4.1 节）关于语言模型是否具有在某些计算阈值下不连续出现的“新兴”能力，以及这些能力是否可以使用小模型预测，一直存在积极的争论[90, 74, 56, 36]。高分辨率的观测缩放定律表明，其中一些现象遵循平滑的 S 形曲线，并且可以使用小型、低于 Llama-2 7B 的模型准确预测。

Agentic capabilities (Sec. 4.2) We show that the more high-level, complex capabilities of LMs as agents, as measured by AgentBench [54] and AgentBoard [57], can be predicted with simple benchmark metrics. Our scaling law precisely predicts the performance of GPT-4 using only weaker models (sub GPT-3.5) and identifies programming capabilities as driving agent performance.
代理能力（第 4.2 节）我们展示了，作为代理的语言模型（LMs）的更高层次、复杂能力，可以通过 AgentBench [54]和 AgentBoard [57]测量，并且可以用简单的基准指标预测。我们的扩展法则精确预测了 GPT-4 的性能，仅使用较弱的模型（低于 GPT-3.5），并确定编程能力是驱动代理性能的关键因素。

Post-training interventions (Sec. 4.3) We show that our scaling laws can reliably predict the gains of post-training techniques, such as Chain-of-Thought [91] and Self-Consistency [89] at scale, even when we fit our scaling laws on weak models (sub Llama-2 7B).
训练后干预（第 4.3 节）我们展示了我们的缩放定律可以可靠地预测训练后技术的收益，例如大规模的 Chain-of-Thought [91]和 Self-Consistency [89]，即使我们在弱模型（低于 Llama-2 7B）上拟合我们的缩放定律。

The contribution of our work is as follows: our conceptual contribution is to propose observational scaling which leverages predictable log-linear relationships between compute, simple capability measures, and complex downstream metrics. Our empirical contributions include identifying a small number of capability measures that cover standard LM benchmarks, demonstrating that these measures provide accurate predictions on a number of complex LM capabilities, and selecting a small set of model families that are useful for low-cost observational scaling analyses.
我们的工作贡献如下：我们的概念性贡献是提出了观测缩放方法，该方法利用计算、简单能力测量和复杂下游指标之间可预测的对数线性关系。我们的实证贡献包括识别出少量涵盖标准语言模型基准的能力测量，证明这些测量能够对多种复杂的语言模型能力进行准确预测，并选择了一小部分适用于低成本观测缩放分析的模型家族。

2 Related Work 2 相关工作

Compute scaling laws 计算缩放定律

In standard scaling laws [34, 42, 32, 7, 33, 35, 61], the “scale” is defined by the compute resources allocated to training LMs, such as the number of training FLOPs $C$ , model parameters $N$ , and training tokens $D$ . Scaling laws are typically formulated as a power-law relationship between LMs’ cross-entropy loss $L$ and their compute scale measures. Common functional forms include $L(N,D)=\frac{a}{N^{\alpha}}+\frac{b}{D^{\beta}}+e$ [35, 61] or $L(C)=\frac{c}{C^{\gamma}}+h$ [42, 32], where $C\approx 6ND$ [42] for the Transformer [85]. The parameters $\left\{\alpha,\beta,a,b,e\right\}$ or $\left\{\gamma,c,h\right\}$ are fitted by training LMs across different compute scales, varying $N$ and/or $D$ , and measuring their loss. Our work differs from compute scaling laws in our goals – compute scaling aims to understand the scaling properties of pretraining, and thus focuses on a single model family and relates downstream performance to directly controllable quantities such as training compute. In contrast, we are interested in scaling laws for downstream, post-training performance, which leads us to consider scaling laws across model families and use more directly observable capability measures than compute.
在标准的缩放定律中[34, 42, 32, 7, 33, 35, 61]，“规模”由分配给训练语言模型的计算资源定义，例如训练 FLOPs 的数量 $C$ 、模型参数 $N$ 和训练标记 $D$ 。缩放定律通常被表述为语言模型的交叉熵损失 $L$ 与其计算规模测量之间的幂律关系。常见的函数形式包括 $L(N,D)=\frac{a}{N^{\alpha}}+\frac{b}{D^{\beta}}+e$ [35, 61]或 $L(C)=\frac{c}{C^{\gamma}}+h$ [42, 32]，其中 $C\approx 6ND$ [42]适用于 Transformer[85]。参数 $\left\{\alpha,\beta,a,b,e\right\}$ 或 $\left\{\gamma,c,h\right\}$ 通过在不同计算规模下训练语言模型来拟合，变化 $N$ 和/或 $D$ ，并测量它们的损失。我们的工作在目标上与计算缩放定律不同——计算缩放旨在理解预训练的缩放特性，因此专注于单一模型家族，并将下游性能与训练计算等直接可控的量联系起来。相比之下，我们对下游、训练后性能的缩放定律感兴趣，这使我们考虑跨模型家族的缩放定律，并使用比计算更直接可观察的能力测量。

Downstream scaling laws 下游缩放定律

Scaling laws have been generalized beyond pretraining loss to analyze transfer learning [33, 78, 1] and downstream performance [32, 29, 15] across various domains, see Villalobos [86] for a comprehensive review. In particular, there has been evidence suggesting that the few-shot performance of LMs on downstream benchmarks is closely tied to compute measures like model size [13], but whether this is predictable with scaling laws remains debated. Extensive research has explored the difficulties of predicting benchmark performance due to their appearing rapid “emergence” [90, 76, 26], while recent works argued the discontinuity is due to the metrics used [74, 56] or the lack of data points [36] (see Anwar et al. [5] for a survey on this topic). Finnveden [24] and Owen [63] have investigated the use of linear and sigmoidal scaling laws, derived from pretraining loss or computational measures, to extrapolate the benchmark performance. Notably, Owen [63] also utilized publicly available LMs from different families to fit their compute scaling laws despite the potential discrepancies in their compute efficiencies. Recent studies have also more extensively investigated the correlations between the pretraining loss and downstream performance of LMs [93, 37], aiding in the understanding of downstream scaling [25] and emergent capabilities [23] of LMs. On the theory front, Arora and Goyal [6] derived a theory characterizing how performance on complex skills of LMs can be derived as a composition of base skills. While our work shares similar goals in that we aim to understand the downstream, post-training performance of models, we differ in our approach in that we aim to build practical higher-resolution scaling laws using multiple model families and their observable standard benchmark metrics.
缩放定律已被推广到超越预训练损失，以分析迁移学习 [33, 78, 1] 和各个领域的下游性能 [32, 29, 15]，详见 Villalobos [86] 的综合评审。特别是，有证据表明，语言模型在下游基准测试中的少样本性能与计算指标（如模型大小）密切相关 [13]，但这种关系是否可以通过缩放定律预测仍存在争议。大量研究探讨了由于基准测试性能的快速“涌现” [90, 76, 26] 而导致的预测困难，而最近的研究认为这种不连续性是由于所使用的指标 [74, 56] 或数据点的缺乏 [36]（详见 Anwar 等人 [5] 对该主题的调查）。Finnveden [24] 和 Owen [63] 研究了从预训练损失或计算指标中推导出的线性和 S 形缩放定律，以外推基准性能。值得注意的是，Owen [63] 还利用了来自不同家族的公开可用语言模型来拟合他们的计算缩放定律，尽管这些模型在计算效率上可能存在差异。最近的研究还更广泛地调查了预训练损失与语言模型（LMs）下游性能之间的相关性[93, 37]，有助于理解语言模型的下游扩展[25]和新兴能力[23]。在理论方面，Arora 和 Goyal [6] 推导出一种理论，描述了如何将语言模型在复杂技能上的表现作为基本技能的组合来推导。虽然我们的工作在目标上相似，即我们旨在理解模型的下游、训练后性能，但我们的方法不同，我们旨在使用多个模型家族及其可观察的标准基准指标来构建实用的高分辨率扩展定律。

Correlations between benchmarks
基准之间的相关性

Numerous works have investigated the correlations between different benchmarks across various contexts. Extensive research has explored the relationship between the out-of-distribution performance and in-distribution performance of machine learning models [77, 59, 68, 96, 69]. In the realm of NLP and LM benchmarks, Qiu et al. [67], Torregrossa et al. [82] found that different evaluations and metrics for word embeddings are highly correlated, and Liu et al. [53] observed a strong correlation between question-answering benchmarks. Moreover, Perlitz et al. [64], Polo et al. [65] observed strong correlations between samples within various LM benchmarks and utilized this observation to develop more efficient benchmarks. Most relevant to our work, Ilić [38] found that a single factor explains 85% of the performance on the Open LLM Leaderboard [9] and GLUE leaderboard [87], while Burnell et al. [14] extracted three factors for LM capabilities that account for 82% of the variation on the HELM benchmark [49], aligning with our observations. Our work also observes such benchmark correlations and low-rank structures but is unique in utilizing these properties for the purpose of scaling predictions that can be used directly for benchmark and algorithm development.
许多研究探讨了在不同背景下不同基准之间的相关性。大量研究探索了机器学习模型的分布外性能与分布内性能之间的关系 [77, 59, 68, 96, 69]。在自然语言处理和语言模型基准的领域，Qiu 等人 [67] 和 Torregrossa 等人 [82] 发现不同的评估和词嵌入的度量高度相关，Liu 等人 [53] 观察到问答基准之间有很强的相关性。此外，Perlitz 等人 [64] 和 Polo 等人 [65] 观察到各种语言模型基准内样本之间的强相关性，并利用这一观察结果开发了更高效的基准。与我们的工作最相关的是，Ilić [38] 发现单一因素解释了 Open LLM排行榜 [9] 和 GLUE 排行榜 [87] 上 85%的性能，而 Burnell 等人 [14] 提取了三个因素来解释 HELM 基准 [49] 上 82%的变异，这与我们的观察结果一致。我们的工作也观察到了这种基准相关性和低秩结构，但独特之处在于利用这些特性进行可直接用于基准和算法开发的预测扩展。

3 Observational Scaling Laws
3 观测标度定律

In this section, we introduce our observational scaling laws that generalize the standard compute scaling laws (Sec. 3.1). The key idea is to extract a low-dimensional capability measure for LMs from their observable benchmark performance (Fig. 2), which we find has a log-linear relationship with compute scale measures (Sec. 3.3) and can thus be used as surrogate “scale” for scaling analysis of complex LM capabilities (Sec. 3.4).
在本节中，我们介绍了推广标准计算缩放定律的观测缩放定律（第 3.1 节）。关键思想是从语言模型的可观测基准性能中提取低维能力度量（图 2），我们发现其与计算规模度量具有对数线性关系（第 3.3 节），因此可以用作复杂语言模型能力缩放分析的替代“规模”（第 3.4 节）。

3.1 Generalizing Compute Scaling Laws
3.1 推广计算缩放定律

Standard compute scaling 标准计算扩展

In compute scaling laws, there is a hypothesized power-law relationship between models’ compute measures $C_{m}$ (e.g., training FLOPs) and their errors $E_{m}$ (e.g., perplexity). Specifically, for a model $m$ within a family $f$ (e.g., Llama-2 7B, 13B, and 70B) we hypothesize
在计算扩展定律中，假设模型的计算度量（例如，训练 FLOPs）与其误差（例如，困惑度）之间存在幂律关系。具体来说，对于一个家族中的模型（例如，Llama-2 7B、13B 和 70B），我们假设

\log(E_{m})\approx\beta_{f}\log(C_{m})+\alpha_{f},

(1)

and if this linear fit is sufficiently accurate, we draw inferences about the performance of a model at future compute scales $C^{\prime}>C$ by extrapolating this relationship. However, fitting such a scaling law can be tricky, as each model family $f$ and downstream benchmark has its own scaling coefficients $\beta_{f}$ and $\alpha_{f}$ . This means that scaling experiments, especially for post-training analysis, are often fitted on very few (3-5) models sharing the same model family, and any predictions are valid only for a specific scaling strategy used within a model family.
如果这种线性拟合足够准确，我们可以通过外推这种关系来推断模型在未来计算规模下的性能。然而，拟合这种缩放定律可能很棘手，因为每个模型家族和下游基准都有自己的缩放系数。这意味着缩放实验，特别是用于训练后分析的，通常只在很少的（3-5 个）共享相同模型家族的模型上进行拟合，任何预测仅对在模型家族内使用的特定缩放策略有效。

Several studies [e.g., 24, 63] have generalized the functional form to analyze the scaling of LMs’ downstream performance. Specifically, let $E_{m}$ represent the normalized downstream errors of models within the range $[0,1]$ , they observed a sigmoidal relationship between $\log(C_{m})$ and $E_{m}$ and thus used a logistic link function instead of a logarithm for the generalized linear model in Eq. 1:
几项研究[例如，24, 63]已经推广了函数形式来分析语言模型下游性能的缩放。具体来说，令 $E_{m}$ 表示范围 $[0,1]$ 内模型的归一化下游误差，他们观察到 $\log(C_{m})$ 和 $E_{m}$ 之间的 S 形关系，因此在方程 1 的广义线性模型中使用了逻辑链接函数而不是对数函数：

\sigma^{-1}(E_{m})\approx\beta_{f}\log(C_{m})+\alpha_{f},

(2)

Observational scaling 观测尺度

In our work, we hypothesize the existence of a low-dimensional capability measure for LMs that relate compute to more complex LM capabilities and can be extracted from observable standard LM benchmarks, as illustrated in Fig. 1. Specifically, given $T$ simple benchmarks and $B_{i,m}$ the error of a model $m$ on benchmark $i\in[T]$ , we hypothesize that there exists some capability vector $S_{m}\in\mathbb{R}^{K}$ such that,
在我们的工作中，我们假设存在一种低维度的能力度量，用于将计算与更复杂的语言模型能力联系起来，并可以从可观察的标准语言模型基准中提取，如图 1 所示。具体来说，给定 $T$ 简单基准和 $B_{i,m}$ 模型 $m$ 在基准 $i\in[T]$ 上的误差，我们假设存在某种能力向量 $S_{m}\in\mathbb{R}^{K}$ ，使得，

$\displaystyle\sigma^{-1}(E_{m})$	$\displaystyle\approx\beta^{\top}S_{m}+\alpha$	(3)
$\displaystyle S_{m}$	$\displaystyle\approx\theta_{f}\log(C_{m})+\nu_{f}$	(4)
$\displaystyle B_{i,m}$	$\displaystyle\approx\gamma^{\top}_{i}S_{m}.$	(5)

for $\theta_{f},\nu_{f},\beta\in\mathbb{R}^{K}$ , $\alpha\in\mathbb{R}$ , and orthonormal vectors $\gamma_{i}\in\mathbb{R}^{K}$ .
对于 $\theta_{f},\nu_{f},\beta\in\mathbb{R}^{K}$ 、 $\alpha\in\mathbb{R}$ 和正交归一向量 $\gamma_{i}\in\mathbb{R}^{K}$ 。

We can view Eq. 3 and Eq. 4 as a generalization of Eq. 2, since combining them can recover the original scaling relationships for a single model family. However, when there are multiple model families, $S_{m}$ serves as a shared, low-dimensional space of model capabilities from which all downstream metrics ( $E$ and $B$ ) are derived (as indicated by the absence of $f$ in Eq. 3 and Eq. 5), and model families only vary in their efficiency in converting compute into capabilities (Eq. 4). One useful way of interpreting Eq. 4 is that $\theta_{f}$ represents the compute efficiency of a model family $f$ , and $S_{m}$ is the capabilities of model $m$ expressed in terms of log-FLOPs for this model family.
我们可以将方程 3 和方程 4 视为方程 2 的推广，因为结合它们可以恢复单一模型家族的原始缩放关系。然而，当存在多个模型家族时， $S_{m}$ 作为一个共享的低维空间，从中派生出所有下游指标（ $E$ 和 $B$ ）（如方程 3 和方程 5 中没有 $f$ 所示），而模型家族仅在将计算转换为能力的效率上有所不同（方程 4）。解释方程 4 的一种有用方式是， $\theta_{f}$ 代表模型家族 $f$ 的计算效率，而 $S_{m}$ 是模型 $m$ 的能力，以该模型家族的对数 FLOPs 表示。

Finally, Eq. 5 ensures that these capabilities are not latent variables to be estimated for each model family, but are instead functions of fully observable properties ( $B$ ). Since $\gamma\in\mathbb{R}^{K\times T}$ is orthonormal, we can linearly estimate $\hat{S}_{m}:=\gamma B_{m}$ , which makes our scaling analysis significantly more robust. Importantly, this enables us to apply this to a large number of public models from heterogeneous sources, including those proprietary ones without any public information on $C$ such as GPT-4.
最后，方程 5 确保这些能力不是每个模型家族需要估计的潜在变量，而是完全可观察属性的函数（ $B$ ）。由于 $\gamma\in\mathbb{R}^{K\times T}$ 是正交归一的，我们可以线性估计 $\hat{S}_{m}:=\gamma B_{m}$ ，这使得我们的缩放分析显著更为稳健。重要的是，这使我们能够将其应用于来自异构来源的大量公共模型，包括那些没有任何关于 $C$ 的公共信息的专有模型，如 GPT-4。

At this point, it is not yet clear that Equations 3, 4, and 5 hold in practice. In next subsections, we validate Eq. 5 (Fig. 2) and Eq. 4 (Sec. 3.3) separately, and then present our estimation algorithm for Eq. 3 in Sec. 3.4. In Sec. 4, we will perform a more extensive validation of Eq. 3.
在这一点上，尚不清楚方程 3、4 和 5 在实践中是否成立。在接下来的小节中，我们将分别验证方程 5（图 2）和方程 4（第 3.3 节），然后在第 3.4 节中提出我们对方程 3 的估计算法。在第 4 节中，我们将对方程 3 进行更广泛的验证。

3.2 Identifying a Low-Dimensional Capability Space (Eq. 5)
3.2 确定低维能力空间（方程 5）

We validate the existence of a low-dimensional capability measure $S$ that linearly relates to standard LM benchmarks $B$ by showing that only a few principal components of $B$ capture most of its variation (Eq. 5). We demonstrate that the benchmark-model matrix $B$ for a reasonable, broad set of benchmarks and models is low-rank and that Eq. 5 is a reasonable assumption. As this type of analysis depends heavily on the set of models and benchmarks chosen, we carefully describe our selection process below.
我们验证了一个低维能力度量 $S$ 的存在，该度量与标准 LM 基准 $B$ 线性相关，方法是展示 $B$ 的几个主成分捕捉了其大部分变化（方程 5）。我们证明了对于一组合理、广泛的基准和模型，基准-模型矩阵 $B$ 是低秩的，并且方程 5 是一个合理的假设。由于这种类型的分析在很大程度上依赖于所选择的模型和基准，我们在下面仔细描述了我们的选择过程。

Models 模型

Since the benchmark-model matrix $B$ can be directly measured for any LM, we include a large number of publicly accessible models for subsequent analysis. We collected a broad set of open LMs covering 21 model families (a collection of models across scales such as LLaMA-2 7B, 13B, 70B) and a total of 77 models. These encompass models trained from heterogeneous recipes, including standard training recipes like LLaMA [83] and Qwen [8], those trained on synthetic data like Phi [48], and models specifically trained on code data like CodeLlama [71] and StarCoder [46]. For this analysis, we consider only pretrained base models to avoid the complexities introduced by instruction tuning. We also include an analysis for instruction-tuned models that include proprietary ones like GPT-4 [62] and Claude-2 [4] in Sec. C.1, which demonstrates similar results. See table B.1 for a detailed list of collected models.
由于基准模型矩阵 $B$ 可以直接测量任何语言模型（LM），我们包括了大量公开可访问的模型以供后续分析。我们收集了一组涵盖 21 个模型家族（跨尺度的模型集合，如 LLaMA-2 7B、13B、70B）和总共 77 个模型的开放语言模型。这些模型包括从异构配方训练的模型，包括标准训练配方如 LLaMA [83] 和 Qwen [8]，在合成数据上训练的模型如 Phi [48]，以及专门在代码数据上训练的模型如 CodeLlama [71] 和 StarCoder [46]。在此分析中，我们仅考虑预训练的基础模型，以避免指令调优引入的复杂性。我们还在 Sec. C.1 中包括了对指令调优模型的分析，其中包括专有模型如 GPT-4 [62] 和 Claude-2 [4]，结果相似。有关收集模型的详细列表，请参见表 B.1。

Benchmarks 基准测试

We collected a set of diverse benchmarks that assess various LMs’ capabilities. These include popular aggregated benchmarks like MMLU [31] that assess the general knowledge of LMs. For more specialized evaluations, we included ARC-C [19], HellaSwag [99], Winogrande [72] for commonsense reasoning, GSM8K [20] for mathematical reasoning, HumanEval [16] for programming, TruthfulQA [50] for truthfulness, and XWinograd [60] for multilingual capabilities. We carefully collected these metrics from standardized evaluation protocols for comparability across LMs. In particular, we compiled them from standardized leaderboards, like the Open LLM Leaderboard [9] and EvalPlus [52], when available. Otherwise, we used standardized libraries such as the LM Eval Harness [27] to evaluate the LMs. See Sec. B.1 for full details of our data collection pipeline.
我们收集了一组多样化的基准，用于评估各种语言模型的能力。这些基准包括像 MMLU [31]这样的流行综合基准，用于评估语言模型的一般知识。对于更专业的评估，我们包括了 ARC-C [19]、HellaSwag [99]、Winogrande [72]用于常识推理，GSM8K [20]用于数学推理，HumanEval [16]用于编程，TruthfulQA [50]用于真实性，以及 XWinograd [60]用于多语言能力。我们从标准化评估协议中仔细收集了这些指标，以便在不同语言模型之间进行比较。特别是，我们从标准化排行榜（如 Open LLM Leaderboard [9]和 EvalPlus [52]）中汇编了这些指标（如果有的话）。否则，我们使用标准化库（如 LM Eval Harness [27]）来评估语言模型。有关我们数据收集流程的详细信息，请参见 B.1 节。

PCA analysis PCA 分析

After obtaining the benchmark metrics for the LMs, we addressed potential missing values (less than $1\%$ of all data), which may have occurred due to evaluation failures, by using PCA imputation. Subsequently, we applied PCA to extract the principal components of the evaluation metrics as the “principal capability” (PC) measures $S$ (additional details in Sec. B.3).
在获得 LMs 的基准指标后，我们通过 PCA 插补处理了可能由于评估失败而导致的潜在缺失值（占所有数据的不到 $1\%$ ）。随后，我们应用 PCA 提取评估指标的主成分作为“主能力”（PC）度量 $S$ （更多细节见 B.3 节）。

PC measures are low-dimensional
PC 测量是低维的

We observe that the extracted PC measures are predominantly low-rank, with the top 3 PCs explaining $\sim 97\%$ of the variance, which supports a low-dimensional representation of benchmarks $B$ (Fig. 2(a)). Surprisingly, we find that the first PC alone explains nearly 80% of the variation in LM capabilities. Taking a closer look at these PCs, we find that these capability measures represent interpretable directions in which LMs capabilities may naturally vary as a function of scale (Fig. 2(b)). Specifically, PC-1 represents the “general capability” as a weighted average of all metrics; PC-2 corresponds to the “reasoning capability”, emphasizing mathematical and coding benchmarks; and PC-3 primarily reflects the “programming capability”. These findings suggest that many simple LM capabilities (as covered in our benchmarks) can be expressed as a linear combination of just a few “principal capabilities” $S$ .
我们观察到提取的主成分（PC）度量主要是低秩的，前三个主成分解释了 $\sim 97\%$ 的方差，这支持了基准测试的低维表示 $B$ （图 2(a)）。令人惊讶的是，我们发现仅第一个主成分就解释了语言模型能力变化的近 80%。仔细观察这些主成分，我们发现这些能力度量代表了语言模型能力可能随规模自然变化的可解释方向（图 2(b)）。具体来说，主成分 1 代表“综合能力”，是所有指标的加权平均；主成分 2 对应“推理能力”，强调数学和编码基准；主成分 3 主要反映“编程能力”。这些发现表明，许多简单的语言模型能力（如我们的基准测试所涵盖的）可以表示为仅几个“主要能力”的线性组合 $S$ 。

3.3 Principal Capability Measures as Surrogate Scale Measures (Eq. 4)
3.3 主要能力度量作为替代尺度度量（方程 4）

We now show that the PC measures $S$ scale log-linearly with training FLOPs within each model family, and can thus be interpreted as a cross-family generalization of compute $C$ .
我们现在表明，PC 在每个模型家族内与训练 FLOPs 呈对数线性比例，因此可以解释为计算的跨家族泛化。

Setup 设置

We collected all available information about training FLOPs on each of our models, analyzing papers and other public information to identify model size $N$ and pretraining data size $D$ . For the models where we were able to identify this information, we used the simple estimate of $C\approx 6ND$ to obtain model training FLOPs [42]. See table B.1 for our collected compute measures.
我们收集了关于每个模型训练 FLOPs 的所有可用信息，分析论文和其他公开信息以确定模型大小 $N$ 和预训练数据大小 $D$ 。对于我们能够识别出这些信息的模型，我们使用简单估算 $C\approx 6ND$ 来获得模型训练 FLOPs [42]。请参见表 B.1 了解我们收集的计算指标。

PC measures linearly correlate with log-compute measures
PC 测量与对数计算测量呈线性相关

Fig. 3 illustrates the correlation between the top PC-1 measure with the corresponding training FLOPs for models within each model family. We find that for each model family with controlled training recipes and comparable compute scale measures, the LMs’ PC-1 measure linearly correlates with their log-training FLOPs (with $R^{2}>0.9$ ). This linear correlation holds across a broad range of model families including those specifically trained on multilingual data like BLOOM [92] or those on code like StarCoder [46]. It also generally holds for lower-ranked PCs such as PC-2 and PC-3, as shown in Fig. C.2. Together with Fig. 2, these results support the validity of Equations 5 and 4, in which we hypothesized that models share the same capability space and a log-linear relationship determines the efficiency by which each model family converts their compute into these principal capabilities.
图 3 展示了每个模型家族中顶级 PC-1 度量与相应训练 FLOPs 之间的相关性。我们发现，对于每个具有受控训练配方和可比计算规模度量的模型家族，LMs 的 PC-1 度量与其对数训练 FLOPs 呈线性相关（ $R^{2}>0.9$ ）。这种线性相关性在广泛的模型家族中成立，包括那些专门在多语言数据上训练的模型，如 BLOOM [92]，或在代码上训练的模型，如 StarCoder [46]。它也通常适用于较低排名的 PC，如 PC-2 和 PC-3，如图 C.2 所示。结合图 2，这些结果支持了方程 5 和方程 4 的有效性，我们假设模型共享相同的能力空间，并且对数线性关系决定了每个模型家族将其计算转换为这些主要能力的效率。

3.4 Fitting Observational Scaling Laws (Eq. 3)
3.4 拟合观测标度定律（方程 3）

Having validated that a simple PC analysis leads to capability measures $S$ that approximately fulfill equations 4 and 5, we now define a procedure to estimate the scaling relationship in Eq. 3. The complete algorithm is presented in algorithm 1.
已经验证简单的主成分分析得出的能力度量 $S$ 大致满足方程 4 和 5，我们现在定义一个程序来估计方程 3 中的缩放关系。完整的算法在算法 1 中给出。

Fitting regression with PC measures
使用主成分测量拟合回归

Given a certain downstream error metric $E$ normalized to $[0,1]$ that measures certain LM capabilities, we slightly generalize Eq. 3 to
鉴于某个下游误差度量 $E$ 标准化为 $[0,1]$ ，该度量衡量某些语言模型的能力，我们稍微推广了方程 3 至

\displaystyle E_{m}\approx h\sigma(\beta^{\top}S_{m}+\alpha)

(6)

where $\beta\in\mathbb{R}^{K}$ and $\alpha\in\mathbb{R}$ are the regression weights and bias, $h\in[0,1]$ is the sigmoidal scale that accounts for the potential discrepancies in the floor performance. We fit the regression with ordinary least squares and restrict $h\in[0.8,1.0]$ , which results in $h^{*}=1$ in most experiments.
其中 $\beta\in\mathbb{R}^{K}$ 和 $\alpha\in\mathbb{R}$ 是回归权重和偏差， $h\in[0,1]$ 是考虑到地板性能潜在差异的 S 形尺度。我们用普通最小二乘法拟合回归，并限制 $h\in[0.8,1.0]$ ，这在大多数实验中导致 $h^{*}=1$ 。

Defining interpretable compute-like measures
定义可解释的计算类度量

Recall that the core component of our scaling law is the fitted linear transformation $P_{m}\vcentcolon=\beta^{*\top}S_{m}+\alpha^{*}$ which maps the extracted PCs into a scalar capability measure for a target downstream metric. While this is perfectly acceptable for prediction, our scaling analysis would be more interpretable if we expressed capabilities in units of FLOPs rather than an arbitrary scalar capability measure.
请记住，我们的缩放法则的核心组件是拟合的线性变换 $P_{m}\vcentcolon=\beta^{*\top}S_{m}+\alpha^{*}$ ，它将提取的主成分映射到目标下游指标的标量能力度量。虽然这对于预测是完全可以接受的，但如果我们用 FLOPs 单位而不是任意的标量能力度量来表示能力，我们的缩放分析将更具可解释性。

Recall that our observational scaling laws generalize compute scaling laws for a single model family (Eq. 3 & Eq. 4). Thus, for a specific family $f$ , our observational scaling laws should correspond to some compute scaling law. Specifically, we note that when Eq. 4 holds exactly, we have that for a model $m$ within a family $f$ ,
请记住，我们的观测缩放定律推广了单一模型家族的计算缩放定律（方程 3 和方程 4）。因此，对于特定的家族 $f$ ，我们的观测缩放定律应对应某些计算缩放定律。具体来说，我们注意到，当方程 4 完全成立时，对于家族 $f$ 中的模型 $m$ ，我们有：

\displaystyle P_{m}:=\beta^{*\top}S_{m}+\alpha^{*}=w_{f}\log(C_{m})+b_{f}

(7)

where $w_{f}=\beta^{*\top}\theta_{f}$ and $b_{f}=\beta^{*\top}\nu_{f}+\alpha^{*}$ . This implies a linear correlation between the scalar capability $P_{m}$ and the compute $\log(C)$ for models within a specific family on a downstream task (see empirical validation in Fig. C.3). Since $\theta_{f}$ and $\nu_{f}$ are unknown a priori, we can fit these coefficients $w_{f},b_{f}$ via linear regression from $\log(C)$ to $P$ using models from the specific family $f$ .
其中 $w_{f}=\beta^{*\top}\theta_{f}$ 和 $b_{f}=\beta^{*\top}\nu_{f}+\alpha^{*}$ 。这意味着在下游任务中，特定系列模型的标量能力 $P_{m}$ 与计算 $\log(C)$ 之间存在线性相关性（参见图 C.3 中的实证验证）。由于 $\theta_{f}$ 和 $\nu_{f}$ 事先未知，我们可以通过线性回归从 $\log(C)$ 到 $P$ 拟合这些系数 $w_{f},b_{f}$ ，使用特定系列的模型 $f$ 。

In the multi-model family case, variations in compute efficiency mean that FLOPs and capabilities are no longer log-linear across model families. However, we can map all of the models to a shared, FLOPs-based capability measure of a specific family $f$ . The core idea is to represent each model’s capabilities by the following hypothetical: “how many FLOPs ( $\bar{C}_{m,f}$ ) would it take for a model in a family $f$ to match a model $m$ ”. We call $\bar{C}_{m,f}$ the $f$ -equivalent FLOPs for model $m$ , as it represents the performance of model $m$ relative to models in the reference model family $f$ . This measure can be computed fairly easily as
在多模型家族的情况下，计算效率的变化意味着 FLOPs 和能力在模型家族之间不再是对数线性的。然而，我们可以将所有模型映射到一个共享的、基于 FLOPs 的特定家族能力度量 $f$ 。核心思想是通过以下假设来表示每个模型的能力：“一个家族 $f$ 中的模型需要多少 FLOPs（ $\bar{C}_{m,f}$ ）才能匹配一个模型 $m$ 。”我们称 $\bar{C}_{m,f}$ 为模型 $m$ 的 $f$ 等效 FLOPs，因为它表示模型 $m$ 相对于参考模型家族 $f$ 中的模型的性能。这个度量可以相对容易地计算出来

\log(\bar{C}_{{m,f}}):=\frac{1}{w_{f}^{*}}\left(\beta^{*\top}S_{m}+\alpha^{*}-% b_{f}^{*}\right),

(8)

obtained from solving for $\log(C_{m})$ in Eq. 7. Throughout the remainder of this work, we apply this scalar transformation where we pick Llama-2 [84] as the reference family $f$ , and so the x-axis of all of our plots can be interpreted as “model capabilities, as measured in units of Llama-2 FLOPs”.
通过在方程 7 中求解 $\log(C_{m})$ 获得。在本文的其余部分，我们应用这种标量变换，其中选择 Llama-2 [84]作为参考族 $f$ ，因此我们所有图表的 x 轴可以解释为“模型能力，以 Llama-2 FLOPs 为单位测量”。

4 Validating Observational Scaling Laws
验证观测标度定律

We evaluate the usefulness of observational scaling laws by showing that they accurately predict the scaling behaviors of LMs over complex, hard-to-predict phenomena (like emergent phenomena and agentic abilities) and help estimate the value of techniques such as Chain-of-Thought.
我们通过展示观测缩放定律准确预测复杂、难以预测现象（如涌现现象和代理能力）的缩放行为，并帮助估计诸如链式思维等技术的价值，来评估其有用性。

To ensure that our scaling laws are actually predictive and that we are not simply overfitting through various choices in scaling law construction and hyperparameters, we design our experiments to have systematic holdout sets and robustness checks. We also preregister our predictions for future models after the release of the paper as a test of whether our scaling laws overfit current models. We release our code including the implementation and collected data at https://github.com/ryoungj/ObsScaling.
为了确保我们的缩放定律确实具有预测性，而不是通过缩放定律构建和超参数的各种选择简单地过拟合，我们设计了系统的保留集和稳健性检查的实验。我们还在论文发布后预注册了对未来模型的预测，以测试我们的缩放定律是否对当前模型过拟合。我们在 https://github.com/ryoungj/ObsScaling 发布了包括实现和收集数据在内的代码。

Details in scaling law fits
缩放定律拟合中的细节

For extracting PC measures, we fixed the number of PCs $K=3$ as it covered $\sim 97\%$ of the variation in benchmark performance and it consistently yielded the best performance across most of our experiments, see Sec. C.3 for robustness checks on PC selection. For the capability-equivalent scale transformation, we used the Llama-2 [84] as the reference model family as it is currently the most representative and widely used open model in the community. For better interpretability and visualization, we used the accuracy metric, typically defined as $Y=1-E$ , for fitting the scaling laws and making the plots.
为了提取主成分（PC）度量，我们固定了主成分的数量 $K=3$ ，因为它涵盖了基准性能变化的 $\sim 97\%$ ，并且在大多数实验中始终表现最佳，关于主成分选择的稳健性检查，请参见第 C.3 节。对于能力等效的尺度变换，我们使用 Llama-2 [84]作为参考模型系列，因为它目前是社区中最具代表性和广泛使用的开源模型。为了更好的可解释性和可视化，我们使用了通常定义为 $Y=1-E$ 的准确性指标来拟合缩放定律并绘制图表。

Holdout validation 保留验证

To validate our observational scaling laws, our primary objective is to assess how accurately the scaling laws fit the available data and extrapolate from smaller-scale, less capable models to larger-scale, more powerful models. We validate this through systematic holdouts for the test set, where we split available models into weaker and stronger ones based on both scale or capability (e.g., FLOPs or accuracy). We used the weaker models to fit the scaling law and evaluated the extrapolated predictions on the stronger ones. To prevent any train-test leakage, all preprocessing steps (e.g., PCA imputation) were fitted on the train set only and then applied to the test set. Unless otherwise stated, we set the cutoff to include all models with training FLOPs less than or equal to that of Llama-2-7B ( $8.4\times 10^{22}$ ) as training data, resulting in a training set of 47 models and a test set of 30 models. We included robustness checks for different holdout strategies in Sec. C.3.
为了验证我们的观测缩放定律，我们的主要目标是评估缩放定律与现有数据的拟合程度，并从小规模、能力较弱的模型外推到大规模、能力更强的模型。我们通过系统的测试集保留来验证这一点，在这里我们根据规模或能力（例如，FLOPs 或准确性）将现有模型分为较弱和较强的模型。我们使用较弱的模型来拟合缩放定律，并在较强的模型上评估外推预测。为了防止任何训练-测试泄漏，所有预处理步骤（例如，PCA 插补）仅在训练集上拟合，然后应用于测试集。除非另有说明，我们将截止点设置为包括所有训练 FLOPs 小于或等于 Llama-2-7B（ $8.4\times 10^{22}$ ）的模型作为训练数据，结果是训练集包含 47 个模型，测试集包含 30 个模型。我们在第 C.3 节中包括了不同保留策略的稳健性检查。

As baselines, we compare our scaling predictions to using existing compute-based scale measures like training FLOPs and model size. We used the mean squared error (MSE) on the holdout set as our main evaluation measure, as the target range is always normalized (0 to 1), and estimating the marginal variance in $R^{2}$ can add additional noise when the test set sizes are small.
作为基线，我们将我们的扩展预测与现有的基于计算的规模测量（如训练 FLOPs 和模型大小）进行比较。我们使用保留集上的均方误差（MSE）作为主要评估指标，因为目标范围始终归一化（0 到 1），并且在测试集规模较小时，估计 $R^{2}$ 中的边际方差可能会增加额外的噪声。

Preregisteration of predictions
预测的预注册

In Sec. C.7, we include all functional forms for the fitted scaling laws in our experiments as preregistration of our predictions for future models. We will assess the accuracy of these scaling laws (without refitting) using models developed after May 2024 and commit to updating the manuscript on ArXiv with our prediction results after 4 months.
在 C.7 节中，我们包括了实验中拟合的标度律的所有函数形式，作为对未来模型预测的预注册。我们将使用 2024 年 5 月之后开发的模型来评估这些标度律的准确性（无需重新拟合），并承诺在 4 个月后将我们的预测结果更新到 ArXiv 上的手稿中。

4.1 Predictability of “Emergent” Capabilities
4.1“新兴”能力的可预测性

Recent works have argued that many LM capabilities are “emergent” and cannot easily be predicted from small-scale models [90, 26]. Discontinuous changes to capabilities would make it difficult to develop algorithms and benchmarks that are effective at scale, and there have been ongoing debates – about whether these capabilities are truly discontinuous and whether the discontinuity is an artifact of the metric used [73, 56, 23, 37] or lack of high-resolution data points [36].
最近的研究认为，许多语言模型的能力是“突现的”，无法从小规模模型中轻易预测 [90, 26]。能力的非连续变化将使开发在大规模上有效的算法和基准变得困难，并且关于这些能力是否真正非连续以及非连续性是否是所使用度量的产物 [73, 56, 23, 37] 或缺乏高分辨率数据点 [36] 的争论一直在进行。

The debate on emergent phenomena has been complicated by the fact that existing scaling analyses (including the original ones in Wei et al. [90]) have very few points [36]. When there are only 5 models across many orders of magnitudes of scale, phenomena can appear to be discontinuous, even if the underlying phenomenon is a smooth but rapidly varying sigmoid.
关于涌现现象的争论因现有的尺度分析（包括 Wei 等人[90]的原始分析）数据点非常少而变得复杂[36]。当在许多数量级的尺度上只有 5 个模型时，即使底层现象是平滑但快速变化的 S 形曲线，现象也可能显得不连续。

We show that the higher resolution of observational scaling laws allows us to clearly see smooth sigmoidal curves in phenomena that were identified as emergent in Wei et al. [90], and even more surprisingly, we can often accurately forecast the transition points where models go from near-random to high performance using only models whose performance is only slightly above random. Our findings validate the observational approach to scaling laws and provide evidence that higher-resolution scaling laws could help us better understand scaling phenomena for LMs.
我们表明，观测标度定律的更高分辨率使我们能够清晰地看到在 Wei 等人[90]中被确定为新兴现象的平滑 S 形曲线，更令人惊讶的是，我们通常可以准确预测模型从近乎随机到高性能的转折点，而仅使用性能略高于随机的模型。我们的研究结果验证了观测标度定律的方法，并提供了证据表明，更高分辨率的标度定律可以帮助我们更好地理解语言模型的标度现象。

Setup 设置

We tested on four BigBench [75] tasks that were labeled as “emergent” in Wei et al. [90], including two arithmetic tasks (3-digit subtraction and 2-digit multiplication) and two non-arithmetic tasks (word unscramble and Persian QA). Additional results on more tasks covering Wei et al. [90] are included in Sec. C.4. For the models, we included base pretrained models following the approach of Wei et al. [90]. For non-arithmetic tasks, we used the default FLOPs cutoff. For arithmetic tasks, we found that this cutoff resulted in an excess of training data near perfect performance (see results in Fig. C.11), making the prediction tasks trivial. Consequently, we reduced the cutoff to a quarter of the default value and also excluded GSM8K (which may be a superset of arithmetic tasks) from our base metrics $B$ to make the tasks more challenging.
我们测试了四个在 Wei 等人[90]中被标记为“新兴”的 BigBench[75]任务，包括两个算术任务（3 位数减法和 2 位数乘法）和两个非算术任务（单词解密和波斯语问答）。更多涵盖 Wei 等人[90]的任务结果见 Sec. C.4。对于模型，我们包括了按照 Wei 等人[90]的方法预训练的基础模型。对于非算术任务，我们使用了默认的 FLOPs 截止值。对于算术任务，我们发现这个截止值导致了训练数据过多，接近完美表现（见图 C.11 中的结果），使得预测任务变得简单。因此，我们将截止值减少到默认值的四分之一，并且从我们的基础指标中排除了 GSM8K（可能是算术任务的超集） $B$ ，以使任务更具挑战性。

Prediction results 预测结果

Fig. 4 shows our prediction results using our PC measures as well as the baseline of predicting performance based on training FLOPs. We find that these capabilities can be accurately predicted using our PC measures, even when only using models that perform poorly. In contrast, using training FLOPs results in significantly poorer extrapolation on the test set and fits on the train set, as indicated by the much higher MSE values. This discrepancy is likely due to the incomparability of training FLOPs across different model families. Additional results of the model size baseline are included in Sec. C.4.
图 4 显示了使用我们的 PC 度量以及基于训练 FLOPs 预测性能的基线的预测结果。我们发现，即使仅使用表现不佳的模型，也可以使用我们的 PC 度量准确预测这些能力。相比之下，使用训练 FLOPs 在测试集上的外推结果和在训练集上的拟合结果显著较差，如更高的 MSE 值所示。这种差异可能是由于不同模型家族之间训练 FLOPs 的不可比性造成的。模型大小基线的其他结果包含在 Sec. C.4 中。

4.2 Predictability of Agentic Capabilities
4.2 代理能力的可预测性

There is significant interest in building autonomous agents using LMs, with notable examples including AutoGPT [70], Devin [44], and SWE-agent [97]. Although the performance of these agents still falls far below human-level on challenging real-world tasks [101, 41, 58], there is a belief that future models at larger scales will significantly enhance these agents’ capabilities. However, there is a significant uncertainty about whether existing models that are trained for language and code capabilities will transfer well to agentic tasks that require taking actions over many rounds. In this section, we utilize our observational scaling laws to analyze the scaling properties of LMs’ agentic capabilities w.r.t. their backbone model capabilities and show that agent performance is highly predictable from simple benchmark metrics.
使用语言模型（LMs）构建自主代理的兴趣显著增加，著名的例子包括 AutoGPT [70]、Devin [44]和 SWE-agent [97]。尽管这些代理在应对具有挑战性的现实世界任务时的表现仍远低于人类水平 [101, 41, 58]，但人们相信未来更大规模的模型将显著增强这些代理的能力。然而，现有为语言和代码能力训练的模型是否能很好地转移到需要多轮次行动的代理任务上，存在显著的不确定性。在本节中，我们利用我们的观察性缩放定律来分析 LMs 代理能力相对于其骨干模型能力的缩放特性，并表明代理性能可以从简单的基准指标中高度预测。

Setup 设置

We tested on two standardized agent evaluation benchmarks, AgentBench [54] and AgentBoard [57], each is a collection of diverse tasks for evaluating LMs’ generic agentic capabilities. For both benchmarks, we utilized their provided aggregated metrics on all tasks for prediction. Specifically, we used the “Overall Score” on AgentBench, which is a weighted average of scores across all tasks (denoted as “OA” in the benchmark), and the “Average Success Rate” on AgentBoard. We included models that have been evaluated on each benchmark, which encompasses both open instruction-tuned models like LLaMA-2-Chat [84] and Vicuna [17], and proprietary models like GPT-4 [62] and Claude-2 [4], see table B.2 for a complete list of included models.
我们在两个标准化的代理评估基准上进行了测试，AgentBench [54] 和 AgentBoard [57]，每个基准都是用于评估语言模型通用代理能力的多样化任务集合。对于这两个基准，我们利用了它们提供的所有任务的聚合指标进行预测。具体来说，我们使用了 AgentBench 上的“总体得分”，这是所有任务得分的加权平均值（在基准中表示为“OA”），以及 AgentBoard 上的“平均成功率”。我们包括了在每个基准上进行评估的模型，其中包括像 LLaMA-2-Chat [84] 和 Vicuna [17] 这样的开放指令调优模型，以及像 GPT-4 [62] 和 Claude-2 [4] 这样的专有模型，完整的包含模型列表见表 B.2。

We followed the same procedure to collect standardized benchmark metrics $B$ for instruction-tuned models, including MMLU [31], ARC-C [19], HellaSwag [99], Winogrande [72], TruthfulQA [50], GSM8K [20], and HumanEval [16], see Sec. B.1.2 for details. The PC measures extracted for these instruction-tuned models followed a similar pattern to those of pretrained base models, as shown in Fig. C.1. Notably, since compute scale measures are not available for proprietary models, only our observational scaling laws apply here and not compute scaling laws. The default FLOPs cutoff does not apply either, and thus we held out the top 10% performing models on each agent benchmark as the test set to simulate weak-to-strong predictions, which included GPT-4 and Claude-2 on AgentBench and GPT-4 on AgentBoard.
我们遵循相同的程序来收集标准化基准指标 $B$ ，用于指令调优模型，包括 MMLU [31]、ARC-C [19]、HellaSwag [99]、Winogrande [72]、TruthfulQA [50]、GSM8K [20]和 HumanEval [16]，详情见 Sec. B.1.2。提取的这些指令调优模型的 PC 测量值与预训练基础模型的模式相似，如图 C.1 所示。值得注意的是，由于专有模型的计算规模测量不可用，这里仅适用我们的观察性缩放定律，而不适用计算缩放定律。默认的 FLOPs 截止值也不适用，因此我们将每个代理基准上表现最好的 10%模型作为测试集，以模拟从弱到强的预测，其中包括 AgentBench 上的 GPT-4 和 Claude-2 以及 AgentBoard 上的 GPT-4。

Prediction results 预测结果

Fig. 5 illustrates the prediction results with our observational scaling laws using PC measures. We find that on both agent benchmarks, the performance of held-out models (GPT-4/Claude-2) can be accurately predicted from models with much weaker performance (> 10% gap). This indicates that the more complex agentic capabilities of LMs are well-correlated with and predictable from their base model capabilities, suggesting the promising scaling properties of LM-based agent capabilities as backbone LMs continue to scale up.
图 5 展示了使用 PC 测量的观测缩放定律的预测结果。我们发现，在两个代理基准上，保留模型（GPT-4/Claude-2）的性能可以从性能较弱的模型（>10%的差距）中准确预测。这表明，语言模型的更复杂的代理能力与其基础模型能力高度相关且可预测，暗示随着基础语言模型的不断扩展，基于语言模型的代理能力具有良好的缩放特性。

Interpreting the capability dimensions
解释能力维度

In Fig. 5(c), we visualize the weights assigned to the base evaluation metrics on both benchmarks, which are derived from the regression weights fitted on PC measures and applied with learned PCA transformation, i.e., $\beta^{\top}\gamma$ . We observe that the fitted weights assign significant importance to programming capabilities (HumanEval) on both benchmarks, underscoring its significance in defining the agentic capabilities of LMs. The weights also emphasize general knowledge (MMLU) on AgentBench, and reasoning capabilities (GSM8K) on AgentBoard, suggesting that these capabilities may also be important for LMs’ agentic capabilities.
在图 5(c)中，我们可视化了分配给两个基准测试的基础评估指标的权重，这些权重是从 PC 测量上拟合的回归权重并应用学习的 PCA 变换得出的，即 $\beta^{\top}\gamma$ 。我们观察到，拟合的权重在两个基准测试上都赋予编程能力（HumanEval）显著的重要性，强调了其在定义语言模型的代理能力方面的重要性。权重还强调了 AgentBench 上的一般知识（MMLU）和 AgentBoard 上的推理能力（GSM8K），这表明这些能力对于语言模型的代理能力也可能很重要。

4.3 Predicting the Impact of Post-Training Techniques
4.3 预测训练后技术的影响

When researchers propose a new prompting or post-training technique to improve a pretrained model, how can we know whether these gains will persist across models and scales? Scaling analysis could enable more quantitative approaches to the design of post-training interventions, but systematic scaling analyses have been rare due to the small number of models within a single model family. Adding to these challenges, some recent works have argued that certain interventions, such as Chain-of-Thought [91], behave in an emergent way and their behaviors are not predictable from smaller models [90]. Using observational scaling laws, we show that it is possible to make relatively accurate predictions on the effectiveness of techniques such as Chain-of-Thought (CoT) [91] and Self-Consistency (SC) [89] as model scale increases. We focus on these post-training interventions in particular, as they are sometimes discussed as examples of post-training interventions that require scale to be effective [91, 90].
当研究人员提出一种新的提示或训练后技术来改进预训练模型时，我们如何知道这些改进是否会在不同模型和规模中持续存在？规模分析可以使训练后干预设计的定量方法成为可能，但由于单一模型家族中的模型数量较少，系统的规模分析一直很少见。更具挑战性的是，一些最新的研究认为某些干预措施（如 Chain-of-Thought [91]）表现出一种突现行为，其行为无法从较小的模型中预测 [90]。通过使用观察性规模定律，我们表明可以相对准确地预测随着模型规模增加，Chain-of-Thought (CoT) [91] 和 Self-Consistency (SC) [89] 等技术的有效性。我们特别关注这些训练后干预措施，因为它们有时被讨论为需要规模才能有效的训练后干预措施的例子 [91, 90]。

Our approach to quantifying the scaling properties of post-training is straightforward: we fit one observational scaling law using base model performance on a target benchmark (e.g., GSM8K few-shot), and then fit another on the performance of models with the post-training intervention (e.g., GSM8K w/ CoT). Each of these fits produces a sigmoidal scaling curve as a function of $\log(\bar{C}_{f})$ , and the relative gaps as a function of $\log(\bar{C}_{f})$ indicates the scaling efficiency of the intervention.
我们量化训练后缩放特性的方式很简单：我们使用基础模型在目标基准（例如，GSM8K few-shot）上的表现拟合一个观测缩放定律，然后在具有训练后干预的模型（例如，GSM8K w/ CoT）表现上拟合另一个。每个拟合都会生成一个作为 $\log(\bar{C}_{f})$ 函数的 S 形缩放曲线，而作为 $\log(\bar{C}_{f})$ 函数的相对差距表明干预的缩放效率。

Setup 设置

We tested on GSM8K with CoT and SC as post-training techniques and included additional results on BigBench-Hard [76] with CoT in Sec. C.5. As with our study on emergent phenomena on arithmetic tasks, we excluded GSM8K from the base metrics $B$ to avoid making the prediction tasks trivial. We included all the pretrained base models listed in table B.1 including those specifically trained for code data and applied the default FLOPs cutoff for holdout validation. For CoT, we followed Wei et al. [91] and compared CoT prompting using eight reasoning examples with naive prompting using only few-shot examples in the greedy decoding setting. For SC, we sampled five CoT reasoning paths at temperature 0.7 to aggregate the final answers following Wang et al. [89] and compared it with a single sampled CoT answer.
我们在 GSM8K 上测试了 CoT 和 SC 作为训练后技术，并在附录 C.5 中包含了 BigBench-Hard [76]上使用 CoT 的额外结果。与我们对算术任务中突现现象的研究一样，我们将 GSM8K 从基本指标中排除，以避免使预测任务变得简单。我们包括了表 B.1 中列出的所有预训练基本模型，包括那些专门为代码数据训练的模型，并应用了默认的 FLOPs 截止点进行保留验证。对于 CoT，我们遵循了 Wei 等人的方法[91]，在贪婪解码设置中使用八个推理示例的 CoT 提示与仅使用少量示例的简单提示进行了比较。对于 SC，我们在温度为 0.7 时采样了五个 CoT 推理路径，以聚合最终答案，遵循 Wang 等人的方法[89]，并将其与单个采样的 CoT 答案进行了比较。

Prediction results 预测结果

Fig. 6(a) shows the scaling predictions for CoT and SC using observational scaling laws. We find that the performance with (CoT, CoT + SC) and without (Naive) post-training techniques for stronger, larger scale models can be accurately predicted from weaker, smaller scale models. In contrast, predictions based on compute scale measures like model size and training FLOPs are less reliable as seen in Fig. C.13. Notably, the scaling trends between the two techniques differ; CoT shows a much more pronounced scaling trend compared to Self-Consistency w/ CoT.
图 6(a)显示了使用观测缩放定律对 CoT 和 SC 的缩放预测。我们发现，使用（CoT，CoT + SC）和不使用（Naive）后训练技术的强大、大规模模型的性能可以从较弱、小规模模型中准确预测。相比之下，如图 C.13 所示，基于计算规模测量（如模型大小和训练 FLOPs）的预测则不太可靠。值得注意的是，这两种技术之间的缩放趋势不同；与自一致性结合 CoT 相比，CoT 显示出更显著的缩放趋势。

Interpreting the capability dimensions
解释能力维度

Another advantage of observational scaling laws over scaling laws constructed on single families is that we can visualize the capabilities that are important to the post-training intervention. Fig. 6(b) visualizes the fitted regression weights $\beta$ , mapped to the space of base capability benchmarks $B$ via $\beta^{\top}\gamma$ . We clearly see that when we go from Naive to CoT, there are significantly higher weights placed on MMLU and HumanEval - meaning that scaling models in a way that enhances general knowledge (MMLU) and code (HumanEval) leads to greater gaps between CoT and the baseline, while improving along commonsense, such as Winogrande does not necessarily lead to improvements at scale. These analyses can inform how different post-training interventions affect different scaling recipes – such as code models vs general-purpose LLMs.
另一个观察性缩放定律相较于基于单一族群构建的缩放定律的优势在于，我们可以直观地看到对训练后干预重要的能力。图 6(b)展示了拟合的回归权重 $\beta$ ，通过 $\beta^{\top}\gamma$ 映射到基础能力基准的空间 $B$ 。我们清楚地看到，当我们从 Naive 转向 CoT 时，MMLU 和 HumanEval 的权重显著增加——这意味着以增强一般知识（MMLU）和代码（HumanEval）的方式缩放模型会导致 CoT 和基线之间的差距更大，而在常识方面的改进（如 Winogrande）不一定会在规模上带来改进。这些分析可以告知不同的训练后干预如何影响不同的缩放方案——例如代码模型与通用模型LLMs。

5 Selecting Low-Cost Model Subsets for Practical Scaling Analyses
选择低成本模型子集进行实际缩放分析

We have now demonstrated the effectiveness of observational scaling laws in forecasting the scaling behavior of various LM capabilities. However, the large number of publically available models is both a strength and a weakness – it enables much higher resolution scaling analyses, but it also requires us to evaluate our benchmarks and post-training methods on a larger number of models.
我们现在已经证明了观测缩放定律在预测各种语言模型能力的缩放行为方面的有效性。然而，大量公开可用的模型既是优势也是劣势——它使得更高分辨率的缩放分析成为可能，但也要求我们在更多的模型上评估我们的基准和训练后方法。

To make observational scaling analyses more broadly accessible, we identify a small set of models that maintain high prediction accuracy while significantly reducing the evaluation cost. We do this by building upon the classic approaches in optimal experimental design which allow us to define optimality criteria for selecting model subsets without knowing the downstream task.
为了使观测尺度分析更广泛地可用，我们确定了一小组模型，这些模型在显著降低评估成本的同时保持了高预测准确性。我们通过基于经典的最优实验设计方法来实现这一点，这些方法使我们能够在不知道下游任务的情况下定义选择模型子集的最优性标准。

Method 方法

More specifically, we consider the constrained optimization problem of identifying the optimal set of models to choose for a regression problem, subject to the constraint that we select a model subset $\mathcal{M}$ of at most $M_{\text{max}}$ models from the set of all models $\mathcal{M}_{a}$ . To define optimality, we turn to the theory of optimal experimental design, which states that for linear regression with a fixed design $X$ and subset $\mathcal{M}$ , the expected prediction error from using the subset $X_{\mathcal{M}}$ is $\text{Tr}(X^{\top}X\left(X_{\mathcal{M}}^{\top}X_{\mathcal{M}}\right)^{-1})$ . This gives a straightforward objective achieving the V-optimality [66]:
更具体地说，我们考虑在回归问题中识别最佳模型集的约束优化问题，前提是我们从所有模型集合 $\mathcal{M}_{a}$ 中选择最多 $M_{\text{max}}$ 个模型的子集 $\mathcal{M}$ 。为了定义最优性，我们转向最优实验设计理论，该理论指出，对于具有固定设计 $X$ 和子集 $\mathcal{M}$ 的线性回归，使用子集 $X_{\mathcal{M}}$ 的预期预测误差为 $\text{Tr}(X^{\top}X\left(X_{\mathcal{M}}^{\top}X_{\mathcal{M}}\right)^{-1})$ 。这给出了实现 V-最优性的直接目标[66]：

\min_{\mathcal{M}\in\mathcal{P}(\mathcal{M}_{a})~{}\mathrm{s.t.}|\mathcal{M}|% \leq M_{\text{max}}}\text{Tr}(S^{\top}S\left(S_{\mathcal{M}}^{\top}S_{\mathcal% {M}}\right)^{-1})

(9)

where $S\in\mathbb{R}^{M\times K}$ is the model-capability matrix obtained from our PC analysis. Instead of directly searching over all model subsets, we conduct a structured search over model families where we include or exclude entire model families, as we believe these selected models are more interpretable and more likely to be adopted by practitioners. In our case, we have a small number of 21 families, and thus we simply perform an exhaustive search over all possible combinations to find the optimal subset under the budget constraint of maximum models.
其中 $S\in\mathbb{R}^{M\times K}$ 是通过我们的主成分分析获得的模型能力矩阵。我们不是直接搜索所有模型子集，而是对模型家族进行结构化搜索，在其中包含或排除整个模型家族，因为我们认为这些选定的模型更具可解释性，更有可能被实践者采用。在我们的案例中，我们有少量的 21 个家族，因此我们只需对所有可能的组合进行穷尽搜索，以在最大模型预算约束下找到最优子集。

Validation 验证

We followed the setup in Sec. 4.3 for validating our selection method, as this represents the most likely application scenario for our observational scaling laws by practitioners. Our objective is to replicate our scaling analysis (using a full set of 47 models) in Fig. 6(a) using a small subset of models selected by our method. In Fig. 7(a), we compute the geometric average of test MSEs on all prediction tasks (Naive, CoT, CoT + SC) as the evaluation metric for different selection methods. We find that our V-optimality selection method significantly outperforms random selection and quickly converges to the prediction performance of using the full set of models. In Fig. 7(b), we show that using only a small subset of 12 models selected by our method, the fitted scaling curves already effectively capture the scaling trends of different post-training methods.
我们按照第 4.3 节中的设置验证了我们的选择方法，因为这代表了从业者对我们的观测缩放定律最可能的应用场景。我们的目标是使用我们的方法选择的一小部分模型来复制图 6(a)中的缩放分析（使用完整的 47 个模型集）。在图 7(a)中，我们计算了所有预测任务（Naive, CoT, CoT + SC）的测试 MSE 的几何平均值，作为不同选择方法的评估指标。我们发现，我们的 V-最优选择方法显著优于随机选择，并迅速收敛到使用完整模型集的预测性能。在图 7(b)中，我们展示了仅使用我们方法选择的 12 个模型的小子集，拟合的缩放曲线已经有效地捕捉到了不同训练后方法的缩放趋势。

Recommended model series for scaling analysis
推荐的缩放分析模型系列

Table 1: Selected models for scaling analysis of post-training methods under different budgets.
表 1：在不同预算下用于后训练方法的缩放分析的选定模型。

Budget 预算	Selected Models 精选模型
8 models 8 个模型	Llama-2 {7B, 13B, 70B}, Mixtral {8x7B}, Phi {1.5B, 2}, MPT {7B, 30B}
12 models 12 个模型	Llama-2 {7B, 13B, 70B}, Llama-3 {8B, 70B}, DeepSeek-Coder {1.3B, 6.7B, 33B}, Llama-2 {7B, 13B, 70B}, Llama-3 {8B, 70B}, DeepSeek-Coder {1.3B, 6.7B, 33B} Falcon {1B, 7B, 40B, 180B} 猎鹰 {1B, 7B, 40B, 180B}
20 models 20 个模型	Llama-2 {7B, 13B, 70B}, Mixtral {8x7B}, Qwen {7B, 14B, 72B}, DeepSeek-Coder {1.3B, 6.7B, 33B}, CodeLlama {7B, 13B, 34B, 70B}, MPT {7B, 30B}, Falcon {1B, 7B, 40B, 180B}
8 models, sub 7B 8 个模型，低于 7B	Llama-2 {7B}, Llama {7B}, Qwen {7B}, DeepSeek-Coder {1.3B, 6.7B}, Llama-2 {7B}, Llama {7B}, Qwen {7B}, DeepSeek-Coder {1.3B, 6.7B} Phi {1.5, 2}, MPT {7B}
12 models, sub 7B 12 个模型，低于 7B	Llama-2 {7B}, Llama {7B}, Qwen {7B}, DeepSeek-Coder {1.3B, 6.7B}, Llama-2 {7B}, Llama {7B}, Qwen {7B}, DeepSeek-Coder {1.3B, 6.7B} Phi {1.5, 2}, MPT {7B}, Gemma {2B, 7B}, Falcon {1B, 7B}

To facilitate future scaling analyses for post-training techniques, we provide a reference list of models selected with our method under different budget constraints in table 1. These models were chosen from all available ones (see table B.1) with Llama-2 models always being included (as it is currently the most representative and widely used model family), and are expected to be representative of them. Notably, the selected models cover diverse capability ranges and dimensions to capture potential scaling dimensions. For example, under the 12 model budget constraint, the selected models cover both stronger models (Llama-3) and weaker ones (Falcon), as well as models with specialized programming capabilities (DeepSeek-Coder). Updating this list with other constraints (e.g., total inference FLOPs) or new model families is straightforward, and we provide both implementations and guidelines in our released code.
为了促进未来对训练后技术的扩展分析，我们在表 1 中提供了使用我们的方法在不同预算限制下选择的模型参考列表。这些模型是从所有可用模型中选择的（见表 B.1），并始终包括 Llama-2 模型（因为它是目前最具代表性和广泛使用的模型家族），预计这些模型具有代表性。值得注意的是，所选模型涵盖了不同的能力范围和维度，以捕捉潜在的扩展维度。例如，在 12 个模型预算限制下，所选模型既包括较强的模型（Llama-3）也包括较弱的模型（Falcon），以及具有专门编程能力的模型（DeepSeek-Coder）。使用其他限制（例如，总推理 FLOPs）或新的模型家族更新此列表是很简单的，我们在发布的代码中提供了实现和指南。

6 Discussion and Other Applications of Observational Scaling
6 讨论及观察量表的其他应用

Our work validates the hypothesis that there is a low-dimensional space of LM capabilities that captures their scaling behaviors and can be measured via a low-rank decomposition of existing LM benchmarks. While the majority of our work focuses on applications to scaling laws and predictions, we also find that the shared, low-dimensional capabilities could potentially be used as an evaluation metric and optimization target for LMs. We discuss some of these possibilities here.
我们的工作验证了这样一个假设：存在一个低维空间的语言模型（LM）能力，可以捕捉它们的扩展行为，并且可以通过现有语言模型基准的低秩分解来测量。虽然我们的大部分工作集中在扩展法则和预测的应用上，但我们也发现，共享的低维能力可能被用作语言模型的评估指标和优化目标。我们在此讨论了一些这些可能性。

PC-1 as a smooth capability measure with high dynamic range
PC-1 作为一种具有高动态范围的平滑能力测量

Many existing benchmarks suffer from a limited dynamic range: they either saturate quickly for large models (e.g., HellaSwag, Winogrande) or have completely random performance for small models (e.g., MMLU, GSM8K), see Fig. C.4 for the behavior of each benchmark. In contrast, we find that PC-1 is a smooth capability measure that can be used to compare LMs across many (at least 5) orders of magnitude. This allows us to compare models from heterogeneous sources and of extremely different capabilities on a single, unified scale (Fig. 9). We believe that the high dynamic range of PC1 may make it suitable as an optimization target for pretraining, where architecture or data interventions can be benchmarked against PC-1 at small scales and validated at large scales.
许多现有的基准测试存在动态范围有限的问题：它们要么在大模型上迅速饱和（例如，HellaSwag，Winogrande），要么在小模型上表现完全随机（例如，MMLU，GSM8K），请参见图 C.4 了解每个基准测试的行为。相比之下，我们发现 PC-1 是一种平滑的能力测量，可以用于比较跨越多个（至少 5 个）数量级的语言模型。这使我们能够在单一、统一的尺度上比较来自异质来源和能力极其不同的模型（图 9）。我们认为，PC-1 的高动态范围可能使其适合作为预训练的优化目标，在小规模上对架构或数据干预进行基准测试，并在大规模上进行验证。

Training data efficiency measurements using PC-1
使用 PC-1 进行训练数据效率测量

Extending these ideas further, since PC-1 serves as a unified measure of capabilities, it may serve as a good way to compare compute efficiencies across many model families. In Fig. 9, we plot PC-1 against log-FLOPs and find that most models fall along a clear pattern in the training-compute to capabilities tradeoff curve. The Phi family is a clear outlier in compute efficiency, though this is likely because we are not accounting for the fact that Phi uses additional inference FLOPs to generate training data that is not shown in this figure.
进一步扩展这些想法，由于 PC-1 作为能力的统一度量，它可能是比较许多模型家族计算效率的好方法。在图 9 中，我们将 PC-1 与对数 FLOPs 进行对比，发现大多数模型在训练计算与能力权衡曲线上沿着一个清晰的模式分布。Phi 家族在计算效率上是一个明显的异常值，尽管这可能是因为我们没有考虑到 Phi 使用额外的推理 FLOPs 来生成未在此图中显示的训练数据这一事实。

Post-training interventions and their interactions with model families
训练后干预及其与模型家族的相互作用

Finally, we can analyze the interactions between post-training techniques and model families by projecting the fitted scaling curves in Fig. 6(a) to $f$ -equivalent FLOPs for different families $f$ using Eq. 8. We can then identify which model families benefit the most from these techniques and the point at which they start to benefit. Fig. 9 shows an example of comparing the predicted scaling of CoT across model families. We find that LMs benefit similarly from CoT, but that Phi is once again an outlier in its behavior: it benefits from CoT much earlier than other model families, but scales less rapidly. Similarly, models specifically trained on code (DeepSeek-Coder), also demonstrate an earlier transition but less rapid scaling compared to models trained with standard protocols. The distinct behavior of Phi/DeepSeek-Coder relative to other models indicates the importance of pretraining data in determining model scaling behaviors. While we did not specifically focus on these types of analysis in this work, we hope that our approach enables future works to gain further insights into differences between LM training recipes and their scaling behavior.
最后，我们可以通过使用方程 8 将图 6(a)中的拟合缩放曲线投影到不同模型家族的 $f$ 等效 FLOPs 来分析训练后技术与模型家族之间的相互作用。然后，我们可以确定哪些模型家族从这些技术中受益最多，以及它们开始受益的时间点。图 9 显示了比较不同模型家族中 CoT 预测缩放的一个示例。我们发现，语言模型（LMs）从 CoT 中受益相似，但 Phi 在其行为上再次表现为一个异常值：它比其他模型家族更早从 CoT 中受益，但缩放速度较慢。同样，专门在代码上训练的模型（DeepSeek-Coder）也表现出较早的过渡，但与使用标准协议训练的模型相比，缩放速度较慢。Phi/DeepSeek-Coder 相对于其他模型的独特行为表明，预训练数据在确定模型缩放行为中的重要性。虽然我们在这项工作中并未专门关注这些类型的分析，但我们希望我们的方法能够使未来的工作进一步深入了解语言模型训练方案及其缩放行为之间的差异。

7 Conclusion 7 结论

We have presented observational scaling laws – an approach that generalizes existing compute scaling laws to handle multiple model families using a shared, low-dimensional capability space. Using this approach, we show that we can build low-cost, high-resolution, and broad-coverage scaling laws that allow us to make accurate predictions for many complex scaling phenomena, such as emergent behaviors, agentic capabilities, and the value of post-training interventions. We provide concrete and practical prescriptions for researchers and practitioners to perform similar forms of scaling analyses for their own benchmarks and post-training methods in the hopes of encouraging more quantitative, scaling-law-based approaches to designing benchmarks and post-training methods.
我们提出了观测缩放定律——一种将现有计算缩放定律推广到使用共享的低维能力空间处理多个模型家族的方法。使用这种方法，我们展示了可以构建低成本、高分辨率和广覆盖的缩放定律，使我们能够对许多复杂的缩放现象做出准确预测，例如突现行为、代理能力和训练后干预的价值。我们为研究人员和从业者提供了具体且实用的建议，以便他们对自己的基准和训练后方法进行类似形式的缩放分析，希望鼓励更多基于定量缩放定律的方法来设计基准和训练后方法。

Acknowledgements 致谢

We thank Zitong Yang for his assistance with an early experiment of the project. We also thank Jimmy Ba, Yann Dubois, Pavan Kapanipathi, Lisa Li, Karthik Narasimhan, Ethan Perez, Chenglei Si, Tristan Thrush, Zitong Yang, Shunyu Yao, and the Hashimoto Group for their helpful discussions or feedback on the paper draft. This project is not possible without the open-source contributions including HuggingFace, EleutherAI LM Eval Harness [27], Open LLM Leaderboard [9], EvalPlus [52], vLLM [43], LMSys Chatbot Arena Leaderboard [18], and AlpacaEval Leaderboard [47].
我们感谢杨子通在项目早期实验中的协助。我们还感谢 Jimmy Ba、Yann Dubois、Pavan Kapanipathi、Lisa Li、Karthik Narasimhan、Ethan Perez、司成磊、Tristan Thrush、杨子通、姚舜禹以及 Hashimoto 小组对论文草稿的有益讨论或反馈。没有包括 HuggingFace、EleutherAI LM Eval Harness [27]、Open LLM Leaderboard [9]、EvalPlus [52]、vLLM [43]、LMSys Chatbot Arena Leaderboard [18]和 AlpacaEval Leaderboard [47]在内的开源贡献，这个项目是不可能实现的。

TH and YR were supported in part by gifts from the Tianqiao and Chrissy Chen Institute, Open Philanthropy, Amazon ARA, Meta, and IBM. Resources used in preparing this research were provided in part by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), RGPIN-2021-03445.
TH 和 YR 部分得到了天桥和 Chrissy Chen 研究所、Open Philanthropy、Amazon ARA、Meta 和 IBM 的资助。用于准备这项研究的资源部分由安大略省、加拿大政府通过 CIFAR 以及赞助 Vector 研究所的公司提供。我们感谢加拿大自然科学与工程研究委员会（NSERC），RGPIN-2021-03445 的支持。

Major Changelog 主要更新日志

07/02/2024

•

Added clarifications that emphasize the predictions of observational scaling laws are based on standard benchmark metrics instead of training FLOPs.

• 增加了澄清，强调观测缩放定律的预测是基于标准基准指标而不是训练 FLOPs。
•

Updated plots to use x-axis with a log10 scale instead of a ln scale for better readability.

• 更新了图表，使用以 log10 为刻度的 x 轴代替 ln 刻度，以提高可读性。
•

Updated results with minor numerical deviations after fixing a minor issue.

• 修复一个小问题后，更新结果有轻微的数值偏差。

References 参考文献

Abnar et al. [2021] Abnar 等人 [2021] Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095, 2021.
萨米拉·阿布纳尔，穆斯塔法·德赫加尼，贝赫纳姆·内沙布尔，哈尼·塞德吉。探索大规模预训练的极限。arXiv 预印本 arXiv:2110.02095, 2021.
AI [2024] 人工智能 [2024] Meta AI. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2024-05-13.
Meta AI. 介绍 Meta Llama 3：迄今为止最强大的公开可用llm。https://ai.meta.com/blog/meta-llama-3/, 2024. 访问日期：2024-05-13.
Almazrouei et al. [2023] Almazrouei 等人 [2023] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, 等. 猎鹰系列开放语言模型. arXiv 预印本 arXiv:2311.16867, 2023.
Anthropic [2023] 人类 [2023] Anthropic. Claude 2, July 2023. URL https://www.anthropic.com/index/claude-2. Accessed: 2023-08-31.
人类公司。Claude 2，2023 年 7 月。网址：https://www.anthropic.com/index/claude-2。访问日期：2023-08-31。
Anwar et al. [2024] Anwar 等人 [2024] Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024.
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut 等人. 确保大型语言模型的对齐和安全性的基础性挑战. arXiv 预印本 arXiv:2404.09932, 2024.
Arora and Goyal [2023] Arora 和 Goyal [2023] Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
Sanjeev Arora 和 Anirudh Goyal. 语言模型中复杂技能出现的理论. arXiv 预印本 arXiv:2307.15936, 2023.
Bahri et al. [2021] 巴赫里等人 [2021] Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, 和 Utkarsh Sharma. 解释神经网络缩放定律. arXiv 预印本 arXiv:2102.06701, 2021.
Bai et al. [2023] 白等人 [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
白金泽, 白帅, 储云飞, 崔泽宇, 党凯, 邓晓东, 范阳, 葛文斌, 韩宇, 黄飞, 等. Qwen 技术报告. arXiv 预印本 arXiv:2309.16609, 2023.
Beeching et al. [2023] Beeching 等人 [2023] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, 和 Thomas Wolf. Open llm 排行榜. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
Bi et al. [2024] 毕等人 [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
肖碧, 陈德利, 陈冠廷, 陈山煌, 戴大麦, 邓成奇, 丁宏辉, 董凯, 杜秋实, 傅哲, 等. Deepseek llm: 通过长期主义扩展开源语言模型. arXiv 预印本 arXiv:2401.02954, 2024.
Biderman et al. [2023] Biderman 等人 [2023] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff 等人. Pythia: 一个用于分析大规模语言模型的套件，涵盖训练和扩展。在国际机器学习会议上，页面 2397-2430。PMLR, 2023.
Black et al. [2022] Black 等人 [2022] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, 等。Gpt-neox-20b: 一个开源的自回归语言模型。arXiv 预印本 arXiv:2204.06745, 2022.
Brown et al. [2020] 布朗等人 [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
汤姆·布朗，本杰明·曼，尼克·赖德，梅拉妮·苏比亚，贾里德·D·卡普兰，普拉富拉·达里瓦尔，阿尔温德·尼拉坎坦，普拉纳夫·夏姆，吉里什·萨斯特里，阿曼达·阿斯克尔，等。语言模型是少样本学习者。神经信息处理系统进展，33:1877–1901，2020。
Burnell et al. [2023] Burnell 等人 [2023] Ryan Burnell, Han Hao, Andrew RA Conway, and Jose Hernandez Orallo. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023.
Ryan Burnell, 韩浩, Andrew RA Conway, 和 Jose Hernandez Orallo. 揭示语言模型能力的结构. arXiv 预印本 arXiv:2306.10062, 2023.
Caballero et al. [2022] Caballero 等人 [2022] Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
Ethan Caballero, Kshitij Gupta, Irina Rish, 和 David Krueger. 破碎的神经缩放定律. arXiv 预印本 arXiv:2210.14891, 2022.
Chen et al. [2021] 陈等人 [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 等人. 评估在代码上训练的大型语言模型. arXiv 预印本 arXiv:2107.03374, 2021.
Chiang et al. [2023] 蒋等人 [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, March 2023. Accessed: 2024-05-13.
魏林·蒋, 卓翰·李, 子·林, 英·盛, 张浩·吴, 浩·张, 连民·郑, 思源·庄, 永浩·庄, 约瑟夫·E·冈萨雷斯, 伊恩·斯托伊卡, 和埃里克·P·邢. Vicuna: 一个开源聊天机器人，以 90%的 chatgpt 质量给 gpt-4 留下深刻印象. https://lmsys.org/blog/2023-03-30-vicuna/, 2023 年 3 月. 访问日期: 2024 年 5 月 13 日.
Chiang et al. [2024] 蒋等人 [2024] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
蒋维霖，郑连敏，盛颖，阿纳斯塔西奥斯·尼古拉斯·安杰洛普洛斯，李天乐，李大成，张浩，朱邦华，迈克尔·乔丹，约瑟夫·E·冈萨雷斯，艾恩·斯托伊卡。聊天机器人竞技场：一个通过人类偏好评估llms的开放平台，2024 年。
Clark et al. [2018] 克拉克等人 [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
彼得·克拉克，艾萨克·考希，奥伦·埃齐奥尼，图沙尔·霍特，阿希什·萨巴瓦尔，卡丽莎·舍尼克，和奥伊温德·塔福德。你认为你已经解决了问答问题吗？试试 ARC，AI2 推理挑战。arXiv 预印本 arXiv:1803.05457，2018 年。
Cobbe et al. [2021] Cobbe 等人 [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano 等人. 训练验证器解决数学文字题. arXiv 预印本 arXiv:2110.14168, 2021.
Databricks [2023] Databricks. Dolly: The first open commercially viable instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, April 2023. Accessed: 2024-05-13.
Databricks. Dolly: 第一个开放的商业可行的指令调优llm。https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, 2023 年 4 月。访问日期：2024 年 5 月 13 日。
Dettmers et al. [2023] Dettmers 等人 [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2023.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, 和 Luke Zettlemoyer. Qlora: 量化llms的高效微调. 神经信息处理系统进展, 36, 2023.
Du et al. [2024] 杜等人 [2024] Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
杜正潇、曾奥涵、董宇潇、唐杰。从损失角度理解语言模型的突现能力。arXiv 预印本 arXiv:2403.15796, 2024。
Finnveden [2020] 芬韦登 [2020] Lukas Finnveden. Extrapolating gpt-n performance. https://www.lesswrong.com/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance, 2020. Accessed: 2024-05-07.
卢卡斯·芬维登。外推 gpt-n 性能。https://www.lesswrong.com/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance, 2020。访问日期：2024-05-07。
Gadre et al. [2024] Gadre 等人 [2024] Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024.
萨米尔·伊扎克·加德尔，乔治奥斯·斯米尔尼斯，维沙尔·尚卡尔，苏钦·古鲁兰甘，米切尔·沃茨曼，邵如林，简·梅卡特，亚历克斯·方，杰弗里·李，塞德里克·科赫，等。语言模型在过度训练和下游任务中可靠地扩展。arXiv 预印本 arXiv:2403.08540, 2024.
Ganguli et al. [2022] 甘古利等人 [2022] Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, 等。大型生成模型的可预测性和惊喜。在 2022 年 ACM 公平性、责任性和透明性会议论文集，1747-1764 页，2022 年。
Gao et al. [2023] 高等人 [2023] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
高磊, 乔纳森·托, 巴伯·阿巴西, 斯特拉·比德曼, 西德·布莱克, 安东尼·迪波菲, 查尔斯·福斯特, 劳伦斯·戈尔丁, 许杰弗里, 阿兰·勒诺阿赫, 李浩南, 凯尔·麦克唐纳, 尼克拉斯·穆尼霍夫, 克里斯·奥西帕, 杰森·方, 拉里亚·雷诺兹, 海莉·斯科尔科普夫, 阿维亚·斯科龙, 林唐·苏塔维卡, 埃里克·唐, 阿尼什·蒂特, 王本, 王凯文, 安迪·邹. 一种少样本语言模型评估框架, 2023 年 12 月. URL https://zenodo.org/records/10256836.
Geng et al. [2023] 耿等人 [2023] Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
耿新阳，阿尔纳夫·古迪班德，刘浩，埃里克·华莱士，皮特·阿贝尔，谢尔盖·莱文，和宋晓冬。Koala：一个用于学术研究的对话模型。博客文章，2023 年 4 月。网址：https://bair.berkeley.edu/blog/2023/04/03/koala/.
Ghorbani et al. [2021] Ghorbani 等人 [2021] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, 和 Colin Cherry. 神经机器翻译的扩展法则. arXiv 预印本 arXiv:2109.07740, 2021.
Guo et al. [2024] 郭等人 [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
郭大雅, 朱启豪, 杨德健, 谢振达, 董凯, 张文涛, 陈冠廷, 毕晓, 吴宇, 李永康, 等. Deepseek-coder: 当大型语言模型遇到编程——代码智能的崛起. arXiv 预印本 arXiv:2401.14196, 2024.
Hendrycks et al. [2020] Hendrycks 等人 [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, 和 Jacob Steinhardt. 测量大规模多任务语言理解. arXiv 预印本 arXiv:2009.03300, 2020.
Henighan et al. [2020] Henighan 等人 [2020] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, 等。自回归生成建模的缩放定律。arXiv 预印本 arXiv:2010.14701, 2020.
Hernandez et al. [2021] Hernandez 等人 [2021] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
丹尼·埃尔南德斯，贾里德·卡普兰，汤姆·亨尼根，山姆·麦坎德利什。迁移的缩放定律。arXiv 预印本 arXiv:2102.01293, 2021。
Hestness et al. [2017] Hestness 等人 [2017] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, 和 Yanqi Zhou. 深度学习的扩展是可预测的，实证研究。arXiv 预印本 arXiv:1712.00409, 2017.
Hoffmann et al. [2022] 霍夫曼等人 [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
乔丹·霍夫曼，塞巴斯蒂安·博尔乔，阿瑟·门施，埃琳娜·布查茨卡娅，特雷弗·蔡，伊丽莎·拉瑟福德，迭戈·德·拉斯·卡萨斯，丽莎·安妮·亨德里克斯，约翰内斯·韦尔布尔，艾丹·克拉克，等。训练计算最优的大型语言模型。arXiv 预印本 arXiv:2203.15556, 2022.
Hu et al. [2024] 胡等人 [2024] Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. Predicting emergent abilities with infinite resolution evaluation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=lDbjooxLkD.
胡盛鼎, 刘鑫, 韩旭, 张新荣, 何超群, 赵伟林, 林彦凯, 丁宁, 欧泽斌, 曾国阳, 刘志远, 孙茂松. 通过无限分辨率评估预测新兴能力. 第十二届国际学习表征会议, 2024. URL https://openreview.net/forum?id=lDbjooxLkD.
Huang et al. [2024] 黄等人 [2024] Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelligence linearly. arXiv preprint arXiv:2404.09937, 2024.
黄玉珍, 张静涵, 单子飞, 和何俊贤. 压缩线性地代表智能. arXiv 预印本 arXiv:2404.09937, 2024.
Ilić [2023] 伊利奇 [2023] David Ilić. Unveiling the general intelligence factor in language models: A psychometric approach. arXiv preprint arXiv:2310.11616, 2023.
David Ilić. 揭示语言模型中的一般智力因素：一种心理测量方法。arXiv 预印本 arXiv:2310.11616, 2023.
Jiang et al. [2023] 江等人 [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
阿尔伯特·Q·姜，亚历山大·萨布莱罗尔，亚瑟·门施，克里斯·班福德，德文德拉·辛格·查普洛特，迭戈·德拉斯·卡萨斯，弗洛里安·布雷桑德，吉安娜·伦吉尔，纪尧姆·兰普尔，露西尔·索尔尼尔，等。Mistral 7b。arXiv 预印本 arXiv:2310.06825，2023。
Jiang et al. [2024] 姜等人 [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
阿尔伯特·Q·姜，亚历山大·萨布莱罗尔，安托万·鲁，阿瑟·门施，布兰奇·萨瓦里，克里斯·班福德，德文德拉·辛格·查普洛特，迭戈·德拉斯·卡萨斯，艾玛·布·汉娜，弗洛里安·布雷桑等。专家混合。arXiv 预印本 arXiv:2401.04088, 2024。
Jimenez et al. [2023] Jimenez 等人 [2023] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2023.
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, 和 Karthik R Narasimhan. Swe-bench: 语言模型能否解决现实世界中的 GitHub 问题？发表于第十二届国际学习表征会议，2023 年。
Kaplan et al. [2020] 卡普兰等人 [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
贾里德·卡普兰，山姆·麦坎德利什，汤姆·亨尼汉，汤姆·B·布朗，本杰明·切斯，雷文·柴尔德，斯科特·格雷，亚历克·拉德福德，杰弗里·吴，达里奥·阿莫代伊。神经语言模型的扩展定律。arXiv 预印本 arXiv:2001.08361, 2020。
Kwon et al. [2023] 权等人 [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
权宇锡, 李卓瀚, 庄思远, 盛颖, 郑连敏, 余浩, 约瑟夫·E·冈萨雷斯, 张浩, 和艾恩·斯托伊卡. 使用分页注意力的大型语言模型服务的高效内存管理. 载于《ACM SIGOPS 第 29 届操作系统原理研讨会论文集》，2023 年.
Labs [2024] 实验室 [2024] Cognition Labs. Introducing devin, the first ai software engineer, March 2024. URL https://www.cognition-labs.com/introducing-devin. Accessed: 2023-05-03.
认知实验室。介绍 devin，第一个 AI 软件工程师，2024 年 3 月。网址：https://www.cognition-labs.com/introducing-devin。访问日期：2023-05-03。
LAION [2023] LAION. Open assistant. https://projects.laion.ai/Open-Assistant/, 2023. Accessed: 2024-05-13.
LAION. Open assistant. https://projects.laion.ai/Open-Assistant/, 2023. 访问日期：2024-05-13.
Li et al. [2023a] 李等人 [2023a] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, 等. Starcoder: may the source be with you! arXiv 预印本 arXiv:2305.06161, 2023a.
Li et al. [2023b] 李等人 [2023b] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
李雪晨，张天一，Yann Dubois，Rohan Taori，Ishaan Gulrajani，Carlos Guestrin，Percy Liang，Tatsunori B. Hashimoto。Alpacaeval：一个自动评估指令遵循模型的工具。https://github.com/tatsu-lab/alpaca_eval，2023 年。
Li et al. [2023c] 李等 [2023c] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c.
李元智, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, 和李殷达. 教科书是你所需要的一切 ii: phi-1.5 技术报告. arXiv 预印本 arXiv:2309.05463, 2023c.
Liang et al. [2022] 梁等人 [2022] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, 等。语言模型的整体评估。arXiv 预印本 arXiv:2211.09110, 2022.
Lin et al. [2021a] 林等人 [2021a] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021a.
Stephanie Lin, Jacob Hilton, 和 Owain Evans. Truthfulqa: 测量模型如何模仿人类谬误. arXiv 预印本 arXiv:2109.07958, 2021a.
Lin et al. [2021b] 林等人 [2021b] Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668, 2021b.
林曦维多利亚, 托多尔·米哈伊洛夫, 米克尔·阿特克谢, 王天路, 陈硕辉, 丹尼尔·西米格, 迈尔·奥特, 纳曼·戈亚尔, 施鲁蒂·博萨莱, 杜静菲, 等. 使用多语言模型进行少样本学习. arXiv 预印本 arXiv:2112.10668, 2021b.
Liu et al. [2023a] 刘等人 [2023a] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
刘家伟, 夏春秋, 王宇尧, 张灵明. 你的代码真的是由 chatGPT 生成的吗？对代码生成的大型语言模型的严格评估. 第三十七届神经信息处理系统会议, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
Liu et al. [2021] 刘等人 [2021] Nelson F Liu, Tony Lee, Robin Jia, and Percy Liang. Do question answering modeling improvements hold across benchmarks? arXiv preprint arXiv:2102.01065, 2021.
Nelson F Liu, Tony Lee, Robin Jia, 和 Percy Liang. 问答建模改进是否在基准测试中保持一致？arXiv 预印本 arXiv:2102.01065, 2021.
Liu et al. [2023b] 刘等人 [2023b] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2023b.
肖刘，郝宇，韩晨张，亦凡徐，轩宇雷，汉宇赖，宇古，杭亮丁，凯文门，柯娟杨，等。Agentbench: 评估llms作为代理。在第十二届国际学习表征会议，2023 年。
Lozhkov et al. [2024] Lozhkov 等人 [2024] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, 等. Starcoder 2 和 stack v2：下一代. arXiv 预印本 arXiv:2402.19173, 2024.
Lu et al. [2023] 卢等人 [2023] Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, and Iryna Gurevych. Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
盛璐, 伊琳娜·比古拉耶娃, 拉赫尼特·萨赫德瓦, 哈里什·塔亚尔·马达布希, 和伊丽娜·古列维奇. 大型语言模型中的新兴能力只是上下文学习吗？arXiv 预印本 arXiv:2309.01809, 2023.
Ma et al. [2024] 马等人 [2024] Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178, 2024.
马畅，张俊磊，朱志豪，杨成，杨宇玖，金耀辉，兰振中，孔令鹏，何俊贤。Agentboard：多轮llm代理的分析评估板。arXiv 预印本 arXiv:2401.13178，2024。
Mialon et al. [2023] Mialon 等人 [2023] Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
格雷瓜尔·米亚隆，克莱芒汀·富里耶，克雷格·斯威夫特，托马斯·沃尔夫，扬·勒昆，托马斯·西亚洛姆。Gaia：通用人工智能助手的基准。arXiv 预印本 arXiv:2311.12983, 2023.
Miller et al. [2021] Miller 等人 [2021] John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning, pages 7721–7735. PMLR, 2021.
John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, 和 Ludwig Schmidt. 准确性在线：关于分布外和分布内泛化之间的强相关性。在国际机器学习会议上，页码 7721-7735。PMLR, 2021.
Muennighoff et al. [2022]
Muennighoff 等人 [2022] Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
尼克拉斯·穆尼霍夫, 托马斯·王, 林唐·苏塔维卡, 亚当·罗伯茨, 斯特拉·比德曼, 特文·勒斯考, M·赛富尔·巴里, 盛申, 郑新勇, 海莉·斯科尔科普夫, 等. 通过多任务微调实现跨语言泛化. arXiv 预印本 arXiv:2211.01786, 2022.
Muennighoff et al. [2024]
Muennighoff 等人 [2024] Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
尼克拉斯·穆尼霍夫，亚历山大·拉什，博阿兹·巴拉克，特文·勒斯考，努阿曼·塔齐，亚历山德拉·皮克图斯，桑波·皮萨洛，托马斯·沃尔夫，科林·A·拉菲尔。扩展数据受限的语言模型。神经信息处理系统进展，36，2024。
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
OpenAI. Gpt-4 技术报告, 2023.
Owen [2024] 欧文 [2024] David Owen. How predictable is language model benchmark performance? arXiv preprint arXiv:2401.04757, 2024.
David Owen. 语言模型基准性能的可预测性如何？arXiv 预印本 arXiv:2401.04757, 2024.
Perlitz et al. [2023] Perlitz 等人 [2023] Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, and Leshem Choshen. Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, 和 Leshem Choshen. 高效基准测试（语言模型）。arXiv 预印本 arXiv:2308.11696, 2023.
Polo et al. [2024] Polo 等人 [2024] Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
费利佩·马亚·波罗，卢卡斯·韦伯，莱谢姆·乔申，孙岳凯，徐公俊，米哈伊尔·尤罗奇金。tinybenchmarks：用更少的例子评估llms。arXiv 预印本 arXiv:2402.14992，2024。
Pukelsheim [2006] 普克尔斯海姆 [2006] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
弗里德里希·普克尔斯海姆。《实验的最优设计》。SIAM，2006 年。
Qiu et al. [2018] 邱等人 [2018] Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, and Lijiao Yang. Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 17th China National Conference, CCL 2018, and 6th International Symposium, NLP-NABD 2018, Changsha, China, October 19–21, 2018, Proceedings 17, pages 209–221. Springer, 2018.
邱媛媛, 李宏正, 李申, 姜颖迪, 胡仁芬, 杨丽娇. 重新审视词嵌入的内在和外在评估之间的相关性. 在中国计算语言学和基于自然标注大数据的自然语言处理: 第十七届中国全国会议, CCL 2018, 和第六届国际研讨会, NLP-NABD 2018, 中国长沙, 2018 年 10 月 19-21 日, 会议论文集 17, 第 209-221 页. Springer, 2018.
Recht et al. [2018] Recht 等人 [2018] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, 和 Vaishaal Shankar. CIFAR-10 分类器能否泛化到 CIFAR-10？arXiv 预印本 arXiv:1806.00451, 2018.
Recht et al. [2019] Recht 等人 [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, 和 Vaishaal Shankar. ImageNet 分类器能否泛化到 ImageNet？在国际机器学习会议上，页码 5389–5400。PMLR, 2019.
Richards [2023] 理查兹 [2023] Toran Bruce Richards. Auto-gpt: Autonomous artificial intelligence software agent. https://github.com/Significant-Gravitas/Auto-GPT, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT. Initial release: March 30, 2023.
托兰·布鲁斯·理查兹。Auto-gpt：自主人工智能软件代理。https://github.com/Significant-Gravitas/Auto-GPT, 2023。网址：https://github.com/Significant-Gravitas/Auto-GPT。初始发布：2023 年 3 月 30 日。
Roziere et al. [2023] Roziere 等人 [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, 等。Code llama: 开放代码基础模型。arXiv 预印本 arXiv:2308.12950, 2023.
Sakaguchi et al. [2021] 坂口等人 [2021] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
坂口圭介，罗南·勒布拉斯，钱德拉·巴伽瓦图拉，崔艺真。Winogrande：大规模对抗性 Winograd 模式挑战。ACM 通讯，64(9):99–106，2021。
Schaeffer et al. [2023a] Schaeffer 等人 [2023a] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2023a.
Rylan Schaeffer, Brando Miranda, 和 Sanmi Koyejo. 大型语言模型的突现能力是海市蜃楼吗？神经信息处理系统进展, 36, 2023a.
Schaeffer et al. [2023b] Schaeffer 等人 [2023b] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
Rylan Schaeffer, Brando Miranda, 和 Sanmi Koyejo. 大型语言模型的突现能力是海市蜃楼吗？在第三十七届神经信息处理系统会议，2023 年 b。
Srivastava et al. [2022] 斯里瓦斯塔瓦等人 [2022] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, 等. 超越模仿游戏：量化和外推语言模型的能力. arXiv 预印本 arXiv:2206.04615, 2022.
Suzgun et al. [2022] Suzgun 等人 [2022] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, 等。挑战大基准任务以及链式思维是否能解决它们。arXiv 预印本 arXiv:2210.09261, 2022.
Taori et al. [2020] 陶里等人 [2020] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, 和 Ludwig Schmidt. 测量图像分类中对自然分布变化的鲁棒性. 神经信息处理系统进展, 33:18583–18599, 2020.
Tay et al. [2023] Tay 等人 [2023] Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
易泰，莫斯塔法·德赫加尼，萨米拉·阿布纳尔，钟亨元，威廉·费杜斯，饶金锋，沙兰·纳朗，陈荣庆，丹尼·约加塔马，唐纳德·梅茨勒。扩展法则与模型架构：归纳偏差如何影响扩展？在 2023 年自然语言处理实证方法会议上，2023 年。
Team et al. [2024] 团队等人 [2024] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
Gemma 团队，Thomas Mesnard，Cassidy Hardin，Robert Dadashi，Surya Bhupatiraju，Shreya Pathak，Laurent Sifre，Morgane Rivière，Mihir Sanjay Kale，Juliette Love，等。Gemma：基于双子座研究和技术的开放模型。arXiv 预印本 arXiv:2403.08295，2024。
Team [2024] 团队 [2024] Qwen Team. Introducing qwen1.5. https://qwenlm.github.io/blog/qwen1.5/, 2024. Accessed: 2024-05-13.
Qwen 团队。介绍 qwen1.5。https://qwenlm.github.io/blog/qwen1.5/，2024。访问日期：2024-05-13。
Team [2023] 团队 [2023] The MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://www.databricks.com/blog/mpt-7b, 2023. Accessed: 2024-05-13.
MosaicML NLP 团队。介绍 mpt-7b：一个新的开源、可商业使用的llms标准。https://www.databricks.com/blog/mpt-7b, 2023。访问日期：2024-05-13。
Torregrossa et al. [2020]
托雷格罗萨等人 [2020] François Torregrossa, Vincent Claveau, Nihel Kooli, Guillaume Gravier, and Robin Allesiardo. On the correlation of word embedding evaluation metrics. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4789–4797, 2020.
François Torregrossa, Vincent Claveau, Nihel Kooli, Guillaume Gravier, 和 Robin Allesiardo. 关于词嵌入评估指标的相关性研究. 见第十二届语言资源与评估会议 (LREC 2020) 论文集, 第 4789-4797 页, 2020 年.
Touvron et al. [2023a] Touvron 等人 [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 等人. Llama: 开放且高效的基础语言模型. arXiv 预印本 arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Touvron 等人 [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, 等。Llama 2: 开放基础和微调聊天模型。arXiv 预印本 arXiv:2307.09288, 2023b.
Vaswani et al. [2017] Vaswani 等人 [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
阿希什·瓦斯瓦尼，诺姆·沙泽尔，尼基·帕尔马尔，雅各布·乌兹科雷特，利昂·琼斯，艾丹·戈麦斯，卢卡斯·凯瑟，伊利亚·波洛苏金。注意力是你所需要的一切。神经信息处理系统进展，30，2017 年。
Villalobos [2023] 维拉洛博斯 [2023] Pablo Villalobos. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2024-05-12.
巴勃罗·维拉洛博斯。规模定律文献综述，2023 年。网址：https://epochai.org/blog/scaling-laws-literature-review。访问日期：2024 年 5 月 12 日。
Wang et al. [2018] 王等人 [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, 和 Samuel R Bowman. Glue: 一个用于自然语言理解的多任务基准和分析平台. 在国际学习表征会议, 2018.
Wang et al. [2023a] 王等人 [2023a] Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations, 2023a.
王冠, 程思杰, 詹显元, 李显刚, 宋森, 刘洋. Openchat: 使用混合质量数据推进开源语言模型. 在第十二届国际学习表征会议, 2023a.
Wang et al. [2023b] 王等人 [2023b] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b.
王学智, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, 和 Denny Zhou. 自我一致性提高了语言模型中的链式思维推理能力. 在第十一届国际学习表征会议, 2023b.
Wei et al. [2022a] 魏等人 [2022a] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler 等人. 大型语言模型的突现能力. 机器学习研究汇刊, 2022a.
Wei et al. [2022b] 魏等 [2022b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
Jason Wei, 王雪芝, Dale Schuurmans, Maarten Bosma, 夏飞, Ed Chi, Quoc V Le, 周丹尼, 等。链式思维提示在大型语言模型中引发推理。神经信息处理系统进展, 35:24824–24837, 2022b.
Workshop et al. [2022] Workshop 等人 [2022] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, 等人. Bloom: 一个拥有 1760 亿参数的开放获取多语言模型. arXiv 预印本 arXiv:2211.05100, 2022.
Xia et al. [2022] 夏等人 [2022] Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Ves Stoyanov. Training trajectories of language models across scales. arXiv preprint arXiv:2212.09803, 2022.
孟舟·夏, 米克尔·阿特克谢, 周春廷, 林希·维多利亚, 拉马坎特·帕苏努鲁, 陈丹琦, 卢克·泽特尔莫耶, 和维斯·斯托亚诺夫. 不同规模语言模型的训练轨迹. arXiv 预印本 arXiv:2212.09803, 2022.
Xu et al. [2023] 徐等人 [2023] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
徐灿，孙庆峰，郑凯，耿修博，赵璞，冯家展，陶崇阳，姜大新。Wizardlm：赋能大型语言模型遵循复杂指令。arXiv 预印本 arXiv:2304.12244，2023。
Xu et al. [2024] 徐等人 [2024] Yiheng Xu, SU Hongjin, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, et al. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations, 2024.
徐一恒, 苏宏进, 陈星, 米博宇, 刘倩, 史伟佳, 惠彬源, 周凡, 刘一涛, 谢天宝, 等. Lemur: 协调自然语言和代码的语言代理. 在第十二届国际学习表征会议, 2024.
Yadav and Bottou [2019] Yadav 和 Bottou [2019] Chhavi Yadav and Léon Bottou. Cold case: The lost mnist digits. Advances in neural information processing systems, 32, 2019.
查维·亚达夫和莱昂·博图。冷案：丢失的 MNIST 数字。神经信息处理系统进展，32，2019 年。
Yang et al. [2024] 杨等人 [2024] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent computer interfaces enable software engineering language models, 2024.
杨约翰, 卡洛斯·E·希门尼斯, 亚历山大·韦蒂格, 基利安·利雷特, 姚顺宇, 卡尔蒂克·纳拉辛汉, 和奥菲尔·普雷斯. Swe-agent: 代理计算机接口使软件工程语言模型成为可能, 2024.
Young et al. [2024] Young 等人 [2024] Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
Zellers et al. [2019] Zellers 等人 [2019] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
罗文·泽勒斯，阿里·霍尔茨曼，尤纳坦·比斯克，阿里·法哈迪，和叶真·崔。Hellaswag：机器真的能完成你的句子吗？arXiv 预印本 arXiv:1905.07830, 2019.
Zhang et al. [2022] 张等人 [2022] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, 等。Opt: 开放预训练的变压器语言模型。arXiv 预印本 arXiv:2205.01068, 2022.
Zhou et al. [2023] 周等人 [2023] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023.
周书彦, 徐方方, 朱浩, 周旭辉, 罗伯特·罗, 阿比谢克·斯里达尔, 程显毅, 欧天岳, 约纳坦·比斯克, 丹尼尔·弗里德, 等. Webarena: 一个用于构建自主代理的现实网络环境. 在第十二届国际学习表征会议, 2023.

Appendix A Algorithm 附录 A 算法

In algorithm 1, we include the detailed algorithm for fitting the observational scaling laws as described in Sec. 3.
在算法 1 中，我们包含了第 3 节中描述的拟合观测标度定律的详细算法。

Args: number of models

M

, number of LM benchmarks

T

, number of principal components

K

, reference model family

f

参数：模型数量

M

，语言模型基准数量

T

，主成分数量

K

，参考模型家族

f

Input: base LM benchmark error metrics

B\in\mathbb{R}^{T\times M}

, target downstream error metric

E\in\mathbb{R}^{M}

, LM compute scales

C\in\mathbb{R}^{{M}}

输入：基础语言模型基准错误指标

B\in\mathbb{R}^{T\times M}

，目标下游错误指标

E\in\mathbb{R}^{M}

，语言模型计算规模

C\in\mathbb{R}^{{M}}

Result: functional form of fitted scaling law

F

结果：拟合缩放定律的函数形式

F

/* Extract principal capability measures with applicable metric preprocessing */
/* 提取主要能力度量与适用的度量预处理 */

B\leftarrow\text{PCAImpute}(B)

\triangleright

Fill in missing values with PCA imputation
用 PCA 插补填补缺失值

E\leftarrow\text{Normalize}(E)

\triangleright

Normalize metric to

[0,1]

for sigmoid non-linearity
将度量标准归一化为

[0,1]

，用于 S 型非线性

\gamma,S\leftarrow\text{PCA}(B,K)

\triangleright

Fit PCA transformation

\gamma\in\mathbb{R}^{K\times T}

and extract top

S=\gamma B

\gamma,S\leftarrow\text{PCA}(B,K)

\triangleright

适配 PCA 转换

\gamma\in\mathbb{R}^{K\times T}

并提取前

S=\gamma B

/* Fit a non-linear regression with weights

\beta\in\mathbb{R}^{K}

and bias

\alpha\in\mathbb{R}

, and sigmoidal scale

h\in\mathbb{R}

*/
/* 拟合具有权重

\beta\in\mathbb{R}^{K}

和偏置

\alpha\in\mathbb{R}

以及 S 形比例

h\in\mathbb{R}

的非线性回归 */

\beta^{*},\alpha^{*},h^{*}\leftarrow\text{Fit}\left(E=h\sigma(\beta^{\top}S+% \alpha)\right)

\triangleright

Obtain optimal parameters
获取最佳参数

P\leftarrow\beta^{*\top}S+\alpha^{*}

\triangleright

Obtain aggregated capability measures

P\in\mathbb{R}^{M}

获取聚合能力测量

/* Project to the capability-equivalent scale of a reference model family */
/* 投射到参考模型家族的能力等效规模 */

w^{*},b^{*}\leftarrow\text{Fit}(P_{f}=w\log(C_{f})+b)

\triangleright

Fit linear projection with models in the reference family

w^{*},b^{*}\leftarrow\text{Fit}(P_{f}=w\log(C_{f})+b)

\triangleright

使用参考族中的模型拟合线性投影

\log(\bar{C}_{f})\leftarrow(P-b^{*})/w^{*}

\triangleright

Compute

f

-equivalent FLOPs for all models
计算所有模型的

f

等效 FLOPs

/* Return the fitted scaling law with capability-equivalent scale transformation */
/* 返回具有能力等效尺度变换的拟合缩放定律 */

return $F:B\rightarrow h^{*}\sigma\left(\beta^{*\top}\gamma B+\alpha^{*}\right)\text{~% {}~{}or~{}~{}}\bar{C}_{f}\rightarrow h^{*}\sigma\left(w^{*}\log(\bar{C}_{f})+b% ^{*}\right)$ 返回

F:B\rightarrow h^{*}\sigma\left(\beta^{*\top}\gamma B+\alpha^{*}\right)\text{~% {}~{}or~{}~{}}\bar{C}_{f}\rightarrow h^{*}\sigma\left(w^{*}\log(\bar{C}_{f})+b% ^{*}\right)

Algorithm 1 Fitting observational scaling laws
算法 1 拟合观测标度定律

Appendix B Experimental Details
附录 B 实验细节

B.1 Model Collection & Evaluation
B.1 模型收集与评估

B.1.1 Pretrained Base Models
B.1.1 预训练基础模型

Model collection 模型收集

We collected a broad set of representative open LMs covering 21 model families and a total of 77 models. These model families include Llama-2 [84], Llama [83], Llama-3 [2], Qwen1.5 [80], Qwen [8], Mistral [39], Mixtral [40], Yi [98], Gemma [79], Falcon [3], Phi [48], Pythia [11], BLOOM [92], GPT-Neo/J [12], OPT [100], MPT [81], XGLM [51], CodeLlama [71], StarCoder [46], StarCoder2 [55], DeepSeek-Coder [30]. For each model, we collected their available metadata including the number of model parameters $N$ and the amount of pretraining tokens $D$ by analyzing papers and other public information. We then estimated the training FLOPs $C$ using the simple estimate of $C\approx 6ND$ [42] for each model. Note that for models that were continually pretrained on additional data such as CodeLlama, we used the sum of the pretraining tokens and the additional continual pretraining tokens to estimate $D$ . See table B.1 for the collected metadata of these models.
我们收集了一组广泛的代表性开放语言模型，涵盖了 21 个模型家族，共 77 个模型。这些模型家族包括 Llama-2 [84]、Llama [83]、Llama-3 [2]、Qwen1.5 [80]、Qwen [8]、Mistral [39]、Mixtral [40]、Yi [98]、Gemma [79]、Falcon [3]、Phi [48]、Pythia [11]、BLOOM [92]、GPT-Neo/J [12]、OPT [100]、MPT [81]、XGLM [51]、CodeLlama [71]、StarCoder [46]、StarCoder2 [55]、DeepSeek-Coder [30]。对于每个模型，我们通过分析论文和其他公开信息，收集了它们的可用元数据，包括模型参数数量 $N$ 和预训练标记数量 $D$ 。然后，我们使用 $C\approx 6ND$ [42]的简单估算方法估算了每个模型的训练 FLOPs $C$ 。请注意，对于在额外数据上持续预训练的模型（如 CodeLlama），我们使用预训练标记和额外持续预训练标记的总和来估算 $D$ 。这些模型的收集元数据见表 B.1。

Benchmark collection & evaluation
基准收集与评估

We collected a set of diverse benchmarks that assess various LMs’ capabilities, including MMLU [31], ARC-C [19], HellaSwag [99], Winogrande [72], GSM8K [20], TruthfulQA [50], and XWinogrande [60], HumanEval [16]. For MMLU, ARC-C, HellaSwag, Winogrande, GSM8K, and TruthfulQA, we primarily sourced results from the Open LLM Leaderboard¹¹1https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard [9], with updates current as of May 6th, 2024. When there were missing benchmark results, we followed the standardized evaluation protocols of the Open LLM Leaderboard and used the LM Eval Harness [27] library to evaluate the LMs. For XWinogrande, we used the LM Eval Harness library to evaluate the models with 5-shot examples. For HumanEval, we primarily used the EvalPlus [52] library and followed their standardized protocols for evaluation, and sourced the results from the EvalPlus leaderboard²²2https://evalplus.github.io/leaderboard.html when available. We used the ‘Base Tests’ results provided by EvalPlus for all the models. See table B.1 for all collected benchmark results.
我们收集了一组多样化的基准测试，以评估各种语言模型的能力，包括 MMLU [31]、ARC-C [19]、HellaSwag [99]、Winogrande [72]、GSM8K [20]、TruthfulQA [50]和 XWinogrande [60]、HumanEval [16]。对于 MMLU、ARC-C、HellaSwag、Winogrande、GSM8K 和 TruthfulQA，我们主要从 Open LLM排行榜[9]获取结果，更新截至 2024 年 5 月 6 日。当缺少基准测试结果时，我们遵循 Open LLM排行榜的标准化评估协议，并使用 LM Eval Harness [27]库来评估语言模型。对于 XWinogrande，我们使用 LM Eval Harness 库对模型进行 5-shot 示例评估。对于 HumanEval，我们主要使用 EvalPlus [52]库并遵循其标准化评估协议，并在可用时从 EvalPlus 排行榜 ² 获取结果。我们使用 EvalPlus 提供的所有模型的“基本测试”结果。所有收集的基准测试结果见表 B.1。

B.1.2 Instruction-Tuned Models
B.1.2 指令调优模型

Model collection 模型收集

We collected the set of instruction-tuned models that have been evaluated on the AgentBench [54] and AgentBoard [57] benchmarks. These include models like GPT [62], Claude [4], Llama-2-Chat [84], Codellama-Instruct [71], Mistral-Instruct [39], Vicuna [17], Deepseek-LLM-Chat [10], Lemur-Chat [95], OpenChat [88], WizardLM [94], Guanaco [22], Koala [28], Dolly-v2 [21], OpenAssistant [45]. We followed the same procedure in Sec. B.1.1 to collect the metadata of open models, while for proprietary models these metadata were not publicly available. Note that we only counted the pretraining tokens (and the continual pretraining tokens when applicable) for $D$ and excluded the data for instruction-tuning or additional finetuning, as these are typically only a small fraction of the total data and are nuanced to estimate due to the complexities in data curation for instruction-tuning. See table B.2 for the collected metadata of these models.
我们收集了在 AgentBench [54] 和 AgentBoard [57] 基准测试中评估过的一组指令调优模型。这些模型包括 GPT [62]、Claude [4]、Llama-2-Chat [84]、Codellama-Instruct [71]、Mistral-Instruct [39]、Vicuna [17]、Deepseek-LLM-Chat [10]、Lemur-Chat [95]、OpenChat [88]、WizardLM [94]、Guanaco [22]、Koala [28]、Dolly-v2 [21]、OpenAssistant [45]。我们按照 Sec. B.1.1 中的相同程序收集了开源模型的元数据，而对于专有模型，这些元数据并未公开。请注意，我们只计算了 $D$ 的预训练标记（以及适用时的持续预训练标记），并排除了指令调优或额外微调的数据，因为这些通常只占总数据的一小部分，并且由于指令调优的数据策划复杂性而难以估算。有关这些模型的收集元数据，请参见表 B.2。

Benchmark collection & evaluation
基准收集与评估

For instruction-tuned models, we also included standard LM evaluations such as MMLU [31], ARC-C [19], HellaSwag [99], Winogrande [72], TruthfulQA [50], GSM8K [20], and HumanEval [16], and we followed the same protocols in Sec. B.1.1 for evaluating open models. For proprietary models like GPT and Claude, it is more nuanced to evaluate them with a unified protocol (e.g., due to the lack of access to likelihood scores), so we collected the official results from their respective papers and documentation for all standard benchmarks (except for HumanEval, which we were able to evaluate using the EvalPlus library). Additionally, we collected Elo scores from the Chatbot Arena³³3https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard [18] which assess instruction-following capabilities of these instruction-tuned models (as of February 2nd, 2024) for reference, we did not utilize this metric for our downstream predictions. See table B.2 for all collected benchmark results.
对于指令调优模型，我们还包括了标准的语言模型评估，如 MMLU [31]、ARC-C [19]、HellaSwag [99]、Winogrande [72]、TruthfulQA [50]、GSM8K [20]和 HumanEval [16]，并且我们遵循了 Sec. B.1.1 中评估开放模型的相同协议。对于像 GPT 和 Claude 这样的专有模型，用统一的协议来评估它们更为复杂（例如，由于无法获得似然分数），因此我们从它们各自的论文和文档中收集了所有标准基准的官方结果（除了 HumanEval，我们能够使用 EvalPlus 库进行评估）。此外，我们从 Chatbot Arena ³ [18]中收集了 Elo 分数，这些分数评估了这些指令调优模型的指令遵循能力（截至 2024 年 2 月 2 日），作为参考，我们没有将此指标用于我们的下游预测。所有收集的基准结果见表 B.2。

Table B.1: Collected metadata and base evaluation metrics for base pretrained models used in Sec. 4.1, Sec. 4.3, and Sec. 5. Model names follow the HuggingFace naming. See data collection details in Sec. B.1.1.
表 B.1：用于第 4.1 节、第 4.3 节和第 5 节的基础预训练模型的收集元数据和基础评估指标。模型名称遵循 HuggingFace 命名。有关数据收集的详细信息，请参见第 B.1.1 节。

Model Family 模范家庭	Model	Param (B) 参数 (B)	Data (T) 数据 (T)	FLOPs (1E21)	MMLU	ARC-C	HellaSwag	Winograd 温诺格拉德	TruthfulQA 真实问答	XWinograd	HumanEval 人类评估
Llama-2 喇嘛-2	Llama-2-7b-hf	7.0	2.0	84.00	0.4380	0.5307	0.7774	0.7403	0.3898	0.7549	0.1280
	Llama-2-13b-hf	13.0	2.0	156.00	0.5434	0.5811	0.8097	0.7664	0.3417	0.7868	0.1829
	Llama-2-70b-hf	70.0	2.0	840.00	0.6983	0.6732	0.8733	0.8374	0.4492	0.8245	0.2988
Llama 羊驼	llama-7b	6.7	1.0	40.20	0.3569	0.5094	0.7781	0.7143	0.3433	0.6932	0.1280
	llama-13b	13.0	1.0	78.00	0.4761	0.5614	0.8092	0.7624	0.3948	0.7304	0.1585
	llama-30b	32.5	1.4	273.00	0.5845	0.6143	0.8473	0.8003	0.4227	0.7711	0.2073
	llama-65b	65.2	1.4	547.68	0.6393	0.6348	0.8609	0.8256	0.4343	0.7768	0.2317
Llama-3 美洲驼-3	Meta-Llama-3-8B	8.0	15.0	720.00	0.6649	-	0.8202	0.7711	0.4395	0.8012	0.3841
Llama-3 美洲驼-3	Meta-Llama-3-70B 元-美洲驼-3-70B	70.0	15.0	6300.00	0.7923	-	0.8798	0.8532	0.4556	0.8447	0.5244
Qwen1.5	Qwen1.5-0.5B	0.5	2.4	7.20	0.3935	0.3148	0.4905	0.5722	0.3830	0.5756	0.1159
	Qwen1.5-1.8B	1.8	2.4	25.92	0.4671	0.3788	0.6142	0.6030	0.3943	0.6438	0.1829
	Qwen1.5-4B	4.0	2.4	57.60	0.5652	0.4846	0.7158	0.6622	0.4727	0.6888	0.2622
	Qwen1.5-7B	7.0	4.0	168.00	0.6197	0.5418	0.7851	0.7127	0.5108	0.7524	0.3476
	Qwen1.5-14B	14.0	4.0	336.00	0.6936	0.5657	0.8108	0.7348	0.5206	0.7775	0.3963
	Qwen1.5-32B	32.0	4.0	768.00	0.7430	0.6357	0.8500	0.8145	0.5739	0.7912	0.4207
	Qwen1.5-72B	72.0	3.0	1296.00	0.7720	0.6587	0.8599	0.8303	0.5961	0.8258	0.4512
Qwen	Qwen-7B	7.0	2.4	100.80	0.5984	0.5137	0.7847	0.7269	0.4779	0.7346	0.3171
	Qwen-14B	14.0	3.0	252.00	0.6770	0.5828	0.8399	0.7680	0.4943	0.7915	0.3537
	Qwen-72B	72.0	3.0	1296.00	0.7737	0.6519	0.8594	0.8248	0.6019	0.8287	0.3720
Mistral 密史脱拉风	Mistral-7B-v0.1	7.3	-	-	0.6416	0.5998	0.8331	0.7861	0.4215	0.7819	0.2744
Mixtral 米克斯特拉尔	Mixtral-8x7B-v0.1	45.0	-	-	0.7188	0.6638	0.8646	0.8169	0.4681	0.8002	0.3354
Yi 易	Yi-6B	6.0	3.0	108.00	0.6411	0.5555	0.7657	0.7419	0.4196	0.7239	0.1585
Yi 易	Yi-34B	34.0	3.0	612.00	0.7635	0.6459	0.8569	0.8303	0.5623	0.7956	0.2683
Gemma 杰玛	gemma-2b	2.0	6.0	72.00	0.4177	0.4838	0.7177	0.6630	0.3308	0.7093	0.2317
Gemma 杰玛	gemma-7b	7.0	6.0	252.00	0.6603	0.6109	0.8247	0.7845	0.4491	0.7839	0.3354
Falcon 猎鹰	falcon-rw-1b 猎鹰-rw-1b	1.0	0.35	2.10	0.2528	0.3507	0.6356	0.6204	0.3596	0.5355	-
	falcon-7b	7.0	1.5	63.00	0.2779	0.4787	0.7813	0.7238	0.3426	0.7176	-
	falcon-40b	40.0	1.0	240.00	0.5698	0.6195	0.8528	0.8129	0.4172	0.7846	-
	falcon-180B	180.0	3.5	3780.00	0.6959	0.6920	0.8889	0.8690	0.4516	0.8446	-
Phi Φ	phi-1_5	1.3	0.15	1.17	0.4389	0.5290	0.6379	0.7222	0.4089	0.5111	0.3415
Phi Φ	phi-2	2.7	1.4	22.68	0.5792	0.6101	0.7492	0.7348	0.4424	0.5267	0.4939
Pythia 皮提亚	pythia-70m-deduped	0.07	0.3	0.13	0.2526	0.2108	0.2717	0.4964	0.4751	0.5101	0.0000
	pythia-160m-deduped	0.16	0.3	0.29	0.2486	0.2406	0.3139	0.5138	0.4434	0.5236	0.0000
	pythia-410m-deduped	0.41	0.3	0.74	0.2599	0.2483	0.4129	0.5438	0.4095	0.5363	0.0122
	pythia-1b-deduped	1.0	0.3	1.80	0.2427	0.2910	0.4965	0.5359	0.3894	0.5610	0.0427
	pythia-1.4b-deduped	1.4	0.3	2.52	0.2556	0.3268	0.5496	0.5730	0.3866	0.5941	0.0427
	pythia-2.8b-deduped	2.8	0.3	5.04	0.2678	0.3626	0.6066	0.6022	0.3556	0.6400	0.0488
	pythia-6.9b-deduped	6.9	0.3	12.42	0.2648	0.4130	0.6705	0.6409	0.3519	0.6525	0.0854
	pythia-12b-deduped	12.0	0.3	21.60	0.2563	0.4138	0.7026	0.6646	0.3300	0.6824	0.1159
BLOOM	bloom-560m	0.56	0.341	1.15	0.2422	0.2474	0.3715	0.5193	0.4244	0.5786	0.0061
	bloom-1b1	1.1	0.341	2.25	0.2670	0.2833	0.4278	0.5501	0.4180	0.6095	0.0000
	bloom-3b	3.0	0.341	6.14	0.2659	0.3575	0.5437	0.5762	0.4057	0.6648	0.0183
	bloom-7b1	7.1	0.341	14.53	0.2625	0.4113	0.6200	0.6543	0.3890	0.6977	0.0488
	bloom 开花	176.0	0.366	386.50	0.3085	0.5043	0.7641	0.7206	0.3976	0.7355	0.1220
GPT-Neo/J	gpt-neo-125m	0.125	0.3	0.22	0.2597	0.2295	0.3026	0.5178	0.4558	0.5022	0.0061
	gpt-neo-1.3B	1.3	0.38	2.96	0.2482	0.3123	0.4847	0.5691	0.3963	0.5611	0.0366
	gpt-neo-2.7B	2.7	0.42	6.80	0.2645	0.3336	0.5624	0.6006	0.3978	0.5740	0.0671
	gpt-j-6b	6.05	0.402	14.59	0.2678	0.4138	0.6754	0.6598	0.3596	0.6811	0.1159
	gpt-neox-20b	20.0	0.472	56.64	0.2500	0.4573	0.7345	0.6890	0.3161	0.7163	0.1280
OPT	opt-125m	0.125	0.18	0.14	0.2602	0.2287	0.3147	0.5162	0.4287	0.4987	0.0000
	opt-350m	0.35	0.18	0.38	0.2602	0.2355	0.3673	0.5264	0.4083	0.5181	0.0000
	opt-1.3b	1.3	0.18	1.40	0.2496	0.2952	0.5453	0.5975	0.3871	0.5440	0.0000
	opt-2.7b	2.7	0.18	2.92	0.2543	0.3396	0.6143	0.6196	0.3743	0.5685	0.0000
	opt-6.7b	6.7	0.18	7.24	0.2457	0.3916	0.6866	0.6598	0.3512	0.5943	0.0061
	opt-13b	13.0	0.18	14.04	0.2490	0.3993	0.7120	0.6851	0.3410	0.6088	0.0061
	opt-30b	30.0	0.18	32.40	0.2666	0.4326	0.7407	0.7064	0.3516	0.6264	0.0122
	opt-66b	66.0	0.18	71.28	0.2699	0.4633	0.7625	0.7001	0.3543	0.6426	0.0122
MPT	mpt-7b	7.0	1.0	42.00	0.2807	0.4770	0.7753	0.7214	0.3355	0.7144	0.1646
MPT	mpt-30b	30.0	1.0	180.00	0.4800	0.5597	0.8242	0.7490	0.3842	0.7453	0.2134
XGLM	xglm-564M	0.564	0.5	1.69	0.2518	0.2457	0.3464	0.5225	0.4043	0.5855	0.0000
	xglm-1.7B	1.7	0.5	5.10	0.2510	0.2585	0.4568	0.5391	0.3721	0.6307	0.0000
	xglm-4.5B	4.5	0.5	13.50	0.2543	0.3148	0.5795	0.5493	0.3584	0.6585	0.0000
	xglm-7.5B	7.5	0.5	22.50	0.2779	0.3413	0.6077	0.5872	0.3666	0.6956	0.0000
CodeLlama 代码美洲驼	CodeLlama-7b-hf	7.0	2.52	105.84	0.3112	0.3993	0.6080	0.6401	0.3782	0.7297	0.3354
	CodeLlama-13b-hf	13.0	2.52	196.56	0.3281	0.4087	0.6335	0.6717	0.4379	0.7349	0.3841
	CodeLlama-34b-hf	34.0	2.52	514.08	0.5502	0.5410	0.7582	0.7356	0.3911	0.7861	0.4756
	CodeLlama-70b-hf	70.0	3.02	1268.40	0.5967	0.5674	0.7821	0.7522	0.3979	0.7756	0.5488
StarCoder 星码员	starcoderbase-1b	1.0	1.0	6.00	0.2667	0.2270	0.3431	0.4996	0.4579	0.5617	0.1460
	starcoderbase-3b	3.0	1.0	18.00	0.2735	0.2585	0.3911	0.5114	0.4305	0.5976	0.1770
	starcoderbase-7b	7.0	1.0	42.00	0.2845	0.2986	0.4387	0.5438	0.4046	0.5978	0.2440
	starcoderbase 星码基础	15.5	1.0	93.00	0.3212	0.3029	0.4721	0.5580	0.4002	0.5952	0.3410
StarCoder2 星码 2	starcoder2-3b 星码者 2-3b	3.0	3.3	59.40	0.3865	0.3456	0.4762	0.5454	0.4049	0.6037	0.3170
	starcoder2-7b 星码者 2-7b	7.0	3.7	155.40	0.4121	0.3831	0.5191	0.5919	0.4199	0.6201	0.3540
	starcoder2-15b 星码 2-15b	15.0	4.3	387.00	0.5135	0.4735	0.6409	0.6385	0.3787	0.7383	0.4630
DeepSeek-Coder 深度搜索编码器	deepseek-coder-1.3b-base	1.3	2.0	15.60	0.2602	0.2577	0.3928	0.5272	0.4261	0.6063	0.2870
	deepseek-coder-6.7b-base	6.7	2.0	80.40	0.3839	0.3703	0.5346	0.5809	0.4028	0.6789	0.4760
	deepseek-coder-33b-base	33.0	2.0	396.00	0.4091	0.4249	0.5999	0.6243	0.3997	0.6961	0.5120

Table B.2: Collected metadata and base evaluation metrics for instruction-tuned models used in Sec. 4.2. Model names follow the HuggingFace naming for open models. See data collection details in Sec. B.1.2.
表 B.2：用于第 4.2 节的指令调优模型的收集元数据和基本评估指标。模型名称遵循 HuggingFace 对开源模型的命名。有关数据收集的详细信息，请参见第 B.1.2 节。

Model Family 模范家庭	Model	Param (B) 参数 (B)	Data (T) 数据 (T)	FLOPs (1E21) FLOPs（1E21）	Arena-Elo 竞技场-Elo	MMLU	ARC-C	HellaSwag	Winogrande	TruthfulQA 真实问答	HumanEval 人类评估
GPT	gpt-4-0613	-	-	-	1161.6608	0.8640	0.9630	0.9530	0.8750	0.5900	0.8720
	gpt-4-0314	-	-	-	1189.5486	0.8640	0.9630	0.9530	0.8750	0.5900	0.9024
	gpt-3.5-turbo-0613	-	-	-	1118.1123	0.7000	0.8520	0.8550	0.8160	0.4700	0.7744
Claude 克劳德	claude-2.0 克劳德-2.0	-	-	-	1132.3173	0.7850	0.9100	-	-	0.6900	0.6707
	claude-1.3 来源文本：claude-1.3 翻译文本：	-	-	-	1149.3443	0.7700	0.9000	-	-	0.6200	0.6159
	claude-instant-1.1 克劳德-即时-1.1	-	-	-	1109.4714	0.7340	0.8570	-	-	0.6600	0.5915
Llama-2-Chat	llama-2-7b-chat	7.0	2.0	84.00	1024.1411	0.4706	0.5290	0.7855	0.7174	0.4557	0.1220
	llama-2-13b-chat 骆驼-2-13b-聊天	13.0	2.0	156.00	1041.8442	0.5412	0.5904	0.8194	0.7451	0.4412	0.1829
	llama-2-70b-chat 骆马-2-70b-聊天	70.0	2.0	840.00	1082.0000	0.6345	0.6459	0.8588	0.8051	0.5280	0.3171
Codellama-Instruct Codellama-指令	codellama-7b-instruct	7.0	2.52	105.84	-	0.3454	0.3652	0.5544	0.6456	0.4125	0.3963
	codellama-13b-instruct	13.0	2.52	196.56	-	0.3889	0.4454	0.6493	0.6803	0.4588	0.4451
	codellama-34b-instruct	34.0	2.52	514.08	1043.4381	0.5462	0.5427	0.7692	0.7451	0.4444	0.4878
Mistral-Instruct 米斯特拉尔-指导	mistral-7b-instruct-v0.1	7.0	-	-	1006.4716	0.5539	0.5452	0.7563	0.7372	0.5628	0.3537
Vicuna 小羊驼	vicuna-7b-v1.5	7.0	2.0	84.00	1004.9595	0.5031	0.5324	0.7739	0.7214	0.5033	0.1341
	vicuna-13b-v1.5	13.0	2.0	156.00	1040.3549	0.5624	0.5657	0.8109	0.7466	0.5107	0.2134
	vicuna-13b-16k	13.0	2.0	156.00	-	0.5489	0.5674	0.8037	0.7285	0.5196	0.2500
	vicuna-33b-v1.3	33.0	2.0	396.00	1093.4174	0.5921	0.6160	0.8306	0.7703	0.5609	0.2134
Deepseek-LLM-Chat Deepseek-LLM-聊天	deepseek-llm-67b-chat 深度搜索-llm-67b-聊天	67.0	2.0	804.00	1081.7334	0.7174	0.6775	0.8680	0.8421	0.5583	0.7012
Lemur-Chat 狐猴聊天	lemur-70b-chat-v1 狐猴-70b-聊天-v1	70.0	2.09	877.80	-	0.6599	0.6698	0.8573	0.8169	0.5658	0.5915
OpenChat 开放聊天	openchat-13b-v3.2	13.0	2.0	156.00	-	0.5668	0.5964	0.8268	0.7695	0.4449	0.2073
WizardLM 巫师语言模型	wizardlm-13b-v1.2	13.0	2.0	156.00	1058.0881	0.5367	0.5904	0.8221	0.7190	0.4727	0.3902
WizardLM 巫师语言模型	wizardlm-30b-v1.0	30.0	3.0	540.00	-	0.5888	0.6254	0.8327	0.7751	0.5249	-
Guanaco 原驼	guanaco-33b	33.0	1.4	277.20	1031.9123	0.5569	0.6246	0.8448	-	0.5122	0.2622
Guanaco 原驼	guanaco-65b	65.0	1.4	546.00	-	0.6251	0.6544	0.8647	0.8240	0.5281	0.2744
Koala 考拉	koala-13b	13.0	1.0	78.00	965.7386	0.4501	0.5299	0.7759	0.7403	0.5023	0.1220
Dolly-v2 多莉-v2	dolly-v2-12b	12.0	0.3	21.60	822.6771	0.2581	0.4241	0.7253	0.6085	0.3383	0.0000
OpenAssistant 开放助手	oasst-sft-4-pythia-12b-epoch-3.5	12.0	0.3	21.60	-	0.2682	0.4573	0.6859	0.6590	0.3781	0.0793

B.2 Downstream Evaluation B.2 下游评估

For all downstream tasks of pretrained base models included in Sec. 4.1 and Sec. 4.3, we used the LM Eval Harness [27] library to evaluate all the models. For the “emergent” capability tasks in Sec. 4.1, we applied likelihood-based evaluation [13] with 2-shot examples. For the post-training intervention tasks in Sec. 4.3, we used the same evaluation protocol as the original papers, as described in the main paper. For agentic capability tasks of instruction-tuned models in Sec. 4.2, we directly sourced the results from the AgentBench [54] and AgentBoard [57] leaderboards and scaled the metrics to $[0,1]$ .
对于第 4.1 节和第 4.3 节中包含的预训练基础模型的所有下游任务，我们使用了 LM Eval Harness [27]库来评估所有模型。对于第 4.1 节中的“新兴”能力任务，我们应用了基于似然的评估[13]，并使用了 2-shot 示例。对于第 4.3 节中的训练后干预任务，我们使用了与原始论文中描述的相同评估协议。对于第 4.2 节中指令调优模型的代理能力任务，我们直接从 AgentBench [54]和 AgentBoard [57]排行榜中获取结果，并将指标缩放到 $[0,1]$ 。

B.3 PCA Analysis B.3 主成分分析

PCA imputation PCA 插补

The PCA imputation starts with a simple mean imputation for missing values in the data matrix, and then PCA is applied to transform the data into a lower-dimensional space where the missing values are imputed by the PCA reconstruction. The above procedure is repeated until the imputed values converge or reach a maximum of 1000 iterations. By default, we used the first principal component (PC-1) to impute the missing values, as we found it to be the most robust in our preliminary experiments. Notably, when there are train and test splits, we first applied the PCA imputation procedure on the training set and then applied the same transformation to the test set to prevent any train-test leakage.
PCA 插补首先对数据矩阵中的缺失值进行简单的均值插补，然后应用 PCA 将数据转换到低维空间，在该空间中通过 PCA 重构对缺失值进行插补。上述过程重复进行，直到插补值收敛或达到最多 1000 次迭代。默认情况下，我们使用第一个主成分（PC-1）来插补缺失值，因为我们在初步实验中发现它最为稳健。值得注意的是，当存在训练集和测试集划分时，我们首先在训练集上应用 PCA 插补过程，然后将相同的变换应用于测试集，以防止训练集和测试集之间的泄漏。

PC extraction PC 提取

When applying PCA to extracting the capability measures, we extracted the top $K=3$ principal components from the model-capability matrix. By default, we mean-centered the data before applying PCA without additional scaling, since most evaluation metrics are already normalized into $[0,1]$ . Similar to PCA imputation, we only fitted the PCA on the training set and applied the same transformation to the test set to prevent any train-test leakage.
在将 PCA 应用于提取能力度量时，我们从模型能力矩阵中提取了前 $K=3$ 个主成分。默认情况下，我们在应用 PCA 之前对数据进行了均值中心化处理，而没有进行额外的缩放，因为大多数评估指标已经归一化为 $[0,1]$ 。与 PCA 插补类似，我们只在训练集上拟合 PCA，并对测试集应用相同的变换，以防止任何训练-测试泄漏。

Appendix C Additional Results
附录 C 附加结果

C.1 PC Analysis of Instruction-Tuned LMs
C.1PC 指令调优语言模型的分析

In Fig. C.1, we conducted a PC analysis for instruction-tuned models (see the model list in table B.2) following exactly the same procedure as Fig. 2. We find that the extracted PC measures for instruction-tuned LMs follow similar patterns as pretrained models and exhibit an even more significant low-rank structure, with the top 3 PCs explaining about 98.6% of the variance in the benchmark performance.
在图 C.1 中，我们对指令调优模型进行了主成分分析（参见表 B.2 中的模型列表），其过程与图 2 完全相同。我们发现，提取的指令调优语言模型的主成分度量与预训练模型表现出相似的模式，并且表现出更显著的低秩结构，前三个主成分解释了基准性能中约 98.6%的方差。

C.2 Properties of PC measures
C.2 PC 度量的性质

Lower-ranked PCs linearly correlate with log-compute measures
低排名的 PC 与对数计算测量呈线性相关

In Fig. 3, we showed that the top PC-1 linearly correlates with log-compute scale measures (log-training FLOPs) within each comparable model family. In Fig. C.2, we show that this linear correlation generally holds for lower-ranked PCs, specifically PC-2 and PC-3, though the correlation tends to decrease with lower-rank PCs compared to the top PC-1.
在图 3 中，我们展示了顶级 PC-1 与每个可比模型家族内的对数计算规模测量（对数训练 FLOPs）线性相关。在图 C.2 中，我们展示了这种线性相关性通常适用于较低排名的 PC，特别是 PC-2 和 PC-3，尽管与顶级 PC-1 相比，较低排名的 PC 的相关性往往会减弱。

Aggregated PCs linearly correlate with log-compute measures
聚合的 PCs 与对数计算度量线性相关

When fitting our observational scaling laws, we utilized the (hypothetical) linear relation between the aggregated PC measures $P_{m}:=\beta^{*\top}S_{m}$ and the log-compute measures $\log(C_{m})$ within each model family to transform $P_{m}$ into compute-equivalent scales (Eq. 8) . This linear correlation has been partially validated through the linear correlation of top PCs (Fig. 3 & Fig. C.2). Here we more directly validate this linearity by analyzing the aggregated PC measures $P_{m}$ fitted on specific tasks. Specifically, in Fig. C.3, we visualize the fitted $P_{m}$ on the “emergent” capability tasks (i.e., Fig. 4(b)) versus the compute measures $\log(C_{m})$ within each comparable model family. We find that the aggregated PC measures generally exhibit a linear correlation with the log-compute measures within each family. Notably, the linear correlation is consistently significant for the Llama-2 family, which we have used as the default reference family for computing the equivalent scales in our experiments.
在拟合我们的观测缩放定律时，我们利用了每个模型家族中聚合 PC 测量 $P_{m}:=\beta^{*\top}S_{m}$ 和对数计算测量 $\log(C_{m})$ 之间的（假设的）线性关系，将 $P_{m}$ 转换为计算等效尺度（方程 8）。这种线性相关性已通过顶级 PC 的线性相关性部分验证（图 3 和图 C.2）。在这里，我们通过分析在特定任务上拟合的聚合 PC 测量 $P_{m}$ 更直接地验证了这种线性关系。具体来说，在图 C.3 中，我们将拟合的 $P_{m}$ 在“突现”能力任务（即图 4(b)）与每个可比模型家族内的计算测量 $\log(C_{m})$ 进行可视化。我们发现，聚合 PC 测量通常与每个家族内的对数计算测量呈线性相关。值得注意的是，对于 Llama-2 家族，这种线性相关性始终显著，我们已将其用作实验中计算等效尺度的默认参考家族。

Single benchmark metric suffers from limited dynamic range
单一基准指标的动态范围有限

In Fig. 9, we have shown that PC-1 can serve as a smooth capability measure for LMs that provide meaningful readouts across many orders of scales (about 5 orders of magnitude). In Fig. C.4, we show that using a single benchmark metric as LM capability measures amy suffer from a limited dynamic range. In particular, they may either saturate quickly for large models (e.g., HellaSwag, Winogrande) or provide random readouts for weak models (e.g., MMLU, GSM8K).
在图 9 中，我们展示了 PC-1 可以作为一种平滑的能力度量，用于在多个数量级（约 5 个数量级）上提供有意义的读数。在图 C.4 中，我们展示了使用单一基准指标作为 LM 能力度量可能会受到动态范围有限的影响。特别是，对于大型模型（例如，HellaSwag，Winogrande），它们可能会迅速饱和，或者对于弱模型（例如，MMLU，GSM8K），它们可能会提供随机读数。

C.3 Robustness Checks C.3 稳健性检验

Number of PC selection PC 选择数量

Recall that we defaulted to use 3 PC measures for all of our prediction tasks. Here we provide additional analysis on the impact of using different numbers of PCs on the prediction performance and validate the robustness of our choice. In particular, we compare the fitted curves and prediction performance of using 1-4 PCs on all our tasks. The results are in Fig. C.5, Fig. C.6, and Fig. C.7 for post-training analysis, “emergent” capability, and agentic capability tasks, respectively. Our results indicate that using more than 2 PCs leads to better prediction performance than using compute measures like FLOPs, and using 3 PCs consistently leads to the most robust predictions across all the tasks. These validate our choice of using 3 PCs as the default number of PCs and indicate the robustness of our results to the choice of the number of PCs.
回想一下，我们默认使用 3 个主成分（PC）测量来进行所有的预测任务。这里我们提供了关于使用不同数量的主成分对预测性能影响的额外分析，并验证了我们选择的稳健性。特别是，我们比较了在所有任务中使用 1-4 个主成分的拟合曲线和预测性能。结果分别在图 C.5、图 C.6 和图 C.7 中展示，分别对应训练后分析、“新兴”能力和代理能力任务。我们的结果表明，使用超过 2 个主成分比使用计算测量（如 FLOPs）能带来更好的预测性能，并且使用 3 个主成分在所有任务中始终能带来最稳健的预测。这些结果验证了我们选择 3 个主成分作为默认数量的合理性，并表明我们的结果对主成分数量选择的稳健性。

Holdout cutoff selection 保留截止选择

The cutoff for selecting the holdout set could have a significant impact on the prediction performance of observational scaling laws, as it determines the size of the training set that could be crucial when the entire dataset is not large (as in our case). Here we analyze how the prediction performance changes with different holdout cutoffs for various predictive measures (PCs vs compute measures) and provide a quantitative comparison that characterizes their overall prediction performance under varying cutoffs.
选择保留集的截止点可能对观测缩放定律的预测性能产生重大影响，因为它决定了训练集的大小，而当整个数据集不大时（如我们的情况），这可能至关重要。在此，我们分析了不同保留截止点对各种预测指标（PCs 与计算指标）的预测性能变化，并提供了定量比较，以表征它们在不同截止点下的整体预测性能。

Specifically, we conducted the analysis on the post-training analysis tasks in Sec. 4.3 and the “emergent” capability tasks in Sec. 4.1, where there are more data points (compared to the agentic capability tasks in Sec. 4.2) to provide a more robust analysis. For each task, we vary the FLOPs cutoff to control the ratio of the test set from 60% to 5% (linearly spaced), which consequently changes the difficulty of the prediction task from more difficult (less training data with weaker performance) to easier (more training data with stronger performance). We can then compare the test MSE of using different predictive measures under different cutoffs and quantify the overall prediction performance using the area under the error curve (AUE). For “emergent” capability tasks, we additionally include a variant of the cutoff strategy that holds out test data based on the accuracy on the task, which simulates a more challenging weak-to-strong prediction scenario and offers an extra robust analyses.
具体来说，我们在第 4.3 节的训练后分析任务和第 4.1 节的“新兴”能力任务上进行了分析，这些任务的数据点比第 4.2 节的代理能力任务更多，从而提供了更为稳健的分析。对于每个任务，我们通过改变 FLOPs 截止值来控制测试集的比例，从 60%到 5%（线性间隔），这会相应地改变预测任务的难度，从更难（训练数据较少，性能较弱）到更容易（训练数据较多，性能较强）。然后，我们可以比较在不同截止值下使用不同预测措施的测试 MSE，并使用误差曲线下面积（AUE）量化整体预测性能。对于“新兴”能力任务，我们还包括了一种基于任务准确性保留测试数据的截止策略变体，这模拟了一个更具挑战性的弱到强预测场景，并提供了额外的稳健分析。

The results are depicted in Fig. C.8 and Fig. C.9. We observe that in most of our evaluated setups, using our PC measures (especially with 3 PCs) generally leads to an earlier transition to the low prediction error region and much lower AUE compared to using compute scales like training FLOPs and model size. This indicates that PC measures are more robust under different cutoffs and more sample-efficient for scaling analysis.
结果如图 C.8 和图 C.9 所示。我们观察到，在我们评估的大多数设置中，使用我们的 PC 测量（尤其是使用 3 个 PC）通常会更早地过渡到低预测误差区域，并且与使用训练 FLOPs 和模型大小等计算规模相比，AUE 要低得多。这表明 PC 测量在不同的截止点下更为稳健，并且在规模分析中更为样本高效。

C.4 Emergent Capabilities C.4 新兴能力

Predicting with model sizes
预测模型规模

In Fig. C.10, we show the prediction performance of using model size for the “emergent” capabilities of LMs. We find that it leads to significantly worse forecasts compared to using training FLOPs and PC measures and poorly captures the “emergence” trend. This is probably because models from different families were trained with very different data sizes and quality and may use different architectures.
在图 C.10 中，我们展示了使用模型大小预测语言模型“新兴”能力的表现。我们发现，与使用训练 FLOPs 和 PC 指标相比，这导致了显著更差的预测结果，并且无法很好地捕捉“新兴”趋势。这可能是因为来自不同家族的模型使用了非常不同的数据规模和质量进行训练，并且可能使用了不同的架构。

Using default cutoff for arithmetic tasks
使用默认截止值进行算术任务

In Fig. 4, we applied a different FLOPs cutoff than the default one on arithmetic tasks to make the prediction tasks more challenging. Here, we present the results of using the default FLOPs cutoff on arithmetic tasks in Fig. C.11. We find that using the default FLOPs cutoff makes the prediction tasks trivial with too many data points close to perfect performance. Notably, using PC measures still outperforms using compute measures like model size and training FLOPs, indicating its robustness to the choice of the cutoff.
在图 4 中，我们在算术任务上应用了不同于默认的 FLOPs 截止值，以使预测任务更具挑战性。这里，我们在图 C.11 中展示了使用默认 FLOPs 截止值在算术任务上的结果。我们发现，使用默认 FLOPs 截止值使预测任务变得简单，许多数据点接近完美表现。值得注意的是，使用 PC 度量仍然优于使用计算度量（如模型大小和训练 FLOPs），表明其对截止值选择的鲁棒性。

Additional tasks 附加任务

In Fig. C.12, we present the results on additional “emergent” capability tasks included in Wei et al. [90]. Similar to the main tasks (Fig. 4), we used the default FLOPs cutoff for non-arithmetic tasks (IPA Transliterate) and a quarter of the default cutoff for arithmetic tasks (3-Digit Addition, 2-Digit Addition). We find that using PC measures consistently leads to the best prediction performance compared to using model size or training FLOPs. While the extrapolation does not exactly match the trend of the ground truth on the IPA Transliterate task, possibly due to the fact that the specific task capabilities are not well covered by our collected benchmark metrics, it still provides a reasonable forecast of the “emergence” behavior.
在图 C.12 中，我们展示了 Wei 等人[90]中包含的额外“新兴”能力任务的结果。与主要任务（图 4）类似，我们对非算术任务（IPA 音标转写）使用了默认的 FLOPs 截止值，对算术任务（3 位数加法、2 位数加法）使用了默认截止值的四分之一。我们发现，使用 PC 度量方法相比于使用模型大小或训练 FLOPs，始终能带来最佳的预测性能。虽然在 IPA 音标转写任务上的外推结果与真实趋势不完全匹配，可能是因为我们的基准指标未能很好地覆盖该特定任务的能力，但它仍然提供了对“新兴”行为的合理预测。

C.5 Post-Training Method Analysis
C.5 训练后方法分析

Prediction results with different scale measures
不同尺度测量的预测结果

In Fig. C.13, we show the prediction performance of using different scale measures on various prediction tasks for the post-training method analysis on GSM8K. Similarly, using PC measures well captures the scaling trend and consistently leads to the best prediction performance across all tasks.
在图 C.13 中，我们展示了在 GSM8K 上使用不同尺度测量进行后训练方法分析的各种预测任务的预测性能。同样，使用 PC 测量很好地捕捉了缩放趋势，并在所有任务中始终带来最佳的预测性能。

Results on BBH BBH 结果

We further validated our observational scaling laws for predicting the impact of CoT on the BigBench-Hard tasks [76] following the same setup in Sec. 4.3. In particular, we used the defaulted FLOPs cutoff and the same PC measures (# = 3). We normalized the prediction accuracy on each BBH task by their respective random prediction accuracy and aggregated the normalized accuracy across all tasks for predictions. The results are depicted in Fig. C.14. Surprisingly, we observe that using training FLOPs leads to reasonable predictions of LM performance with and without CoT on BBH tasks, possibly due to the denoising effect of aggregation over all tasks. Furthermore, using PC measures accurately captures the scaling trends in both setups, even when using training FLOPs leads to less tight captures in the “Naive” setup or fails to capture the behavior of models trained on synthetic data (Phi).
我们进一步验证了我们的观察性缩放定律，以预测 CoT 对 BigBench-Hard 任务的影响[76]，遵循第 4.3 节中的相同设置。特别是，我们使用了默认的 FLOPs 截止值和相同的 PC 度量（# = 3）。我们通过各自的随机预测准确性对每个 BBH 任务的预测准确性进行了归一化，并汇总了所有任务的归一化准确性以进行预测。结果如图 C.14 所示。令人惊讶的是，我们观察到使用训练 FLOPs 可以合理预测 LM 在 BBH 任务上有无 CoT 的表现，这可能是由于对所有任务进行汇总的去噪效果。此外，使用 PC 度量准确捕捉了两种设置中的缩放趋势，即使在使用训练 FLOPs 导致“Naive”设置中捕捉不够紧密或无法捕捉到在合成数据（Phi）上训练的模型行为的情况下。

C.6 Model Subset Selection
C.6 模型子集选择

Prediction results with different number of models selected by V-optimality
由 V-最优性选择的不同数量模型的预测结果

In Fig. 7(a), we demonstrated how the prediction errors change with the number of models selected by our method. Here we present a qualitative analysis of the prediction results with different numbers of models selected in Fig. C.15. We find that with more than 8 models, the fitted scaling curves have already converged to accurately capture the scaling trend, indicating the efficiency of our method.
在图 7(a)中，我们展示了预测误差如何随着我们方法选择的模型数量而变化。这里我们在图 C.15 中展示了不同数量模型选择的预测结果的定性分析。我们发现，当模型数量超过 8 个时，拟合的缩放曲线已经收敛，能够准确捕捉缩放趋势，这表明我们的方法是有效的。

Prediction results with randomly selected models
使用随机选择模型的预测结果

We present the prediction results with randomly selected models from all available models in Fig. C.16, in comparison to the results with models selected by our V-optimality criterion (Fig. C.15). All these results are produced with a fixed random seed. We find that using randomly selected models leads to a much worse prediction performance, even with 16 models, demonstrating the critical need to carefully select models for effective scaling analyses.
我们在图 C.16 中展示了从所有可用模型中随机选择模型的预测结果，并与通过我们的 V 最优性标准选择的模型结果（图 C.15）进行比较。所有这些结果都是在固定随机种子的情况下产生的。我们发现，即使使用 16 个模型，随机选择模型也会导致预测性能大幅下降，这表明为了进行有效的规模分析，仔细选择模型是至关重要的。

C.7 Fited Functional Forms for Preregistration of Predictions
C.7 适合预测预注册的功能形式

In table C.1, we included the functional forms of fitted scaling laws included in our experiments. These functional forms serve as a preregistration of our predictions for future models, which will be used to test the generalizability of our scaling analysis to unseen models.
在表 C.1 中，我们包含了在实验中使用的拟合缩放定律的函数形式。这些函数形式作为我们对未来模型预测的预注册，将用于测试我们的缩放分析对未见模型的普遍适用性。

Table C.1: The functional forms of the fitted scaling laws included in our paper, are preregistered for predictions of future models. Each functional form is presented as the logit of the normalized accuracy metric

\phi^{-1}(Y,h)=\sigma^{-1}\left(\left(Y-(1-h)\right)/h\right)=X

that is equivalent to Eq. 6. Each benchmark metric is scaled to be within the range

[0,1]

.
表 C.1：我们论文中包含的拟合缩放定律的函数形式已预注册用于预测未来模型。每个函数形式都表示为归一化准确性度量

\phi^{-1}(Y,h)=\sigma^{-1}\left(\left(Y-(1-h)\right)/h\right)=X

的对数几率，这相当于方程 6。每个基准度量都缩放到

[0,1]

范围内。

Setup 设置	Task 任务	Functional Form 函数形式
“Emergent” capabilities “涌现”能力 (Sec. 4.1) 第 4.1 节	Word Unscramble 单词解谜	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&2.00\log_{10}(\bar{C}_{\text{Llama-2}})-6.11\\ =~{}&6.74\text{PC1}-3.22\text{PC2}-1.37\text{PC3}-4.93\\ =~{}&1.02\text{MMLU}+3.02\text{ARC-C}+5.73\text{HellaSwag}+2.44\text{Winograd}% ~{}-\\ &1.06\text{TruthfulQA}+1.21\text{GSM8K}+2.48\text{XWinograd}-0.08\text{% HumanEval}-12.28\end{aligned}$
	Persian QA 波斯问答	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&2.32\log_{10}(\bar{C}_{\text{Llama-2}})-8.43\\ =~{}&2.86\text{PC1}+3.18\text{PC2}-0.19\text{PC3}-5.26\\ =~{}&2.08\text{MMLU}+1.06\text{ARC-C}+1.13\text{HellaSwag}+0.53\text{Winograd}% ~{}+\\ &0.36\text{TruthfulQA}+2.89\text{GSM8K}+0.66\text{XWinograd}+1.55\text{% HumanEval}-7.98\end{aligned}$
	3-Digit Substraction 三位数减法	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&5.50\log_{10}(\bar{C}_{\text{Llama-2}})-8.92\\ =~{}&5.98\text{PC1}+8.74\text{PC2}+39.55\text{PC3}-4.68\\ =~{}&2.17\text{MMLU}+2.32\text{ARC-C}-3.44\text{HellaSwag}-7.96\text{Winograd}% ~{}+\\ &0.65\text{TruthfulQA}+34.27\text{XWinograd}+20.39\text{HumanEval}-20.99\end{aligned}$
	2-Digit Multiplication 两位数乘法	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&2.22\log_{10}(\bar{C}_{\text{Llama-2}})-4.45\\ =~{}&3.60\text{PC1}+4.24\text{PC2}+8.05\text{PC3}-2.68\\ =~{}&1.62\text{MMLU}+1.95\text{ARC-C}+0.55\text{HellaSwag}-0.63\text{Winograd}% ~{}+\\ &0.14\text{TruthfulQA}+6.80\text{XWinograd}+6.52\text{HumanEval}-8.00\end{aligned}$
Agentic capabilities 代理能力 (Sec. 4.2) 第 4.2 节	AgentBench	$\begin{aligned} &\phi^{-1}(Y,0.99)\\ =~{}&\log_{10}(\bar{C}_{\text{Llama-2-Chat}})-5.52\\ =~{}&2.32\text{PC1}+0.79\text{PC2}-2.82\text{PC3}-2.96\\ =~{}&2.34\text{MMLU}+0.82\text{ARC-C}+0.32\text{HellaSwag}+0.54\text{% Winogrande}~{}+\\ &0.60\text{TruthfulQA}-0.42\text{GSM8K}+2.63\text{HumanEval}-6.37\end{aligned}$
Agentic capabilities 代理能力 (Sec. 4.2) 第 4.2 节	AgentBoard 代理板	$\begin{aligned} &\phi^{-1}(Y,0.97)\\ =~{}&0.98\log_{10}(\bar{C}_{\text{Llama-2-Chat}})-6.60\\ =~{}&3.02\text{PC1}+2.60\text{PC2}+1.17\text{PC3}-2.98\\ =~{}&-0.10\text{MMLU}-0.31\text{ARC-C}-0.55\text{HellaSwag}+0.14\text{% Winogrande}~{}+\\ &0.56\text{TruthfulQA}+2.28\text{GSM8K}+3.36\text{HumanEval}-5.06\end{aligned}$

Setup 设置	Task 任务	Functional Form 函数形式
Post-training 训练后 (analysis Sec. 4.3) 分析第 4.3 节	GSM Naive + Greedy	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&1.09\log_{10}(\bar{C}_{\text{Llama-2}})-4.77\\ =~{}&2.69\text{PC1}+1.55\text{PC2}-0.36\text{PC3}-3.57\\ =~{}&1.53\text{MMLU}+1.30\text{ARC-C}+1.22\text{HellaSwag}+0.75\text{Winograd}% ~{}+\\ &0.16\text{TruthfulQA}+0.13\text{XWinograd}+1.92\text{HumanEval}-5.97\end{aligned}$
	GSM CoT + Greedy GSM CoT + 贪婪算法	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&2.04\log_{10}(\bar{C}_{\text{Llama-2}})-5.53\\ =~{}&2.56\text{PC1}+4.64\text{PC2}+4.21\text{PC3}-2.50\\ =~{}&5.03\text{MMLU}+2.04\text{ARC-C}-0.10\text{HellaSwag}+0.96\text{Winograd}% ~{}+\\ &1.75\text{TruthfulQA}-2.39\text{XWinograd}+2.58\text{HumanEval}-4.77\end{aligned}$
	GSM CoT + SC	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&2.19\log_{10}(\bar{C}_{\text{Llama-2}})-5.74\\ =~{}&2.73\text{PC1}+4.82\text{PC2}+4.95\text{PC3}-2.49\\ =~{}&5.58\text{MMLU}+2.27\text{ARC-C}-0.08\text{HellaSwag}+1.11\text{Winograd}% ~{}+\\ &1.97\text{TruthfulQA}-2.78\text{XWinograd}+2.45\text{HumanEval}-4.95\end{aligned}$
	BBH Naive + Greedy BBH 朴素 + 贪婪	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&1.26\log_{10}(\bar{C}_{\text{Llama-2}})-4.87\\ =~{}&2.70\text{PC1}+3.06\text{PC2}-0.84\text{PC3}-3.23\\ =~{}&1.41\text{MMLU}+1.05\text{ARC-C}+0.75\text{HellaSwag}+0.36\text{Winograd}% ~{}+\\ &0.11\text{TruthfulQA}+0.61\text{XWinograd}+3.63\text{HumanEval}-5.47\end{aligned}$
	BBH CoT + Greedy BBH CoT + 贪心算法	$\begin{aligned} &\phi^{-1}(Y,1.00)\\ =~{}&1.62\log_{10}(\bar{C}_{\text{Llama-2}})-5.03\\ =~{}&4.20\text{PC1}+3.81\text{PC2}-2.92\text{PC3}-3.12\\ =~{}&0.84\text{MMLU}+1.30\text{ARC-C}+1.57\text{HellaSwag}+0.42\text{Winograd}% ~{}-\\ &0.44\text{TruthfulQA}+1.96\text{XWinograd}+5.62\text{HumanEval}-6.61\end{aligned}$

Observational Scaling Laws and the Predictability of Language Model Performance观测缩放定律与语言模型性能的可预测性

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

Compute scaling laws 计算缩放定律

Downstream scaling laws 下游缩放定律

Correlations between benchmarks基准之间的相关性

3 Observational Scaling Laws3 观测标度定律

3.1 Generalizing Compute Scaling Laws3.1 推广计算缩放定律

Standard compute scaling 标准计算扩展

Observational scaling 观测尺度

3.2 Identifying a Low-Dimensional Capability Space (Eq. 5)3.2 确定低维能力空间（方程 5）

Models 模型

Benchmarks 基准测试

PCA analysis PCA 分析

PC measures are low-dimensionalPC 测量是低维的

3.3 Principal Capability Measures as Surrogate Scale Measures (Eq. 4)3.3 主要能力度量作为替代尺度度量（方程 4）

Setup 设置

PC measures linearly correlate with log-compute measuresPC 测量与对数计算测量呈线性相关

3.4 Fitting Observational Scaling Laws (Eq. 3)3.4 拟合观测标度定律（方程 3）

Fitting regression with PC measures使用主成分测量拟合回归

Defining interpretable compute-like measures定义可解释的计算类度量

4 Validating Observational Scaling Laws验证观测标度定律

Details in scaling law fits缩放定律拟合中的细节

Holdout validation 保留验证

Preregisteration of predictions预测的预注册

4.1 Predictability of “Emergent” Capabilities4.1“新兴”能力的可预测性

Setup 设置

Prediction results 预测结果

4.2 Predictability of Agentic Capabilities4.2 代理能力的可预测性

Setup 设置

Prediction results 预测结果

Interpreting the capability dimensions解释能力维度

4.3 Predicting the Impact of Post-Training Techniques4.3 预测训练后技术的影响

Setup 设置

Prediction results 预测结果

Interpreting the capability dimensions解释能力维度

5 Selecting Low-Cost Model Subsets for Practical Scaling Analyses选择低成本模型子集进行实际缩放分析

Method 方法

Validation 验证

Recommended model series for scaling analysis推荐的缩放分析模型系列

6 Discussion and Other Applications of Observational Scaling6 讨论及观察量表的其他应用

PC-1 as a smooth capability measure with high dynamic rangePC-1 作为一种具有高动态范围的平滑能力测量

Training data efficiency measurements using PC-1使用 PC-1 进行训练数据效率测量

Post-training interventions and their interactions with model families训练后干预及其与模型家族的相互作用

7 Conclusion 7 结论

Acknowledgements 致谢

Major Changelog 主要更新日志

07/02/2024

References 参考文献

Appendix A Algorithm 附录 A 算法

Appendix B Experimental Details附录 B 实验细节

B.1 Model Collection & EvaluationB.1 模型收集与评估

B.1.1 Pretrained Base ModelsB.1.1 预训练基础模型

Model collection 模型收集

Benchmark collection & evaluation基准收集与评估

B.1.2 Instruction-Tuned ModelsB.1.2 指令调优模型

Model collection 模型收集

Benchmark collection & evaluation基准收集与评估

B.2 Downstream Evaluation B.2 下游评估

B.3 PCA Analysis B.3 主成分分析

PCA imputation PCA 插补

PC extraction PC 提取

Appendix C Additional Results附录 C 附加结果

C.1 PC Analysis of Instruction-Tuned LMsC.1PC 指令调优语言模型的分析

C.2 Properties of PC measuresC.2 PC 度量的性质

Lower-ranked PCs linearly correlate with log-compute measures低排名的 PC 与对数计算测量呈线性相关

Aggregated PCs linearly correlate with log-compute measures聚合的 PCs 与对数计算度量线性相关

Single benchmark metric suffers from limited dynamic range单一基准指标的动态范围有限

C.3 Robustness Checks C.3 稳健性检验

Number of PC selection PC 选择数量

Holdout cutoff selection 保留截止选择

C.4 Emergent Capabilities C.4 新兴能力

Predicting with model sizes预测模型规模

Using default cutoff for arithmetic tasks使用默认截止值进行算术任务

Additional tasks 附加任务

C.5 Post-Training Method AnalysisC.5 训练后方法分析

Prediction results with different scale measures不同尺度测量的预测结果

Results on BBH BBH 结果

C.6 Model Subset SelectionC.6 模型子集选择

Observational Scaling Laws and
the Predictability of Language Model Performance
观测缩放定律与语言模型性能的可预测性

Correlations between benchmarks
基准之间的相关性

3 Observational Scaling Laws
3 观测标度定律

3.1 Generalizing Compute Scaling Laws
3.1 推广计算缩放定律

3.2 Identifying a Low-Dimensional Capability Space (Eq. 5)
3.2 确定低维能力空间（方程 5）

PC measures are low-dimensional
PC 测量是低维的

3.3 Principal Capability Measures as Surrogate Scale Measures (Eq. 4)
3.3 主要能力度量作为替代尺度度量（方程 4）

PC measures linearly correlate with log-compute measures
PC 测量与对数计算测量呈线性相关

3.4 Fitting Observational Scaling Laws (Eq. 3)
3.4 拟合观测标度定律（方程 3）

Fitting regression with PC measures
使用主成分测量拟合回归

Defining interpretable compute-like measures
定义可解释的计算类度量

4 Validating Observational Scaling Laws
验证观测标度定律

Details in scaling law fits
缩放定律拟合中的细节

Preregisteration of predictions
预测的预注册

4.1 Predictability of “Emergent” Capabilities
4.1“新兴”能力的可预测性

4.2 Predictability of Agentic Capabilities
4.2 代理能力的可预测性

Interpreting the capability dimensions
解释能力维度

4.3 Predicting the Impact of Post-Training Techniques
4.3 预测训练后技术的影响

Interpreting the capability dimensions
解释能力维度

5 Selecting Low-Cost Model Subsets for Practical Scaling Analyses
选择低成本模型子集进行实际缩放分析

Recommended model series for scaling analysis
推荐的缩放分析模型系列

6 Discussion and Other Applications of Observational Scaling
6 讨论及观察量表的其他应用

PC-1 as a smooth capability measure with high dynamic range
PC-1 作为一种具有高动态范围的平滑能力测量

Training data efficiency measurements using PC-1
使用 PC-1 进行训练数据效率测量

Post-training interventions and their interactions with model families
训练后干预及其与模型家族的相互作用

Appendix B Experimental Details
附录 B 实验细节

B.1 Model Collection & Evaluation
B.1 模型收集与评估

B.1.1 Pretrained Base Models
B.1.1 预训练基础模型

Benchmark collection & evaluation
基准收集与评估

B.1.2 Instruction-Tuned Models
B.1.2 指令调优模型

Benchmark collection & evaluation
基准收集与评估

Appendix C Additional Results
附录 C 附加结果

C.1 PC Analysis of Instruction-Tuned LMs
C.1PC 指令调优语言模型的分析

C.2 Properties of PC measures
C.2 PC 度量的性质

Lower-ranked PCs linearly correlate with log-compute measures
低排名的 PC 与对数计算测量呈线性相关

Aggregated PCs linearly correlate with log-compute measures
聚合的 PCs 与对数计算度量线性相关

Single benchmark metric suffers from limited dynamic range
单一基准指标的动态范围有限

Predicting with model sizes
预测模型规模

Using default cutoff for arithmetic tasks
使用默认截止值进行算术任务

C.5 Post-Training Method Analysis
C.5 训练后方法分析

Prediction results with different scale measures
不同尺度测量的预测结果

C.6 Model Subset Selection
C.6 模型子集选择

Prediction results with different number of models selected by V-optimality
由 V-最优性选择的不同数量模型的预测结果

Prediction results with randomly selected models
使用随机选择模型的预测结果

C.7 Fited Functional Forms for Preregistration of Predictions
C.7 适合预测预注册的功能形式