Apple Intelligence Foundation Language Models 苹果智能基金会语言模型
Abstract 摘要
Apple We present foundation language models developed to power Apple Intelligence features, including a billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b]. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development. 苹果公司我们开发了语言基础模型,用于推动苹果智能产品功能,包括一个 十亿参数模型,专门设计在设备上高效运行,以及一个大型基于服务器的语言模型,用于私有云计算[苹果公司,2024b]。这些模型旨在以高效、准确和负责任的方式执行各种任务。本报告描述了模型架构、用于训练模型的数据、训练过程、如何优化模型推理以及评估结果。我们重点介绍了对负责任人工智能的关注,以及在模型开发过程中如何应用这些原则。
1 Introduction 1 简介
At the 2024 Worldwide Developers Conference, we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18 , and macOS Sequoia. 在 2024 年全球开发者大会上,我们推出了 Apple Intelligence,这是一个与 iOS 18、iPadOS 18 和 macOS Sequoia 深度集成的个人智能系统。
Apple Intelligence consists of multiple highly-capable generative models that are fast, efficient, specialized for our users' everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps. 苹果智能由多个高度能力的生成模型组成,它们速度快、效率高,针对用户的日常任务进行专门设计,并且能够根据当前活动进行快速适应。苹果智能中内置的基础模型已经针对用户体验进行了精细调整,包括编写和优化文本、整理和总结通知、为家人和朋友创造有趣的图像,以及跨应用简化交互的应用内操作。
Responsible AI principles inform all steps 负责任的 AI 原则指导所有步骤
Figure 1: Modeling overview for the Apple foundation models. 图 1:苹果基础模型的建模概述。
In this report we will detail how two of these models-AFM-on-device (AFM stands for Apple Foundation Model), a billion parameter language 在这份报告中,我们将详细介绍两个这样的模型-AFM-on-device(AFM 代表 Apple Foundation Model),一个 十亿参数的语言
model, and AFM-server, a larger server-based language model-have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly (Figure 1). These two foundation models are part of a larger family of generative models created by Apple to support users and developers; this includes a coding model (based on an AFM language model) to build intelligence into Xcode, as well as a diffusion model to help users express themselves visually, for example, in the Messages app. 模型和 AFM-server,这个更大的基于服务器的语言模型已被构建和调整以高效、准确和负责任地执行专门任务(图 1)。这两个基础模型是苹果公司创建的更大的生成模型系列的一部分;这包括一个编码模型(基于 AFM 语言模型)为 Xcode 添加智能功能,以及一个扩散模型来帮助用户以视觉方式表达自己,例如在 Messages app 中。
Apple Intelligence is designed with Apple's core values at every step and built on a foundation of industry-lead privacy protection. Additionally, we have created Responsible AI principles to guide how we develop AI tools, as well as the models that underpin them: 苹果智能设计时充分考虑了苹果公司的核心价值观,并以业界领先的隐私保护为基础。此外,我们还制定了负责任的人工智能(AI)原则,以指导我们开发 AI 工具以及支撑它们的模型:
Empower users with intelligent tools: We identify areas where AI can be used responsibly to create tools for addressing specific user needs. We respect how our users choose to use these tools to accomplish their goals. 通过智能工具赋予用户力量: 我们确定可以负责任地利用人工智能来创造满足特定用户需求的工具。我们尊重用户选择如何使用这些工具来实现自己的目标。
Represent our users: We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating stereotypes and systemic biases across our AI tools and models. 代表我们的用户:我们构建深度个性化的产品,目标是真诚地代表全球用户。我们不懈努力,避免在我们的人工智能工具和模型中延续刻板印象和系统性偏见。
Design with care: We take precautions at every stage of our process, including design, model training, feature development, and quality evaluation to identify how our AI tools may be misused or lead to potential harm. We will continuously and proactively improve our AI tools with the help of user feedback. 谨慎设计:在设计、模型训练、功能开发和质量评估的每个阶段,我们都会采取预防措施,以识别我们的 AI 工具可能被误用或导致潜在危害。我们将在用户反馈的帮助下,不断主动改进我们的 AI 工具。
Protect privacy: We protect our users' privacy with powerful ondevice processing and groundbreaking infrastructure like Private Cloud Compute. We do not use our users' private personal data or user interactions when training our foundation models. 保护隐私:我们通过强大的设备内处理和突破性的基础设施(如私有云计算)来保护用户的隐私。我们在训练基础模型时不使用用户的私人个人数据或用户交互。
These principles are reflected at every stage of the architecture that enables Apple Intelligence and connects features and tools with specialized models. 这些原则体现在苹果智能的架构的每个阶段,将功能和工具与专门的模型相连接。
In the remainder of this report, we provide details on decisions such as: how we develop models that are highly capable, fast, and power-efficient; how we approach training these models; how our adapters are fine-tuned for specific user needs; and how we evaluate model performance for both helpfulness and unintended harm. 在本报告的剩余部分,我们提供了关于以下决策的细节:我们如何开发高度能力、快速和高能效的模型;我们如何方法训练这些模型;我们如何根据特定用户需求对我们的适配器进行微调;以及我们如何评估模型在有用性和非预期危害方面的性能。
2 Architecture 2 建筑
The AFM base models are dense decoder-only models that build on the Transformer architecture [Vaswani et al., 2017], with the following design choices: 自适应刚性变换(AFM)基础模型是基于变换器架构[Vaswani 等人,2017]的密集解码器模型,并做出以下设计选择:
A shared input/output embedding matrix [Press and Wolf, 2016] to reduce memory usage for parameters. 一个共享的输入/输出嵌入矩阵 [Press and Wolf, 2016] 以减少参数的内存使用。
Pre-Normalization [Nguyen and Salazar, 2019] with RMSNorm [Zhang and Sennrich, 2019] for training stability. 使用 RMSNorm [Zhang and Sennrich, 2019] 进行训练稳定性的预归一化 [Nguyen and Salazar, 2019]。
Query/key normalization [Wortsman et al., 2023] to improve training stability. 查询/关键词标准化[Wortsman 等人,2023]以改善训练稳定性。
Grouped-query attention (GQA) [Ainslie et al., 2023] with 8 key-value heads to reduce the KV-cache memory footprint. 群查询注意力 (GQA) [Ainslie et al., 2023] 采用 8 个键值头来减少 KV 缓存内存占用。
The SwiGLU activation [Shazeer, 2020] for higher efficiency. SwiGLU 激活[Shazeer, 2020]以提高效率。
RoPE [Su et al., 2024] positional embeddings with the base frequency set to 500 k for long-context support. RoPE [Su et al., 2024] 位置嵌入,其基频设置为 500 k 以支持长上下文。
Table 1 provides some details about AFM-on-device dimensions. 表 1 提供了关于设备上 AFM 尺寸的一些详细信息。
Our AFM pre-training process plays a critical role in developing highly capable language models to power a host of Apple Intelligence features that can help and support users. We focus on efficiency and data quality at every step in order to pre-train for a high-quality end-to-end user experience with efficient and low-latency models. 我们的 AFM 预训练过程在开发高度有能力的语言模型中发挥着关键作用,这些语言模型可以为苹果智能功能提供动力,为用户提供帮助和支持。为了提供高质量的端到端用户体验以及高效且低延迟的模型,我们在每个步骤中都关注效率和数据质量。
3.1 Data 3.1 数据
The AFM pre-training dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publiclyavailable or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot [Apple, 2024a]. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives. AFM 预训练数据集由多样化和优质的数据混合组成。这包括我们从出版商处许可的数据、经过策划的公开可用或开源的数据集,以及我们的网络爬虫 Applebot[Apple, 2024a]爬取的公开可用信息。我们尊重网页拒绝被 Applebot 爬取的权利,使用标准的 robots.txt 指令。
Given our focus on protecting user privacy, we note that no private Apple user data is included in the data mixture. Additionally, extensive efforts have been made to exclude profanity, unsafe material, and personally identifiable information from publicly available data (see Section 7 for more details). Rigorous decontamination is also performed against many common evaluation benchmarks. 鉴于我们注重保护用户隐私,我们注意到数据混合中没有任何私人 Apple 用户数据。此外,我们已做出大量努力,从公开可用数据中排除了亵渎性语言、不安全材料和个人身份信息(详见第 7 节)。我们还对许多常用评估基准进行了严格的净化。
We find that data quality, much more so than quantity, is the key determining factor of downstream model performance. In the following, we provide more details about key components of the data mixture. 我们发现数据质量远比数量更是下游模型性能的关键决定因素。在下文中,我们将提供关于数据混合关键组成部分的更多细节。
3.1.1 Web pages 3.1.1 网页
We crawl publicly available information using our web crawler, Applebot [Apple, 2024a], and respect the rights of web publishers to opt out of Applebot using standard robots.txt directives. Plus, we take steps to exclude pages containing profanity and apply filters to remove certain categories of personally identifiable information (PII). The remaining documents are then processed by a pipeline which performs quality filtering and plain-text extraction, more specifically: 我们使用我们的网络爬虫 Applebot [Apple, 2024a] 来爬取公开可用的信息,并尊重网页发布者使用标准 robots.txt 指令拒绝 Applebot 的权利。此外,我们采取措施排除包含亵骂性词语的页面,并应用过滤器删除某些类别的个人身份信息(PII)。然后,剩余的文档将通过一个管道处理,该管道执行质量过滤和纯文本提取,更具体如下:
Body extraction is performed using a combination of Safari's reader mode and the Boilerpipe [Kohlschütter et al., 2010] algorithm. 正文提取使用 Safari 的阅读模式和 Boilerpipe [Kohlschütter 等人, 2010]算法的组合进行。
Safety and profanity filtering, using heuristics and model-based classifiers. 安全和亵渎性过滤,使用启发式和基于模型的分类器。
Global fuzzy de-duplication using locality-sensitive n-gram hashing. 使用基于位置敏感的 n 元语法散列的全局模糊去重。
Extensive quality filtering using heuristics and model-based classifiers [Kong et al., 2024; Li et al., 2024a]. 使用启发式和基于模型的分类器进行广泛的质量过滤[Kong 等人,2024;李等人,2024a]。
Decontamination against 811 common pre-training benchmarks, filtering entire documents upon 4-13 gram collisions with any of the benchmark datasets, unless the collision-count for a given n-gram reaches a "commonusage" threshold of 1000 针对 811 种常见预培训基准进行去污,在 4-13 个字符冲突次数达到 1000 个"常用"阈值之前,就根据基准数据集过滤整个文档
3.1.2 Licensed datasets 3.1.2 授权数据集
We go to lengths to identify and license a limited amount of high-quality data from publishers. These licensed datasets provide a natural source of diverse and high quality long-context data, so we include them as part of the data mixture for the continued and context-lengthening stages of pretraining (see Section 3.2.2 and 3.2.3 for more details). We decontaminate sections of publisher licensed data the same way we decontaminate web pages (Section 3.1.1). 我们努力识别和授权一定数量的高质量数据来自发布者。这些授权数据集提供了多样性和高质量的长上下文数据,因此我们将它们包括在预训练的持续和上下文延长阶段的数据集中(详情见 3.2.2 节和 3.2.3 节)。我们对发布者授权数据的污染部分进行了与网页污染清理相同的处理(见 3.1.1 节)。
3.1.3 Code 3.1.3 代码
Code data is obtained from license-filtered open source repositories on GitHub. The bulk of the code data covers 14 common programming languages, including: Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go. The data is de-duplicated, further filtered for PII and quality, and decontaminated in the same fashion as in Section 3.1.1. 代码数据来自 GitHub 上经过许可过滤的 开源存储库。代码数据主要涵盖 14 种常见编程语言,包括: Swift、Python、C、Objective-C、C++、JavaScript、Java 和 Go。数据已去重,并进一步过滤掉 PII 和质量问题,并以与第 3.1.1 节相同的方式进行除污。
\footnotetext{ \footnotetext{
翻译文本: Using MIT, Apache, BSD, CC0, CC-BY, Unlicensed, ISC, and Artistic Licenses. 使用麻省理工学院、Apache、BSD、CC0、CC-BY、Unlicensed、ISC 和 Artistic 许可证。
3.1.4 Math 3.1.4 数学
We integrate two categories of high-quality data sourced from the web. The first category is a Math Q&A dataset, comprising 3 billion tokens from 20 web domains rich in math content. We extract the questions and answers by identifying relevant tags from HTML pages. The second category is a collection of 14 billion tokens from web pages such as math forums, blogs, tutorials, and seminars. To filter these web pages, we used a specialized pipeline that includes a math tag filter with a collection of 40 strings to identify mathematical templates, a math symbol filter with a collection of 350 Unicode and LaTeX symbols to identify math content, a quality filter powered by a language model classifier specifically designed for math [Kong et al., 2024], and a domain filter that processes all web pages from domains manually labeled by humans. We applied these filters, followed by deduplication, decontamination, and PII removal to produce the final dataset. 我们整合了两类从网上获取的高质量数据。第一类是数学问答数据集,包括来自 20 个富含数学内容的网络领域的 30 亿个标记。我们通过识别 HTML 页面中的相关标签来提取问题和答案。第二类是一个包含 140 亿个标记的网页集合,包括数学论坛、博客、教程和研讨会。为了过滤这些网页,我们使用了一个专门的流水线,其中包括一个数学标签过滤器,其中收集了 40 个字符串来识别数学模板,一个数学符号过滤器,其中收集了 350 个 Unicode 和 LaTeX 符号来识别数学内容,一个由专门为数学设计的语言模型分类器支持的质量过滤器[Kong et al., 2024],以及一个处理人工标记的网页领域的领域过滤器。我们应用了这些过滤器,然后进行重复数据删除、去污染和个人身份信息删除,制作出最终的数据集。
3.1.5 Public datasets 3.1.5 公开数据集
We evaluated and selected a number of high-quality publicly-available datasets with licenses that permit use for training language models. Then, we filtered the datasets to remove personally identifiable information before including them in the pre-training mixture. 我们评估和选择了许多高质量的公开可用数据集,这些数据集的许可证允许用于训练语言模型。然后,我们对这些数据集进行过滤,以删除个人可识别信息,然后将其包括在预训练中。
3.1.6 Tokenizer 3.1.6 分词器
We use a byte-pair encoding (BPE) tokenizer, following the implementation from SentencePiece. All numbers are split into individual digits and we use byte-fallback to decompose unknown UTF-8 characters into byte tokens. We do not enable Unicode normalization. The total vocabulary size is 100 k and 49k tokens for AFM-server and AFM-on-device, respectively. 我们使用字节对编码(BPE)分词器,遵循 SentencePiece 的实现。所有数字都被拆分为单个数字,我们使用字节回退来分解未知的 UTF-8 字符为字节令牌。我们没有启用 Unicode 标准化。词汇表大小分别为 100k 和 49k,分别用于 AFM-server 和 AFM-on-device。
3.2 Recipe 3.2 菜谱
We break AFM pre-training into three distinct stages: 1. core which consumes most of the compute budget, 2. continued, where we down-weight the lowerquality bulk web-crawl data, favoring a higher code and math weight instead combined with inclusion of the licensed data described in Section 3.1.2, 3. context-lengthening which is similar to another continued pre-training stage, but conducted at longer sequence length and with synthetic long-context data included in the mixture. 我们将 AFM 预训练分为三个不同的阶段:1. 核心,它消耗了大部分的计算预算;2. 持续,我们在这里降低低质量的批量网络爬虫数据的权重,而是更加偏向于更高的代码和数学权重,并结合了第 3.1.2 节中描述的许可数据;3. 上下文延长,这与另一个持续预训练阶段类似,但是在更长的序列长度下进行,并且在混合中包括合成的长上下文数据。
Details about model quality after each of the three pre-training stages (alongside additional metrics for AFM derived from our internal benchmark implementations) are in Appendix C, and Appendix D examines AFM-server's long-context capabilities. 关于模型质量的细节,在每个三个预训练阶段之后(以及从我们内部基准实施中派生的 AFM 的其他指标)都在附录 C 中,附录 D 检查了 AFM-server 的长上下文功能。
All three stages use decoupled weight decay [Loshchilov and Hutter, 2019] for regularization, as well as a simplified version of Param [Yang et al., 2022], similar to what is described as Param (simple) in [Wortsman et al., 2023] Thus far we have not found more sophisticated parameter norm controls to be 所有三个阶段都使用非耦合权重衰减[Loshchilov and Hutter, 2019]进行正则化,以及与[Wortsman et al., 2023]中所描述的 Param (simple)类似的简化版本 Param [Yang et al., 2022]。到目前为止,我们还没有发现更复杂的参数范数控制方法
necessary at these scales. All stages maintain sharded model and optimizer states in float32, casting to bfloat16 for the forward and backward passes for efficiency. 在这些规模下是必需的。所有阶段都将模型和优化器状态保持在 float32 中,为了效率而将其转换为 bfloat16 进行前向和反向传递。
3.2.1 Core pre-training 3.2.1 核心预训练
AFM-server core training is conducted from scratch, while AFM-on-device is distilled and pruned from a larger model. AFM-server 核心培训从头开始进行,而 AFM-on-device 则从一个更大的模型中提取和剪枝。
AFM-server: We train AFM-server from scratch for 6.3T tokens on 8192 TPUv4 chips, using a sequence length of 4096 and a batch-size of 4096 sequences. The batch size was determined using a scaling law fit to model size and compute budget, however we find that downstream results are relatively insensitive to a fairly wide range of batch sizes, and expect that any value between and the predicted batch size would have yielded similar results (the predicted optimum was in fact , but 4096 allowed for better chip utilization). We perform a learning rate sweep using a proxy model with a model dimension of 768 , finding that the optimum learning rate spans , so we choose 0.01 to be conservative. Linear layers will have an effective learning rate scaled by due to the use of Param (simple). AFM 服务器:我们在 8192 个 TPUv4 芯片上从头开始训练 AFM 服务器,使用 6.3T 个代币,序列长度为 4096,批大小为 4096 个序列。批大小是使用模型大小和计算预算的缩放法则确定的,但我们发现下游结果对相当广泛的批大小范围相对不敏感,并且预计任何介于 和 之间的预测批大小都会产生相似的结果(实际预测最优值为 ,但 4096 允许更好的芯片利用率)。我们使用一个模型维度为 768 的代理模型进行学习率扫描,发现最佳学习率范围为 ,因此我们选择 0.01 以保守为宜。由于使用 Param(简单),线性层的有效学习率将缩放 。
We use a tuned decoupled weight decay of , finding that it works well across all tested model sizes and compute budgets. The learning rate schedule includes a linear warm-up for 5000 steps, followed by cosine decay to 0.005 of the peak over the remainder of training. For further details on the optimizer, see Section 3.2.4. Appendix A compares the AFM core pre-training recipe to a more typical configuration. 我们使用精调的解耦权重衰减 ,发现它在所有测试的模型规模和计算预算上效果良好。学习率时间表包括 5000 步的线性预热,然后在剩余的训练时间内以余弦衰减到峰值的 0.005 倍。有关优化器的更多细节,请参见第 3.2.4 节。附录 A 将 AFM 核心预训练配方与更典型的配置进行比较。
AFM-on-device: For the on-device model, we found that knowledge distillation [Hinton et al., 2015] and structural pruning are effective ways to improve model performance and training efficiency. These two methods are complementary to each other and work in different ways. More specifically, before training AFM-on-device, we initialize it from a pruned 6.4B model (trained from scratch using the same recipe as AFM-server), using pruning masks that are learned through a method similar to what is described in [Wang et al., 2020; Xia et al., 2023]. The key differences are: (1) we only prune the hidden dimension in the feed-forward layers; (2) we use Soft-Top-K masking [Lei et al., 2023] instead of HardConcrete masking [Louizos et al., 2018]; (3) we employ the same pre-training data mixture as the core phase to learn the mask, training for 188B tokens. Then, during the core pre-training of AFM-on-device, a distillation loss is used by replacing the target labels with a convex combination of the true labels and the teacher model's top-1 predictions, (with 0.9 weight assigned to the teacher's labels), training for a full 6.3T tokens. We observe that initializing from a pruned model improves both data efficiency and the 针对设备上的模型,我们发现知识蒸馏[Hinton et al., 2015]和结构性剪枝是提高模型性能和训练效率的有效方法。这两种方法是互补的,工作方式不同。更具体地说,在训练 AFM-on-device 之前,我们将其初始化为一个经过修剪的 64 亿参数模型(使用与 AFM-server 相同的方法从头训练),采用了类似于[Wang et al., 2020; Xia et al., 2023]所描述的方法学习的修剪掩码。关键区别在于:(1)我们只修剪了前馈层中的隐藏维度;(2)我们使用 Soft-Top-K 掩码[Lei et al., 2023]而不是 HardConcrete 掩码[Louizos et al., 2018];(3)我们使用与核心阶段相同的预训练数据混合来学习掩码,训练了 1880 亿个令牌。然后,在 AFM-on-device 的核心预训练过程中,我们使用蒸馏损失,将目标标签替换为真实标签和教师模型 top-1 预测的凸组合(给予教师标签 0.9 的权重),训练了 63T 个令牌。我们观察到,从修剪模型进行初始化可以提高数据效率和
final benchmark results by , whilst adding distillation boosts MMLU and GSM8K by about and respectively. More detailed ablation results can be found in Appendix B. All training hyper-parameters except for batch-size are kept the same as AFM-server. 最终基准测试结果由 给出,同时添加蒸馏将 MMLU 和 GSM8K 分别提高约 和 。更详细的消融实验结果可以在附录 B 中找到。除了批量大小外,所有训练超参数与 AFM 服务器相同。
3.2.2 Continued pre-training 3.2.2 继续预训练
For both models we perform continued pre-training at a sequence length of 8192, with another 1T tokens from a mixture that upweights math and code, and down-weights the bulk web-crawl. We also include the licensed data described in Section 3.1.2. We use a peak learning rate of and decoupled weight decay of , and 1000 warm-up steps with a final learning rate decay to 0.001 of peak, differently to core pre-training. Other settings (batch size, etc) are carried over. We did not find a distillation loss to be helpful here for AFM-on-device, unlike in core pre-training, so the recipe is identical to that used for AFM-server. 对于这两种模型,我们在序列长度为 8192 的情况下进行了持续的预训练,使用了额外的 1T 个令牌,其中增加了数学和代码的权重,并减少了大规模网络抓取数据的比重。我们还包括了第 3.1.2 节中描述的许可数据。我们使用了 的峰值学习率和 的分离权重衰减,以及 1000 个预热步骤,最终学习率下降至峰值的 0.001,这与核心预训练有所不同。其他设置(批量大小等)保持不变。我们发现对于 AFM 在设备上的应用,不像在核心预训练中那样,蒸馏损失并没有带来帮助,因此该方案与用于 AFM 服务器的配方完全相同。
3.2.3 Context lengthening 3.2.3 上下文延长
Finally, we conduct a further 100B tokens of continued pre-training at a sequence length of 32768 tokens, using the data mixture from the continued pre-training stage, augmented with synthetic long-context Q&A data. We also increase the RoPE base frequency from 500k to 6315089, following the scaling laws described in [Liu et al., 2024], with the expectation that this will allow for better short-to-long generalization-which is desirable given that the majority of our pre-training data is comprised of documents that are significantly shorter than 32 k tokens long. The recipe is similar to that used for continued pre-training. We examine the long-context performance of AFM-server in Appendix D. 最后,我们对连续预训练阶段使用的数据混合进行了 100 亿令牌的进一步连续预训练,序列长度为 32,768 个令牌。我们还将 RoPE 基频从 500k 增加到 6,315,089,遵循[Liu et al., 2024]中描述的伸缩定律,预计这将有助于更好的短期到长期泛化,这很重要,因为我们的大部分预训练数据都是由远小于 32k 个令牌的文档组成的。该配方与连续预训练所用的配方相似。我们在附录 D 中检查了 AFM 服务器的长上下文性能。
3.2.4 Optimizer 3.2.4 优化器
We choose to use a variant of RMSProp [Hinton, 2012] with momentum for AFM pre-training. In particular, we divide the raw gradient by the square-root of a bias-corrected exponential moving average of the squared gradient to produce an instantaneous update, which is clipped to a maximum norm of 1.0 per parameter block, before then further smoothing this estimate over steps with an exponential moving average without bias-correction to produce the net update. Unless otherwise noted, the smoothing constants for both the squared gradient and the update are set to 0.95 . A small constant is added to the instantaneous squared gradient prior to smoothing, for numerical stability. 我们选择使用 RMSProp [Hinton, 2012]的一种变体带动量进行 AFM 预训练。具体来说,我们将原始梯度除以经偏差校正的平方梯度的指数移动平均的平方根来产生即时更新,然后将其剪切到每个参数块的最大范数为 1.0。然后我们进一步使用没有偏差校正的指数移动平均对这个估计值进行平滑处理来产生最终更新。除非另有说明,否则方程平方梯度 和更新 的平滑常数都设置为 0.95。在平滑之前,我们会给即时平方梯度加上一个小常数 ,以确保数值稳定性。
The smoothed updates are scaled by the learning rate, weight-decay is added, and then scheduled decay is applied to form the final weight delta. As an additional guard for stability, prior to the optimizer we clip the global gradient norm to 1.0. For a recipe ablation against a more typical configuration, see Appendix A. 平滑的更新被学习率缩放,增加了权重衰减,然后应用调度衰减来形成最终的权重增量。作为稳定性的另一层保护,在优化器之前,我们将全局梯度范数夹断到 1.0。有关更典型配置的配方消融,请参见附录 A。
3.3 Training infrastructure 3.3 培训基础设施
The AFM models are pre-trained on v4 and v5p Cloud TPU clusters with the AXLearn framework [Apple, 2023], a JAX [Bradbury et al., 2018] based deep learning library designed for the public cloud. Training is conducted using a combination of tensor, fully-sharded-data-parallel, and sequence parallelism, allowing training to scale to a large number of model parameters and sequence lengths at high utilization. This system allows us to train the AFM models efficiently and scalably, including AFM-on-device, AFM-server, and larger models. 这些 AFM 模型是在使用 AXLearn 框架 [Apple, 2023] 的 v4 和 v5p Cloud TPU 集群上预训练的,AXLearn 是一个基于 JAX [Bradbury et al., 2018] 的深度学习库,专为公有云设计。训练过程使用了张量、完全分布式数据并行和序列并行等方法,使得训练能够扩展到大量模型参数和序列长度,同时保持高利用率。这个系统使我们能够有效且可扩展地训练 AFM 模型,包括 AFM-on-device、AFM-server 和更大的模型。
AFM-server was trained on 8192 TPUv4 chips provisioned as chip slices, where slices are connected together by the data-center network (DCN) [Chowdhery et al., 2022]. Only data-parallelism crosses the slice boundary, other types of state sharding are within-slice only as the within-slice interconnect bandwidth is orders of magnitude higher than the DCN. The sustained model-flop-utilization (MFU) for this training run was approximately . AFM-on-device was trained on one slice of 2048 TPUv5p chips. AFM-server 经过 8192 个 TPUv4 芯片的训练,其中芯片片分配为 片切,这些片切通过数据中心网络(DCN)[Chowdhery et al., 2022]连接在一起。只有数据并行跨越片切边界,其他类型的状态分片仅在片内进行,因为片内互连带宽比 DCN 高几个数量级。这次训练运行的持续模型-浮点运算利用率(MFU)约为 。AFM-on-device 在 2048 个 TPUv5p 芯片的单个片上进行了训练。
4 Post-Training
While Apple Intelligence features are powered through adapters on top of the base model (see Section 5 for a deep-dive on the adapter architecture), empirically we found that improving the general-purpose post-training lifts the performance of all features, as the models have stronger capabilities on instruction following, reasoning, and writing. 尽管苹果智能功能是通过基础模型上方的适配器供电(有关适配器架构的深入分析,请参见第 5 节),但我们实证发现,提高通用后训练可以提升所有功能的性能,因为模型在指令遵循、推理和写作方面的能力更强。
We conduct extensive research in post-training methods to instill generalpurpose instruction following and conversation capabilities to the pre-trained AFM models. Our goal is to ensure these model capabilities are aligned with Apple's core values and principles, including our commitment to protecting user privacy, and our Responsible AI principles. Our post-training efforts include a series of data collection and generation, instruction tuning, and alignment innovations. Our post-training process contains two stages: supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). We present two new post-training algorithms: (1) a rejection sampling fine-tuning algorithm with teacher committee (iTeC), and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator (MDLOO) that are used on our reinforcement learning iterations and lead to significant model quality improvements. 我们在训后方法方面进行了广泛的研究,以赋予预训练的 AFM 模型通用目的的指令跟随和对话能力。我们的目标是确保这些模型能力与苹果的核心价值观和原则保持一致,包括我们保护用户隐私的承诺,以及我们的负责任 AI 原则。我们的训后工作包括一系列的数据收集和生成、指令调整以及对齐创新。我们的训后过程包含两个阶段:有监督的微调(SFT)和从人类反馈的强化学习(RLHF)。我们提出了两种新的训后算法:(1)具有教师委员会的拒绝采样微调算法(iTeC);(2)一种使用镜面下降策略优化和留一剩余优势估计器(MDLOO)的从人类反馈的强化学习(RLHF)算法,这些算法在我们的强化学习迭代过程中得到应用,并带来了显著的模型质量改进。
4.1 Data 4.1 数据
We use a hybrid data strategy in our post-training pipeline, which consists of both human annotated and synthetic data. Throughout our data collection and experiment process, we have found data quality to be the key to model success and thus have conducted extensive data curation and filtering procedures. 我们在训练后的管线中使用了一种混合数据策略,包括人工标注的数据和合成数据。在数据收集和实验过程中,我们发现数据质量是模型成功的关键,因此我们进行了广泛的数据整理和过滤程序。
4.1.1 Human annotations 4.1.1 人工标注
Demonstration data To fuel the instruction fine-tuning of AFM, we collect high-quality human annotated demonstration datasets from various sources. This dialogue-style data consists of both system-level and task-level instructions (a.k.a. prompts), as well as their corresponding responses. Similar to [Zhou et al., 2024], we observe quality to weigh more importantly than quantity in our experiments. As a result, we focus on key data quality criteria including helpfulness, harmlessness, presentation, and response accuracy, in addition to targeting a diverse task distribution covering Apple Intelligence features. To protect user privacy, we take steps to verify no personally identifiable information is present in our data, and we do not include any personal data stored by users with Apple. 演示数据
为了促进 AFM 的指令微调,我们从各种来源收集了高质量的人工标注演示数据集。这些对话式数据包括系统级和任务级指令(又称提示),以及相应的响应。与[Zhou 等,2024]类似,我们发现质量比数量更重要。因此,除了覆盖苹果智能功能的多样化任务分布外,我们还注重了包括有用性、无害性、呈现和响应准确性在内的关键数据质量标准。为了保护用户隐私,我们采取措施验证数据中没有任何个人可识别信息,并且不包括用户在苹果设备上存储的任何个人数据。
Human preference feedback To iteratively improve AFM's capabilities, we further collect human feedback for reinforcement learning. In particular, we instruct human annotators to compare and rank two model responses for the same prompt to collect side-by-side preference labels. In addition, we also use single-side questions to guide this process. These questions inform raters to grade the model response quality of various aspects including instruction following, safety, factuality, and presentation, and we also retain these labels for model training. We emphasize Apple values and standards in the process. Similar to demonstration data, we find data quality to be crucial for feedback data, and thus we iterate data and model qualities jointly to improve them in a unified flywheel. 人类偏好反馈
为了逐步改进 AFM 的功能,我们进一步收集人类反馈用于强化学习。具体来说,我们指示人工标注员比较和排序同一个提示的两个模型响应,以收集并排偏好标签。此外,我们还使用单边问题来指导这一过程。这些问题告知评注员评估模型响应在执行指令、安全性、真实性和表现等方面的质量,我们也保留这些标签用于模型训练。我们在此过程中强调 Apple 的价值观和标准。与演示数据类似,我们发现数据质量对反馈数据至关重要,因此我们联合迭代数据和模型质量,以统一的方式改进它们。
4.1.2 Synthetic data 4.1.2 合成数据
In addition to human annotations, we delve into enhancing data quality and diversity through synthetic data generation. Our findings suggest that when guided by our robust reward models, AFMs are capable of generating high quality responses and for some specific domains, these responses are found to be on par with, or even superior to, human annotations. Therefore, we extend our prompt set to increase the diversity and find that those generated responses can benefit AFMs themselves. In the following, we discuss three domains where we generate synthetic data for AFM post-training: mathematics, tool use, and coding. 除了人工注释之外,我们还深入研究通过合成数据生成来提高数据质量和多样性。我们的研究结果表明,在我们强大的奖赏模型的指导下,AFM 能够生成高质量的响应,对于某些特定领域,这些响应被发现与人工注释持平,甚至优于人工注释。因此,我们扩展了提示集以增加多样性,发现这些生成的响应可以使 AFM 自身受益。在下面,我们讨论了三个领域,在这些领域我们生成了合成数据用于 AFM 的后训练:数学、工具使用和编码。
Mathematics In the field of mathematics, the wide-ranging subjects and difficulty level make it exceptionally resource-intensive for collecting human demonstrations, since it requires expert knowledge from the human writers. It also becomes impractical to solely rely on human-written content as the model continuously improves. As a consequence, exploring the potential of synthetic data becomes essential to effectively address the challenges. 数学
在数学领域,广泛的学科和难度水平使之成为收集人类演示的高度资源密集型领域,因为这需要人类作者的专业知识。随着模型不断改进,单纯依赖人工编写的内容也变得不太实际。因此,探索合成数据的潜力变得至关重要,以有效应对这些挑战。
The creation of synthetic data for mathematics involves two primary stages: generating synthetic math problems and producing their corresponding solutions. For math problem synthesis, we employ several "evolution" strategies 创建数学合成数据涉及两个主要阶段:生成合成数学问题和生成相应的解决方案。对于数学问题合成,我们采用了几种"进化"策略。
where a seed set of prompts are transformed into a much larger set of diverse prompts: 种子提示集被转化为更多样化的提示集:
Problem rephrase and reversion. Following the approach in [Yu et al., 2023], we prompt AFM to rephrase seed math questions, and curate reverse questions to derive a specific number in a raw problem statement when provided with the final answer. 重新表述和反转问题。根据[Yu 等人,2023]提出的方法,我们提示 AFM 重新表述初始数学问题,并整理出反向问题,以便在给定最终答案时派生原始问题陈述中的特定数字。
Problem evolution. Inspired by the instruction evolving technique [Xu et al., 2023], given a seed problem set we prompt AFM to generate 问题演化。受[徐等人,2023]指令演化技术的启发,给定一组种子问题集 ,我们让 AFM 生成
two distinct sets of math problems, i.e. , and . The in-depth evolution enhances instructions by adding complexities while the in-breadth evolution improves the topic coverage. For both and , we first perform de-duplication with an embedding model, and subsequently prompt LLMs to ensure the coherence and solvability of the math problems. In addition, for a difficulty level is assigned and we only select math problems that score above a specified threshold. 两个不同的数学问题集,即 和 。深度进化通过增加复杂性来增强说明,而广度进化则提高了主题覆盖范围。对于 和 来说,我们首先使用嵌入模型进行重复数据删除,然后提示LLMs以确保数学问题的一致性和可解性。此外,对于 ,我们分配了难度级别,并只选择得分高于指定阈值的数学问题。
With an augmented set of math questions, we then prompt AFM to synthesize responses with chain-of-thought per question. If the initial seed data has ground truth, they can be used as an "outcome reward signal" to filter synthesized answers. For problems that require less reasoning steps, we observe that a correct final answer often gets associated with correct intermediate steps. If direct answer checking is unsuccessful or ground truth is unavailable, we instead assess the response correctness by querying an LLM judge. We find that the filtered answers, when fed into the training data, boost our models' math capabilities by a large margin. 通过扩展的一套数学问题,我们随后提示 AFM 针对每个问题合成 带有连贯思路的回答。如果初始的种子数据具有真实答案,它们可以作为"结果奖励信号"来过滤合成的答案。对于需要较少推理步骤的问题,我们观察到正确的最终答案往往与正确的中间步骤相关联。如果直接检查答案失败或没有真实答案,我们改为通过查询LLM评判器来评估回答的正确性。我们发现,将过滤后的答案输入到训练数据中,可以大幅提升我们模型的数学能力。
Tool use We develop tool-use capabilities such as function call, code interpreter, and browsing through a mixture of synthetic and human data. The model capabilities are first bootstrapped with synthetic data, which focuses on single-tool use cases. We then collect human annotations to improve model capabilities that involve multi-tool and multi-step scenarios. We further augment the human curated function call data by mixing the oracle tool with other similar tools to increase the difficulty of tool selection. In addition, we synthesize parallel function call from human curated function call data to enable the new capability and tool intent detection data based on human curated function call and general SFT data to mitigate tool call over-triggering issues. 使用工具
我们开发了诸如函数调用、代码解释器和浏览等工具使用功能,这些功能是基于合成和人类数据的混合产生的。模型功能首先通过合成数据进行自举,这些数据主要关注单一工具的使用场景。然后我们收集人工标注,以提高涉及多工具和多步骤场景的模型功能。此外,我们通过将 Oracle 工具与其他类似工具混合,来增加工具选择的难度,进一步增强了人工策划的函数调用数据。此外,我们还合成了平行函数调用,以实现新的功能,并基于人工策划的函数调用和一般 SFT 数据创建了工具调用过度触发问题的缓解数据。
Coding The generation of a synthetic coding dataset involves a self-instruct method with rejection sampling. This approach enables the model to learn and generate data autonomously. Starting with 71 different programming topics as the seeds, the model is prompted to generate an initial pool of coding interview-like questions. For each question, the model generates a set of unit 编码综合编码数据集的生成涉及一种自我指导方法和拒绝采样。这种方法使模型能够自主学习和生成数据。从 71 个不同的编程主题作为种子开始,模型被要求生成一组初始的类似编码面试的问题。对于每个问题,模型生成一组
tests and a number of potential solutions. We then use an execution-based rejection sampling method to select the best solution. This involves compiling each potential solution with every unit test and executing them. The solution with the highest number of successful executions is chosen. This results in a collection of (question, test cases, solution) triplets. At the end, we validate the quality of the dataset by filtering the triplets using the number of passed unit tests, resulting in 12 K high quality triplets used in the SFT. 测试和许多潜在的解决方案。然后,我们使用基于执行的拒绝采样方法来选择最佳解决方案。这涉及到编译每个潜在的解决方案以及所有单元测试,并执行它们。成功执行次数最多的解决方案被选中。这产生了一组(问题,测试用例,解决方案)三元组。最后,我们通过使用通过单元测试数量来过滤三元组,来验证数据集的质量,得到 12K 高质量的三元组用于 SFT。
4.2 Supervised fine-tuning (SFT) 4.2 有监督微调(SFT)
It has been shown [Chung et al., 2024] that scaling multi-task instruction tuning dramatically enhances model performance on a wide variety of tasks Similarly, we attempt to scale supervised fine-tuning data to achieve a strong base model for subsequent alignment. During SFT, we collect and train models on demonstration data of a given prompt . We carefully select and combine both human data and synthetic data to form a high quality mixture that covers various natural language use cases. 据[Chung 等人,2024]报告显示,扩展多任务指令微调可显著提高模型在各种任务中的性能。类似地,我们试图扩大监督微调数据,以获得强大的基础模型用于后续校准。在 SFT 过程中,我们收集并训练模型使用给定提示 的演示数据。我们仔细选择并组合人工数据和合成数据,形成一个高质量的混合样本,涵盖各种自然语言使用场景。
Data selection We establish a series of quality guards of the data before onboarding them for model training, including ratings from in-house human labelers, automatic model-based filtering techniques, and deduplication with text embeddings. We also scale up the mixture size by a variety of synthetic data generation methods, as described in Section 4.1.2, and rejection sampling as described in Section 4.3.2. 数据选择
我们在将数据纳入模型训练之前建立了一系列质量保护措施,包括内部人工标注员的评分、基于自动模型的过滤技术,以及使用文本嵌入进行去重。我们还通过第 4.1.2 节中描述的各种合成数据生成方法以及第 4.3.2 节中描述的拒绝采样方法,扩大了混合样本的规模。
Tuning the mixture ratio In order to tune the mixture weight, we treat it as an optimization problem. Specifically, given a set of weights where represents the ratio of a specific component in the mixture, we train a model with and evaluate the quality change on a set of benchmarks. We find that extensively running such experiments can effectively identify the best mixture and remove the least impactful data components. 调整混合比例
为了调整混合比重,我们将其视为一个优化问题。具体而言,给定一组权重 其中 表示混合中特定成分的比例,我们用 训练一个模型,并评估一组基准测试上的质量变化。我们发现,广泛运行这样的实验可以有效地确定最佳混合物并移除影响最小的数据成分。
Training hyperparameters The model is trained with a constant learning rate for AFM-server and for AFM-device models, as well as a drop out rate 0.1 . Since the evaluation metrics fluctuate across different checkpoints, we run checkpoint selection based on automatic evaluation benchmarks and best-of-N selection with reward models to test the headroom for RL. 训练超参数模型在常数学习率 (对于 AFM-server 模型)和 (对于 AFM-device 模型)以及 0.1 的 dropout 率下进行训练。由于评估指标在不同检查点之间波动,我们基于自动评估基准和基于奖励模型的最佳 N 选择进行检查点选择,以测试强化学习的发挥空间。
4.3 Reinforcement learning from human feedback (RLHF) 4.3 基于人类反馈的强化学习(RLHF)
We further use reinforcement learning with collected human preference data to improve model performance and quality. This involves training a robust reward model and applying it in two algorithms of iTeC and MDLOO that we discuss below. We describe more details of our RLHF pipeline in Appendix E. 我们进一步利用收集到的人类偏好数据的强化学习来提高模型性能和质量。这涉及训练一个强大的奖励模型,并将其应用于我们下面讨论的两种算法 iTeC 和 MDLOO。我们在附录 E 中描述了我们 RLHF 管道的更多细节。
4.3.1 Reward modeling 4.3.1 奖励建模
We train reward models using the human preference data collected with the method in Section 4.1.1. Each human preference data item contains one prompt and two responses along with human labels including: 我们使用第 4.1.1 节中的方法收集的人类偏好数据训练奖励模型。每个人类偏好数据项都包含一个提示和两个响应以及人类标签,包括:
The preferred response between the two and the preference level, i.e., whether the preferred response is significantly better, better, slightly better, or negligibly better than the rejected response. 在两个之间的首选响应以及首选响应相对于被拒绝响应的优先级水平,即是否显著更好、更好、略好或微好。
The single-sided grading of each response, measuring the instruction following property, the conciseness, truthfulness, and harmlessness of each of the responses. 对每个回复的单侧评分,测量遵循指令属性、简洁性、真实性和无害性。
Our reward model training follows the standard practice of reward modeling in RLHF with two main innovations: 我们的奖励模型训练遵循 RLHF 中奖励建模的标准做法,并有两大创新:
We design a soft label loss function that takes the level of human preference into account. 我们设计了一个软标签损失函数,考虑了人类偏好的水平。
We incorporate single-sided gradings as regularization terms in reward modeling. 我们将单面评分纳入奖励建模中的正则化项。
We employ the commonly used Bradley-Terry-Luce (BTL) model [Bradley and Terry, 1952] for reward modeling in RLHF. In this model, the probability that a human annotator prefers one response over another is modeled as the sigmoid function of the difference of the rewards. Our soft label loss function encourages that this probability is high when the preference level is high, e.g., when one response is significantly better than the other, and vice versa We note that this is different from the margin-based loss function in Llama 2 [Touvron et al., 2023], which also leverages the preference level. Empirically, we find that our method works better than the margin-based loss function. Moreover, we also find that using the single-sided gradings as regularization terms can effectively improve the accuracy of the reward model. More details of our reward modeling techniques can be found in Section E.1. 我们在 RLHF 中使用常用的 Bradley-Terry-Luce(BTL) 模型 [Bradley 和 Terry, 1952] 进行奖励建模。在该模型中,人类标注员更喜欢一个响应而不是另一个响应的概率被建模为奖励差异的 sigmoid 函数。我们的软标签损失函数鼓励当偏好水平较高时,即一个响应明显优于另一个响应时,该概率较高,反之亦然。我们注意到这与 Llama 2 [Touvron 等人, 2023] 中基于边界的损失函数不同,后者也利用了偏好水平。从经验上看,我们发现我们的方法比基于边界的损失函数效果更好。此外,我们还发现使用单侧等级作为正则化项可以有效提高奖励模型的精度。我们奖励建模技术的更多细节可以在第 E.1 节中找到。
To fully unlock the ability of our model with multiple rounds of RLHF, we propose a novel iterative RLHF framework which effectively combines various preference optimization algorithms, including rejection sampling (RS), Direct Preference Optimization (DPO) [Rafailov et al., 2024] and its variants such as IPO [Azar et al., 2024], and online reinforcement learning (RL). This enables us to bring the benefit of RLHF to AFM models across all sizes and improve their alignment at the same time. 为了充分发挥我们的模型在多轮 RLHF 中的能力,我们提出了一种全新的迭代 RLHF 框架,它有效地结合了各种偏好优化算法,包括拒绝采样(RS)、直接偏好优化(DPO) [Rafailov et al., 2024]及其变体如 IPO [Azar et al., 2024],以及在线强化学习(RL)。这使我们能够将 RLHF 的优势带到各种尺寸的 AFM 模型中,并同时提高它们的对齐度。
Iterative committee One of the most important lessons we learned from developing AFM RLHF is to refresh online human preference data collection using a diverse set of the best performing models. Specifically, for each batch 迭代委员会 从开发 AFM RLHF 中我们学到的最重要的一个教训是,使用最好表现的一组多样化模型刷新在线人类偏好数据收集。具体而言,对于每一批数据
of human preference data collection, we set up a collection of latest promising models trained from SFT, RS, DPO/IPO, and RL, as well as best models from the previous iterations, which we refer to as "model committee". We collect pairwise human preference on responses sampled from the latest model committee. 对于人类偏好数据收集,我们建立了一个"模型委员会",其中包括从 SFT、RS、DPO/IPO 和 RL 训练的最新有前景的模型,以及之前迭代中的最佳模型。我们收集了从最新的模型委员会抽取的响应的成对人类偏好。
After acquiring each batch of human preference data, we refresh our reward model, and further train a new set of models using the collection of preference optimization algorithms. We then continue the next round of iterative RLHF data collection with a new model committee. 获得每批人类偏好数据后,我们会刷新奖励模型,并使用偏好优化算法集合进一步训练新的模型集合。然后我们继续使用新的模型委员会进行下一轮的迭代 RLHF 数据收集。
Committee distillation We further run rejection sampling (distillation) from the model committee with the latest reward model as a reranker. Instead of reranking at global-level, i.e., picking a single best performing model from the committee and using it as a teacher model, we rerank model responses at the prompt-level. Specifically, for each prompt, we sample multiple responses from each model in the committee, and use the latest reward model to select the best response for each prompt. This allows us to combine the advantages of models trained by different preference optimization algorithms. For instance, we find that algorithms that leverage negative examples, e.g., online RLHF, DPO, IPO, to be better in improving reasoning skills such as math, while rejection sampling fine-tuning learns instruction following and writing skills more effectively. 委员会蒸馏 我们进一步通过最新的奖励模型作为重新排序器,对模型委员会进行拒绝抽样(蒸馏)。我们不是在全局层面进行重新排序,即从委员会中选择一个性能最佳的模型并将其用作教师模型,而是在提示层面对模型响应进行重新排序。具体而言,对于每个提示,我们从委员会中的每个模型采样多个响应,并使用最新的奖励模型来选择每个提示的最佳响应。这使我们能够结合通过不同偏好优化算法训练的模型的优势。例如,我们发现利用负面样本的算法(如在线 RLHF、DPO、IPO)在提高如数学等推理技能方面更胜一筹,而拒绝抽样微调则更有效地学习指令跟随和写作技能。
Scaling up distillation In order to bring the RLHF improvements to AFM models across all sizes, we scale up distillation from the model committee Different from larger models, where carefully iterating data and model quality matters much more than data quantity, we find smaller models can achieve tremendous improvement when we scale up the number of prompts for distillation. Our final AFM-on-device model is trained on more than 1 M high quality responses generated from the model committee. 扩大蒸馏
为了将 RLHF 改进带到所有尺寸的 AFM 模型上,我们从模型委员会扩大了蒸馏。
与较大的模型不同,对于较小的模型来说,数据数量的扩大比数据和模型质量的精心迭代更加重要。我们发现,当我们扩大用于蒸馏的提示数量时,较小的模型可以实现巨大的改进。
我们最终的 AFM 设备模型是在由模型委员会生成的超过 100 万条高质量响应的基础上进行训练的。
In this section, we introduce our online reinforcement learning algorithm MDLOO, where we decode responses during model training and apply RL algorithms to maximize the reward. 在本节中,我们介绍了我们的在线强化学习算法 MDLOO,在该算法中,我们在模型训练期间解码响应,并应用 RL 算法来最大化奖励。
We use the commonly adopted RLHF objective that maximizes the KLpenalized reward function [Ouyang et al., 2022]: 我们使用通常采用的 RLHF 目标,最大化 KL 惩罚奖励函数[Ouyang 等人, 2022]:
where is the prompt distribution, denotes the Kullback-Leibler divergence between two distributions, and is the coefficient that controls the divergence between the behavior policy and a reference policy , which is typically a model trained by SFT. In our RL training, we use the reward 其中 是提示分布, 表示两个分布之间的 Kullback-Leibler 散度,而 是控制行为策略 和参考策略 之间散度的系数,通常是由 SFT 训练的模型。在我们的强化学习训练中,我们使用奖励
function 函数
whose expectation is equivalent to Equation 1. We consider the bandit setting where the generation of the entire response is considered as one action, and we do not use the value network (a.k.a. the critic) to obtain the per-token reward or advantage. 其期望等同于方程式 1。我们考虑赌博设置,其中产生完整响应被视为一项操作,我们不使用价值网络(也称为评论家)来获得每个令牌的奖励或优势。
Similar to commonly used RLHF algorithms such as PPO [Schulman et al., 2017], we use a trust-region based policy iteration algorithm. We made two main design choices in our online RL algorithm: 与常用的 RLHF 算法(如 PPO [Schulman et al., 2017])类似,我们使用了基于信任区域的策略迭代算法。我们在在线强化学习算法中做了两个主要设计选择:
We use the Leave-One-Out (LOO) estimator to estimate the advantage of a prompt-response pair, similar to a recent work [Ahmadian et al., 2024]. 我们使用 Leave-One-Out(LOO)估计器来估计提示响应对的优势,类似于最近的工作[Ahmadian et al., 2024]。
We use Mirror Descent Policy Optimization (MDPO) [Tomar et al., 2020] to optimize the policy, differently from the more commonly used clipping-based PPO method. 我们使用镜像下降策略优化(MDPO) [Tomar et al., 2020] 来优化策略,这与更常用的基于剪裁的 PPO 方法不同。
Thus, we name our online RL algorithm Mirror Descent with Leave-OneOut estimation (MDLOO). More specifically, during the decoding stage of the algorithm, we decode multiple responses for each prompt, and assign the advantage of each response to be the difference of the reward of the (prompt, response) pair and the mean reward of the other responses generated by the same prompt. Intuitively, this estimator aims to measure how much better a response is compared to a typical response. Empirically, we find this advantage estimator crucial for stabilizing the RL algorithm and achieving strong results. Moreover, we use a KL-regularization-based trust region method, i.e. MDPO, to control the policy change in each iteration. We find that this algorithm is more effective than PPO in our setting. More details of our online RLHF algorithm can be found in Section E.2. 因此,我们将我们的在线 RL 算法命名为"留一出估计的镜像下降(MDLOO)"。更具体地说,在算法的解码阶段,我们为每个提示解码多个响应,并将每个响应的优势分配为(提示,响应)对的奖励与由同一提示生成的其他响应的平均奖励之差。直观地说,这个估计量旨在衡量一个响应比典型响应好多少。从经验上看,我们发现这个优势估计量对于稳定 RL 算法并取得强劲结果至关重要。此外,我们使用基于 KL 正则化的信任区域方法,即 MDPO,来控制每次迭代中的策略变化。我们发现,与 PPO 相比,这个算法在我们的设置中更有效。我们在 E.2 节中提供了更多关于我们在线 RLHF 算法的细节。
5 Powering Apple Intelligence features 5 为苹果智能功能提供动力
Our foundation models are designed for Apple Intelligence, the personal intelligence system integrated into supported models of iPhone, iPad, and Mac. We have built these models to be fast and efficient. And while we have achieved impressive levels of broad capability in our base model, the actual relevant measure of its quality is how it performs on specific tasks across our operating systems. 我们的基础模型是为 Apple Intelligence 设计的,这是集成到支持的 iPhone、iPad 和 Mac 型号中的个人智能系统。我们已经建立这些模型使之既快速又高效。虽然我们在基础模型中实现了广泛的功能,但其质量的实际相关指标是它在我们的操作系统上的具体任务表现。
Here we have found that we can elevate the performance of even small models to best-in-class performance through task-specific fine-tuning and have developed an architecture, based on runtime-swappable adapters, to enable the single foundation model to be specialized for dozens of such tasks. A high-level overview is presented in Figure 2. 我们发现,通过针对特定任务的微调,即使是小型模型也可以达到最佳水平的性能,我们开发了一种基于可交换适配器的架构,使单一基础模型能够专门针对数十项此类任务进行专业化。这一高级概述在图 2 中进行了展示。
Figure 2: Architecture of Apple Intelligence with adapters for the language on-device and server models and the image models. In this report we are only describing the text models. 图 2:苹果智能架构,带有适用于设备上语言和服务器模型以及图像模型的适配器。在这份报告中,我们仅描述文本模型。
5.1 Adapter architecture 5.1 适配器架构
Our foundation models are fine-tuned for users' everyday activities, and can dynamically specialize themselves on-the-fly for the task at hand. We use LoRA [Hu et al., 2021] adapters, small neural network modules that can be plugged into various layers of the base model, to fine-tune our models for specific tasks. For each task, we adapt all of AFM's linear projection matrices in the self-attention layers and the fully connected layers in the pointwise feedforward networks. By fine-tuning only the adapters, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge of the model while tailoring the adapters to support specific tasks. 我们的基础模型针对用户的日常活动进行了微调,并且能够动态地为手头的具体任务专门化。我们使用 LoRA [Hu et al., 2021] 适配器,这是一些小型神经网络模块,可以插入基础模型的各个层中,以微调我们的模型以完成特定任务。对于每个任务,我们都会调整自注意力层和逐点前馈网络中所有的线性投射矩阵。通过只微调适配器,基础预训练模型的原始参数保持不变,保留了模型的一般知识,同时也调整了适配器以支持特定任务。
We represent the values of the adapter parameters using 16 bits, and for the billion parameter on-device model, the parameters for a rank 16 adapter typically require 10s of megabytes. The adapter models can be dynamically loaded, temporarily cached in memory, and swapped - giving our foundation model the ability to specialize itself on the fly for the task at hand while efficiently managing memory and guaranteeing the operating system's responsiveness. 我们使用 16 位表示适配器参数的值,对于具有 十亿参数的在设备模型,排名为 16 的适配器参数通常需要数十兆字节的存储空间。适配器模型可以动态加载,临时缓存在内存中,并进行交换 - 这使得我们的基础模型能够在保持内存高效管理和操作系统响应能力的同时,针对手头的任务即时进行专门化。
To facilitate the training of the adapters, we created an efficient infrastructure that allows us to rapidly add, retrain, test, and deploy adapters when the base model or the training data gets updated or new capabilities are 为了促进适配器的培训,我们创建了一个高效的基础设施,使我们能够在基础模型或训练数据更新或新功能添加时快速添加、重新训练、测试和部署适配器
required. It is worth noting that the adapter parameters are initialized using the accuracy-recovery adapter introduced in Section 5.2. 必需的。值得注意的是,适配器参数是使用第 5.2 节中介绍的精度恢复适配器初始化的。
5.2 Optimizations 5.2 优化
The AFM models are designed to support our users throughout their daily activities, and both inference latency and power efficiency are important for the overall user experience. We apply various optimization techniques to allow AFM to be efficiently deployed on-device and in Private Cloud Compute. These techniques significantly reduce memory, latency, and power usage while maintaining the overall model quality. 这些 AFM 模型旨在支持用户进行日常活动,推理延迟和能效是整体用户体验的重要因素。我们应用了各种优化技术,使 AFM 可以在设备上和私有云计算环境中高效部署。这些技术大幅降低了内存、延迟和功耗,同时保持了整体模型质量。
In order to fit AFM into a constrained memory budget of edge devices and reduce inference cost, it is critical to apply model quantization techniques to reduce the effective bits per weight while maintaining the model quality. Previous works have found that 4 -bit quantized models only have marginal loss of quality (typically measured in pre-training metrics) compared to the original -bit float-point versions. Since AFM is expected to support a diverse set of product features, it is essential that the quantized model retains capabilities in specific domains critical to these use cases. To achieve an optimal trade-off between model capacity and inference performance, we have developed state-of-the-art quantization methods and a framework that utilizes accuracy-recovery adapters. This allows us to achieve near-lossless quantization that is on average less than 4 bit-per-weight, and provides flexible quantization scheme choices. 为了将 AFM 适配于边缘设备的受限内存预算并降低推理成本,应用模型量化技术以降低每个权重的有效比特数,同时保持模型质量至关重要。先前的研究发现,4 位量化模型与原始 位浮点版本相比仅有微小的质量损失(通常衡量为预训练指标)。由于 AFM 预计将支持多样的产品特性,因此量化模型在这些关键使用场景中必须保持相应的能力。为了在模型容量和推理性能之间达到最佳权衡,我们开发了最先进的量化方法和一个利用精度恢复适配器的框架。这使我们能够实现平均少于 4 比特/权重的近无损量化,并提供灵活的量化方案选择。
Methods The model is compressed and quantized, on average under 4-bit-perweight, after the post-training stages (details of the quantization scheme will be discussed later). The quantized model often shows a moderate level of quality loss. Therefore, instead of directly passing the quantized model to application teams for feature development, we attach a set of parameter-efficient LoRA adapters for quality recovery. We make sure that these LoRA adapters training recipes are consistent with pre-training and post-training processes. Then, products will fine-tune their own feature-specific LoRA adapters by initializing the adapter weights from the accuracy-recovery adapters, while keeping the quantized base model frozen. 方法
该模型在后训练阶段被压缩和量化,平均每个权重低于 4 比特(量化方案的细节将在后续讨论)。量化后的模型通常会出现适度的质量损失。因此,我们在将量化模型直接转交给应用团队进行功能开发之前,我们附加了一组参数高效的 LoRA 适配器来恢复质量。我们确保这些 LoRA 适配器训练方案与预训练和后训练过程保持一致。然后,产品将通过从准确性恢复适配器初始化适配器权重来微调自己特定功能的 LoRA 适配器,同时保持量化的基础模型冻结。
It is noteworthy that training accuracy-recovery adapters is sample-efficient and can be considered as a mini-version of training the base model. During the pre-training stage of the adapters, we only require approximately 10 billion tokens ( of base model training) to fully recover the capacity for the quantized model. Since application adapters will fine tune from these accuracy-recovery adapters, they do not incur any additional memory usage or inference costs. Regarding adapter size, we found that adapter rank of 16 offers the optimal tradeoff between model capacity and inference performance However, to provide flexibility for various use cases, we provide a suite of accuracy-recovery adapters in different ranks for application teams to select from. In Appendix F, we provide detailed evaluation results across 值得注意的是,训练精度恢复适配器是样本高效的,可以视为训练基础模型的迷你版本。在适配器的预训练阶段,我们只需大约 100 亿个令牌( 基础模型训练)就可以完全恢复量化模型的容量。由于应用程序适配器将从这些精度恢复适配器进行微调,因此它们不会带来任何额外的内存使用或推理成本。关于适配器大小,我们发现 rank 为 16 的适配器在模型容量和推理性能之间提供了最佳的权衡。但是,为了为各种用例提供灵活性,我们提供了不同 rank 的精度恢复适配器套件,供应用程序团队选择。在附录 F 中,我们提供了广泛的评估结果。
unquantized, quantized, and accuracy-recovered models and show that the recovered models perform much closer to the unquantized version. 非量化、量化和精度恢复的模型,并且表明恢复的模型性能远接近非量化版本。
Quantization schemes Another benefit brought by accuracy-recovery adapters is that they allow more flexible choices of quantization schemes. Previously when quantizing LLMs, people typically group the weights into small blocks, normalize each block by the corresponding maximal absolute values to filter out outliers, then apply quantization algorithms in a block basis. While a larger block size yields lower effective bits per weight and a higher throughput, the quantization loss would increase. In order to balance this tradeoff, it is common to set block size to a small value, like 64 or 32. In our experiments, we found that accuracy-recovery adapters can greatly improve the pareto frontier in the tradeoff. More errors will be recovered for more aggressive quantization schemes. As a result, we are able to use a highly-efficient quantization scheme for AFM without worrying about losing model capacity. Specifically, our AFMon-device model running on Apple Neural Engine (ANE) uses palettization: for projection weights, every 16 columns/rows share the same quantization constants (i.e., lookup tables) and are quantized using K-means with 16 unique values ( 4 -bit). The quantization block size can be up to 100k. Besides, since AFM's embedding layer is shared between the input and output, it is implemented differently from projection layers on ANE. Hence, we quantize the embedding using per-channel quantization with 8 -bit integers for better efficiency. 量化方案
精度恢复适配器的另一个好处是,它们允许更灵活地选择量化方案。以前量化LLMs时,人们通常将权重分组为小块,使用相应的最大绝对值归一化每个块以过滤异常值,然后在块的基础上应用量化算法。虽然较大的块大小会产生较低的每权重有效位数和较高的吞吐量,但量化损失会增加。为了平衡这一权衡,通常将块大小设置为较小的值,如 64 或 32。在我们的实验中,我们发现精度恢复适配器可以大大改善这一权衡中的 Pareto 前沿。更积极的量化方案会恢复更多错误。因此,我们能够为 AFM 使用高效的量化方案,而不必担心丢失模型能力。具体来说,我们在 Apple Neural Engine (ANE)上运行的 AFM 在设备上模型使用调色板:对于投影权重,每 16 列/行共享相同的量化常数(即查找表),并使用 16 个唯一值(4 位)进行 K-means 量化。量化块大小最大可达 10 万。此外,由于 AFM 的嵌入层在输入和输出之间共享,它在 ANE 上的实现与投影层不同。因此,我们使用每通道量化和 8 位整数对嵌入进行量化,以获得更好的效率。
Mixed-precision quantization Residual connections exist in every transformer block and every layer in AFM. So it is unlikely that all layers have the equal importance. Following this intuition, we further reduce the memory usage by pushing some layers to use 2 -bit quantization (default is 4 -bit). On average, AFM-on-device can be compressed to only about 3.5 bits per weight (bpw) without significant quality loss. We choose to use 3.7 bpw in production as it already meets the memory requirements. 混合精度量化
残差连接存在于每个 Transformer 块和 AFM 中的每一层。因此,不太可能所有层都有同等重要性。基于这一直觉,我们通过将某些层推到使用 2 位量化(默认为 4 位)来进一步减少内存使用量。平均而言,设备上的 AFM 可以压缩到每权重仅约 3.5 位(bpw),而不会产生显著的质量损失。我们选择在生产中使用 3.7 bpw,因为它已经满足内存需求。
Interactive model analysis We use an interactive model latency and power analysis tool, Talaria [Hohman et al., 2024], to better guide the bit rate selection for each operation. 交互式模型分析我们使用交互式模型延迟和功耗分析工具 Talaria [Hohman 等人,2024],以更好地指导每个操作的比特率选择。
More discussions The usage of quantized model and LoRA adapters look conceptually similar to QLoRA [Dettmers et al., 2024]. While QLoRA was designed to save computational resources during fine-tuning, our focus is on the ability to switch between different LoRA adapters to efficiently support high performance across various specific use cases. Before feature-specific finetuning, we first train accuracy-recovery adapters on the same pretraining and post-training data, which is critical to preserve the model quality. The accuracy-recovery framework can be combined with different quantization techniques, like GPTQ [Frantar et al., 2022] and AWQ [Lin et al., 2024], since it does not depend directly on the quantization method itself. The feature 量化模型和 LoRA 适配器的用法在概念上与 QLoRA [Dettmers et al., 2024]非常相似。虽然 QLoRA 被设计用于在细调过程中节省计算资源,但我们的重点是在于能够在不同的 LoRA 适配器之间切换,从而有效地支持在各种特定用例中实现高性能。在特征特定的细调之前,我们首先在相同的预训练和后训练数据上训练出精度恢复适配器,这对于保持模型质量至关重要。精度恢复框架可以与不同的量化技术相结合,如 GPTQ [Frantar et al., 2022]和 AWQ [Lin et al., 2024],因为它不依赖于量化方法本身。
adapters described in Section 5 are initialized from these accuracy-recovery adapters. 第 5 节中描述的适配器是从这些精度恢复适配器初始化的。
5.3 Case study: summarization 5.3 案例研究:总结
We use the AFM-on-device model to power summarization features. We worked with our design teams to create specifications for summaries of Emails, Messages, and Notifications. 我们使用设备上的 AFM 模型来驱动摘要功能。我们与设计团队合作,为电子邮件、消息和通知的摘要制定了规范。
While AFM-on-device is good at general summarization, we find it difficult to elicit summaries that strictly conform to the specification. Therefore, we fine tune a LoRA adapter on top of the quantized AFM-on-device for summarization. The adapter is initialized from the accuracy-recovery adapter as described in Section 5.2. We use a data mixture consisting of input payloads covering Emails, Messages, and Notifications. These payloads include public dataset, vendor data, and internally generated and submitted examples. All the data have been approved to use for production. Vendor data and internally generated data have been anonymized to remove the user information. Given these payloads, we generated synthetic summaries using AFM-server according to product's requirements. These payloads and summaries are used for training. 尽管 AFM-on-device 擅长于一般性总结,但我们发现很难得到严格符合规范的总结。因此,我们在量化的 AFM-on-device 上微调了一个 LoRA 适配器来进行总结。该适配器是从第 5.2 节中描述的精度恢复适配器初始化的。我们使用了一个由电子邮件、消息和通知的输入有效载荷组成的数据混合。这些有效载荷包括公共数据集、供应商数据以及内部生成和提交的示例。所有数据都已获得批准用于生产。供应商数据和内部生成的数据已经过匿名化处理,以去除用户信息。给定这些有效载荷,我们根据产品要求使用 AFM-server 生成了合成摘要。这些有效载荷和摘要用于训练。
Synthetic summaries We use AFM-server to generate synthetic summaries. We apply a series of rule-based filters followed by model based filters. Rulebased filters are based on heuristics such as length constraints, formatting constraints, points of view, voice, etc. Model-based filters are used to screen more challenging problems such as entailment. Our synthetic data pipeline allows us to efficiently generate a large amount of training data and filter it out by an order of magnitude to retain high-quality examples for fine tuning. 合成摘要
我们使用 AFM-server 生成合成摘要。我们应用了一系列基于规则的过滤器,后跟基于模型的过滤器。基于规则的过滤器基于诸如长度约束、格式约束、观点、语气等启发式。基于模型的过滤器用于筛选更具挑战性的问题,如蕴含性。我们的合成数据管道允许我们有效地生成大量训练数据,并将其过滤掉一个数量级,以保留高质量的示例以进行微调。
Prompt injection We find that AFM-on-device is prone to following instructions or answering questions that are present in the input content instead of summarizing it. To mitigate this issue, we identify a large set of examples with such content using heuristics, use AFM-server to generate summaries, as it does not exhibit similar behavior, and add this synthetic dataset to the fine tuning data mixture. 提示注入
我们发现 AFM-on-device 倾向于遵循输入内容中存在的说明或回答问题,而不是对其进行概括。为了缓解这一问题,我们使用启发式方法识别出大量此类内容的示例,使用不会表现出类似行为的 AFM-server 生成摘要,并将该合成数据集添加到微调数据集中。
6 Evaluation 6 评估
We evaluate the AFM models on pre-training (Section 6.1), post-training (Section 6.2), and most importantly, feature-specific (Section 6.3) benchmarks. 我们在预训练(第 6.1 节)、后训练(第 6.2 节)以及最重要的特征特定(第 6.3 节)基准上评估 AFM 模型。
6.1 Pre-training evaluation 6.1 预训练评估
In this section we present common few-shot pre-training evaluation metrics. While these benchmarks are useful for tracking our progress on pre-training, we found that human evaluations on the post-trained models (Section 6.2) and 在本节中,我们介绍了常见的少量样本预训练评估指标。虽然这些基准对于跟踪我们在预训练方面的进展很有用,但我们发现在经过后训练的模型(第 6.2 节)上进行人工评估以及
feature adapters (Section 6.3) are more closely correlated to end-to-end user experience. 功能适配器(第 6.3 节)与端到端用户体验更密切相关。
We evaluate AFM pre-trained models with common open-sourced evaluation harnesses and benchmarks. Table 2 presents the results of AFM-on-device and AFM-server on HELM MMLU v1.5.0 [Liang et al., 2023], which tests 5 -shot multiple-choice question answering across 57 subjects. Also see Table 3 and Table 4 for the results of AFM-server on a subset of the HuggingFace OpenLLM leaderboard V1 [Huggingface, 2024] and the HELM-Lite v1.5.0 benchmark suite [Stanford, 2024], respectively. These benchmarks show that the AFM pretrained models possess strong language and reasoning capabilities and provide a solid foundation for post-training and feature fine-tuning. 我们使用常见的开源评测工具和基准测试来评估 AFM 预训练模型。表 2 展示了 AFM-on-device 和 AFM-server 在 HELM MMLU v1.5.0 [Liang et al., 2023] 上的结果,该测试包括 57 个学科的 5 次选择题回答。另请参见表 3 和表 4,了解 AFM-server 在 HuggingFace OpenLLM 排行榜 V1 [Huggingface, 2024] 和 HELM-Lite v1.5.0 基准测试套件 [Stanford, 2024] 上的结果。这些基准测试表明,AFM 预训练模型具有强大的语言和推理能力,为后续的训练和特性微调奠定了坚实的基础。
Table 3: A subset of Open LLM Leaderboard [Huggingface, 2024] V1 evaluation results. 表 3:Open LLM排行榜[Huggingface, 2024]V1 评估结果的子集。
6.2 Post-training evaluation 6.2 训练后评估
We evaluate post-training models on comprehensive benchmarks and compare AFM models with various open-source models, as well as GPT-3.5 and GPT-4. All results reported in this section are obtained using AFM-on-device and AFM-server base models without any adapter, in bfloat16 precision. In this section, we first present human evaluation results that measure the AFMs general capabilities, and then present results for several specific capabilities and domains. 我们在全面的基准上评估模型训练后的性能,并将 AFM 模型与各种开源模型以及 GPT-3.5 和 GPT-4 进行比较。 本部分报告的所有结果都是使用没有任何适配器的 AFM-on-device 和 AFM-server 基础模型以 bfloat16 精度获得的。在本部分,我们首先展示了测量 AFM 总体能力的人工评估结果,然后介绍了几个具体能力和领域的结果。
AFM-server AFM 服务器
Narrative QA 分叙问答
77.5
Natural Questions (open) 自然问题(公开)
73.8
Natural Questions (closed) 自然语言问题(已关闭)
43.1
Openbook QA 开放书籍问答
89.6
MMLU
67.2
MATH-CoT 数学系统
55.4
GSM8K
72.3
LegalBench 法律基准
67.9
MedQA 医疗问答
64.4
WMT 2014 2014 年机器翻译研讨会
18.6
Table 4: HELM-Lite v1.5.0 [Stanford, 2024] pre-training evaluation results. N.B. Many benchmarks (e.g. MMLU) differ significantly from commonly used settings. 表 4:HELM-Lite v1.5.0 [Stanford, 2024]预训练评估结果。注意:许多基准测试(如 MMLU)与常用设置有显著不同。
6.2.1 Human evaluation 6.2.1 人类评估
Human evaluation simulates practical use cases and user feedback, and so often serves as the gold standard for language model evaluation. Consequently, we conduct extensive human evaluations both while developing the model and to evaluate its final form. We collect sets of evaluation prompts to test the model on different aspects, including both general capabilities and safety. For each prompt, two model responses are presented to human raters anonymously for side-by-side comparisons. Depending on the nature of the evaluation, a detailed guideline containing grading principles and examples of single-response ratings and side-by-side preference ratings is provided to human raters to ensure consistent grading standards and evaluation quality. Each pair of model responses is graded by multiple graders and their ratings are aggregated for final results. Overall, we find human evaluation to align better with user experience and provide a better evaluation signal than some academic benchmarks that use LLMs as graders. In this section, we present the results for human evaluation on general capabilities, and the safety evaluation results are provided in Section 7.6. 人类评估模拟实际使用场景和用户反馈,因此通常被视为评估语言模型的金标准。因此,我们在开发模型的同时以及评估其最终形式时进行了广泛的人类评估。我们收集一组评估提示来测试模型的不同方面,包括一般能力和安全性。对于每个提示,两个模型响应以匿名方式呈现给人类评分员进行并排比较。根据评估的性质,我们提供了包含评分原则和单响应评分以及并排偏好评分示例的详细指南,以确保评分标准和评估质量的一致性。每对模型响应都由多名评分员进行评分,并将其评分汇总为最终结果。总的来说,我们发现人类评估更好地与用户体验相一致,并提供了比一些使用LLMs作为评分员的学术基准测试更好的评估信号。在本节中,我们介绍了人类评估一般能力的结果,而安全评估结果在第 7.6 节中提供。
We collect a comprehensive set of 1393 prompts to evaluate the general model capabilities. These prompts are diverse across different difficulty levels and cover major categories including: analytical reasoning, brainstorming, chatbot, classification, closed question answering, coding, extraction, mathematical reasoning, open question answering, rewriting, safety, summarization, and writing. To prevent overfitting, when preparing training data, we conduct decontamination against our evaluation prompts. 我们收集了一系列全面的 1393 个提示来评估通用模型的功能。这些提示在不同难度级别上都很多样化,涵盖了包括:分析推理、头脑风暴、聊天机器人、分类、封闭式问答、编程、抽取、数学推理、开放式问答、改写、安全性、总结和写作等主要类别。为了防止过度拟合,在准备训练数据时,我们对评估提示进行了去污染处理。
In Figure 3, we compare AFM with both open-source models (Phi-3, Gemma-1.1, Llama-3, Mistral, DBRX-Instruct) and commercial models (GPT3.5, and GPT-4). AFM models are preferred by human graders over competitor models. In particular, AFM-on-device obtains a win rate of when 在图 3 中,我们将 AFM 与开源模型(Phi-3、Gemma-1.1、Llama-3、Mistral、DBRX-Instruct)以及商业模型(GPT3.5 和 GPT-4)进行了比较。人类评分员更倾向于选择 AFM 模型。特别是,AFM-on-device 在 时获得了胜率。
Human Evaluation 人工评估
AFM wins Tie AFM loses AFM wins <tie>AFM loses</tie>
Figure 3: Side-by-side evaluation of AFM-on-device and AFM-server against comparable models. We find that our models are often preferred over competitor models by human graders. 图 3:AFM-on-device 和 AFM-server 与可比模型的并行评估。我们发现人类评判者常常更倾向于我们的模型。
compared to Phi-3-mini despite being smaller in model sizes, and even outperforms open-source strong baselines Gemma-7B and Mistral-7B that are more than twice larger in the number of parameters. When compared to closed-source models, AFM-server achieves competitive performance, scoring a win rate of more than and a tie rate of against GPT-3.5. 与 Phi-3-mini 相比,尽管模型尺寸更小,但在性能方面仍然胜过开源强基线 Gemma-7B 和 Mistral-7B,这两者的参数数量都超过它的两倍。与闭源模型相比,AFM-server 实现了竞争性的性能,赢得率超过 ,与 GPT-3.5 打平的比率为 。
6.2.2 Instruction following 6.2.2 指令后续
Instruction following (IF) is the core capability we desire of language models, as real-world prompts are often sophisticated and contain complex instructions. We emphasize the importance of instruction following in both our RLHF data collection and human evaluation. In this subsection, we evaluate our models' IF skills using automated benchmarks. 指令跟随(IF)是我们希望语言模型具备的核心能力,因为现实世界中的提示通常很复杂,包含复杂的指令。我们强调在 RLHF 数据收集和人工评估中,指令跟随的重要性。在这个小节中,我们使用自动化基准测试来评估我们模型的 IF 技能。
In Figure 4 we evaluate AFM-on-device and AFM-server on the public IFEval benchmark [Zhou et al., 2023], respectively. This benchmark measures a language model's capability to generate responses that precisely follow instructions in the prompt. The instructions in this benchmark typically include requirements on the response length, format, content, etc. We find AFM-on-device and AFM-server to achieve superior performance on both instruction-level and prompt-level accuracy. In addition, we also benchmark AFM models on the AlpacaEval 2.0 LC benchmark [Dubois et al., 2024] to 在图 4 中,我们分别在公开的 IFEval 基准测试[周等人,2023]上评估了 AFM-on-device 和 AFM-server。这个基准测试衡量了语言模型生成严格遵循提示中指令的响应的能力。该基准测试中的指令通常包括对响应长度、格式、内容等的要求。我们发现 AFM-on-device 和 AFM-server 在指令级和提示级准确性方面都取得了出色的表现。此外,我们还在 AlpacaEval 2.0 LC 基准测试[Dubois 等,2024]上对 AFM 模型进行了基准测试。
measure general instruction-following capability, and results suggest that our models are highly competitive. 衡量一般的指令跟从能力,结果表明我们的模型非常具有竞争力。
Instruction Following Benchmark 教学指引基准
Figure 4: Instruction-following capability (measured with IFEval) for AFM models and relevant comparison models (higher is better). The AlpacaEval 2.0 LC results for Mistral 7B, Llama3 8B, Llama3 70B, DBRX-Instruct, and Mixtral 8x22B are obtained from the AlpacaEval leaderboard [Taori et al., 2023]. The Arena Hard results for comparison models are from the ArenaHard-Auto leaderboard [Li et al., 2024b]. All other results are from our own evaluations. Figure 4: 遵循指令的能力(用 IFEval 衡量,结果越高越好)。 AlpacaEval 2.0 LC 中 Mistral 7B、Llama3 8B、Llama3 70B、DBRX-Instruct 和 Mixtral 8x22B 的结果来自 AlpacaEval 排行榜 [Taori et al., 2023]。 比较模型的 Arena Hard 结果来自 ArenaHard-Auto 排行榜 [Li et al., 2024b]。 其他结果均来自我们自己的评估。
6.2.3 Tool use 6.2.3 使用工具
In tool use applications, given a user request and a list of potential tools with descriptions, the model can choose to issue tool calls by providing a structured output specifying the name and parameter values of the tools to call. We expect the tool descriptions to follow the OpenAPI specification. 在工具使用应用程序中,给定用户请求和潜在工具列表及其描述,该模型可以选择发出工具调用,方法是提供一个结构化输出,指定要调用的工具名称和参数值。我们期望工具描述遵循 OpenAPI 规范。
We evaluate on the public Berkeley Function Calling Leaderboard benchmarks [Patil et al., 2023] via native support of function calling, using the AST metrics. 我们使用 AST 指标通过对函数调用的原生支持,在公共 Berkeley Function Calling Leaderboard 基准上进行评估 [Patil et al., 2023]。
As shown in Figure 5, AFM-server achieves the best overall accuracy, outperforming Gemini-1.5-Pro-Preview-0514 and GPT-4. 如图 5 所示,AFM-server 实现了最佳的整体准确性,超越了 Gemini-1.5-Pro-Preview-0514 和 GPT-4。
Figure 5: Berkeley Function Calling Leaderboard Benchmark evaluation results on Function Calling API, along-side relevant sampled comparisons. Numbers were collected from the Gorilla leaderboard [Patil et al., 2023]. 图表 5:伯克利函数调用领导者板基准评估结果,以及相关的抽样比较。数据来自 Gorilla 领导者板[Patil 等人,2023]。
6.2.4 Writing 6.2.4 写作
Writing is one of the most critical abilities for large language models to have, as it empowers various downstream use cases such as changing-of-tone, rewriting, and summarization. However, assessing writing quality is a non-trivial task, and not well-covered in the above public benchmarks. 撰写是大型语言模型必须具备的最关键能力之一,因为它赋予了各种下游使用案例,如语气转换、重写和总结。然而,评估写作质量是一项非平凡的任务,而且在上述公共基准中也未得到很好的涵盖。
We evaluate AFM's writing ability on our internal summarization and composition benchmarks, consisting of a variety of writing instructions. Following LLM-as-a-judge [Zheng et al., 2024], we design a grading instruction 我们评估了 AFM 在我们内部概括和作文基准测试上的写作能力,包括各种写作指示。根据LLM-as-a-judge [Zheng et al., 2024]的设计,我们制定了一个评分指令
\footnotetext{ \footnotetext{
翻译文本: https://github.com/OAI/OpenAPI-Specification
for each summarization and composition task, and prompt GPT-4 Turbo to assign a score from 1 to 10 for model responses. We note that there are certain limitations and biases associated with using an LLM as a grader, such as length bias. 对于每个总结和撰写任务,提示 GPT-4 Turbo 为模型响应分配 1 到 10 分的评分。 我们注意到使用LLM作为评分者存在一些局限性和偏差,如长度偏差。
We compare AFM with a few of the most outstanding models, along with smaller-scale open-source models. As shown in Figure 6, AFM-on-device can achieve comparable or superior performance when compared to Gemma-7B and Mistral-7B. AFM-server significantly outperforms DBRX-Instruct and GPT3.5 and is comparable to GPT4. 我们将 AFM 与几个最杰出的模型进行比较,以及一些较小规模的开源模型。如图 6 所示,AFM-on-device 与 Gemma-7B 和 Mistral-7B 相比可以实现可比或更优秀的性能。AFM-server 明显优于 DBRX-Instruct 和 GPT3.5,并且与 GPT4 相当。
Writing Benchmarks 写的基准
Figure 6: Writing ability on internal summarization and composition benchmarks (higher is better) for AFM-on-device and AFM-server alongside relevant sampled comparisons. We find that our models perform better or similar to related models. 图 6:内部总结和作文基准(越高越好)的书写能力,AFM-on-device 和 AFM-server 以及相关采样比较。我们发现,我们的模型的表现优于或类似于相关模型。
6.2.5 Math 6.2.5 数学
In Figure 7, we compare post-training AFM's performance on math benchmarks including GSM8K [Cobbe et al., 2021] and MATH [Hendrycks et al., 2021]. We use 8 -shot chain-of-thought (CoT) [Wei et al., 2022] prompt for GSM8K and 4 -shot CoT prompt [Lewkowycz et al., 2022] for MATH. We conduct all evaluations using an internal automated evaluation pipeline. We see that the AFM-on-device significantly outperforms Mistral-7B and Gemma-7B, even at less than half of their sizes. 在图 7 中,我们比较了后训练的自适应分析机器(AFM)在包括 GSM8K [Cobbe et al., 2021]和 MATH [Hendrycks et al., 2021]在内的数学基准测试中的性能。我们对 GSM8K 使用 8 次样本思维链条(CoT) [Wei et al., 2022]提示,对 MATH 使用 4 次样本 CoT 提示 [Lewkowycz et al., 2022]。我们使用内部自动评估管道进行所有评估。我们发现,设备上的 AFM 甚至在体积小于它们一半的情况下,也显著优于 Mistral-7B 和 Gemma-7B。
Math Benchmarks 数学基准
Figure 7: Math benchmarks for AFM-on-device and AFM-server alongside relevant sampled comparisons. GSM8K is 8 -shot and MATH is 4 -shot. All results are collected with an internal automated evaluation pipeline. 图 7:AFM 在设备上和 AFM 在服务器上的数学基准测试,以及相关的采样比较。GSM8K 是 8 次采样,MATH 是 4 次采样。所有结果均通过内部自动评估管道收集。
6.3 Summarization feature evaluation 6.3 摘要特征评估
The product team specifications for summarizing Emails, Messages, and Notifications necessitated a tailor-made set of guidelines, metrics, and specialized graders to evaluate summarization quality against various open-source, licensed, and proprietary datasets. 产品团队为总结电子邮件、消息和通知制定的规格需要定制一套指南、指标和专门的评价器来评估摘要质量,针对各种开源、授权和专有数据集。
Datasets. We sampled abundant payloads carefully for each use case. These evaluation datasets emphasize a diverse set of inputs which our product features are likely to face in production, and include a stratified mixture of single and stacked documents of varying content types and lengths. We developed a pipeline to build evaluation datasets that simulate real user inputs. 数据集。我们为每个用例都仔细采样了大量有效载荷。这些评估数据集强调了一组多样的输入,这些输入是我们的产品功能在生产中可能面临的,并包括了各种内容类型和长度的单一和堆叠文档的分层混合。我们开发了一个管道来构建模拟真实用户输入的评估数据集。
Graders. We enlisted a pool of highly-trained, full-time, Apple-employed human graders with specialized writing and comprehension skills to evaluate summarization quality. To qualify for grading projects, each grader must pass a series of eligibility and training steps, which include a required bachelor's degree in a writing-related discipline, customized training sessions, and consistently high performance against internal grading quality benchmarks. 评分员。我们招聘了一群经过专业训练、全职工作、具备专业写作和理解能力的苹果公司的人工评分员,来评估总结质量。要参与评分项目,每位评分员必须通过一系列资格审查和培训步骤,包括拥有写作相关专业的学士学位、定制的培训课程,以及保持在内部评分质量基准标准上的出色表现。
Grading guidelines. During the evaluation task, graders are presented with a specification for the summary, the original input content, and the output summary. Graders assess the summary on each the following sub-dimensions of quality using 3 point scales ("good", "neutral", or "poor" ): 评分指南。在评估任务期间,评分人员会收到摘要规范、原始输入内容和输出摘要。评分人员使用 3 点量表("好"、"中性"或"差")对每个质量子维度进行评估:
Composition: Evaluates the overall readability of the summary considering grammar, punctuation, spelling, and brevity. 组成:评估摘要的整体可读性,考虑语法、标点、拼写和简洁性。
Comprehensiveness: Evaluates how comprehensive the summary is in capturing the essential points or calling out any actions/conclusions for the user. 全面性:评估摘要在捕捉关键要点或向用户指出任何行动/结论方面的全面程度。
Groundedness: Evaluates how grounded the summary is with respect to the original payload. Summaries that are not completely grounded may contain details that are exaggerated, inferred, inaccurate, or hallucinated. 实在性:评估摘要与原始有效载荷的实在性程度。不完全实在的摘要可能包含夸大、推断、不准确或幻想的细节。
Following instructions: Evaluates whether the summary meets specific style and formatting requirements. Requirements are tailored to each feature and reflect specific product and design expectations. 以下说明:评估摘要是否符合特定的样式和格式要求。这些要求是针对每个功能定制的,反映了特定的产品和设计期望。
Harmfulness: Evaluates whether the summary contains content that is harmful or unsafe according to Apple's safety taxonomy. 害处:评估总结是否包含根据苹果的安全分类定义为有害或不安全的内容。
A summary is classified as "poor" if any of the sub-dimensions are "poor" according to predefined product specifications. Likewise a summary is "good" only if all sub-dimensions are good. These classifications are used to compute "Good/Poor Result Ratio" metrics defined as the percentage of good/poor summaries out of all summaries. 如果任何子维度根据预定的产品规格被归类为"较差",则该摘要将被归类为"较差"。同样地,只有当所有子维度都良好时,摘要才会被归类为"良好"。这些分类用于计算定义为良好/较差摘要占所有摘要的百分比的"良好/较差结果比率"指标。
Results. We ask human graders to evaluate the summarization quality of the AFM-on-device adapter, Phi-3-mini, Llama-3-8B, and Gemma-7B. Figure 8 shows that AFM-on-device-adapter overall outperforms the other models. 结果。我们要求人类评判员评估 AFM-on-device 适配器、Phi-3-mini、Llama-3-8B 和 Gemma-7B 的总结质量。图 8 显示 AFM-on-device-adapter 总体上优于其他模型。
7 Responsible AI 负责任的人工智能
7.1 Overview 7.1 概述
Apple Intelligence is developed responsibly and designed with care to empower our users, represent them authentically, and protect their privacy. Of primary importance to our Responsible AI approach is that we are ultimately delivering intelligent, well-defined tools that address specific user needs. Having a clear definition of what a feature is intended to do allows us to better identify any potential safety gaps. 苹果智能以负责任的方式开发,并以关怀的方式设计,以授权我们的用户、真实代表他们并保护他们的隐私。我们负责任的人工智能方法的首要重要性是,我们最终交付有针对性的、定义明确的工具来满足特定用户需求。对特征意在做什么有明确定义,让我们更好地识别任何潜在的安全隙缝。
We have developed a safety taxonomy in order to be comprehensive and consistent in the design and evaluation of our generative AI-powered features. This taxonomy builds and extends Apple's extensive experience in using artificial intelligence and machine learning to deliver helpful features to users around the world, and is updated regularly as we develop and test features. Currently, it consists of 12 primary categories comprised of 51 subcategories, including "Hate Speech, Stereotypes, and Slurs", "Discrimination, Marginalization, and Exclusion", "Illegal Activities", "Adult Sexual Material", and "Graphic Violence." 我们开发了一个安全分类法,以便在设计和评估我们的基于生成式 AI 的功能时做到全面和一致。这个分类法建立并扩展了苹果公司在使用人工智能和机器学习为全球用户提供有帮助功能的丰富经验,并会随着我们开发和测试新功能而定期更新。目前,它由 12 个主要类别组成,共包含 51 个子类别,其中包括"仇恨言论、定型观念和贬低用语"、"歧视、边缘化和排斥"、"非法活动"、"成人性内容"和"暴力画面"。
The taxonomy serves as a structured way to consider potential issues and risks relative to each specific feature. As new or additional risks are identified, we develop and revise the associated policies that are contextualized to each 该分类方式为考虑每个特定功能的潜在问题和风险提供了结构化的方式。随着新的或额外的风险被识别,我们开发并修订与之相关的政策,使其与每个特定情况相对应。
Human Satisfaction with Summarization Feature 人类对摘要功能的满意度
Figure 8: Ratio of "good" and "poor" responses for three summarization use cases relative to all responses. Summaries are classified as "good", "neutral", or "poor" along five dimensions. A result is classified as "good" if all of the dimensions are good (higher is better). A result is classified as "poor" if any of the dimensions are poor (lower is better). Overall, our AFM-on-device adapter generates better summaries than comparable models. 图 8:三个总结使用案例中"好"和"差"响应的比率相对于所有响应。摘要被归类为沿五个维度的"好"、"中性"或"差"。如果所有维度都良好(越高越好),则结果被归类为"好"。如果任何维度都不良好(越低越好),则结果被归类为"差"。总的来说,我们的 AFM-on-device 适配器生成的摘要比可比模型好。
individual feature, taking into account the specific needs that it serves, the content it produces, and the appropriate mitigations. They are developed with extensive internal and external input from academics, AI ethicists, trust and safety, and legal experts to better identify and understand the relevant risks, the potential severity of such risks, and the potential disparate impact these risks may have on certain groups. These policies guide our work in data collection, human annotation, model training, guardrails development, evaluation, and red teaming. 根据其所服务的特定需求、所产生的内容以及适当的缓解措施,个人特征应得到考虑。它们是在广泛的内部和外部投入下开发的,包括学者、AI 伦理学家、信任和安全专家以及法律专家,以更好地识别和理解相关风险、此类风险的潜在严重性以及这些风险可能对某些群体产生的不同影响。这些政策指导我们在数据收集、人工标注、模型训练、防护栏开发、评估和红队测试方面的工作。
Particularly, the taxonomy is not itself the sole determinant of our policy. For example, content that may fall within the safety taxonomy is not necessarily always blocked, as doing so unilaterally may be in conflict with other aspects of Apple's Responsible AI development principles, such as "respecting how our users choose to use these tools to accomplish their goals." Thus, features that operate as tools may be more permissive in the kinds of content they operate over and produce in order to effectively address the user's intent. On the other hand, features that may generate content beyond a user's specified intent may need to be more constrained. Regardless, we strive for some categories of harm 尤其是,分类法本身并不是我们制定政策的唯一决定因素。例如,可能属于安全分类的内容并不一定总是被阻止,因为这样做可能与苹果负责任的人工智能开发原则,如"尊重用户使用这些工具实现自己目标的方式"相冲突。因此,作为工具运作的功能可能会对其操作和生成的内容更加宽松,以有效满足用户的意图。另一方面,可能会超出用户预期生成内容的功能可能需要更多限制。无论如何,我们努力防止某些类别的伤害。
to always be treated with special care (such as any content that relates to self harm) while other categories will always be blocked (such as illegal content). 需要以特殊方式对待的内容(如与自我伤害相关的内容),而其他类别的内容则会被一直屏蔽(如非法内容)。
In addition, our Responsible AI principles are built into every stage of Apple Foundation Models and Apple Intelligence as well as the safety taxonomy, which helps us evaluate risks and formulate policies feature by feature. We include safety-oriented data as part of our fine-tuning of specific adapters tailored by use case. Furthermore, at the time of inference, we also run guardrail models [Inan et al., 2023] as pre- and post-processing steps to evaluate potential harm at both the input and output level. Finally, we have mechanisms in place to continuously and proactively improve our AI tools with the help of ongoing user feedback. 此外,我们的负责任 AI 原则已纳入 Apple Foundation Models 和 Apple Intelligence 的每个阶段,以及安全分类法,这有助于我们评估风险并逐一制定相应的政策。我们在对特定用例进行微调时,也会将以安全为导向的数据纳入其中。此外,在推理过程中,我们还会运行护栏模型 [Inan et al., 2023] 作为前后处理步骤,以评估输入和输出层面的潜在危害。最后,我们已建立机制,持续主动改进我们的 AI 工具,并借助用户持续反馈来实现此目标。
7.2 Pre-Training 7.2 预训练
At the pre-training stage, we take several steps to ensure that the values as outlined above are upheld. We follow a strict data policy ensuring that no Apple user data is included, as well as conduct rigorous legal review for each component in the training corpus. Further, we perform safety filtering to reduce potentially harmful content, including NSFW content, profanity, spam, and PII or financial data. 在预训练阶段,我们采取几个步骤来确保上述价值观得到坚持。我们遵循严格的数据政策,确保不包含任何苹果用户数据,并对训练语料库中的每个组件进行全面的法律审查。此外,我们还进行安全过滤,以减少可能有害的内容,包括 NSFW 内容、粗俗语言、垃圾信息以及个人隐私或财务数据。
Because pre-training is a step which is shared among various downstream features, our safety mitigations aim to retain general capabilities that allow us to iterate on the taxonomy and policy at a per-feature level, without hurting the helpfulness of these downstream models. We take learnings from prior work to avoid overly aggressive filtering at the pre-training stage, which has potential benefits in safety alignment [Touvron et al., 2023]. Intuitively, the pre-trained model should be aware of content that downstream features and policies may require it to handle - in some cases with care, or in other cases operating over such content directly. 由于预训练是多种下游特性之间共享的一个步骤,我们的安全缓解措施的目标是保留能够在每个特性级别上对分类法和策略进行迭代的一般能力,而不会降低这些下游模型的有用性。我们从以前的工作中吸取了经验教训,避免在预训练阶段过度激进的过滤,这可能会带来安全性对齐方面的好处[Touvron et al., 2023]。直观地说,预训练模型应该了解下游特性和策略可能需要它处理的内容 - 有时需要小心,在其他情况下则可以直接处理这些内容。
7.3 Post-Training 7.3 训后处理
In the post-training phase, we aim to instill a baseline level of alignment with our Responsible AI principles to avoid necessitating the full complexities of post-training (such as RLHF) in each downstream model that builds on top of the foundation model. In doing so, there are two key considerations: 在训练后阶段,我们旨在培养对我们负责任人工智能原则的基线水平,以避免在建基于基础模型的下游模型中需要完全复杂的训练后处理(如 RLHF)。在此过程中,有两个关键考虑因素:
We must ensure our models produce output that is helpful to users, while minimizing potential harm. 我们必须确保我们的模型产生对用户有帮助的输出,同时最大限度地减少潜在的危害。
We must contextualize our safety taxonomy and policies on a feature by feature basis to deliver the best possible user experience. 我们必须根据每个特征对我们的安全分类法和政策进行上下文化,以提供最佳的用户体验。
To balance helpfulness and harmlessness trade-off, our solution is to treat safety alignment as one of the many core post-training tasks that are evaluated and iterated on in tandem, instead of as a separate stage of training. Specifically, we include adversarial data into our SFT and RLHF training corpora that is curated according to our policy and values by partnering closely with trusted 为了在有益性和无害性之间取得平衡,我们的解决方案是将安全对齐视为许多核心训练后任务中的一个,这些任务在同时评估和迭代的过程中进行,而不是作为单独的训练阶段。具体来说,我们将对抗性数据纳入我们的 SFT 和 RLHF 训练语料库中,这些数据根据我们的政策和价值观进行了策划,并与可信的合作伙伴密切合作。
vendors. We also incorporate safety tasks and benchmarks into the automatic and human evaluations used during model development. 供应商。我们还将安全任务和基准纳入到用于模型开发的自动和人工评估中。
In total, over of the training data are adversarial or related to safety or sensitive topics, including single and multi-turn safety category annotations, pairwise and overall preference ratings, and annotator rewrites. This data is either used directly or as seed data for synthetic data generation, as described in Section 4.1.2. 总共有 的训练数据是对抗性的或与安全或敏感话题有关的,包括单一和多轮安全类别注释、成对和总体偏好评级以及注释者重写。这些数据要么直接使用,要么作为合成数据生成的种子数据,如第 4.1.2 节所述。
We do additional work to achieve appropriate safety behavior for each feature beyond baseline alignment. A primary way that we do this is by collecting safety-specific training data and including it when fine-tuning adapters. For instance, in fine-tuning our summarization adapter we sought to improve aspects such as, improving robustness against malicious questions included within the content to be summarized, and reducing the likelihood that summaries would inadvertently amplify harmful or sensitive content to be summarized. 我们额外做一些工作来实现每个功能超越基准线的适当安全行为。我们这样做的主要方式是通过收集特定于安全性的训练数据,并在微调适配器时将其包括在内。例如,在微调我们的摘要适配器时,我们努力改善一些方面,如提高对总结内容中包含的恶意问题的稳健性,并降低总结无意中放大有害或敏感内容的可能性。
7.4 Guarding against malicious code 7.4 防范恶意代码
Code generation requires special care. Our code benchmarks involve actually executing the generated code to determine both syntactic and semantic correctness. Thus, responsible training of code models involves treating all generated code as unsafe by default - all code is always executed in a fully locked down environment with no access to the internet or any internal or external services. Specifically, the locked down environment is managed with FireCracker [Agache et al., 2020], with a FireCracker jailer at the cluster level. 代码生成需要特别小心。我们的代码基准测试需要实际执行生成的代码来确定语法和语义的正确性。因此,负责任的代码模型训练需要把所有生成的代码都视为不安全的默认值 - 所有代码都在完全锁定的环境中执行,无法访问互联网或任何内部或外部服务。具体来说,锁定的环境是使用 FireCracker [Agache et al., 2020] 进行管理的,在集群级别有一个 FireCracker 监狱。
7.5 Red teaming 7.5 红队作战
Red teaming attempts to elicit safety policy violating responses from models, or harmful responses for which no policy yet exists. These results inform both policy development as well as the focus and content of safety evaluation datasets. These in turn can influence design, engineering, and shipping readiness decisions. 红队演练旨在诱发违反安全政策的反应或尚未制定相关政策的有害反应。这些结果不仅有助于政策制定,也有利于安全评估数据集的重点和内容。这些反馈反过来也会影响设计、工程和发布准备决策。
Red teaming is a fundamentally creative endeavor that requires red teamers to employ combinations of attack vectors to probe known model vulnerabilities, and try to discover new ones. Attack vectors used when engaging with language models include jailbreaks/prompt injections, persuasive techniques [Zeng et al., 2024], and linguistic features known to cause model misbehavior (e.g. slang, code-switching, emojis, typos). 红队工作是一种本质上创造性的工作,需要红队成员运用各种攻击向量来探测已知的模型漏洞,并试图发现新的漏洞。与语言模型互动时使用的攻击向量包括越狱/提示注入、说服性技巧[Zeng 等人,2024]以及已知会导致模型行为异常的语言特征(如俚语、代码切换、表情符号、错别字等)。
We employ both manual and automatic red-teaming [Ganguli et al., 2022] to elicit potentially unknown failure modes of the aligned models. More recent works [Touvron et al., 2023] suggest that automated processes can potentially generate even more diverse prompts than humans, previously seen as the "gold" standard for data collection. These automated processes can include using the language models themselves to identify gaps, some of which may be unintuitive or even surprising. Such examples can be used directly as synthetic training or evaluation data and to inform future data collection efforts. 我们同时使用人工和自动红方团队[Ganguli et al., 2022]来引发对齐模型可能存在的未知失败模式。最新的研究[Touvron et al., 2023]表明,自动化流程可能生成的提示比人类生成的更加多样,人类被视为数据收集的"黄金"标准。这些自动化流程可包括使用语言模型本身来识别差距,其中一些可能不太直观甚至令人惊讶。这些示例可直接用作合成训练或评估数据,并用于指导未来的数据收集工作。
A basic human red teaming task schema is as follows: a red teamer is assigned a safety taxonomy category and attack vector(s). They author an input to the model, using that attack vector, that is intended to elicit a response containing content from that category. If the response does not contain the target content, the red teamer can engage in a fixed number of conversational turns, after which they provide a final harmfulness rating of the model output and list the taxonomy categor(ies) in it, if any. To ensure annotation quality, red teamers also provide an overall confidence score for their ratings. 一个基本的人类红队任务模式如下:红队人员被分配一个安全分类和攻击向量。他们使用该攻击向量编写输入到模型,目的是引发包含该分类内容的响应。如果响应中不包含目标内容,红队人员可以进行有限次数的对话,之后他们会给出模型输出的危害级别评分,并列出其中包含的分类(如果有)。为确保注释质量,红队人员还会给出他们评分的整体置信度分数。
In addition to red teaming at the base model level, we also red team specific features. Red teaming projects at the feature level use feature-specific guidelines with attack vectors informed by the feature's safety policy and engineering concerns. These projects can provide in-depth probing of known risks for that particular feature and also adversarially probe for unknown vulnerabilities. 除了在基础模型级别进行红队测试之外,我们还对特定的功能进行红队测试。在功能级别的红队项目中,我们使用基于该功能的安全政策和工程问题的攻击向量进行特定的指南。这些项目可以对该特定功能的已知风险进行深入探测,也可以以对抗的方式探测未知的漏洞。
Our red teaming projects are run using internal and external crowds. To ensure responsible data collection, due to the sensitive nature of red teaming we: 1) make red teaming completely voluntary; 2) impose a strict time limit on how much each red teamer spends on the tasks per week; 3) provide health and well-being resources available around the clock; and 4) maintain an open line of communication with internal red teamers via weekly office hours and a Slack channel for them to communicate any concerns that arise. 我们的红队项目是使用内部和外部群众运行的。为确保负责任的数据收集,由于红队工作的敏感性,我们采取以下措施:1)使红队工作完全自愿;2)对每名红队成员每周在任务上的花费时间设置严格的时间限制;3)提供全天候的健康和福祡资源;4)通过每周办公时间和 Slack 频道与内部红队成员保持沟通,让他们及时反馈任何担忧。
7.6 Evaluation 7.6 评估
As mentioned in previous sections, safety is one of the many axes iterated on during foundation model development, and therefore undergoes the same automatic and human evaluation cycles during post-training. 如前几节所述,安全性是在基础模型开发过程中迭代的多个轴之一,因此在训练后也经历了同样的自动和人工评估周期。
Safety evaluation set To reduce noise, cost, and turn-around time during human evaluations, we must ensure that our safety evaluation sets are clean, yet challenging and comprehensive. To that end, we filter out "easy" prompts which consistently yield low harmfulness responses across different versions of the model, and employ an embedding-based analysis to improve our evaluation prompt set coverage. Overall, we curate a set of over a thousand adversarial prompts to test AFM's performance on harmful content, sensitive topics, and factuality according to our safety policy. 安全性评估集
为了降低人类评估期间的噪音、成本和周转时间,我们必须确保我们的安全性评估集干净、具有挑战性和全面性。为此,我们过滤掉了会一致产生低危害度响应的"容易"提示,并采用基于嵌入的分析来提高我们的评估提示集的覆盖范围。总的来说,我们策划了一组由上千个对抗性提示组成的集合,根据我们的安全政策来测试 AFM 对有害内容、敏感话题和事实性的性能。
Safety evaluation results Figure 9 summarizes the violation rates of different models evaluated by human graders on this safety evaluation set. Lower is better. Both AFM-on-device and AFM-server are robust to adversarial prompts, achieving violation rates significantly lower than open-source and commercial models. In addition, we report side-by-side human preference on our safety evaluation prompts in Figure 10. AFM models are preferred by human graders as safe and helpful responses over competitor models. 安全评估结果 图 9 总结了不同模型在此安全评估集上由人工评分员评估的违反率。越低越好。 AFM 设备端和 AFM 服务器端对对抗性提示都很稳健,违反率明显低于开源和商业模型。此外,我们在图 10 中报告了对我们的安全评估提示的并列人工偏好。 AFM 模型被人工评分员更多地偏好为安全有帮助的回应,超过竞争对手模型。
Human Evaluation of Output Harmfulness 人工评估输出危害性
Figure 9: Fraction of violating responses for harmful content, sensitive topics, and factuality (lower is better). Our models are robust when faced with adversarial prompts. 图 9:有害内容、敏感话题和事实性违反响应的分数(越低越好)。我们的模型在面对对抗性提示时很稳健。
Human Preference Evaluation on Safety Prompts 人类对安全提示的偏好评估
Figure 10: Fraction of preferred responses in side-by-side evaluation of Apple's foundation model against comparable models on safety prompts. Human graders found our responses safer and more helpful. 图 10:苹果公司基础模型与可比模型在安全提示评估中的首选响应比例。人工评判员认为我们的响应更安全、更有帮助。
8 Conclusion 8 结论
In this report we introduced the foundation language models that power Apple Intelligence features, AFM-on-device and AFM-server. The models are designed to be fast and run efficiently on iPhone, iPad, and Mac as well as on Apple silicon servers via Private Cloud Compute. They are trained to be highly capable in tasks like language understanding, instruction following, 在这份报告中,我们介绍了支持 Apple Intelligence 特性的基础语言模型,AFM-on-device 和 AFM-server。这些模型旨在快速运行并在 iPhone、iPad 和 Mac 上以及通过 Private Cloud Compute 在 Apple silicon 服务器上高效运行。它们经过训练,在语言理解、指令执行等任务上表现卓越。
reasoning, writing, and tool use. We have developed an innovative model architecture to specialize these models for our users' most common tasks. On top of the foundation models, feature-specific adapters are fine-tuned to provide high-quality user experiences such as summarization of emails, messages, and notifications. Our models have been created with the purpose of helping users do everyday activities across their Apple products, grounded in Apple's core values, and rooted in our Responsible AI principles at every stage. These foundation models are at the heart of Apple Intelligence, the personal intelligence system built by Apple to continue empowering our users and enriching their lives. 推理、写作和工具使用。我们已经开发了一个创新的模型架构来专门为用户的最常见任务定制这些模型。在基础模型之上,为提供高质量的用户体验(如对电子邮件、消息和通知进行摘要)进行了特定功能的微调。我们的模型是为了帮助用户在其 Apple 产品中进行日常活动而创建的,这些都植根于 Apple 的核心价值观,并在每个阶段都体现了我们负责任的 AI 原则。这些基础模型是 Apple Intelligence 的核心,Apple Intelligence 是 Apple 为继续赋能我们的用户并丰富他们的生活而建立的个人智能系统。
References 参考文献
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692-3702. PMLR, 2019. Yasin Abbasi-Yadkori、Peter Bartlett、Kush Bhatia、Nevena Lazic、Csaba Szepesvari 和 Gellért Weisz。Politex:使用专家预测的策略迭代的后悔界限。在机器学习国际会议上,第 3692-3702 页。PMLR,2019 年。
Alexandru Agache, Marc Brooker, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. Firecracker: Lightweight virtualization for serverless applications. In 17th USENIX symposium on networked systems design and implementation (NSDI 20), pages 419-434, 2020 . 阿历山德鲁·阿加谢、马克·布鲁克、亚历山德拉·约尔达切、安东尼·里古里、罗尔夫·诺伊格劳贝尔、菲尔·皮温卡和戴安娜-玛丽亚·波帕。Firecracker: 无服务器应用程序的轻量级虚拟化。在 2020 年第 17 届 USENIX 网络系统设计与实施研讨会(NSDI 20)上,第 419-434 页。
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs. 2024, arXiv:2402.14740. 阿拉什·艾哈迈迪安、克里斯·克里默、马蒂亚斯·加勒、玛尔济叶·法代、朱莉娅·克罗伊特泽、艾哈迈特·乌斯图恩和萨拉·胡克。回归基础:重新审视在LLMs中从人类反馈中学习的强化方式优化。2024 年,arXiv:2402.14740。
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895-4901, 2023. doi: . emnlp-main. 298 . 乔舒亚·艾因斯利、詹姆斯·李-索普、米希尔·德容、尤里·泽姆斯基、费德里科·莱布朗和 Sumit Sanghai。 GQA: 从多头检查点训练通用多查询变换器模型。 在 2023 年经验自然语言处理会议论文集中,第 4895-4901 页,2023 年。 doi: . emnlp-main. 298 .
Apple. The AXLearn library for deep learning. https://github.com/apple/ axlearn, 2023. 苹果公司。用于深度学习的 AXLearn 库。https://github.com/apple/axlearn,2023。
Apple. Private cloud compute: A new frontier for ai privacy in the cloud. https: //security.apple.com/blog/private-cloud-compute/, 2024b. Accessed: 2024-07-11. 苹果。私有云计算:云端人工智能隐私的新前景。https://security.apple.com/blog/private-cloud-compute/,2024 年 b。访问时间:2024 年 7 月 11 日。
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International 摩哈默德·格什拉吉·阿扎尔,赵汉·丹尼尔·郭,比拉尔·皮奥特,雷米·穆诺斯,马克·罗兰,米哈尔·瓦尔科,and Daniele Calandriello.一种通用的理论范式来理解从人类偏好中学习。在 International
Conference on Artificial Intelligence and Statistics, pages 4447-4455. PMLR, 2024, arXiv:2310.12036. 人工智能和统计会议,第 4447-4455 页。PMLR, 2024, arXiv:2310.12036.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. 詹姆斯·布拉德伯里、罗伊·弗罗斯蒂格、彼得·霍金斯、马修·詹姆斯·约翰逊、克里斯·利尔、道格拉斯·麦克劳林、乔治·内卡拉、亚当·帕斯基、杰克·范德普拉斯、斯凯·万德曼-米尔及乔张。 JAX:Python+NumPy 程序的可组合化转换,2018。URL http://github.com/google/jax。
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324-345, 1952 . 拉尔夫·艾伦·布拉德利和米尔顿·E·特里。不完全区组设计的秩分析:I. 配对比较法。《生物统计学》,39(3/4):324-345,1952.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. 2022, arXiv:2204.02311.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1-53, 2024, arXiv:2210.11416. 丛翔云,侯乐,朗珀,巴瑞特·祖普,郑宜泰,威廉·费达斯,李云萱,王学智,穆斯塔法·德哈尼,西达塔·布拉马等.缩放指令精调语言模型.机器学习研究杂志,25(70):1-53, 2024, arXiv:2210.11416.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. 2021, arXiv:2110.14168. 卡尔·柯布、温尼特·科萨拉朱、穆罕默德·巴瓦里安、马克·陈、希卢、卢卡斯·凯撒、马蒂亚斯·普拉彭特、杰瑞·托韦克、雅各布·希尔顿、莱希里罗·中野等。训练验证器解决数学应用题。2021 年, arXiv:2110.14168.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 36, 2024, arXiv:2305.14314. 蒂姆·德特默斯、阿蒂多罗·帕尼、阿里·霍尔茨曼和卢克·泽特勒迈尔。Qlora 高效的量化LLMs微调。神经信息处理系统进展,36,2024,arXiv:2305.14314。
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. 2024, arXiv:2404.04475. 尚·杜布瓦、巴拉兹·加朗博西、珀西·梁和塔宗诺里·B·哈希莫托。一种对抗自动评估器偏差的简单方法:长度控制 alpacaeval。2024 年,arXiv:2404.04475。
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. 2022, arXiv:2210.17323. 伊利亚斯·弗兰塔尔、萨利·阿什科布斯、托尔斯滕·豪夫勒和丹·阿里斯特。GPTQ:针对生成预训练转换器的精确的训练后量化。2022,arXiv:2210.17323。
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. 2022, arXiv:2209.07858. 甘古利、利弗特、柯尼恩、阿斯克尔、白云涛、卡达瓦斯、曼恩、佩雷兹、希弗、恩都塞等人。红色团队评估语言模型以减少危害:方法、扩展行为和经验教训。2022 年,arXiv:2209.07858。
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. 2021, arXiv:2103.03874. 丹·亨德里克斯、柯林·伯恩斯、索拉夫·卡达瓦特、阿库·阿罗拉、斯蒂文·巴萨特、埃里克·唐、唐安歌、雅各布·斯坦哈特。使用 math 数据集测量数学问题解决能力。2021 年, arXiv:2103.03874.
Geoffrey Hinton. Lecture 6.5 -rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26, 2012. 杰弗里·欣顿。课程 6.5 -rmsprop:用最近幅度的移动平均值来除以梯度。COURSERA:机器学习神经网络, 4(2):26, 2012。
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2015, arXiv:1503.02531. 杰弗里·辛顿、奥里奥尔·维尼亚尔斯和杰夫·迪恩。蒸馏神经网络中的知识。2015 年,arXiv:1503.02531。
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. Talaria: Interactively optimizing machine learning models for efficient inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1-19, 2024, arXiv:2404.03085. 弗雷德·霍曼、超群王、金默·李、约翰·戈特勒、多米尼克·莫里茨、杰弗里·P·比格姆、郑力、塞西尔·福雷特、单启和张晓毅。Talaria:交互式优化机器学习模型以实现高效推理。《2024 年人机交互系统学术会议论文集》,第 1-19 页,arXiv:2404.03085。
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What's the real context size of your long-context language models? 2024, arXiv:2404.06654. 解晨平、孙思梦、萨缪尔·克瑞曼、山塔努·阿查尔亚、迪马·雷克什、贾飞、张洋和鲍里斯·金斯堡。Ruler: 你的长上下文语言模型的真实上下文大小是多少?2024 年, arXiv:2404.06654.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021, arXiv: 2106.09685. 胡爱德, 华力士, 朱正源, 李原智, 王省, 王露, 陈惟珠等. LoRA: 大型语言模型的低秩适应性. 在 2021 年国际学习表征会议上, arXiv: 2106.09685.
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: LLM-based input-output safeguard for human-ai conversations. 2023, arXiv:2312.06674.
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detection using shallow text features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, page 441-450, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605588896. doi: . 克里斯蒂安·科尔施特尔、彼得·范克豪泽和沃尔夫冈·内德尔。使用浅层文本特征进行样板检测。在第三届 ACM 国际网络搜索和数据挖掘会议论文集, WSDM '10, 第 441-450 页, 纽约, NY, USA, 2010。计算机协会。ISBN 9781605588896。doi:
Xiang Kong, Tom Gunter, and Ruoming Pang. Large language model-guided document selection. 2024, arXiv:2406.04638. 向孔, Tom Gunter 和 Ruoming Pang。大语言模型引导的文档选择。2024, arXiv:2406.04638。
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! 2019. URL https://openreview.net/forum?id= r11gTGL5DE. 沃特尔·库尔、赫克·范·霍夫和马克斯·维林。购买 4 个 REINFORCE 样本,免费获得基线!2019 年。URL https://openreview.net/forum?id=r11gTGL5DE。
Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, and Csaba Szepesvari. Improved regret bound and experience replay in regularized policy iteration. In International Conference on Machine Learning, pages 6032-6042. PMLR, 2021, arXiv:2102.12611. 内维娜·拉兹奇、董寅、亚辛·阿巴西-亚德科里和茨巴·斯泽佩斯瓦里。改进的后悔边界和规则化策略迭代中的经验重放。在机器学习国际会议上,第 6032-6042 页。PMLR,2021,arXiv:2102.12611.
Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi Zhou, Nan Du, Vincent Zhao, Yuexin Wu, Bo Li, et al. Conditional adapters: Parameter-efficient transfer learning with fast inference. Advances in Neural Information Processing Systems, 36:8152-8172, 2023, arXiv: 2304.04947. 陶磊、柏俊文、西达塔·布拉玛、约书亚·恩斯利、肯顿·李、杨皮周、南渡、云场、吴越心、李博等。有条件适配器:快速推理的参数高效迁移学习。神经信息处理系统进展, 36:8152-8172, 2023, arXiv: 2304.04947.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 3843-3857, 2022. 艾托尔·列夫科维茨、安德斯·安德烈森、大卫·多汉、伊桑·戴尔、亨里克·米卡莱夫斯基、维内·拉马瑟赫、安布罗斯·斯隆、切姆·阿尼尔、伊曼纽尔·辛格、西奥·古特曼-索罗等. 使用语言模型解决定量推理问题. 神经信息处理系统进展, 35: 3843-3857, 2022.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: In search of the next generation of training sets for language models. 2024a, arXiv:2406.11794.
Tianle* Li, Wei-Lin* Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline. April 2024b. URL https://lmsys.org/blog/ 2024-04-19-arena-hard/. 李天乐*、蒋伟林*、Evan Frick、Lisa Dunlap、朱邦华、Joseph E. Gonzalez 和 Ion Stoica。从实时数据到高质量基准:arena-hard 管道。2024 年 4 月。<URL>https://lmsys.org/blog/2024-04-19-arena-hard/</URL>。
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto,
Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. 2023, arXiv:2211.09110. 托马斯·伊卡德、张天翼、陈涛、王威、李学忱、麦逸凡、张宇辉和是仓真太。全面评估语言模型。2023 年,arXiv:2211.09110。
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87-100, 2024, arXiv:2306.00978. 纪林、汤佳明、汤昊天、杨尚、陈伟明、王维辰、肖广轩、党星宇、甘创、韩松。Awq:面向设备的激活感知权重量化压缩与加速。机器学习与系统论文集,6:87-100,2024,arXiv:2306.00978.
Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation. 2024, arXiv:2310.05209. 刘小冉、颜航、张硕、安陈鑫、邱喜朋、林大华。基于绳索的外推缩放定律。2024 年,arXiv:2310.05209。
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 2019, arXiv:1711.05101. 一列亚·洛什乔夫和弗兰克·哈特。分离权重衰减正则化。2019 年,arXiv:1711.05101.
Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018, arXiv:1712.01312. URL https:// openreview.net/forum?id=H1Y8hhgOb. 克里斯托斯·劳伊佐斯、麦克斯·维灵和迪德里克·P·金格马。通过 正则化学习稀疏神经网络。载于《国际学习表征大会》, 2018 年, arXiv:1712.01312。URL 为 https:// openreview.net/forum?id=H1Y8hhgOb。
Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In Jan Niehues, Rolando Cattoni, Sebastian Stüker, Matteo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth Salesky, Ramon Sanabria, Loic Barrault, Lucia Specia, and Marcello Federico, editors, Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong, November 2-3 2019. Association for Computational Linguistics. URL https://aclanthology.org/2019.iwslt-1.17. 阮文宽和朱利安·萨拉查尔。没有眼泪的变形金刚:改善自我关注的标准化。在 1 月尼埃尔斯、罗兰多·卡托尼、塞巴斯蒂安·斯图克、马特奥·内格里、马可·塔奇、谢·哈、伊丽莎白·塞尔斯基、拉蒙·萨纳布里亚、洛伊克·巴罗和卢西亚·斯佩西亚、马切洛·费德里科编辑的《2019 年 11 月 2-3 日在香港举行的第 16 届国际口语语言翻译大会论文集》中。计算语言学协会。URL https://aclanthology.org/2019.iwslt-1.17.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:2773027744, 2022, arXiv:2203.02155. 欧洋龙、吴俊杰、蒋旭、迪奥戈·阿尔梅达、卡罗尔·温赖特、帕米拉·米斯金、张冲、桑蒂尼·阿加尔瓦尔、卡塔丽娜·斯拉马、亚历克斯·雷等. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:2773027744, 2022, arXiv:2203.02155.
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. 2023, arXiv:2305.15334. 帕蒂尔·G·谢希里、张恬君、王鑫、何塞·E·冈萨雷斯。Gorilla:与大量 API 连接的大型语言模型。2023 年,arXiv:2305.15334。
Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Conference of the European Chapter of the Association for Computational Linguistics, 2016, arXiv:1608. 05859. 奥菲尔·普雷斯和利奥·沃尔夫。使用输出嵌入来改善语言模型。在《计算语言学欧洲分会议》上, 2016 年, arXiv:1608.05859。
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024, arXiv:2305.18290. 拉斐尔·拉斐洛夫、阿尔西特·夏尔马、艾瑞克·米切尔、克里斯托弗·D·曼宁、斯特凡诺·埃尔蒙和切尔西·芬恩。直接偏好优化:您的语言模型秘密是一个奖励模型。神经信息处理系统进展,36,2024 年,arXiv:2305.18290。
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897. PMLR, 2015, arXiv:1502.05477. 约翰·舒尔曼、谢尔盖·列维、皮特·阿贝尔、迈克尔·乔丹和菲利普·莫里茨。信任区域策略优化。在机器学习国际会议上,第 1889-1897 页。PMLR,2015, arXiv:1502.05477.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. 2017, arXiv:1707.06347. 约翰·舒尔曼、菲利普·沃尔斯基、普拉夫拉·达里瓦尔、亚历克·拉德福德和奥列格·克里莫夫。近端策略优化算法。2017 年,arXiv:1707.06347。
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 苏健麟,穆塔达·艾哈迈德,芦宇,潘圣峰,文波,和刘云峰。Roformer:增强版具有旋转位置嵌入的 Transformer。Neurocomputing, 568:127063, 2024。
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford_alpaca, 2023. 罗汉·塔奥里、伊山·古拉拉尼、天艺·张、阳·杜布瓦、雪琛·李、卡洛斯·格斯特里恩、珀西·梁和达特诺里·B·哈西摩托。斯坦福阿尔帕卡:一个遵循指令的驼羊模型。https://github.com/tatsu-lab/stanford_alpaca, 2023.
Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. 2020, arXiv:2005.09814. 马南·托马尔、利奥尔·沙尼、约纳坦·埃夫罗尼和穆罕默德·加瓦姆扎德。镜像下降政策优化。2020,arXiv:2005.09814。
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. 2023, arXiv:2307.09288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017, arXiv: 1706.03762. 阿希什·瓦斯瓦尼、诺阿姆·夏泽、尼基·帕玛、雅各布·乌斯科雷特、利昂·琼斯、艾登·N·戈麦斯、卢卡斯·凯撒和伊利亚·波洛苏金。注意力就是一切所需。在《神经信息处理系统进展》,2017 年,arXiv: 1706.03762。
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151-6162, 2020, arXiv:1910.04732. doi: 10.18653/v1/2020.emnlp-main. 496. 王子恒、杰瑞米·沃尔文德和雷涛。大型语言模型的结构化修剪。在 2020 年实证自然语言处理会议(EMNLP)论文集中,第 6151-6162 页,2020,arXiv:1910.04732。doi: 10.18653/v1/2020.emnlp-main.496.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824-24837, 2022. 杰森·魏、薛峥、戴尔·舒尔曼斯、马丁·博斯马、费·夏、艾德·奇、柯克·勒、邓妮·周等。思维链提示引发大型语言模型的推理。神经信息处理系统进展,35:24824-24837,2022。
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256, 1992. 罗纳德·J·威廉姆斯。简单的统计梯度跟踪算法用于连接主义强化学习。机器学习,8:229-256,1992。
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies for large-scale transformer training instabilities. 2023, arXiv:2309.14322. 米切尔·沃茨曼, 刘杰, 肖乐超, 凯蒂·艾弗里特, 亚历克斯·阿勒米, 本·阿德拉姆, 约翰·科-里耶斯, 伊扎丁·古尔, 阿比什·库马尔, 罗曼·诺瓦克, 杰弗里·彭宁顿, 贾斯查·索尔-迪克斯坦, 徐凯文, 李源浩, 贾斯汀·吉尔默, 西蒙·科恩布里克。小规模代理用于大规模变换器训练不稳定性。2023 年, arXiv:2309.14322.
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. 2023, arXiv: 2310.06694. 孟州夏, 高天宇, 曾智远, 陈丹琪. 剪裁的骆驼: 通过结构化剪枝加速语言模型预训练. 2023, arXiv: 2310.06694.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions. 2023, arXiv:2304.12244. 徐灿,孙青锋,郑凯,耿修波,赵璞,冯嘉展,陶崇阳,蒋大鑫。 WizardLM:赋予大型语言模型执行复杂指令的能力。2023, arXiv:2304.12244.
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. 2022, arXiv:2203.03466. 格雷格·杨, 胡爱德, 伊戈尔·巴布斯金, 西蒙·西多尔, 刘晓东, 大卫·法尔希, 尼克·赖德, 雅库布·帕霍奇基, 陈伟珠和高剑锋。Tensor programs v: 通过零样本超参数迁移调整大型神经网络。2022, arXiv:2203.03466.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. 2023, arXiv:2309.12284. 余龙辉、蒋维森、史汉、余金铖、刘正英、张毓、James T Kwok、李正国、Adrian Weller 和刘维洋。Metamath:为大型语言模型生成你自己的数学问题。2023 年,arXiv:2309.12284。
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024, arXiv:2401.06373. URL https://arxiv.org/abs/2401.06373. 曾毅、林红鹏、张景文、杨迪颐、贾若曦和史维言。约翰尼如何说服llms让他们越狱:重新思考说服来挑战人工智能安全,通过人性化llms,2024 年,arXiv:2401.06373。URL https://arxiv.org/abs/2401.06373。
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems, 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf. 张彪和 Rico Sennrich。均方根层归一化。在《神经信息处理系统进展》,2019 年。URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf。
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024. 郑联民、蒋维林、陈颖、庄思源、吴章浩、庄永浩、林子、李卓涵、李大成、熊诗杰等.用 MT-bench 和聊天机器人竞技场评判LLM-作为法官.神经信息处理系统进展,第 36 期,2024 年.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. LIMA: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. 周春婷、刘鹏飞、徐鹏旭、Srinivasan Iyer、孙皎、毛沅宁、马学哲、Avia Efrat、俞平、俞丽丽等. LIMA:对齐来说"少即是多". 神经信息处理系统进展, 36, 2024.
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. 2023, arXiv:2311.07911. 周杰峰、陆天剑、Swaroop Mishra、Siddhartha Brahma、Sujoy Basu、Li Ruan、周登尼和侯乐。大型语言模型的指令跟随评估。2023 年,arXiv:2311.07911。
Contributors 贡献者
Within each section, contributors are listed in alphabetical order by first name. 在每个部分中,贡献者根据名字的字母顺序排列。
Foundation Models 基础模型
Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang (inference efficiency lead), Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Ruoming Pang (overall lead), Sam Wiseman, Syd Evans, Tao Lei, Tom Gunter (pre-train lead), Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Zirui Wang (post-train lead) <name>安迪·纳拉雅南, 张傲楠, 张博文, 陈晨, 王冲(推理效率主管), 邱崇正, 邱大为, 戈皮纳特·德帕克, 叶碣昂, 尹东, 娜珑, 韦斯·弗洛里斯, 尹国利, 黄豪烁, 王建宇, 陆嘉睿, 皮博斯·约翰, 叶可, 李航, 杜楠, 陈启斌, 苏忻, 庞戎明(总负责人), 威斯曼·山姆, 埃文斯·西德, 雷涛, 冈特·汤姆(预训练主管), 拉托德·维韦克, 孔祥, 杜贤植, 李泱皓, 王永强, 高远, 扎伊德·艾哈迈德, 许昭阳, 鲁志云, 王子睿(后训练主管)</name>
Data, Evaluation, and Responsible AI 数据、评估和负责任的人工智能
Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Ke Ye, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Mark Lee, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xiang Kong, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Zhiyun Lu
Adapters, Optimizations, and Summarization 适配器、优化和总结
Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Chong Wang, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Guoli Yin, Irina Belousova, Jianyu Wang, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qibin Chen, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Vivek Rathod, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xiang Kong, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren
Appendix 附录
A Core pre-training recipe ablation 核心预训练配方剖析
We compare our chosen settings for 'core' pre-training from Section 3.2.1 (optimizer, scaling-law-predicted batch-size, weight-decay, etc.) to a baseline based on [Wortsman et al., 2023]. In particular, the baseline uses AdamW with a standard hyperparameter configuration of , and a decoupled weight decay of , decaying the learning rate to 0.0001 of peak, with a batch size of 1024 sequences. Otherwise both recipes are identical. Training covers 3.1T tokens using the AFM-on-device architecture but with a different data mixture to that used by the official AFM training runs. 我们将第 3.2.1 节中选择的"核心"预训练设置(优化器、缩放定律预测的批大小、权重衰减等)与基于[Wortsman et al., 2023]的基线进行比较。特别是,基线使用 AdamW 作为标准的超参数配置 ,并采用分离权重衰减 ,将学习率衰减到峰值的 0.0001 ,批大小为 1024 个序列。除此之外,两种方法都是相同的。训练涵盖 3.1T 个标记,使用 AFM-on-device 架构,但数据混合与官方 AFM 训练运行的不同。
Task 任务
Baseline (acc) 基准值(acc)
AFM (acc)
arc_challenge
41.9
44.6
arc_easy 弧线_简易
75.6
76.1
hellaswag 赫拉斯瓦格
54.3
55.0
lambada 伦巴舞
69.3
68.9
piqa 匹奇
78.3
78.4
sciq 化学科学
94.5
94.7
winogrande 维诺格兰德
67.3
66.9
triviaqa (1-shot) 试题 (1-shot)
40.5
41.0
webqs (1-shot) web 问题 (1 次)
20.6
20.6
CoreEN average 核心英语平均
60.2
60.7
GSM8K (8-shot CoT) GSM8K(8 轮 CoT)
16.6
18.9
MMLU (5-shot) MMLU(5 个样本)
45.4
45.5
Table 5: Core pre-training recipe ablation few-shot results. Unless otherwise noted, we use 0 -shot prompts. We note that AFM's recipe allows for slight improvements across the majority of tasks, although the difference is typically very small. The data mixture differs from the official AFM runs. 表 5:核心预训练配方易损性少样本结果。除非另有说明,否则我们使用 0-样本提示。我们注意到 AFM 的配方能够在大多数任务中略有改善,尽管差异通常非常小。数据混合不同于官方 AFM 运行。
In Table 5, AFM's recipe demonstrates a slight improvement over the baseline. This likely indicates that the most important recipe settings are already well-enough configured by the baseline for this model size and training budget. 在表 5 中,AFM 的配方相较于基线显示了些许改善。这可能表明对于该模型大小和训练预算,基线已经足够好地配置了最重要的配方设置。
B Ablations on pruning and distillation 剪枝和蒸馏的 B Ablations
Here we detail the evaluation results of using structural pruning and distillation separately and show they can be combined together to get the best performance. 我们在此详细介绍了使用结构性修剪和蒸馏分别进行评估的结果,并表明可以将它们结合在一起以获得最佳性能。
Table 6 shows the ablation results of training 3B models using an early version of our pre-training data mixture. As shown in the table, both pruning 表 6 显示了使用我们预训练数据混合物的早期版本训练 3B 模型的消融结果。如表所示,修剪和
and distillation methods can outperform a baseline model trained from scratch. For example, pruning and distillation achieve a MMLU score of and respectively, whereas a baseline using more steps gets . It is also interesting that pruning achieves a higher score on the CoreEn benchmark, while distillation is better on MMLU. Finally, when combining these two methods together, we observe further improvements on MMLU and GSM8k by a large margin, getting better or on par results compared to the baseline trained using more computation. 修剪和提取法可以超越从头开始训练的基线模型。例如,修剪和提取分别获得 和 的 MMLU 分数,而使用 更多步骤的基线获得 。有趣的是,修剪在 CoreEn 基准测试中获得更高的分数,而提取在 MMLU 上表现更好。最后,将这两种方法结合起来,我们在 MMLU 和 GSM8k 上观察到进一步的大幅提升,获得比使用 更多计算训练的基线更好或相当的结果。
Metric/Method 度量衡/方法
Baseline 基准线
Prune 修剪
Distill 蒸馏
Both 两个
Baseline 基准线
Training cost 培训费用
MMLU (5-shot) MMLU(5 个样本)
34.6
42.9
44.9
45.4
GSM8K (8-shot CoT) GSM8K(8 轮 CoT)
12.7
13.5
11.0
CoreEN Average 核心英语平均
59.8
58.1
59.7
60.3
Table 6: Ablation results of pruning and distillation methods. The training data is an early version that differs from the official AFM runs. 表 6:修剪和蒸馏方法的消融结果。训练数据是与官方 AFM 运行版本有所不同的早期版本。
C Pre-training stage-breakdown evaluations C 预训练阶段分解评估
We present few-shot evaluation results after core, continued, and long-context pre-training stages, for a subset of evaluation metrics that we find to be low-variance, diverse, and correlated with downstream evaluation after posttraining. These metrics are derived using an internal harness and set of benchmark formulations, which are not optimized for absolute performance (e.g. we do not apply length normalization, and use more difficult test splits where available - for TriviaQA as one example). They are therefore not suitable for comparison with other published results. 我们在核心、持续和长上下文预训练阶段呈现了少量样本评估结果,这些结果对应于我们发现低方差、多样性和与后训练下游评估相关的一些评估指标。这些指标是使用内部工具和一组基准公式得出的,这些公式并没有针对绝对性能进行优化(例如,我们没有进行长度标准化,并在可用的情况下使用更困难的测试划分 - 以 TriviaQA 为例)。因此,它们不适用于与其他发表的结果进行比较。
In Table 7 and 8 we present internal benchmarks after all three stages of pre-training. As expected, continued pre-training acts to improve math and particularly code model capabilities, whilst subtly improving a few other benchmarks. The context-lengthening stage leaves the majority of these benchmarks on-par, with changes (positive and negative) typically within the range of what we consider to be evaluation noise. 在表 7 和 8 中,我们展示了预训练的三个阶段之后的内部基准测试。如预期所料,持续的预训练有助于提高数学和特别是代码模型的能力,同时也略微提高了其他一些基准测试。上下文延长阶段使这些基准测试大多保持持平,变化(正面和负面)通常在我们认为是评估噪音的范围内。
D Long-context evaluation D 长文本评估
Although the focus for this version of AFM was not to support context lengths longer than 8 k , in Table 9 we use the RULER [Hsieh et al., 2024] benchmark to evaluate AFM-server at 4 k to 32 k context lengths. We note that the model is capable of performing perfectly up to a sequence length of when tested against straightforward retrieval-like tasks, e.g., needle-in-the-haystack (NIAH). It is clear, however, that the model performance gradually suffers with an 尽管此版本的 AFM 的重点并非支持超过 8k 的上下文长度,但在表 9 中,我们使用 RULER [Hsieh 等人,2024]基准测试在 4k 到 32k 的上下文长度下评估了 AFM-server。我们注意到,该模型在针对简单的检索类任务(如针对干草堆的针)进行测试时,最高可达 的序列长度表现完美。然而,很明显,模型性能会随着上下文长度的增加而逐步下降。
AFM-on-device 原子力显微镜设备
Core 核心
Continued 继续
Context lengthened 上下文延长
ARC_C
43.17
47.53
45.39
ARC_E
74.87
78.62
78.37
HellaSwag 赫拉斯威格
54.70
55.50
55.24
LAMBADA
73.51
70.13
69.90
PIQA
77.37
78.67
78.40
SciQ
94.90
95.80
95.70
WinoGrande
65.82
67.32
67.01
TriviaQA (1 shot)
42.46
39.13
38.11
WebQS (1 shot)
19.24
18.06
17.22
CoreEN average 核心英语平均
60.67
61.20
60.59
MMLU (5 shot) MMLU(5 条样本)
57.00
61.35
60.64
GSM8K (8 shot CoT) GSM8K(8 拍 CvT)
27.45
42.53
40.00
MATH (4 shot CoT) 数学(4 shot CoT)
8.31
16.97
15.48
HumanEval-Py pass@1
16.48
27.38
30.84
MultiPLE-Swift pass@1
8.88
19.24
18.06
Table 7: Pre-training evaluation for AFM-on-device with an internal harness. Unless otherwise noted, we use 0 -shot prompts. TriviaQA evaluation is on the larger and more challenging "Web" split. 表 7:使用内部测试工具进行 AFM-on-device 预训练评估。除非另有说明,我们使用 0 -shot 提示。TriviaQA 评估是在更大更具挑战性的"Web"分割上进行的。
increasing context length on RULER, a more complex evaluation benchmark than NIAH, suggesting that the real context length for AFM-server, for tasks beyond retrieval, is currently at most 24 k . 增加 RULER 的上下文长度,这比 NIAH 这个更复杂的评估基准更有意义,表明 AFM-server 在检索以外的任务中,当前的真实上下文长度最多为 24 k。
E Technical details for RLHF RLHF 的技术细节
E. 1 Reward modeling 奖励建模
The human preference data that we use in reward model training has the following format: 我们在奖励模型训练中使用的人类偏好数据采用以下格式:
: the prompt; : 提示;
: the chosen (preferred) response; : 所选(首选)的响应;
the rejected response; 被拒的响应;
: the level of the human preference; :人类偏好的水平;
and : the instruction-following property of the two responses; 和 :两个响应的指令遵循属性;
and : the verbosity of the two responses; 和 : 两个响应的详细程度;
and : the truthfulness of the two responses; 和 :两个反应的真实性;
AFM-server AFM 服务器
Core 核心
Continued 继续
Context lengthened 上下文延长
ARC_C
58.28
58.87
57.94
ARC_E
85.61
85.44
85.06
HellaSwag 赫拉斯威格
64.17
64.53
64.37
LAMBADA
78.38
77.59
77.82
PIQA
82.37
81.99
81.88
SciQ
96.60
97.10
97.00
WinoGrande
80.51
79.16
79.08
TriviaQA (1 shot)
54.33
53.57
53.42
WebQS (1 shot)
29.97
27.66
27.41
CoreEN average 核心英语平均
70.02
69.55
69.33
MMLU (5 shot) MMLU(5 条样本)
74.00
75.24
74.80
GSM8K (8 shot CoT) GSM8K(8 拍 CvT)
75.44
74.83
75.51
MATH (4 shot CoT) 数学(4 shot CoT)
32.24
36.48
35.77
HumanEval-Py
33.23
40.77
39.55
MultiPLE-Swift 多重-Swift
30.15
37.70
38.11
Table 8: Pre-training evaluation for AFM-server with an internal harness. Unless otherwise noted, we use 0 -shot prompts. TriviaQA evaluation is on the larger and more challenging "Web" split. 表 8:使用内部测试环境进行 AFM-server 预训练评估。除非另有说明,我们使用 0-shot 提示。TriviaQA 评估是在更大且更具挑战性的"Web"数据集上进行的。
AFM-server AFM 服务器
Average acc 平均加速度
Ctx @ 4096 上下文 @ 4096
91.7
Ctx @ 8192
87.7
Ctx @ 16384 内容 @ 16384
84.1
Ctx @ 20480
79.1
Ctx @ 24576
75.8
Ctx @ 32768 上下文 @ 32768
43.3
Table 9: RULER [Hsieh et al., 2024] average evaluation results, averaged over 13 synthetic long-context tasks using 500 examples per task. 表 9:RULER [Hsieh et al., 2024]平均评估结果,平均在 13 个合成的长上下文任务中使用每项任务 500 个示例。
and : the harmlessness of the two responses. 和 :两个反应的无害性。
In our reward modeling, the preference level takes 4 possible values, indicating that the chosen response is negligibly better, slightly better, better, or significantly better than the rejected response. As for the single sided gradings, each label, e.g., , takes 3 possible values. For instruction following, truthfulness, and harmlessness, the 3 values correspond to the cases where the response has major issue, minor issue, or no issue. For verbosity, the 3 values 在我们的奖励建模中,偏好水平 有 4 种可能的值,表示所选响应比被拒绝的响应微乎其微更好、稍微更好、更好或大幅更好。至于单边评分,每个标签(如 )都有 3 种可能的值。对于遵循指令、诚实性和无害性,这 3 个值分别对应响应有严重问题、有轻微问题或没有问题的情况。对于冗长性,这 3 个值
correspond to the cases where the response is too verbose, too short, or just right. 对应于响应过于冗长、过于简短或恰到好处的情况。
We use a multi-head architecture for the reward model. More specifically, we take a decoder-only transformer and obtain the last-layer embedding of the last non-padding token. We attach one linear and four MLP heads to the embedding. Denote the model parameters by and the input prompt-response pair by . The linear head outputs the preference reward . The four MLP heads are classification heads representing the instructionfollowing, verbosity, truthfulness, and harmlessness property of the response. We denote the output logits of the 4 classification heads by , , respectively. 我们在奖赏模型中使用多头架构。更具体地说,我们采用一个只有解码器的 transformer 并获得最后一个非填充令牌的最后一层嵌入。我们在嵌入上附加了一个线性头和四个 MLP 头。用 表示模型参数,用 表示输入提示-响应对。线性头输出偏好奖赏 。四个 MLP 头是表示响应的指令跟随性、冗长性、真实性和无害性的分类头。我们分别用 和 表示这 4 个分类头的输出 logits。
Soft label loss. We train the preference reward based on BradleyTerry-Luce (BTL) model [Bradley and Terry, 1952]. Recall that in BTL model, the probability that is preferred over is modeled as , where is the sigmoid function. Intuitively, this probability should be larger if the preferred response is annotated as significantly better than the rejected response , and smaller if is only negligibly better than . We incorporate this information using the preference level . More specifically, for each preference level , we design a target preference probability . Then we use a soft label loss as follows: 软标签损失。我们根据 Bradley-Terry-Luce (BTL) 模型[Bradley and Terry, 1952]训练偏好奖励 。回忆一下,在 BTL 模型中, 被偏好于 的概率被建模为 ,其中 是 sigmoid 函数。直观上讲,如果被选择的响应 被注释为明显优于被拒绝的响应 ,那么这个概率应该更大;如果 仅仅是微小地优于 ,那么这个概率应该更小。我们使用偏好级别 来融合这些信息。更具体地说,对于每个偏好级别 ,我们设计了一个目标偏好概率 。然后我们使用如下的软标签损失:
The target level is a hyperparameter in our algorithm and should take larger value if the preference level is higher. In our experiments, we choose for significantly better, better, slightly better, and negligibly better, respectively. 目标水平 是我们算法中的超参数,如果偏好水平较高,它应该取较大的值。在我们的实验中,我们分别选择 表示显著更好、更好、略好和可忽略。
Single-sided grading as regularization. We also leverage the single-sided gradings as regularization terms in our reward model. The intuition is that with these gradings as regularization terms, we can learn a better embedding to capture human preferences. The regularization loss is 单面评分作为正则化。我们也利用单面评分作为我们奖励模型中的正则化项。直观上讲,有了这些评分作为正则化项,我们可以学习到一个更好的嵌入来捕捉人类偏好。正则化损失是
Overall, the reward model training loss that we use is 整体而言,我们使用的奖励模型训练损失是
E. 2 Online RL algorithm 在线强化学习算法
In this section, we present more details of our online RLHF algorithm, MDLOO. 在本节中,我们介绍了更多关于我们的在线 RLHF 算法 MDLOO 的细节。
Leave-One-Out (LOO) estimator of the advantage. In each iteration of the algorithm, we have a data collection stage and a policy updating stage. Let be the model parameter at the beginning of the -th iteration. We sample a batch of prompts from our prompt set, and for each prompt, we sample responses according to the policy , and thus collecting a total of data points in each iteration. Let be a prompt and be one of the responses. Since we consider the bandit setting, by definition, the advantage of is 算法的遗漏一个(LOO)优势估计器。在算法的每个迭代中,我们有一个数据收集阶段和一个策略更新阶段。让 是第 次迭代开始时的模型参数。我们从我们的提示集中采样一批 个提示,对于每个提示,我们根据策略 采样 个响应,从而在每次迭代中收集总共 个数据点。让 是一个提示, 是其中一个响应。由于我们考虑了强盗环境,根据定义, 的优势是
We use the leave-one-out (LOO) method [Kool et al., 2019] to estimate . Namely, we estimate the mean reward given the prompt with the other responses, i.e., 我们使用留一法(LOO)方法[Kool et al., 2019]来估计 。具体而言,我们根据提示 和其他 个响应来估计平均奖励。
As shown in recent works [Ahmadian et al., 2024], this advantage estimation is beneficial for RLHF. Empirically, we find that using LOO estimator leads to more stable training and better results compared to directly using the reward as the advantage estimation or using the difference between the reward and a running average baseline [Williams, 1992]. 根据最近的研究成果[Ahmadian 等人,2024],这种优势估算对于 RLHF 是有益的。经验证明,使用 LOO 估计量可以获得更稳定的训练和更好的结果,相比之下,直接使用奖励作为优势估计或使用奖励与移动平均基准的差异[Williams,1992]效果不如。
Mirror descent policy optimization (MDPO). Our policy optimization approach belongs to a widely used class of trust-region policy optimization algorithms [Schulman et al., 2015]. The basic idea in these algorithms is that in each policy iteration, we apply a regularization method to prevent the policy from changing too much in an iteration. The regularization can be achieved by adding KL regularization [Abbasi-Yadkori et al., 2019; Lazic et al., 2021; Tomar et al., 2020] and using clipping for the probability ratio such as in PPO [Schulman et al., 2017]. In this work, we use KL regularization as in Mirror Descent Policy Optimization (MDPO) [Tomar et al., 2020]. 镜面下降策略优化(MDPO)。我们的策略优化方法属于广泛使用的信任区域策略优化算法[Schulman et al., 2015]。这些算法的基本思想是,在每次策略迭代中,我们应用正则化方法,以防止策略在一次迭代中发生太大变化。通过添加 KL 正则化[Abbasi-Yadkori et al., 2019; Lazic et al., 2021; Tomar et al., 2020]并对概率比使用剪切(如在 PPO 中)[Schulman et al., 2017]来实现正则化。在这项工作中,我们使用 KL 正则化,就像在镜面下降策略优化(MDPO) [Tomar et al., 2020]中一样。
In particular, in the -th iteration, with the data (prompts along with the responses sampled according to for each prompt), we aim to optimize the following regularized advantage maximization problem: 特别是在第 次迭代中,使用按照 采样的每个提示的数据(提示及其相应的 响应),我们的目标是优化以下正则化优势最大化问题:
Note that here the KL regularization term is different from the one in Eq. (1). The KL regularization in Eq. (1) is between the policy model and the reference model; whereas the KL regularization term in Eq. (8) is between the policy model and the policy at the beginning of the -th iteration. Then we can obtain the gradient of as 注意这里的 KL 正则化项与(1)式中的不同。(1)式中的 KL 正则化是在策略模型和参考模型之间;而(8)式中的 KL 正则化项是在策略模型和第 次迭代开始时的策略之间。然后我们可以得到 的梯度
The MDLOO algorithm can be derived by replacing the expectations in Eq. (9) with the samples collected with , and the advantage with the LOO estimator in Eq. (7). Empirically, we find that MDLOO works better than the popular PPO [Schulman et al., 2017] algorithm in our setting. MDLOO 算法可以通过将公式(9)中的期望替换为使用 收集的样本,并将优势 替换为公式(7)中的 LOO 估计器 而得出。经验上,我们发现在我们的设置中,MDLOO 表现优于流行的 PPO[Schulman et al., 2017]算法。
F Accuracy-recovery adapters ablation F Accuracy 恢复适配器消融
In this section, we present the evaluation results on unquantized, quantized, and accuracy-recovered models. As shown in Table 10, the quantized models have huge quality drops in both pre-train and post-train metrics. By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by , GMS 8 K accuracy is boosted by . The recovered models perform much closer to the original unquantized model while achieving significant reductions on the model size. More interestingly, we observe that when the quantization scheme becomes more aggressive (from 3.7 to 3.5 bpw ), the adapters also recover more quality back. 在本节中,我们将介绍无量化、量化和准确度恢复模型的评估结果。如表 10 所示,量化模型在预训练和后训练指标中都有巨大的质量下降。通过使用仅 rank 16 的准确度恢复 LoRA 适配器,Alpaca 胜率可以提高 ,GMS 8 K 准确度提高 。恢复模型的性能与原始无量化模型相差很小,同时也实现了模型尺寸的显著减小。更有趣的是,我们观察到当量化方案变得更加激进(从 3.7 bpw 到 3.5 bpw)时,适配器也能更好地恢复质量。
BPW
Models 模型
IFEval Instruction-Level IFEval 指令级
AlpacaEval 2.0 LC 阿尔帕卡评估 2.0 LC
GSM8K (8-shot CoT) GSM8K(8 轮 CoT)
AFM-on-device 原子力显微镜设备
quantized 量子化的
Acc.-recovered (rank 16) 恢复 (排名 16)
quantized 量子化的
Acc.-recovered (rank 16) 恢复 (排名 16)
Table 10: Evaluation results for quantized and accuracy-recovered models. Numbers are normalized to the unquantized version. 表 10:量化和精度恢复模型的评估结果。数字是根据未量化版本进行归一化的。
In scaling law experiments we find that Param (simple) stabilizes the optimal learning rate as model size increases, although extrapolating to very significantly deeper and/or larger models does exhibit a slight left-shift beyond what is accounted for. 在缩放定律实验中,我们发现 Param (simple) 可以稳定模型增大时的最佳学习率,尽管向更深和/或更大的模型外推时会出现略微的左移,超出了所预计的范围。
A prompt may consist of the most recent user instruction as well as all previous usermodel-system interactions. 提示可能包含最近一次用户指令以及所有先前的用户-模型交互。
We compared against the following model versions: gpt-3.5-turbo-0125, gpt-4-0125preview, Gemini-1.5-Pro-0514, DBRX Instruct, Phi-3-mini-4k-instruct, LLaMA 38 B Instruct, LLaMA 3 70B Instruct, Mistral-7B-Instruct-v0.2, Mixtral-8x22B-Instruct-v0.1, Gemma-1.12B, and Gemma-1.1-7B. 我们与以下型号版本进行了比较:gpt-3.5-turbo-0125、gpt-4-0125preview、Gemini-1.5-Pro-0514、DBRX Instruct、Phi-3-mini-4k-instruct、LLaMA 38 B Instruct、LLaMA 3 70B Instruct、Mistral-7B-Instruct-v0.2、Mixtral-8x22B-Instruct-v0.1、Gemma-1.12B 和 Gemma-1.1-7B。
Due to the choice of using GPT-4 as judge, the score of GPT-4 Turbo can be overestimated. 由于选择使用 GPT-4 作为评判者,GPT-4 Turbo 的分数可能会被高估。