这是用户在 2024-7-16 24:41 为 https://arxiv.org/html/2407.10671v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
授权:arXiv.org 永久非独家许可
arXiv:2407.10671v1 [cs.CL] 15 Jul 2024
arXiv:2407.10671v1 [cs.CL] 2024 年 7 月 15 日

Qwen2 Technical Report Qwen2 技术报告


An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan
杨安, 杨宝松, 惠斌元, 郑博, 余博文, 周畅, 李成鹏, 李成元, 刘大为, 黄飞, 董冠廷, 魏浩然, 林焕, 唐家龙, 王家林, 杨健, 涂建宏, 张建伟, 马建新, 徐进, 周敬仁, 白金泽, 何金正, 林俊阳, 党凯, 卢克明, 陈克勤, 杨克新, 李梅, 薛明峰, 倪娜, 张培, 王鹏, 彭瑞, 门瑞, 高瑞泽, 林润基, 王世杰, 白帅, 谭思楠, 朱天航, 李天昊, 刘天宇, 葛文斌, 邓晓东, 周晓环, 任星章, 张新宇, 魏锡平, 任煊成, 范阳, 姚洋, 张一昌, 万宇, 褚云飞, 崔泽宇, 张振如, 范志豪
\ANDQwen Team, Alibaba Group
ANDQwen 团队,阿里巴巴集团

Authors are ordered alphabetically by the first name.
作者按首名字母顺序排列。
Abstract 摘要

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning.
本报告介绍了 Qwen2 系列,这是我们大型语言模型和大型多模态模型的最新成员。我们发布了一套全面的基础和指令微调语言模型,参数范围从 0.5 亿到 72 亿,包括密集模型和专家混合模型。Qwen2 超越了大多数先前的开放权重模型,包括其前身 Qwen1.5,并在语言理解、生成、多语言能力、编码、数学和推理等多个基准测试中展现出与专有模型相媲美的性能。

The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach.
旗舰型号 Qwen2-72B 展现了卓越性能:MMLU 得分 84.2,GPQA 得分 37.9,HumanEval 得分 64.6,GSM8K 得分 89.5,作为基础语言模型 BBH 得分 82.4。指令调优版本 Qwen2-72B-Instruct 在 MT-Bench 上达到 9.1 分,Arena-Hard 上 48.1 分,LiveCodeBench 上 35.7 分。此外,Qwen2 显示出强大的多语言能力,精通约 30 种语言,涵盖英语、中文、西班牙语、法语、德语、阿拉伯语、俄语、韩语、日语、泰语、越南语等,突显其多功能性和全球适用性。

To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face111https://huggingface.co/Qwen and ModelScope222https://modelscope.cn/organization/qwen, and the supplementary materials including example code on GitHub333https://github.com/QwenLM/Qwen2. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.
为促进社区创新与资源共享,我们已在 Hugging Face 1 和 ModelScope 2 上公开发布了 Qwen2 模型权重,并在 GitHub 3 提供了包括示例代码在内的补充材料。这些平台还包含了量化、微调及部署的相关资源,助力广泛的应用与研究工作。

1 Introduction 1 引言

Following the emergence of ChatGPT (OpenAI, 2022), enthusiasm for large language models (LLMs) has escalated globally. The release of the Llama series (Touvron et al., 2023) has further ignited interests within the open-source community, particularly regarding GPT-level local LLMs. Recently, Claude-3 Opus (Anthropic, 2024) and GPT-4o (omni) (OpenAI, 2024), the updated model for ChatGPT, have ascended to the pinnacle of the Chatbot Arena (Chiang et al., 2024) in quick succession. This platform is well-regarded for its human evaluations of LLMs. Moreover, Llama-3 (AI@Meta, 2024) has emerged as the state-of-the-art open-weight model series, narrowing the performance gap with leading proprietary models and widely acknowledged as GPT-4–level. An increasing number of competitive LLMs are now pursuing advancements similar to those made by the GPT series from OpenAI. Many of these models, including Qwen (Bai et al., 2023a), Mistral (Jiang et al., 2023a), Gemma (Mesnard et al., 2024), etc., have been released in an open-weight manner.
继 ChatGPT(OpenAI,2022)问世后,全球对大型语言模型(LLMs)的热情持续高涨。Llama 系列(Touvron et al., 2023)的发布进一步激发了开源社区的兴趣,尤其是对 GPT 级别本地化LLMs的关注。近期,Claude-3 Opus(Anthropic,2024)与 GPT-4o(omni)(OpenAI,2024),即 ChatGPT 的更新模型,相继登顶 Chatbot Arena(Chiang et al., 2024)之巅。该平台以其对LLMs的人工评估而备受推崇。此外,Llama-3(AI@Meta,2024)作为最先进的开源权重模型系列崭露头角,缩小了与领先专有模型之间的性能差距,并被广泛认可为 GPT-4 级别。越来越多的竞争性LLMs正追求类似于 OpenAI GPT 系列所取得的进步。其中,包括 Qwen(Bai et al., 2023a)、Mistral(Jiang et al., 2023a)、Gemma(Mesnard et al., 2024)等模型均已开放权重发布。

Over recent months, we have successively introduced the Qwen series (Bai et al., 2023a) and progressed to Qwen1.5 (Qwen Team, 2024a). In the meantime, we have unveiled the vision-language model Qwen-VL (Bai et al., 2023b), and launched the audio-language model Qwen-Audio (Chu et al., 2023). In this work, we introduce the newest addition to the Qwen family of large language models and large multimodal modles: Qwen2. Qwen2 is a series of LLMs, grounded in the Transformer architecture (Vaswani et al., 2017), trained using next-token prediction. The model series encompasses foundational, i.e., base language models, pre-trained but unaligned to human preferences, and instruction-tuned models, fine-tuned with single-turn and multi-turn instruction-following datasets suitable for chat and agent purposes. Our release comprises four dense models with parameter counts of 0.5 billion, 1.5 billion, 7 billion, and 72 billion, plus a Mixture-of-Experts (MoE) model with 57 billion parameters, of which 14 billion are activated for each token. The smaller models, specifically Qwen2-0.5B and Qwen2-1.5B, are designed for easy deployment on portable devices such as smartphones, earphones, and smart glasses. Conversely, the larger models cater to deployment across GPUs of varying scales.
近几个月来,我们陆续推出了 Qwen 系列(Bai 等,2023a),并推进至 Qwen1.5(Qwen 团队,2024a)。同时,我们发布了视觉-语言模型 Qwen-VL(Bai 等,2023b),并推出了音频-语言模型 Qwen-Audio(Chu 等,2023)。在此工作中,我们介绍 Qwen 家族大型语言模型与大型多模态模型的新成员:Qwen2。Qwen2 系列基于 Transformer 架构(Vaswani 等,2017),采用下一词预测进行训练。该系列模型包括基础型,即未经人类偏好对齐的预训练基础语言模型,以及针对聊天和代理目的、使用单轮与多轮指令遵循数据集进行微调的指令调优模型。我们的发布包含四个密集模型,参数规模分别为 0.5 亿、15 亿、70 亿和 720 亿,外加一个含有 570 亿参数的专家混合(MoE)模型,其中每词激活 140 亿参数。较小型号,特别是 Qwen2-0.5B 和 Qwen2-1.5B 型号专为便携设备如智能手机、耳机和智能眼镜的轻松部署而设计。相反,更大型的模型则适用于在不同规模的 GPU 上进行部署。

All models were pre-trained on a high-quality, large-scale dataset comprising over 7 trillion tokens, covering a wide range of domains and languages. Compared to previous editions of Qwen, Qwen2 includes a broader spectrum of linguistic data, enhancing the quantity and quality of code and mathematics content. This enrichment is hypothesized to improve reasoning abilities of LLMs. Regarding post-training, all models underwent supervised fine-tuning and direct preference optimization (DPO, Rafailov et al., 2023), aligning them with human preferences through learning from human feedback. This process endows the models with the capability to follow instructions effectively.
所有模型均在一个高质量、大规模的数据集上进行了预训练,该数据集包含超过 7 万亿个标记,涵盖了广泛的领域和语言。与 Qwen 的前一版本相比,Qwen2 包含了更广泛的语料数据,提升了代码和数学内容的数量与质量。据推测,这种丰富性将增强LLMs的推理能力。在后续训练阶段,所有模型都经历了监督式微调及直接偏好优化(DPO,Rafailov 等人,2023 年),通过学习人类反馈来与人类偏好对齐。这一过程赋予了模型有效遵循指令的能力。

We have conducted a thorough evaluation of Qwen2, alongside a selection of baseline models including both open-weight and proprietary models accessible via API. Qwen2 outperforms competing models in evaluations of both fundamental language capabilities and instruction-tuned functionalities Specifically, Qwen2-72B-Instruct, our instruction-tuned variant, scores 9.1 on MT-Bench (Zheng et al., 2023), 48.1 on Arena-Hard (Chiang et al., 2024), and 35.7 on LiveCodeBench (Jain et al., 2024). Meanwhile, Qwen2-72B, the base language model, achieves 84.2 on MMLU (Hendrycks et al., 2021a), 37.9 on GPQA (Rein et al., 2023), 64.6 on HumanEval (Chen et al., 2021), 89.5 on GSM8K (Cobbe et al., 2021), and 82.4 on BBH (Suzgun et al., 2023).
我们对 Qwen2 进行了全面评估,同时选取了一系列基准模型,包括可通过 API 访问的开源权重模型和专有模型。在基础语言能力和指令调优功能两方面的评估中,Qwen2 均优于竞争模型。具体而言,我们的指令调优变体 Qwen2-72B-Instruct 在 MT-Bench(郑等人,2023)上得分为 9.1,在 Arena-Hard(蒋等人,2024)上为 48.1,在 LiveCodeBench(贾因等人,2024)上为 35.7。同时,基础语言模型 Qwen2-72B 在 MMLU(亨德里克斯等人,2021a)上达到 84.2,在 GPQA(赖因等人,2023)上为 37.9,在 HumanEval(陈等人,2021)上为 64.6,在 GSM8K(科布等人,2021)上为 89.5,在 BBH(苏兹根等人,2023)上为 82.4。

2 Tokenizer & Model 2Tokenizer 与模型

This section introduces the tokenizer and model design of Qwen2. We detail the model architecture and configurations for different model sizes.
本节介绍 Qwen2 的标记器及模型设计。我们详细阐述了不同模型尺寸的架构和配置。

2.1 Tokenizer 2.1 分词器

Following Qwen (Bai et al., 2023a), we employ the identical tokenizer based on byte-level byte-pair encoding. Notably, this tokenizer exhibits high encoding efficiency, as evidenced by its better compression rate relative to alternatives, facilitating the multilingual capabilities of Qwen2.
继 Qwen 之后(Bai 等人,2023a),我们采用了基于字节级字节对编码的相同分词器。值得注意的是,该分词器展现出高效的编码性能,其压缩率优于其他方案,从而促进了 Qwen2 的多语言处理能力。

Models of all sizes employ a common vocabulary consisting of 151,643 regular tokens and 3 control tokens. For more information, please refer to Bai et al. (2023a). It should be noted that, owing to considerations in distributed training, the effective size for the embeddings is larger.
各种规模的模型均采用由 151,643 个常规标记和 3 个控制标记组成的通用词汇表。更多详情,请参阅 Bai 等人(2023a)。需要注意的是,由于分布式训练的考量,嵌入层的实际尺寸更大。

2.2 Model Architecture 2.2 模型架构

The Qwen2 series fundamentally constitute large language models based on the Transformer architecture, featuring self-attention with causal masks (Vaswani et al., 2017). Specifically, this series encompasses dense language models of 4 scales and a Mixture-of-Experts (MoE) model. We introduce the specifics of the dense models before delving into the MoE model’s distinctive attributes.
Qwen2 系列从根本上构成了基于 Transformer 架构的大型语言模型,其特点是采用带有因果掩码的自注意力机制(Vaswani 等,2017)。具体而言,该系列包括四种规模的密集语言模型和一个专家混合(Mixture-of-Experts, MoE)模型。我们首先介绍密集模型的具体细节,随后深入探讨 MoE 模型的独特特性。

2.2.1 Qwen2 Dense Model 2.2.1 Qwen2 密集模型

The architecture of the Qwen2 dense models comprises multiple Transformer layers, each equipped with causal attention mechanisms and feed-forward neural networks (FFNs). Key differences from Qwen are described below:
Qwen2 稠密模型的架构由多个 Transformer 层组成,每一层均配备因果注意力机制和前馈神经网络(FFNs)。与 Qwen 的主要区别如下所述:

Grouped Query Attention 分组查询注意力

We adopt Grouped Query Attention (GQA, Ainslie et al., 2023) instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput. Detailed KV head configurations for various model sizes are reported in Section 2.2.3.
我们采用 Grouped Query Attention(GQA,Ainslie 等人,2023)替代传统的多头注意力(MHA)。GQA 在推理过程中优化 KV 缓存的使用,显著提升吞吐量。不同模型规模的详细 KV 头配置在 2.2.3 节中报告。

Dual Chunk Attention with YARN
双块注意力与 YARN

To expand the context window of Qwen2, we implement Dual Chunk Attention (DCA, An et al., 2024), which segments long sequences into chunks of manageable lengths. If the input can be handled in a chunk, DCA produces the same result as the original attention. Otherwise, DCA facilitates effective capture of relative positional information between tokens within and across chunks, thereby improving long context performance. Moreover, we also employ YARN (Peng et al., 2023) to rescale the attention weights for better length extrapolation.
为了扩展 Qwen2 的上下文窗口,我们采用了双块注意力机制(Dual Chunk Attention, DCA,An et al., 2024),该机制将长序列分割成可管理长度的块。若输入能在单个块内处理,DCA 与原始注意力机制产生相同结果。否则,DCA 有助于有效捕捉块内及跨块间标记的相对位置信息,从而提升长上下文性能。此外,我们还运用了 YARN(Peng et al., 2023)来重新调整注意力权重,以实现更好的长度外推。

Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.
此外,我们采用 SwiGLU(Dauphin 等人,2017)作为激活函数,旋转位置嵌入(RoPE,Su 等人,2024)用于位置编码,QKV 偏置(Su,2023)用于注意力机制,以及 RMSNorm(Jiang 等人,2023b)和预归一化技术以确保训练稳定性。

2.2.2 Qwen2 Mixture-of-experts Model
2.2.2 Qwen2 混合专家模型

The architecture of Qwen2 MoE models closely mirrors that of Qwen1.5-MoE-A2.7B (Qwen Team, 2024c). As a substitute for the original FFN, the MoE FFN consists of n𝑛nitalic_n individual FFNs, each serving as an expert. Each token is directed to a specific expert Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for computation based on probabilities assigned by a gated network G𝐺Gitalic_G:
Qwen2 MoE 模型的架构与 Qwen1.5-MoE-A2.7B(Qwen 团队,2024c)高度相似。作为原始 FFN 的替代,MoE FFN 包含 n𝑛nitalic_n 个独立的 FFN,每个充当一个专家。每个令牌根据由门控网络 G𝐺Gitalic_G 分配的概率被引导至特定专家 Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 进行计算:

p𝑝\displaystyle pitalic_p =softmax(G(x)),absentsoftmax𝐺𝑥\displaystyle=\mathrm{softmax}\left(G\left(x\right)\right),= roman_softmax ( italic_G ( italic_x ) ) , (1)
y𝑦\displaystyle yitalic_y =itopk(p)Ei(x).absentsubscript𝑖subscripttop𝑘𝑝subscript𝐸𝑖𝑥\displaystyle=\sum\nolimits_{i\in\text{top}_{k}\left({p}\right)}E_{i}(x).= ∑ start_POSTSUBSCRIPT italic_i ∈ top start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) . (2)

In the following, we present critical design considerations of Qwen2 MoE.
在下文中,我们将阐述 Qwen2 MoE 的关键设计考量。

Table 1: Architecture of Qwen2 dense and MoE models. For MoE models, 57B-A14B denotes that the model has 57B parameters in total and for each token 14B parameters are active, the Intermediate size denotes that of each expert, and # Activated Experts excludes the shared experts.
表 1:Qwen2 密集模型与 MoE 模型架构。对于 MoE 模型,57B-A14B 表示该模型总共有 57B 参数,每个令牌激活 14B 参数,中间层大小表示每个专家的参数规模,#激活专家数不包括共享专家。
Configuration 配置 0.5B 1.5B 7B 72B 57B-A14B
Hidden Size 隐藏尺寸 896 1,536 3,584 8,192 3,584
# Layers # 层次 24 28 28 80 28
# Query Heads # 查询头 14 12 28 64 28
# KV Heads # KV 头 2 2 4 8 4
Head Size 头围大小 64 128 128 128 128
Intermediate Size 中量级 4,864 8,960 18,944 29,568 2,560
# Routed Experts # 路由专家 - - - - 64
# Activated Experts # 激活专家 - - - - 8
# Shared Experts # 共享专家 - - - - 8
Embedding Tying 嵌入式绑定 True 真实 True 真的 False 错误 False 错误 False 错误
Vocabulary Size 词汇量 151,646 151,646 151,646 151,646 151,646
# Trained Tokens # 训练过的令牌 12T 7T 7T 7T 4.5T
Expert Granularity 专家粒度

The key structural difference between MoE models and dense models is that MoE layers incorporate multiple FFNs, each serving as an individual expert. Consequently, one straightforward strategy to transition from a dense architecture to an MoE architecture is to set the parameters of each expert equal to those of a single FFN from the original dense model. For example, transitioning from Mistral-7B (Jiang et al., 2023a) to Mixtral 8x7B (Jiang et al., 2024), involves activating one of the eight experts at a time. Differently, our model employs fine-grained experts (Dai et al., 2024), creating smaller-scale experts while activating a greater number of experts simultaneously. Given an equal total number of expert parameters and activated parameters, fine-grained experts offer a richer set of expert combinations. By leveraging these fine-grained experts, Qwen2 MoE facilitates more diverse and dynamic expert utilization, thereby enhancing overall performance and adaptability.
MoE 模型与密集模型在结构上的关键区别在于,MoE 层集成了多个 FFN,每个 FFN 作为一个独立的专家。因此,从密集架构过渡到 MoE 架构的一个直接策略是将每个专家的参数设置为原始密集模型中单个 FFN 的参数。例如,从 Mistral-7B(Jiang et al., 2023a)过渡到 Mixtral 8x7B(Jiang et al., 2024),涉及一次激活八个专家中的一个。不同的是,我们的模型采用细粒度专家(Dai et al., 2024),在同时激活更多专家的同时创建规模较小的专家。在专家参数总数和激活参数数量相等的情况下,细粒度专家提供了更丰富的专家组合。通过利用这些细粒度专家,Qwen2 MoE 促进了更多样化和动态的专家利用,从而提升了整体性能和适应性。

Expert Routing 专家路由

The design of expert routing mechanisms is crucial for enhancing the performance of MoE models. Recently, there has been a notable trend towards integrating both shared and routing-specific experts within MoE layers (Rajbhandari et al., 2022; Dai et al., 2024). We adopt this approach, as it facilitates the application of shared experts across various tasks while reserving others for selective use in specific routing scenarios. The introduction of shared and specialized experts offers a more adaptable and efficient method for developing MoE routing mechanisms.
设计专家路由机制对于提升 MoE 模型性能至关重要。近期,一个显著趋势是将共享专家与路由专用专家整合到 MoE 层中(Rajbhandari 等,2022;Dai 等,2024)。我们采纳了这一方法,因为它便于在不同任务间应用共享专家,同时保留部分专家用于特定路由场景的精选使用。引入共享与专业化专家为开发 MoE 路由机制提供了一种更灵活高效的方法。

Expert Initialization 专家初始化

We initialize the experts in a similar way to upcycling (Komatsuzaki et al., 2023), leveraging the weights of a dense model. In contrast, our approach emphasizes diversification among fine-grained experts to enhance the model’s representational breadth. Given the designated expert intermediate size hEsubscriptEh_{\text{E}}italic_h start_POSTSUBSCRIPT E end_POSTSUBSCRIPT, the number of experts n𝑛nitalic_n, and the original FFN intermediate size hFFNsubscriptFFNh_{\text{FFN}}italic_h start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT, the FFN is replicated n×hE/hFFN𝑛subscriptEsubscriptFFN\left\lceil\nicefrac{{n\times h_{\text{E}}}}{{h_{\text{FFN}}}}\right\rceil⌈ / start_ARG italic_n × italic_h start_POSTSUBSCRIPT E end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT end_ARG ⌉ times. This replication ensures compatibility with the specified number of experts while accommodating any arbitrary expert intermediate size. To promote diversity within each FFN copy, parameters are shuffled along the intermediate dimension. This guarantees that each fine-grained expert exhibits unique characteristics, even across different FFN copies. Subsequently, these experts are extracted from the FFN copies, and the remaining dimensions are discarded. For each fine-grained expert, 50% of its parameters are randomly reinitialized. This process introduces additional stochasticity into expert initialization, potentially enhancing the model’s capacity for exploration during training.
我们以类似于升级再造的方式初始化专家(Komatsuzaki 等人,2023),利用密集模型的权重。相比之下,我们的方法强调细粒度专家之间的多样化,以增强模型的表征广度。在给定指定专家中间大小 hEsubscriptEh_{\text{E}}italic_h start_POSTSUBSCRIPT E end_POSTSUBSCRIPT 、专家数量 n𝑛nitalic_n 以及原始 FFN 中间大小 hFFNsubscriptFFNh_{\text{FFN}}italic_h start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT 的情况下,FFN 被复制 n×hE/hFFN𝑛subscriptEsubscriptFFN\left\lceil\nicefrac{{n\times h_{\text{E}}}}{{h_{\text{FFN}}}}\right\rceil⌈ / start_ARG italic_n × italic_h start_POSTSUBSCRIPT E end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT end_ARG ⌉ 次。此复制确保与指定专家数量兼容,同时适应任意专家中间大小。为促进每个 FFN 副本内的多样性,参数沿中间维度进行洗牌。这确保了即使跨不同 FFN 副本,每个细粒度专家也展现出独特特征。随后,从 FFN 副本中提取这些专家,并丢弃剩余维度。对于每个细粒度专家,其 50%的参数被随机重新初始化。这一过程为专家初始化引入了额外的随机性,可能增强模型在训练期间的探索能力。

2.2.3 Model Configuration 2.2.3 模型配置

In the following, we provide the key configuration and information for the Qwen2 series.
以下内容提供了 Qwen2 系列的关键配置和信息。

The Qwen2 series consists of models of 5 sizes, which are Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Table 1 lists the hyper-parameters and important information, e.g., the number of pre-trained tokens. Particularly, Qwen2-57B-A14B is upscaled from Qwen2-7B. Notably, Qwen2 models demonstrate a substantially lower Key-Value (KV) size per token relative to Qwen1.5 models. This characteristic translates into a reduced memory footprint, particularly advantageous in long-context inference tasks.
Qwen2 系列包含五种尺寸的模型,分别是 Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B、Qwen2-57B-A14B 和 Qwen2-72B。表 1 列出了超参数和重要信息,例如预训练的 token 数量。特别地,Qwen2-57B-A14B 是从 Qwen2-7B 升级而来。值得注意的是,相较于 Qwen1.5 模型,Qwen2 模型每个 token 的键值(KV)大小显著降低。这一特性转化为内存占用的减少,尤其有利于长上下文推理任务。

3 Pre-training 3 预训练

In the pre-training of Qwen2, our efforts were focused on refining the dataset and investigating methods to handle extended context lengths effectively.
在 Qwen2 的预训练阶段,我们致力于优化数据集并探索有效处理扩展上下文长度的方法。

3.1 Pre-training Data 3.1 预训练数据

The pre-training of the Qwen2 models involves the development of a new, large-scale, high-quality multilingual dataset. This dataset represents an improvement over the corpora used in previous Qwen and Qwen1.5 models (Bai et al., 2023a; Qwen Team, 2024a), enhancing the scale, quality, and diversity of the pre-training data in several key areas:
Qwen2 模型的预训练过程涉及开发一种新的、大规模、高质量的多语言数据集。该数据集相较于先前 Qwen 和 Qwen1.5 模型所使用的语料库(Bai 等,2023a;Qwen 团队,2024a)有所改进,在多个关键领域提升了预训练数据的规模、质量和多样性:

Quality Enhancement 质量提升

The filtering algorithm has been refined with additional heuristic and model-based methods, including the use of the Qwen models to filter out low-quality data. Moreover, these models are utilized to synthesize high-quality pre-training data.
过滤算法已通过增加启发式和基于模型的方法得到改进,其中包括使用 Qwen 模型来筛选低质量数据。此外,这些模型还被用于合成高质量的预训练数据。

Data Expansion 数据扩展

Compared to Qwen1.5 (Qwen Team, 2024a), we have collected a significantly larger volume of high-quality code, mathematics, and multilingual data, enhancing the model’s capabilities in respective areas. This new dataset supports approximately 30 languages, such as English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, and Vietnamese.
相较于 Qwen1.5(Qwen 团队,2024a),我们收集了更大规模的高质量代码、数学及多语种数据,显著提升了模型在各相关领域的能力。该新数据集支持约 30 种语言,包括英语、中文、西班牙语、法语、德语、阿拉伯语、俄语、韩语、日语、泰语和越南语等。

Distribution Improvement 分布优化

To ensure the model learns the distribution akin to human-like learning, we conduct experiments on scaled-down models to optimize the mixing of data from various sources and domains.
为确保模型学习到类似人类的学习分布,我们通过在缩减规模的模型上进行实验,以优化来自不同来源和领域的数据混合。

Based on these enhancements, the pre-training data was expanded from 3 trillion tokens in Qwen1.5 (Qwen Team, 2024a) to 7 trillion tokens. An attempt to further relax the quality threshold resulted in a 12 trillion token dataset. However, the model trained on this dataset did not show a significant performance improvement over the 7 trillion token model. It is suspected that increasing the volume of data does not necessarily benefit model pre-training. Considering training costs, we opted to use the higher-quality 7 trillion token dataset for training larger models, leaving further exploration for future model iterations.
基于这些改进,预训练数据从 Qwen1.5(Qwen 团队,2024a)的 3 万亿个 token 扩展至 7 万亿个 token。尝试进一步放宽质量阈值后,数据集达到了 12 万亿个 token。然而,使用该数据集训练的模型并未在性能上显著超越 7 万亿 token 的模型。有理由怀疑,单纯增加数据量未必能带来模型预训练的益处。考虑到训练成本,我们选择采用质量更高的 7 万亿 token 数据集来训练更大规模的模型,并将进一步探索留待未来模型迭代中进行。

All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
所有 Qwen2 稠密模型,除 Qwen2-0.5B 外,均在此包含超过 7 万亿标记的大规模数据集上进行了预训练。Qwen2-0.5B 则采用 12 万亿标记的数据集进行预训练。MoE 模型额外接受了 4.5 万亿标记的预训练,遵循了升级再利用的原则。与先前的 Qwen 模型类似,高质量的多任务指令数据被融入 Qwen2 的预训练过程中,以提升其上下文学习和指令遵循能力。

3.2 Long-context Training 3.2 长上下文训练

To enhance the long-context capability of Qwen2, we augmented the context length from 4,096 tokens to 32,768 tokens during the concluding phase of pre-training. This expansion was complemented by the introduction of a significantly increased volume of high-quality, lengthy data. In conjunction with these enhancements, we modified the base frequency of RoPE from 10,000 to 1,000,000 to optimize performance in long-context scenarios (Xiong et al., 2023).
为提升 Qwen2 的长上下文能力,我们在预训练的收尾阶段将上下文长度从 4,096 个 token 扩展至 32,768 个 token。这一扩展得到了大量高质量长数据的补充。同时,为了优化长上下文场景下的性能,我们将 RoPE 的基础频率从 10,000 调整至 1,000,000(Xiong 等,2023)。

To fully leverage the model’s length extrapolation potential, we adopted the YARN mechanism (Peng et al., 2023) and the Dual Chunk Attention mechanism (An et al., 2024). These strategies enable the model to process sequences of up to 131,072 tokens while maintaining high performance, as evidenced by minimal perplexity degradation in preliminary experiments.
为充分发挥模型的长度外推潜力,我们采用了 YARN 机制(Peng 等,2023)与双块注意力机制(An 等,2024)。这些策略使得模型能够处理长达 131,072 个令牌的序列,同时保持高性能,初步实验结果显示,困惑度下降极小,证明了这一点。

4 Post-training 4 训练后

Following extensive large-scale pre-training, we engage in a post-training phase for Qwen2. This process is pivotal in enhancing its proficiency across a broad spectrum of domains, including coding, mathematics, logical reasoning, instruction following, and multilingual comprehension. Moreover, it ensures that the generation from the models is in harmony with human values, making it helpful, honest, and harmless. Unlike traditional methods that heavily rely on extensive human supervision, our approach focuses on scalable alignment with minimal human annotation (Cao et al., 2024). Specifically, we investigate methods to acquire high-quality demonstration and preference data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), aiming to minimize the need for human labeling while maximizing the quality and reliability of the data.
经过广泛的大规模预训练后,我们针对 Qwen2 进行了一轮后训练阶段。这一过程对于提升其在编码、数学、逻辑推理、指令遵循及多语言理解等多个领域的熟练度至关重要。此外,它确保了模型生成的内容与人类价值观相协调,使之有益、诚实且无害。与传统方法高度依赖大量人工监督不同,我们的方法侧重于以最少的人工标注实现可扩展的对齐(Cao et al., 2024)。具体而言,我们探索了获取高质量示范和偏好数据的方法,用于监督微调(SFT)和基于人类反馈的强化学习(RLHF),旨在减少对人工标注的依赖,同时最大化数据的质量和可靠性。

4.1 Post-training Data 4.1 训练后数据

The post-training data primarily consists of two components: demonstration data 𝒟={(xi,yi)}𝒟subscript𝑥𝑖subscript𝑦𝑖\mathcal{D}=\{(x_{i},y_{i})\}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } and preference data 𝒫={(xi,yi+,yi)}𝒫subscript𝑥𝑖superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖\mathcal{P}=\{(x_{i},y_{i}^{+},y_{i}^{-})\}caligraphic_P = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) }, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the instruction, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a satisfactory response, and yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and yisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are two responses to xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT being the preferred choice over yisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT. The set 𝒟𝒟\mathcal{D}caligraphic_D is utilized in SFT, whereas 𝒫𝒫\mathcal{P}caligraphic_P is employed in RLHF.
训练后数据主要由两部分组成:演示数据 𝒟={(xi,yi)}𝒟subscript𝑥𝑖subscript𝑦𝑖\mathcal{D}=\{(x_{i},y_{i})\}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } 和偏好数据 𝒫={(xi,yi+,yi)}𝒫subscript𝑥𝑖superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖\mathcal{P}=\{(x_{i},y_{i}^{+},y_{i}^{-})\}caligraphic_P = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) } ,其中 xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 代表指令, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 表示满意回应, yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTyisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT 是对 xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的两个回应, yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT 优于 yisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT 。集合 𝒟𝒟\mathcal{D}caligraphic_D 用于 SFT,而 𝒫𝒫\mathcal{P}caligraphic_P 则应用于 RLHF。

The construction of training data entails a two-step process: collaborative data annotation and automated data synthesis. First, we extract the data ontology from large-scale instruction corpora, leading to a broad and diverse set of high-quality instructions. These instructions are systematically enhanced to incorporate greater complexity. Through human annotation, we obtain the target response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and their positive and negative counterparts (yi+,yi)superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖(y_{i}^{+},y_{i}^{-})( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ). Subsequently, a variety of automated alignment strategies are employed to synthesize a substantial volume of artificially annotated data across the domains of code, mathematics, instruction-following, creation, role-playing, and safety.
训练数据的构建涉及两个步骤:协作数据标注与自动化数据合成。首先,我们从大规模指令语料库中提取数据本体,由此获得广泛且多样化的高质量指令集。这些指令被系统性地增强,以纳入更高的复杂度。通过人工标注,我们获取目标响应 yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 及其正负对照 (yi+,yi)superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖(y_{i}^{+},y_{i}^{-})( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) 。接着,采用多种自动化对齐策略,在代码、数学、指令遵循、创作、角色扮演及安全等多个领域合成大量人工标注数据。

4.1.1 Collaborative Data Annotation
4.1.1 协同数据标注

Automatic Ontology Extraction
自动本体抽取

The process initiates with the application of InsTag (Lu et al., 2024c), an open-set fine-grained tagger, to extract the underlying ontology from a large-scale instruction dataset. Subsequent manual refinement ensures the accuracy of the extracted ontology.
该过程始于应用 InsTag(Lu 等人,2024c),一种开放集细粒度标记器,从大规模指令数据集中提取底层本体。随后的人工精炼确保了提取本体的准确性。

Instruction Selection 指令选择

Each instruction, with tags annotated, is evaluated for tag diversity, semantic richness, complexity, and intent completeness. Based on these criteria, we select a set of representative instructions (Dong et al., 2023).
每条带有标注标签的指令都会根据标签多样性、语义丰富度、复杂度及意图完整性进行评估。基于这些标准,我们选取了一组具有代表性的指令(Dong 等,2023)。

Instruction Evolution 指令演化

To enrich the instruction dataset, a self-evolution strategy (Zhao et al., 2024) is employed, prompting the Qwen models to add constraints or requirements to existing instructions, thereby increasing their complexity and ensuring a diverse range of difficulty levels within the dataset.
为丰富指令数据集,采用了自进化策略(Zhao et al., 2024),促使 Qwen 模型对现有指令添加约束或要求,从而提升其复杂度,并确保数据集中难度层次的多样性。

Human Annotation 人工标注

Multiple responses to an instruction are obtained using diverse generation strategies and Qwen models of different scales. Annotators rank these responses based on their preferences, ensuring the best response meets established criteria, yielding both demonstration and preference data.
通过采用多样化的生成策略和不同规模的 Qwen 模型,获取对指令的多个响应。标注人员根据个人偏好对这些响应进行排序,确保最佳响应符合既定标准,从而产生演示数据和偏好数据。

4.1.2 Automated Data Synthesis
4.1.2 自动化数据合成

Maintaining the quality of annotations for responses to instructions presents significant challenges on a large scale, particularly those that require expertise, experience, carefulness, or patience. To address these challenges, we devised various automated alignment strategies to synthesize data at scale.
在大型数据集上保持指令响应注释的质量面临重大挑战,尤其是那些需要专业知识、经验、细心或耐心的任务。为应对这些挑战,我们设计了多种自动化对齐策略,以大规模合成数据。

Rejection Sampling 拒绝采样

For mathematical or similar tasks with definitive final answers, rejection sampling (Yuan et al., 2023) is applied to improve the quality of solutions. Large language models (LLMs) are tasked to generate multiple responses, namely the reasoning paths, for each instruction. Paths that result in accurate conclusions and are considered reasonable by the model are preserved, serving as demonstration data. Preference data is generated by contrasting correct and incorrect paths.
对于具有明确最终答案的数学或类似任务,采用拒绝抽样法(Yuan et al., 2023)以提升解题质量。大型语言模型(LLMs)被指派为每项指令生成多个响应,即推理路径。保留那些得出准确结论且被模型视为合理的路径,作为示范数据。通过对比正确与错误路径来生成偏好数据。

Execution Feedback 执行反馈

For coding tasks, LLMs are employed to generate solutions and associated test cases. The efficacy of these solutions is evaluated by compiling and executing them against the test cases, thereby creating demonstration and preference data. This methodology is also applicable to assessing instruction following (Dong et al., 2024). For each instruction with constraints, e.g., length limit, the LLM is tasked to generate a Python verification function to ensure the response aligns with the instruction requirements.
对于编码任务,LLMs被用于生成解决方案及相应的测试用例。这些解决方案的有效性通过编译并执行它们针对测试用例来评估,从而产生演示和偏好数据。此方法同样适用于评估指令遵循情况(Dong et al., 2024)。对于每条带有约束条件的指令,例如长度限制,LLM负责生成一个 Python 验证函数,以确保响应符合指令要求。

Data Repurposing 数据再利用

Creating skilled responses in literary writing tasks is challenging for annotators without specialized training. To tackle this problem, we aggregate high-quality literary works from the public domain and employ LLMs to develop instructions with varying levels of detail. These instructions, paired with the original works, serve as demonstration data. For example, to compile roleplay data with vivid and engaging responses, we source detailed character profiles from knowledge repositories such as Wikipedia and instruct LLMs to generate corresponding instructions and responses (Lu et al., 2024b). This process, similar to a reading comprehension task, ensures that the integrity of the character’s profile is maintained.
在文学写作任务中创造熟练的回应对未经专门训练的标注者来说颇具挑战。为解决这一难题,我们汇集了公共领域内的高质量文学作品,并利用LLMs开发了不同详尽程度的指导说明。这些说明与原作相结合,作为示范数据使用。例如,为了编纂角色扮演数据,使其回应生动且引人入胜,我们从维基百科等知识库中获取详尽的角色档案,并指示LLMs生成相应的指导与回应(Lu et al., 2024b)。此过程类似于阅读理解任务,确保了角色档案的完整性得以保持。

Constitutional Feedback 宪法反馈

Constitutional AI refers to the process of guiding LLMs to generate responses based on predefined sets of principles (Bai et al., 2022). To ensure adherence to guidelines such as safety and values, a constitution dataset was compiled. This dataset delineates principles to be followed and those to be avoided. It was used to instruct LLMs to produce responses that either are aligned with or deviated from these guidelines, serving as a reference for demonstration and preference data.
宪法式 AI 指的是引导LLMs根据预设的一系列原则(Bai et al., 2022)生成回应的过程。为确保遵循安全与价值观等指导方针,编制了宪法数据集。该数据集明确了应遵循的原则及应避免的原则。它被用来指导LLMs生成与这些指导方针一致或偏离的回应,作为展示和偏好数据的参考。

4.2 Supervised Fine-tuning
4.2 监督式微调

We have assembled an extensive instruction dataset featuring more than 500,000 examples that cover skills such as instruction following, coding, mathematics, logical reasoning, role-playing, multilingualism, and safety. Our model was fine-tuned for two epochs with a sequence length of 32,768 tokens. To optimize learning, the learning rate was gradually decreased from 7×1067superscript1067\times 10^{-6}7 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 7×1077superscript1077\times 10^{-7}7 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. To address overfitting, we applied a weight decay of 0.1 and gradients were clipped at a maximum value of 1.0.
我们已经构建了一个庞大的指令数据集,包含超过 500,000 个示例,涵盖了指令遵循、编程、数学、逻辑推理、角色扮演、多语言能力及安全性等技能。我们的模型在序列长度为 32,768 个令牌的情况下进行了两个周期的微调。为优化学习效果,学习率从 7×1067superscript1067\times 10^{-6}7 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 逐渐降低至 7×1077superscript1077\times 10^{-7}7 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 。为防止过拟合,我们采用了 0.1 的权重衰减,并将梯度裁剪至最大值 1.0。

4.3 Reinforcement Learning from Human Feedback
4.3 基于人类反馈的强化学习

Our training regime for RLHF comprises two sequential stages: offline and online training. In the offline training stage, we use a pre-compiled preference dataset 𝒫𝒫\mathcal{P}caligraphic_P to maximize the difference in likelihood between yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and yisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT with Direct Preference Optimization (DPO, Rafailov et al., 2023). In the online training stage, the model iteratively refines its performance in real-time, leveraging reward models for immediate feedback. Specifically, we sample multiple responses from the current policy model, and the reward model selects the most and the least preferred responses, forming preference pairs that are used for DPO in each episode. Moreover, we employ Online Merging Optimizer (Lu et al., 2024a) to mitigate the alignment tax, i.e., the performance degradation associated with aligning model generation with human preferences.
我们的 RLHF 训练体系包含两个连续阶段:离线与在线训练。在离线训练阶段,我们利用预编译的偏好数据集 𝒫𝒫\mathcal{P}caligraphic_P ,通过直接偏好优化(DPO,Rafailov 等人,2023 年)来最大化 yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTyisuperscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT 之间的似然差异。而在在线训练阶段,模型实时迭代提升表现,借助奖励模型获取即时反馈。具体而言,我们从当前策略模型中抽取多个响应,奖励模型从中选出最优与最劣响应,形成偏好对用于每轮的 DPO。此外,我们采用在线合并优化器(Lu 等人,2024a)以减轻对齐税,即因模型生成与人类偏好对齐而导致的性能下降问题。

5 Evaluation 5 评估

To thoroughly assess the Qwen2 models, consisting of both base and instruction-tuned models, we implement a comprehensive evaluation protocol. This protocol examines a range of competencies, including general knowledge understanding, language comprehension, generation, coding, mathematics, reasoning, and additional areas of expertise. Specifically, base models are assessed using established benchmark datasets for large language models (LLMs), with responses elicited through few-shot prompting, unless specified otherwise. For instruction-tuned models, in addition to benchmark evaluations, we prioritize human preference assessments.
为了全面评估 Qwen2 模型,包括基础模型和指令调优模型,我们实施了一套详尽的评估方案。该方案涵盖了广泛的能力领域,如常识理解、语言领悟、生成、编码、数学、推理及其他专业领域。具体而言,基础模型采用大型语言模型(LLMs)的既定基准数据集进行评估,通过少量样本提示引发响应,除非另有说明。对于指令调优模型,除了基准测试外,我们更侧重于人类偏好评估。

5.1 Base Language Models 5.1 基础语言模型

In this section, we illustrate the evaluation of the base language models of the Qwen2 series. Specifically, we evaluate the models on benchmark datasets for knowledge and basic capabilities and apply multilingual benchmark datasets to evaluate their support of languages. As there are multiple model sizes, we compare them with the state-of-the-art (SOTA) models of similar or larger sizes.
在本节中,我们展示了 Qwen2 系列基础语言模型的评估情况。具体而言,我们在知识与基础能力基准数据集上对模型进行评估,并采用多语种基准数据集来检验其语言支持能力。鉴于存在多种模型规模,我们将其与同规模或更大规模的当前最先进(SOTA)模型进行比较。

5.1.1 Core Capabilities 5.1.1 核心能力

Table 2: Performance of the 70B+ models. We compare Qwen2-72B with the baselines, including Mixtral-8x22B, Llama-3-70B, Qwen1.5-110B, and Qwen1.5-72B. For most datasets, Qwen2-72B demonstrates advantages over the baselines.
表 2:70B+模型的性能表现。我们将 Qwen2-72B 与基准模型进行对比,包括 Mixtral-8x22B、Llama-3-70B、Qwen1.5-110B 和 Qwen1.5-72B。在多数数据集上,Qwen2-72B 相较于基准模型展现出优势。
Datasets 数据集 Mixtral-8x22B 混合模型-8x22B Llama-3-70B 羊驼-3-70B Qwen1.5-72B Qwen1.5-110B Qwen2-72B
English 源文本:英语 翻译文本:英语
MMLU 77.8 79.5 77.5 80.4 84.2
MMLU-Pro 49.5 52.8 45.8 49.4 55.6
GPQA 34.3 36.3 36.3 35.9 37.9
Theorem QA 定理 QA 35.9 32.3 29.3 34.9 43.1
BBH 78.9 81.0 65.5 74.8 82.4
HellaSwag 海拉斯拉格 88.7 88.0 86.0 87.5 87.6
Winogrande 85.0 85.3 83.0 83.5 85.1
ARC-C 70.7 68.8 65.9 69.6 68.9
TruthfulQA 真实问答 51.0 45.6 59.6 49.6 54.8
Coding 编码
HumanEval 人类评估 46.3 48.2 46.3 54.3 64.6
MBPP 71.7 70.4 66.9 70.9 76.9
EvalPlus 评估增强版 54.1 54.8 52.9 57.7 65.4
MultiPL-E 多 PL-E 46.7 46.3 41.8 52.7 59.6
Mathematics 数学
GSM8K 83.7 83.0 79.5 85.4 89.5
MATH 41.7 42.5 34.1 49.6 51.1
Chinese 中文
C-Eval 54.6 65.2 84.1 89.1 91.0
CMMLU 53.4 67.2 83.5 88.3 90.1
Multilingual 多语种
Exam 考试 63.5 70.0 66.4 75.6 76.6
Understanding 理解 77.7 79.9 78.2 78.2 80.7
Mathematics 数学 62.9 67.1 61.7 64.4 76.0
Translation 源文本:翻译 译文:翻译 23.3 38.0 35.6 36.2 37.8
Benchmarks and Evaluation Protocol
基准测试与评估协议

The common practice of evaluating the core capabilities of base language models is the implementation of benchmark dataset evaluation with few-shot or zero-shot prompting. The evaluation mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, etc. The datasets for evaluation include MMLU (Hendrycks et al., 2021a) (5-shot), MMLU-Pro (Wang et al., 2024) (5-shot), GPQA (Rein et al., 2023) (5shot), Theorem QA (Chen et al., 2023a) (5-shot), BBH (Suzgun et al., 2023) (3-shot), HellaSwag (Zellers et al., 2019) (10-shot), Winogrande (Sakaguchi et al., 2021) (5-shot), TruthfulQA (Lin et al., 2022a) (0-shot), ARC-C (Clark et al., 2018) (25-shot), HumanEval (Chen et al., 2021) (0-shot), MBPP (Austin et al., 2021) (0-shot), EvalPlus(Liu et al., 2023a) (0-shot), MultiPL-E (Cassano et al., 2023) (0-shot on Python, C++, Java, PHP, TypeScript, C#, Bash, and JavaScript), GSM8K (Cobbe et al., 2021) (5-shot), MATH (Hendrycks et al., 2021b) (4-shot), C-Eval (Huang et al., 2023) (5-shot), and CMMLU (Li et al., 2023) (5-shot). Multilingual datasets can be grouped into four categories: (a) Exam: M3Exam (5-shot, we only choose examples that require no image), IndoMMLU (Koto et al., 2023) (3-shot), ruMMLU (Fenogenova et al., 2024) (5-shot), and translated MMLU (Chen et al., 2023b) (5-shot on Arabic, Spanish, French, Portuguese, German, Italian, Japanese, and Korean); (b) Understanding: BELEBELE (Bandarkar et al., 2023) (5-shot), XCOPA (Ponti et al., 2020) (5-shot), XWinograd (Muennighoff et al., 2023) (5-shot), XStoryCloze (Lin et al., 2022b) (0-shot) and PAWS-X (Yang et al., 2019) (5-shot); (c) Mathematics: MGSM (Goyal et al., 2022) (8-shot CoT); and (d) Translation: Flores-101 (Goyal et al., 2022) (5-shot).
评估基础语言模型核心能力的一般做法是采用少量样本或零样本提示的基准数据集评估。评估主要集中在自然语言理解、通用问答、编码、数学、科学知识、推理等方面。评估所用的数据集包括 MMLU(Hendrycks 等人,2021a)(5 样本)、MMLU-Pro(Wang 等人,2024)(5 样本)、GPQA(Rein 等人,2023)(5 样本)、Theorem QA(Chen 等人,2023a)(5 样本)、BBH(Suzgun 等人,2023)(3 样本)、HellaSwag(Zellers 等人,2019)(10 样本)、Winogrande(Sakaguchi 等人,2021)(5 样本)、TruthfulQA(Lin 等人,2022a)(零样本)、ARC-C(Clark 等人,2018)(25 样本)、HumanEval(Chen 等人,2021)(零样本)、MBPP(Austin 等人,2021)(零样本)、EvalPlus(Liu 等人,2023a)(零样本)、MultiPL-E(Cassano 等人,2023)(Python、C++、Java、PHP、TypeScript、C#、Bash 和 JavaScript 零样本)、GSM8K(Cobbe 等人,2021)(5 样本)、MATH(Hendrycks 等人,2021b)(4 样本)、C-Eval(Huang 等人,2023)(5 样本)以及 CMMLU(Li 等人,2023)(5 样本)。 多语言数据集可分为四类:(a) 考试类:M3Exam(5-样本,仅选取无需图像的示例),IndoMMLU(Koto 等人,2023 年)(3-样本),ruMMLU(Fenogenova 等人,2024 年)(5-样本),以及翻译版 MMLU(Chen 等人,2023b 年)(阿拉伯语、西班牙语、法语、葡萄牙语、德语、意大利语、日语和韩语各 5-样本);(b) 理解类:BELEBELE(Bandarkar 等人,2023 年)(5-样本),XCOPA(Ponti 等人,2020 年)(5-样本),XWinograd(Muennighoff 等人,2023 年)(5-样本),XStoryCloze(Lin 等人,2022b 年)(0-样本)和 PAWS-X(Yang 等人,2019 年)(5-样本);(c) 数学类:MGSM(Goyal 等人,2022 年)(8-样本,含思维链);以及(d) 翻译类:Flores-101(Goyal 等人,2022 年)(5-样本)。

Table 3: Performance of the 30B+ dense models and 40B+ MoE models. Qwen2-57B-A14B, an MoE model with a total of 57 billion parameters and 14 billion activated parameters, is designed to match the performance of 30 billion parameter dense models. This comparison includes dense model baselines: Yi-1.5-34B and Qwen1.5-32B, as well as MoE baselines: Mixtral-8x7B and Jamba. Results demonstrate that Qwen2-57B-A14B achieves competitive performance overall, with a notable superiority in coding and mathematics tasks.
表 3:30B+密集模型与 40B+ MoE 模型性能对比。Qwen2-57B-A14B,一款拥有总计 570 亿参数和 140 亿激活参数的 MoE 模型,旨在媲美 300 亿参数密集模型的性能。本次比较涵盖了密集模型基线:Yi-1.5-34B 和 Qwen1.5-32B,以及 MoE 基线:Mixtral-8x7B 和 Jamba。结果显示,Qwen2-57B-A14B 整体表现具有竞争力,在编程和数学任务中展现出显著优势。
Datasets 数据集 Jamba 贾巴 Mixtral-8x7B 混合特拉-8x7B Yi-1.5-34B Qwen1.5-32B Qwen2-57B-A14B
Architecture 建筑学 MoE 专家混合模型 MoE 专家混合模型 Dense 稠密 Dense 稠密 MoE 专家混合模型
# Act Params # 行为参数 12B 12B 32B 34B 14B
# Params # 参数 52B 47B 32B 34B 57B
English 源文本:英语 翻译文本:英语
MMLU 67.4 71.8 77.1 74.3 76.5
MMLU-Pro - 41.0 48.3 44.0 43.0
GPQA - 29.2 - 30.8 34.3
Theorem QA 定理 QA - 23.2 - 28.8 33.5
BBH 45.4 50.3 76.4 66.8 67.0
HellaSwag 海拉斯拉格 87.1 86.5 85.9 85.0 85.2
Winogrande 82.5 81.9 84.9 81.5 79.5
ARC-C 64.4 66.0 65.6 63.6 64.1
TruthfulQA 真实问答 46.4 51.1 53.9 57.4 57.7
Coding 编码
HumanEval 人类评估 29.3 37.2 46.3 43.3 53.0
MBPP - 63.9 65.5 64.2 71.9
EvalPlus 评估增强版 - 46.4 51.9 50.4 57.2
MultiPL-E 多 PL-E - 39.0 39.5 38.5 49.8
Mathematics 数学
GSM8K 59.9 62.5 82.7 76.8 80.7
MATH - 30.8 41.7 36.1 43.0
Chinese 中文
C-Eval - - - 83.5 87.7
CMMLU - - 84.8 82.3 88.5
Multilingual 多语种
Exam 考试 - 56.1 58.3 61.6 65.5
Understanding 理解 - 70.7 73.9 76.5 77.0
Mathematics 数学 - 45.0 49.3 56.1 62.3
Translation 源文本:翻译 译文:翻译 - 29.8 30.0 33.5 34.5
Qwen2-72B

In terms of the largest model of Qwen2, we compare Qwen2-72B with competitive baseline open-weight models, including Mixtral-8x22B (Jiang et al., 2024), Llama-3-70B (AI@Meta, 2024), as well as Qwen1.5-72B (Qwen Team, 2024a) and Qwen1.5-110B (Qwen Team, 2024b). The results are reported in Table 2. Qwen2-72B outperforms Llama-3-70B in general knowledge understanding on both MMLU and MMLU-Pro, achieving accuracy improvements of 4.7 and 2.8, respectively. In scientific assessments, Qwen2-72B demonstrates superiority over Llama-3-70B with enhancements of 1.6 and 9.8 on GPQA and Theorem QA. Upon enrichment of coding data, Qwen2-72B exhibits a significant 18.3 and 10.0 percentage point advantage over Qwen1.5-72B in HumanEval and MBPP evaluations. Enhanced mathematics-related data allows Qwen2-72B to outperform Qwen1.5-72B by 10.0 and 17.0 percentage points in the GSM8K and MATH benchmarks. Qwen2-72B displays reasoning capabilities equivalent to Llama-3-70B, considering BBH, Winogrande, and ARC-C, attributable to its improved coding and mathematical data. In assessing language understanding in Chinese, Qwen2-72B significantly outperforms Mixtral-8x22B and Llama-3-70B, and also outperforms Qwen1.5-72B.
在 Qwen2 系列中最大型号的比较上,我们将 Qwen2-72B 与多个竞争性基线开放权重模型进行对比,包括 Mixtral-8x22B(Jiang 等人,2024 年)、Llama-3-70B(AI@Meta,2024 年),以及 Qwen1.5-72B(Qwen 团队,2024a)和 Qwen1.5-110B(Qwen 团队,2024b)。结果详见表 2。Qwen2-72B 在 MMLU 和 MMLU-Pro 上的常识理解能力均优于 Llama-3-70B,准确率分别提升了 4.7 和 2.8。在科学评估方面,Qwen2-72B 在 GPQA 和 Theorem QA 上较 Llama-3-70B 有 1.6 和 9.8 的提升。通过丰富编码数据,Qwen2-72B 在 HumanEval 和 MBPP 评测中相较于 Qwen1.5-72B 分别取得了 18.3 和 10.0 个百分点的显著优势。增强的数学相关数据使得 Qwen2-72B 在 GSM8K 和 MATH 基准测试中分别超过 Qwen1.5-72B 达 10.0 和 17.0 个百分点。考虑到 BBH、Winogrande 和 ARC-C,Qwen2-72B 展现出与 Llama-3-70B 相当的推理能力,这归功于其改进的编码和数学数据。 在评估中文语言理解能力时,Qwen2-72B 显著优于 Mixtral-8x22B 和 Llama-3-70B,并且也超越了 Qwen1.5-72B。

Qwen2-57B-A14B

For the evaluation of the MoE model, Qwen2-57B-A14B is compared against baselines of similar sizes. These baselines include other MoE models, such as Mixtral-8x7B (Jiang et al., 2024) and Jamba (Lieber et al., 2024), and dense models, such as Yi-1.5-34B (Young et al., 2024) and Qwen1.5-32B (Qwen Team, 2024a), both of which have approximately 30 billion parameters. The results are shown in Table 3. We anticipate that Qwen2-57B-A14B, which activates 14 billion parameters, will match the performance of a 30 billion parameter dense equivalent Qwen2 model. Our evaluation reveals that Qwen2-57B-A14B performs comparably to Yi-1.5-34B in natural language understanding tasks. Moreover, it outperforms the baseline models in coding and mathematics tasks. Additionally, Qwen2-57B-A14B demonstrates robust Chinese language understanding capabilities, rivaling the larger Qwen2-72B model. In essence, Qwen2-57B-A14B is an efficient model that, while activating only 14 billion parameters per forward pass, maintains the performance level of a 30 billion parameter dense model.
在评估 MoE 模型时,Qwen2-57B-A14B 与相近规模的基准模型进行了对比。这些基准包括其他 MoE 模型,如 Mixtral-8x7B(Jiang 等,2024)和 Jamba(Lieber 等,2024),以及密集模型,如 Yi-1.5-34B(Young 等,2024)和 Qwen1.5-32B(Qwen 团队,2024a),它们均约含 300 亿参数。结果见表 3。我们预期激活 140 亿参数的 Qwen2-57B-A14B 能与 300 亿参数的密集型 Qwen2 模型性能相当。评估显示,Qwen2-57B-A14B 在自然语言理解任务中与 Yi-1.5-34B 表现相当,并在编程与数学任务上超越基准模型。此外,Qwen2-57B-A14B 展现出强大的中文理解能力,可与更大的 Qwen2-72B 模型媲美。实质上,Qwen2-57B-A14B 是一款高效模型,尽管每次前向传播仅激活 140 亿参数,却能保持 300 亿参数密集模型的性能水平。

Table 4: Performance of the 7B+ models. We compare Qwen2-7B with previously released state-of-the-art 7B+ models including Mixtral-7B, Gemma-7B, Llama-3-8B, and our previous Qwen1.5-7B. Qwen2-7B demonstrates significant advantages over the baselines in most of the evaluation datasets.
表 4:7B+模型的性能表现。我们将 Qwen2-7B 与先前发布的顶尖 7B+模型进行比较,包括 Mixtral-7B、Gemma-7B、Llama-3-8B 以及我们早前的 Qwen1.5-7B。在多数评估数据集中,Qwen2-7B 相较于基准线展现出显著优势。
Datasets 数据集 Mistral-7B Gemma-7B Llama-3-8B Qwen1.5-7B Qwen2-7B
English 源文本:英语 翻译文本:英语
MMLU 64.2 64.6 66.6 61.0 70.3
MMLU-Pro 30.9 33.7 35.4 29.9 40.0
GPQA 24.7 25.7 25.8 26.7 31.8
Theorem QA 定理 QA 19.2 21.5 22.1 14.2 31.1
BBH 56.1 55.1 57.7 40.2 62.6
HellaSwag 海拉斯拉格 83.2 82.2 82.1 78.5 80.7
Winogrande 78.4 79.0 77.4 71.3 77.0
ARC-C 60.0 61.1 59.3 54.2 60.6
TruthfulQA 真实问答 42.2 44.8 44.0 51.1 54.2
Coding 编码
HumanEval 人类评估 29.3 37.2 33.5 36.0 51.2
MBPP 51.1 50.6 53.9 51.6 65.9
Evalplus 评估加 36.4 39.6 40.3 40.0 54.2
MultiPL-E 多 PL-E 29.4 29.7 22.6 28.1 46.3
Mathematics 数学
GSM8K 52.2 46.4 56.0 62.5 79.9
MATH 13.1 24.3 20.5 20.3 44.2
Chinese 中文
C-Eval 47.4 43.6 49.5 74.1 83.2
CMMLU - - 50.8 73.1 83.9
Multilingual 多语种
Exam 考试 47.1 42.7 52.3 47.7 59.2
Understanding 理解 63.3 58.3 68.6 67.6 72.0
Mathematics 数学 26.3 39.1 36.3 37.3 57.5
Translation 源文本:翻译 译文:翻译 23.3 31.2 31.9 28.4 31.5
Qwen2-7B

The 7B model is widely utilized, as it enables the execution in 16-bit floating points on accelerators equipped with 16GB memory. Our focus is on comparing this model with other leading 7B models, including Llama-3-8B, which has recently demonstrated exceptional performance in the Chatbot Arena (Chiang et al., 2024). This comparison also includes Mistral-7B-v0.2 (Jiang et al., 2023a), Gemma-7B (Mesnard et al., 2024), and our predecessor, Qwen1.5-7B (Qwen Team, 2024a). The results can be found in Table 4. Qwen2-7B demonstrates superior performance across most datasets compared to other models, particularly excelling in coding tasks, mathematics, and Chinese language tasks. It also shows strong performance in multilingual understanding and exams. This indicates that Qwen2-7B has been optimized for a wide range of language and logic-based tasks, showcasing its versatility and advanced capabilities.
7B 模型因其能够在配备 16GB 内存的加速器上以 16 位浮点数执行运算而被广泛应用。我们的研究重点在于将该模型与其它领先的 7B 模型进行比较,其中包括近期在聊天机器人竞技场(Chiang et al., 2024)中表现卓越的 Llama-3-8B。此次对比还包括 Mistral-7B-v0.2(Jiang et al., 2023a)、Gemma-7B(Mesnard et al., 2024)以及我们的前代模型 Qwen1.5-7B(Qwen Team, 2024a)。比较结果详见表 4。相较于其他模型,Qwen2-7B 在多数数据集上展现出更优异的性能,尤其在编码任务、数学及中文语言任务方面表现突出。同时,它在多语言理解和考试中也表现出色。这表明 Qwen2-7B 针对广泛的基于语言和逻辑的任务进行了优化,展现了其多功能性和先进能力。

Table 5: Performance of the smaller models. We compare our Qwen2-0.5B and Qwen2-1.5B with the previous SOTA small models including Phi-2, Gemma-2B and Qwen1.5-1.8B. Qwen2-0.5B with a much smaller model size achieves competitive performance, and Qwen2-1.5B significantly outperforms Qwen2-0.5B.
表 5:小型模型的性能对比。我们将 Qwen2-0.5B 和 Qwen2-1.5B 与先前的 SOTA 小型模型进行比较,包括 Phi-2、Gemma-2B 和 Qwen1.5-1.8B。尽管模型尺寸大幅缩小,Qwen2-0.5B 仍展现出竞争性性能,而 Qwen2-1.5B 则显著超越了 Qwen2-0.5B。
Datasets 数据集 Phi-2 Gemma-2B Qwen1.5-1.8B Qwen2-0.5B Qwen2-1.5B
# Non-Emb Params # 非嵌入参数 2.5B 2.0B 1.2B 0.3B 1.2B
MMLU 52.7 42.3 46.8 45.4 56.5
MMLU-Pro - 15.9 - 14.7 21.8
Theorem QA 定理 QA - - - 8.9 15.0
BBH 43.4 35.2 24.2 28.4 37.2
HellaSwag 海拉斯拉格 73.1 71.4 61.4 49.3 66.6
Winogrande 74.4 66.8 60.3 56.8 66.2
ARC-C 61.1 48.5 37.9 31.5 43.9
TruthfulQA 真实问答 44.5 33.1 39.4 39.7 45.9
HumanEval 人类评估 47.6 22.0 20.1 22.0 31.1
MBPP 55.0 29.2 18.0 22.0 37.4
GSM8K 57.2 17.7 38.4 36.5 58.5
MATH 3.5 11.8 10.1 10.7 21.7
C-Eval 23.4 28.0 59.7 58.2 70.6
CMMLU 24.2 - 57.8 55.1 70.3
Qwen2-1.5B & Qwen2-0.5B Qwen2-1.5B 与 Qwen2-0.5B

To evaluate the performance of our smaller models, specifically Qwen2-1.5B and Qwen2-0.5B, we compare them against established baselines: Phi-2 (Abdin et al., 2024), Gemma-2B (Mesnard et al., 2024), and Qwen1.5-1.8B (Qwen Team, 2024a). The results are given in Table 5. In language understanding, Qwen2-1.5B outperforms Phi-2, a model trained on textbook-like data. For coding tasks, Qwen2-0.5B matches the performance of Gemma-2B and Qwen1.5-1.8B, while Qwen2-1.5B surpasses these baselines, except for Phi-2. Both Qwen2 models exhibit superior performance in mathematics compared to their competitors. In terms of general reasoning, we find that Phi-2 generally outperforms all others, which to some extent reflects the significance of textbook data for reasoning capabilities. In TruthfulQA, Qwen2-1.5B performs the best, demonstrating that smaller models does not necessarily suffer from hallucination. In Chinese language understanding, both Qwen2 models outperform all the others, a trend consistent with larger models in their respective comparisons.
为了评估我们较小模型,特别是 Qwen2-1.5B 和 Qwen2-0.5B 的性能,我们将它们与既定基准进行比较:Phi-2(Abdin 等人,2024 年)、Gemma-2B(Mesnard 等人,2024 年)以及 Qwen1.5-1.8B(Qwen 团队,2024a)。结果见表 5。在语言理解方面,Qwen2-1.5B 超越了基于教科书数据训练的 Phi-2 模型。对于编程任务,Qwen2-0.5B 与 Gemma-2B 和 Qwen1.5-1.8B 表现相当,而 Qwen2-1.5B 则超越了这些基准,除了 Phi-2 之外。两版 Qwen2 模型在数学领域均展现出优于竞争对手的表现。在通用推理能力上,我们发现 Phi-2 普遍优于其他模型,这在一定程度上反映了教科书数据对推理能力的重要性。在 TruthfulQA 测试中,Qwen2-1.5B 表现最佳,表明小型模型未必会遭遇幻觉问题。在中文语言理解方面,两款 Qwen2 模型均超越了其他所有模型,这一趋势与各自对比中的大型模型表现一致。

In general, the Qwen2 series demonstrates superior performance against the baselines across different model sizes. Notably, Qwen2-72B exhibits the highest performance among all Qwen2 models, underscoring the efficacy of model size scaling.
总体而言,Qwen2 系列在不同模型规模下均展现出优于基准线的卓越性能。特别地,Qwen2-72B 在所有 Qwen2 模型中表现最为突出,凸显了模型规模扩展的有效性。

5.2 Instruction-tuned Model
5.2 指令调优模型

To critically evaluate instruction-tuned models, we implement a multifaceted approach. Assessments of foundational skills and human preferences are conducted using open datasets and benchmarks. Our detailed in-house examinations further probe model competencies in key areas. A particular focus is placed on assessing long context capability. Safety measures include multilingual safety assessments and red teaming exercises. The following sections detail the evaluation methods and their outcomes.
为了对指令调优模型进行批判性评估,我们采用了多方面的方法。通过开放数据集和基准测试来评估基础技能和人类偏好。我们详细的内部审查进一步探究了模型在关键领域的性能。特别关注评估长上下文处理能力。安全措施包括多语言安全评估和红队演练。以下部分详细介绍了评估方法及其结果。

5.2.1 Open Benchmark Evaluation
5.2.1 开放基准评估

To comprehensively evaluate the quality of instruction-tuned models, we compile automatic and human evaluation to assess the capabilities and human preference. For the evaluation of basic capabilities, we apply similar datasets in the pre-trained model evaluation, which target on natural language understanding, coding, mathematics, and reasoning. Specifically, we evaluate on MMLU, MMLU-Pro, GPQA, and Theorem QA for language understanding and knowledge, HumanEval, MBPP, MultiPL-E, and LiveCodeBench v1 (Jain et al., 2024) for coding, GSM8K and MATH for mathematics. Additionally, we assess the performance of human preference alignment and instruction following by evaluating on benchmarks including MT-Bench (Zheng et al., 2023), Arena-Hard (Li et al., 2024), AlignBench (Liu et al., 2023b), MixEval (Ni et al., 2024) whose results approximate those of Chatbot Arena, and IFEval (Zhou et al., 2023)444For simplicity, we report the results of the subset strict-prompt. for instruction following.
为了全面评估指令调优模型的质量,我们综合采用自动和人工评估方法,以衡量其能力和符合人类偏好的程度。在基础能力评估方面,我们采用了与预训练模型评估相似的数据集,这些数据集针对自然语言理解、编程、数学和推理等领域。具体而言,我们在语言理解和知识方面评估了 MMLU、MMLU-Pro、GPQA 和 Theorem QA,在编程方面评估了 HumanEval、MBPP、MultiPL-E 和 LiveCodeBench v1(Jain 等人,2024 年),在数学方面评估了 GSM8K 和 MATH。此外,我们还通过评估 MT-Bench(Zheng 等人,2023 年)、Arena-Hard(Li 等人,2024 年)、AlignBench(Liu 等人,2023b)、MixEval(Ni 等人,2024 年)等基准,以及 IFEval(Zhou 等人,2023 年) 4 ,来评估模型在遵循指令和与人类偏好对齐方面的表现,这些基准的结果与 Chatbot Arena 的结果相近。

Table 6: Performance of 70B+ instruction-tuned models. We compare Qwen2-72B-Instruct with Mixtral-8x22B-Instruct, Llama-3-70B-Instruct, Qwen1.5-72B-Chat, and Qwen1.5-110B-Chat. “-Instruct” or “-Chat” is omitted in the table. Qwen2-72B-Instruct demonstrates advantages in core capabilities, and superior performance in human preference alignment.
表 6:70B+指令调优模型的性能。我们将 Qwen2-72B-Instruct 与 Mixtral-8x22B-Instruct、Llama-3-70B-Instruct、Qwen1.5-72B-Chat 及 Qwen1.5-110B-Chat 进行比较。表格中省略了“-Instruct”或“-Chat”。Qwen2-72B-Instruct 在核心能力上展现出优势,并在人类偏好对齐方面表现卓越。
Datasets 数据集 Mixtral-8x22B 混合特拉-8x22B Llama-3-70B 羊驼-3-70B Qwen1.5-72B Qwen1.5-110B Qwen2-72B
English 源文本:英语 翻译文本:英语
MMLU 74.0 82.0 75.6 76.5 82.3
MMLU-Pro 56.1 56.2 51.7 50.5 64.4
GPQA 49.7 41.9 39.4 32.8 42.4
Theorem QA 定理 QA 40.8 42.5 28.8 18.8 44.4
Coding 编码
HumanEval 人类评估 73.8 81.7 71.3 74.4 86.0
MBPP 75.9 82.3 71.9 76.4 80.2
MultiPL-E 多 PL-E 61.1 63.4 48.1 55.4 69.2
LiveCodeBench v1 21.8 29.3 17.9 25.3 35.7
Mathematics 数学
GSM8K 89.1 93.0 82.7 84.5 93.2
MATH 47.4 50.4 42.5 42.0 69.0
Alignment 对齐
MT-Bench 8.66 8.95 8.61 8.88 9.12
MixEval 混合评估 82.3 84.0 84.1 85.7 86.7
Arena-Hard 竞技场-硬核 36.4 41.1 36.1 39.8 48.1
IFEval strict-prompt IFEval 严格提示 67.1 77.3 55.8 57.5 77.6
AlignBench 对齐基准 - 7.42 7.28 7.87 8.27
Qwen2-72B-Instruct

We compare Qwen2-72B-Instruct against the instruction-tuned models including Mixtral-8x22B-Instruct, Llama-3-70B-Instruct, as well as Qwen1.5-72B-Chat. The results are presented in Table 6. It can be found that a strong base language model can help boost the downstream performance of the instruction-tuned model. Specifically, Qwen2-72B-Instruct outshines its peers in areas such as language understanding, coding, and mathematics, with the exception of GPQA and MBPP. Regarding human preference alignment and instruction following, Qwen2-72B has significant advantages over the baselines. We assume this achievement is attributed to both the high-quality pre-trained model and improvements in both data and training techniques for post-training.
我们将 Qwen2-72B-Instruct 与经过指令微调的模型进行比较,包括 Mixtral-8x22B-Instruct、Llama-3-70B-Instruct 以及 Qwen1.5-72B-Chat。结果如表 6 所示。可以发现,强大的基础语言模型能够显著提升指令微调模型的下游性能。具体而言,Qwen2-72B-Instruct 在语言理解、编码和数学等领域表现优异,仅在 GPQA 和 MBPP 方面稍逊一筹。在人类偏好对齐和指令遵循方面,Qwen2-72B 相较于基准模型具有显著优势。我们认为这一成就归功于高质量的预训练模型以及后期训练中数据和训练技术的改进。

Table 7: Performance of 30B+ dense and 40B+ MoE instruction-tuned models. We compare Qwen2-57B-A14B-Instruct with the similar-size MoE model Mixtral-8x7B-Instruct, 30B dense models such as Yi-1.5-34B-Chat and Qwen1.5-32B-Chat. “-Instruct” or “-Chat” is omitted in the table. Qwen2-57B-A14B-Instruct is competitive with the recent SOTA 30B dense models, and significantly outcompetes the MoE baseline.
表 7:30B+密集型与 40B+ MoE 指令调优模型的性能对比。我们将 Qwen2-57B-A14B-Instruct 与同规模的 MoE 模型 Mixtral-8x7B-Instruct,以及 30B 密集型模型如 Yi-1.5-34B-Chat 和 Qwen1.5-32B-Chat 进行比较。表格中省略了“-Instruct”或“-Chat”后缀。Qwen2-57B-A14B-Instruct 与近期最先进的 30B 密集型模型竞争激烈,并显著超越了 MoE 基线模型。
Datasets 数据集 Mixtral-8x7B 混合特拉-8x7B Yi-1.5-34B Qwen1.5-32B Qwen2-57B-A14B
Architecture 建筑学 MoE 混合专家模型 Dense 稠密 Dense 稠密 MoE 专家混合模型
# Act Params # 行为参数 12B 32B 34B 14B
# Params # 参数 47B 32B 34B 57B
English 源文本:英语 翻译文本:英语
MMLU 71.4 76.8 74.8 75.4
MMLU-Pro 43.3 52.3 46.4 52.8
GPQA - - 30.8 34.3
Theorem QA 定理 QA - - 30.9 33.1
Coding 编码
HumanEval 人类评估 45.1 75.2 68.3 79.9
MBPP 59.5 74.6 67.9 70.9
MultiPL-E 多 PL-E - - 50.7 66.4
LiveCodeBench v1 12.3 - 15.2 25.5
Mathematics 数学
GSM8K 65.7 90.2 83.6 85.3
MATH 30.7 50.1 42.4 49.1
Alignment 对齐
MT-Bench 8.30 8.50 8.30 8.55
MixEval 混合评估 70.0 81.7 81.0 82.3
IFEval strict-prompt IFEval 严格提示 - - 50.3 59.9
AlignBench 对齐基准 5.70 7.20 7.19 7.36
Qwen2-57B-A14B-Instruct Qwen2-57B-A14B-指令

For medium-size models, we compare Qwen2-57B-A14B-Instruct with Mixtral-8x7B-Instruct, another MoE baseline, as well as the dense SOTA models with over 30 billion parameters, e.g., Yi-1.5-34B-Chat and Qwen1.5-32B-Chat. The results are provided in Table 7. Compared with Qwen1.5-32B-Chat, Qwen2-57B-A14B-Instruct reaches superior performance in almost all benchmarks, and compared with the 30B SOTA model Yi-1.5-34B-Chat, Qwen2-57B-A14B-Instruct has gained advantages in most evaluations except for those for mathematics. In terms of the evaluation for alignment, the advantages of Qwen2-57B-A14B-Instruct are notably evident.
对于中等规模模型,我们将 Qwen2-57B-A14B-Instruct 与另一个 MoE 基线模型 Mixtral-8x7B-Instruct 进行比较,同时也与超过 300 亿参数的密集型 SOTA 模型,如 Yi-1.5-34B-Chat 和 Qwen1.5-32B-Chat 进行对比。结果见表 7。相较于 Qwen1.5-32B-Chat,Qwen2-57B-A14B-Instruct 在几乎所有基准测试中均展现出更优性能;与 300 亿参数的 SOTA 模型 Yi-1.5-34B-Chat 相比,Qwen2-57B-A14B-Instruct 在大多数评估中占据优势,仅在数学相关评估中稍逊一筹。在针对对齐能力的评估上,Qwen2-57B-A14B-Instruct 的优势尤为显著。

Table 8: Performance of 7B+ instruction-tuned models. We compare Qwen2-7B-Instruct with the recent SOTA models with 7-9 billion parameters, including Llama-3-8B-Instruct, Yi-1.5-9B-Chat, GLM-4-9B-Chat, and Qwen1.5-7B-Chat. “-Instruct” or “-Chat” is omitted in the table. Qwen2-7B-Instruct demonstrates competitive performance against Llama-3-8B-Instruct.
表 8:7B+指令调优模型的性能。我们将 Qwen2-7B-Instruct 与近期具有 7-9 亿参数的 SOTA 模型进行比较,包括 Llama-3-8B-Instruct、Yi-1.5-9B-Chat、GLM-4-9B-Chat 和 Qwen1.5-7B-Chat。表格中省略了“-Instruct”或“-Chat”。Qwen2-7B-Instruct 在与 Llama-3-8B-Instruct 的对比中展现出竞争力。
Datasets 数据集 Llama-3-8B Yi-1.5-9B GLM-4-9B Qwen1.5-7B Qwen2-7B
English 源文本:英语 翻译文本:英语
MMLU 68.4 69.5 72.4 59.5 70.5
MMLU-Pro 41.0 - - 29.1 44.1
GPQA 34.2 - - 27.8 25.3
Theorem QA 定理 QA 23.0 - - 14.1 25.3
Coding 编码
HumanEval 人类评估 62.2 66.5 71.8 46.3 79.9
MBPP 67.9 - - 48.9 67.2
MultiPL-E 多 PL-E 48.5 - - 27.2 59.1
LiveCodeBench v1 17.3 - - 6.0 26.6
Mathematics 数学
GSM8K 79.6 84.8 79.6 60.3 85.7
MATH 30.0 47.7 50.6 23.2 52.9
Alignment 对齐
MT-Bench 8.05 8.20 8.35 7.60 8.41
MixEval 混合评价 75.0 74.2 - 71.4 76.5
IFEval strict-prompt IFEval 严格提示 72.1 - 69.0 38.3 54.7
AlignBench 对齐基准测试 6.20 6.90 7.01 6.20 7.21
Qwen2-7B-Instruct

Within the spectrum of 7B to 9B models, we compare Qwen2-7B-Instruct with Llama-3-8B-Instruct, Yi-1.5-9B-Chat, GLM-4-9B-Chat, and Qwen1.5-7B-Chat. The results can be found in Table 8. Qwen2-7B-Instruct demonstrates substantial advancements compared to its predecessor, Qwen1.5-7B-Chat, across comprehensive evaluations, notably achieving higher scores in coding and mathematics-related tasks. Compared with the recent SOTA model, Llama-3-8B-Instruct, Qwen2-7B-Instruct demonstrates competitive performance and specifically it achieves superior performance in coding. Nonetheless, in terms of instruction following, Qwen2-7B-Instruct greatly falls behind the competitor. To address this limitation, we plan to augment the 7B model’s instruction-following ability by enhancing the quality of post-training data, ensuring a more robust understanding and execution of complex commands.
在 7B 至 9B 模型的范围内,我们将 Qwen2-7B-Instruct 与 Llama-3-8B-Instruct、Yi-1.5-9B-Chat、GLM-4-9B-Chat 和 Qwen1.5-7B-Chat 进行比较。结果见表 8。相较于前代 Qwen1.5-7B-Chat,Qwen2-7B-Instruct 在全面评估中展现出显著进步,尤其在编程和数学相关任务中得分更高。与近期 SOTA 模型 Llama-3-8B-Instruct 相比,Qwen2-7B-Instruct 表现出竞争性性能,特别是在编程方面表现更优。然而,在遵循指令方面,Qwen2-7B-Instruct 明显落后于竞争对手。为弥补这一短板,我们计划通过提升后训练数据质量来增强 7B 模型的指令遵循能力,确保对复杂指令的理解和执行更为稳健。

Table 9: Performance of smaller instruction-tuned models. We compare both Qwen2-0.5B-Instruct and Qwen2-1.5B-Instruct with Qwen1.5-0.5B-Chat and Qwen2-1.8B-Chat. “-Instruct” or “-Chat” is omitted in the table. Compared with the similar-size baselines, Qwen2 significant surpasses the performance of Qwen1.5.
表 9:较小指令调优模型的性能。我们将 Qwen2-0.5B-Instruct 和 Qwen2-1.5B-Instruct 与 Qwen1.5-0.5B-Chat 和 Qwen2-1.8B-Chat 进行比较。表中省略了“-Instruct”或“-Chat”。与相似规模的基准相比,Qwen2 显著超越了 Qwen1.5 的性能。
Datasets 数据集 Qwen1.5-0.5B Qwen2-0.5B Qwen1.5-1.8B Qwen2-1.5B
MMLU 35.0 37.9 43.7 52.4
HumanEval 人类评估 10.4 29.9 27.4 47.0
MBPP 14.5 37.8 28.6 51.9
GSM8K 11.3 40.1 35.3 61.6
IFEval strict-prompt IFEval 严格提示 14.6 20.0 16.8 29.0
Qwen2-1.5B-Instruct & Qwen2-0.5B-Instruct
Qwen2-1.5B-Instruct 与 Qwen2-0.5B-Instruct

In the context of smaller models, we compare Qwen2-0.5B-Instruct with Qwen1.5-0.5B-Chat, and Qwen2-1.5B-Instruct with Qwen1.5-1.8B-Chat. Notably, the complexity of certain datasets designed for larger models exceeds the capabilities of these smaller models; thus, our analysis focuses on a selected subset. As detailed in Table 9, the Qwen2 models demonstrate a marked advantage over their predecessors in both core capabilities and instruction-following tasks. The achievement mainly attributes to the scaling of pre-training data. Consequently, our results affirm that data scaling remains an effective strategy for enhancing model performance, even in the domain of sub-billion parameter models.
在较小模型的背景下,我们将 Qwen2-0.5B-Instruct 与 Qwen1.5-0.5B-Chat 进行对比,以及 Qwen2-1.5B-Instruct 与 Qwen1.5-1.8B-Chat 进行对比。值得注意的是,某些为大型模型设计的数据集的复杂性超出了这些较小模型的能力范围;因此,我们的分析聚焦于精选子集。如表 9 所述,Qwen2 模型在核心能力及指令遵循任务方面均显著优于前代。这一成就主要归功于预训练数据的扩展。因此,我们的结果证实,即使在十亿参数以下的模型领域,数据扩展仍是提升模型性能的有效策略。

5.2.2 In-house Automatic Evaluation
5.2.2 内部自动评估

Table 10: Performances of Qwen2-Instruct models on our in-house Chinese automatic evaluation benchmark. Scores of Qwen2 models surpassing their comparable-sized Qwen1.5 counterparts are in bold. Qwen2-57B-A14B-Instruct is compared with Qwen1.5-32B-Chat.
表 10:Qwen2-Instruct 模型在我们内部中文自动评估基准上的表现。Qwen2 模型得分超过其同等规模的 Qwen1.5 对应模型的部分以粗体显示。Qwen2-57B-A14B-Instruct 与 Qwen1.5-32B-Chat 进行对比。
Models 模型 Knowledge 知识 Exam 考试 Comprehension 理解 Coding 编码 Math 数学 Reasoning 推理 Avg. 平均值
Proprietary LLMs 专有 LLMs
GPT-4o-2024-05-13 66.68 69.04 76.85 59.58 71.16 69.94 68.87
Qwen-Max-0428 源文本:Qwen-Max-0428 翻译文本: 76.65 74.80 73.66 49.48 66.01 70.84 68.57
Qwen1.5 Series Qwen1.5 系列
Qwen1.5-0.5B-Chat 28.55 36.99 29.70 3.82 13.10 25.47 22.94
Qwen1.5-1.8B-Chat 30.31 44.98 44.81 6.86 29.85 34.61 31.90
Qwen1.5-4B-Chat 33.67 47.17 50.44 14.05 36.20 39.98 36.92
Qwen1.5-MoE-A2.7B-Chat 52.76 60.49 52.84 19.34 38.45 43.07 44.49
Qwen1.5-7B-Chat 56.77 59.36 55.50 18.85 46.41 48.77 47.61
Qwen1.5-14B-Chat 63.35 66.13 60.06 28.19 54.80 50.20 53.79
Qwen1.5-32B-Chat 68.63 67.59 64.67 35.28 60.62 62.87 59.94
Qwen1.5-72B-Chat 71.52 70.04 66.70 38.22 63.09 61.30 61.81
Qwen1.5-110B-Chat 76.26 74.00 71.25 44.25 64.92 64.47 65.86
Qwen2 Series Qwen2 系列
Qwen2-0.5B-Instruct 28.18 38.09 35.90 9.40 21.20 25.61 26.40
Qwen2-1.5B-Instruct 35.46 51.93 44.70 14.05 34.58 35.94 36.11
Qwen2-7B-Instruct 61.54 66.66 59.63 34.74 60.99 58.22 56.96
Qwen2-57B-A14B-Instruct Qwen2-57B-A14B-指令 64.15 73.67 67.52 40.66 63.90 59.89 61.63
Qwen2-72B-Instruct 76.19 75.65 74.72 49.53 70.80 70.59 69.58
Table 11: Performances of Qwen2-Instruct models on our in-house English automatic evaluation benchmark. Scores of Qwen2 models surpassing their comparable-sized Qwen1.5 and Llama-3 counterparts are in bold. Qwen2-57B-A14B-Instruct is compared with Qwen1.5-32B-Chat.
表 11:Qwen2-Instruct 模型在我们内部英语自动评估基准上的表现。Qwen2 模型得分超过其同等规模的 Qwen1.5 和 Llama-3 对应模型的部分以粗体显示。Qwen2-57B-A14B-Instruct 与 Qwen1.5-32B-Chat 进行对比。
Models 模型 Knowledge 知识 Comprehension 理解 Coding 编码 Math 数学 Avg. 平均值
Proprietary LLMs 专有 LLMs
GPT-4o-2024-05-13 87.29 76.30 55.87 84.99 76.11
Qwen-Max-0428 源文本:Qwen-Max-0428 翻译文本: 80.73 71.63 48.76 79.12 70.06
Qwen1.5 Series Qwen1.5 系列
Qwen1.5-0.5B-Chat 30.12 25.44 1.78 15.48 18.21
Qwen1.5-1.8B-Chat 40.37 41.87 4.99 29.71 29.23
Qwen1.5-4B-Chat 51.44 50.16 15.45 44.83 40.47
Qwen1.5-MoE-A2.7B-Chat 61.64 54.79 21.28 50.46 47.04
Qwen1.5-7B-Chat 64.86 58.61 20.79 54.24 49.62
Qwen1.5-14B-Chat 74.41 59.80 28.18 66.91 57.32
Qwen1.5-32B-Chat 76.38 64.70 37.39 73.04 62.88
Qwen1.5-72B-Chat 77.59 67.58 37.30 73.76 64.06
Qwen1.5-110B-Chat 78.29 70.17 44.12 78.87 67.86
Llama-3 Series Llama-3 系列
Llama-3-8B-Instruct 71.01 64.71 42.56 65.82 61.03
Llama-3-70B-Instruct Llama-3-70B-指令 83.06 76.31 57.18 79.70 74.06
Qwen2 Series Qwen2 系列
Qwen2-0.5B-Instruct 43.19 29.57 6.95 31.52 27.81
Qwen2-1.5B-Instruct 56.03 45.08 17.61 50.44 42.29
Qwen2-7B-Instruct 73.75 63.09 36.41 75.67 62.23
Qwen2-57B-A14B-Instruct Qwen2-57B-A14B-指令 76.80 67.92 42.37 77.04 66.03
Qwen2-72B-Instruct 83.00 73.58 53.03 82.15 72.94

Despite a number of open benchmark datasets for the evaluation, we believe that it is far from sufficient to fully comprehend the capabilities of LLMs. Specifically, we have made a series of in-house datasets that assess different capabilities of the models, e.g., knowledge understanding, text generation, coding, etc. The evaluation is in Chinese and English. The results are gathered in Table 10 and Table 11, respectively.
尽管存在多个公开的基准数据集用于评估,我们认为这远不足以全面理解LLMs的能力。具体而言,我们制作了一系列内部数据集,以评估模型的不同能力,例如知识理解、文本生成、编码等。评估采用中文和英文进行。相关结果分别汇总在表 10 和表 11 中。

Chinese Evaluation 中文评价

For the evaluations in Chinese, we focus on comparing the performance of Qwen2 models with the Qwen1.5 counterparts. For the small models, Qwen2-1.5B-Instruct generally outperforms Qwen1.5-1.8B-Chat in almost all the evaluations even with fewer parameters. In terms of the comparison of 7B models, the advantages of Qwen2 are more significant. Noteworthy is Qwen2-72B’s superior performance to Qwen1.5-110B-Chat, despite the latter’s greatly more parameters. The MoE model displays superior performance across most domains relative to Qwen1.5-32B-Chat, excluding knowledge understanding. This discrepancy may be attributed to a short of pre-training tokens. In the near future, we are about to continue the pre-training of the MoE model to discover its scaling behaviors.
在中文评测中,我们着重比较了 Qwen2 模型与 Qwen1.5 系列的性能。对于小型模型,Qwen2-1.5B-Instruct 在几乎所有评测中均优于 Qwen1.5-1.8B-Chat,尽管参数数量较少。在 7B 模型的对比中,Qwen2 的优势更为显著。值得注意的是,Qwen2-72B 的表现优于参数远多的 Qwen1.5-110B-Chat。MoE 模型在大多数领域相对于 Qwen1.5-32B-Chat 展现出更优异的性能,知识理解除外。这一差异可能归因于预训练 token 的不足。近期,我们计划继续对 MoE 模型进行预训练,以探索其扩展行为。

English Evaluation 英语评估

For English, we compare Qwen2 with both Qwen1.5 and Llama-3. Similarly, the small models of Qwen2 significantly outcompete the Qwen1.5 counterparts. However, in comparison with Llama-3-70B, Qwen2-72B-Instruct is falling behind by small margins especially in comprehension and coding. We assume both the amount of English tokens for pre-training and the quantity and diversity of data for post-training lead to the performance gap in English.
对于英语,我们将 Qwen2 与 Qwen1.5 和 Llama-3 进行了对比。同样,Qwen2 的小模型在性能上显著超越了 Qwen1.5 的对应版本。然而,与 Llama-3-70B 相比,Qwen2-72B-Instruct 在理解和编码方面稍显落后。我们推测,预训练中的英语词元数量以及后训练数据的量与多样性共同导致了英语性能上的差距。

5.2.3 Long Context Capabilities
5.2.3 长上下文处理能力

Three methods to evaluate long context capabilities are employed: the Needle in a Haystack (NIAH, Kamradt, 2023), NeedleBench (OpenCompass Contributors, 2023), and LV-Eval (Yuan et al., 2024).
采用三种方法评估长上下文能力:针在草堆中(NIAH,Kamradt,2023)、NeedleBench(OpenCompass 贡献者,2023)和 LV-Eval(Yuan 等人,2024)。

Needle in a Haystack 大海捞针

This experiment assesses a model’s proficiency in pinpointing facts within voluminous texts. Texts with 8K, 16K, …, 128K tokens in length were crafted, with facts strategically positioned at varying depths. Each depth interval, e.g., from 0% to 10%, encompassed two instances. For contexts over 32K, YARN (Peng et al., 2023) was applied in this evaluation. As illustrated in Figure 1, Qwen2-72B-Instruct exhibits exceptional accuracy in retrieving information from the entire 128K context. Coupled with its inherent strength, this model emerges as the optimal choice for processing extensive texts, assuming sufficient resources are accessible. Additionally, models within the same series showcases remarkable performance across different context lengths. Precisely, Qwen2-7B-Instruct achieves a high level of accuracy in handling contexts up to 128K tokens. Meanwhile, Qwen2-57B-A14B-Instruct manages contexts up to 64K tokens proficiently, and the two smaller models in the Qwen2 series could support contexts of 32K tokens.
本实验评估模型在海量文本中精准定位事实的能力。构建了包含 8K、16K、…、128K 个词符长度的文本,事实被策略性地置于不同深度。每个深度区间,例如从 0%到 10%,包含两个实例。对于超过 32K 的上下文,本评估采用了 YARN(Peng 等,2023)。如图 1 所示,Qwen2-72B-Instruct 在从整个 128K 上下文中检索信息时表现出卓越的准确性。结合其固有优势,该模型成为处理大量文本的最佳选择,前提是具备充足的资源。此外,同一系列中的模型在不同上下文长度上均展现出显著性能。确切地说,Qwen2-7B-Instruct 在处理高达 128K 词符的上下文时达到了高准确度。同时,Qwen2-57B-A14B-Instruct 能熟练处理高达 64K 词符的上下文,而 Qwen2 系列中的两个较小模型则可支持 32K 词符的上下文。

Refer to caption
Figure 1: Performance of Qwen2 instruction-tuned models on Needle in A Haystack Test. All models that supports context lengths above 32k tokens integrates the YARN mechanism.
图 1:Qwen2 指令调优模型在“大海捞针”测试中的表现。所有支持上下文长度超过 32k 令牌的模型均集成了 YARN 机制。
Table 12: Performance of Qwen2-72B-Instruct and Qwen2-7B-Instruct on NeedleBench and LV-Eval. +YARN+DCA does not change the model behavior within 32k tokens.
表 12:Qwen2-72B-Instruct 与 Qwen2-7B-Instruct 在 NeedleBench 和 LV-Eval 上的表现。+YARN+DCA 在 32k 令牌内不改变模型行为。
Datasets 数据集 NeedleBench 针座 LV-Eval
8k 32k 128k 128 千 256k 256 千 16k 32k 64k 128k 256k 256 千
ChatGLM4-9B-1M 56.61 49.15 44.30 45.29 46.40 43.23 42.92 40.41 36.95
Qwen2-7B-Instruct 87.07 73.64 38.77 2.92 49.77 46.93 28.03 11.01 0.55
    + YARN + DCA 66.32 60.71 42.14 36.64 34.72
Qwen2-72B-Instruct 91.90 92.01 73.05 17.13 58.82 56.70 42.92 31.79 2.88
    + YARN + DCA 90.27 85.21 53.03 48.83 42.35
NeedleBench 针座

NeedleBench ups the challenge on NIAH by including multiple facts (two to five) in passages, necessitating simultaneous identification and multi-hop reasoning. Table 12 reveals that the integration of YARN and DCA (An et al., 2024) notably improves Qwen2 models’ long-context abilities. Qwen2-7B-Instruct surpasses ChatGLM4-9B-1M (Zeng et al., 2024), which claims a 1M context length. Moreover, Qwen2-72B-Instruct demonstrates strong performance, with an accuracy reduction of just 6 points, compared to ChatGLM4-9B-1M, which shows a more pronounced decline of 11 points, particularly given its lower initial accuracy.
NeedleBench 通过在段落中包含多个事实(两到五个),增加了 NIAH 的挑战性,这要求同时识别和多跳推理。表 12 显示,YARN 与 DCA(An 等,2024)的结合显著提升了 Qwen2 模型在长上下文处理能力。Qwen2-7B-Instruct 超越了声称拥有 1M 上下文长度的 ChatGLM4-9B-1M(Zeng 等,2024)。此外,Qwen2-72B-Instruct 表现出色,仅降低了 6 个百分点的准确性,相比之下,ChatGLM4-9B-1M 下降了 11 个百分点,尤其是在其初始准确性较低的情况下。

LV-Eval

LV-Eval comprises 11 diverse QA datasets that demand comprehension of multiple pieces of evidence at once. To rectify the shortcomings of its original metric, which was excessively stringent and led to a high rate of false negatives, we adopt the keyword recall as the reported score. As shown in Table 12, integrating YARN and DCA substantially bolsters the long-context competencies of Qwen2 models on LV-Eval. Qwen2-7B-Instruct achieves parity with ChatGLM4-9B-1M, albeit with a more noticeable decline at extended contexts. Moreover, Qwen2-72B-Instruct demonstrates strong performance across all lengths, confirming its proficiency in handling long-context tasks.
LV-Eval 包含 11 个多样化的问答数据集,要求同时理解多条证据。为纠正其原有评估指标过于严格、导致高误判率的问题,我们采用关键词召回率作为报告分数。如表 12 所示,结合 YARN 与 DCA 显著增强了 Qwen2 模型在 LV-Eval 上的长上下文处理能力。Qwen2-7B-Instruct 与 ChatGLM4-9B-1M 持平,但在长上下文情境下下降更为明显。此外,Qwen2-72B-Instruct 在各长度表现均强劲,证实其擅长处理长上下文任务。

5.2.4 Multilingual Evaluation
5.2.4 多语言评估

For the multilingual evaluation, we implement a comprehensive human evaluation for the assessment of multilingual capabilities. Specifically, we design diverse test cases assessing different capabilities of large language models, and we have test cases that are in a number of languages. For the annotators, we invite one professional annotator for each language who majors in the language for the evaluation. For each test case, the annotator grades the response from model with a score from 1 to 5.
在多语言评估方面,我们实施了一项全面的人工评估,以评定大型语言模型的多语言能力。具体而言,我们设计了多样化的测试案例,评估模型不同的能力,并涵盖多种语言。对于标注人员,我们邀请了每种语言的专业标注员,他们主修该语言,负责评估工作。每个测试案例中,标注员会根据模型的回答给出 1 至 5 分的评分。

Table 13: Performance of Qwen2-72B-Instruct and proprietary LLMs in multilingual human evaluation. We compare Qwen2-72B-Instruct with GPT-3.5-Turbo-1106, GPT-4-Turbo-0409, GPT-4o-0513, Claude-3-Opus-0229. Scores range from 1 to 5. Overall, Qwen2-72B-Instruct performs substantially better than GPT-3.5-Turbo but there is progress to be made to be competitive with the proprietary models released in the last 6 months.
表 13:Qwen2-72B-Instruct 与专有模型LLMs在多语言人工评估中的表现。我们将 Qwen2-72B-Instruct 与 GPT-3.5-Turbo-1106、GPT-4-Turbo-0409、GPT-4o-0513、Claude-3-Opus-0229 进行比较。评分范围为 1 至 5。总体而言,Qwen2-72B-Instruct 的表现远超 GPT-3.5-Turbo,但与过去 6 个月发布的专有模型相比,仍有进步空间以保持竞争力。
Language 语言 GPT-3.5-Turbo GPT-4-Turbo GPT-4o Claude-3-Opus Claude-3-作品 Qwen2-72B-Instruct
Arabic 阿拉伯语 2.52 3.44 3.55 4.15 3.86
French 法语 3.47 4.19 4.16 4.23 4.01
Indonesian 印度尼西亚语 3.56 4.09 4.39 4.40 3.83
Japanese 日语 2.75 3.68 3.72 3.85 3.63
Korean 韩语 2.37 4.24 4.40 4.23 4.14
Portuguese 葡萄牙语 3.37 3.86 3.89 4.09 3.97
Russian 俄语 3.24 4.27 4.32 4.25 4.15
Spanish 西班牙语 4.07 4.08 4.26 4.31 4.10
Thai 泰语 3.38 4.11 4.09 4.01 3.75
Vietnamese 越南语 3.90 3.84 4.14 3.98 3.91
Average 平均 3.16 3.98 4.09 4.15 3.93

We report the results of our model and the baselines in the evaluation of different languages. From Table 13, it can be found that on average Qwen2-72B-Instruct significantly outperforms GPT-3.5-Turbo and it is competitive with GPT-4-Turbo and slightly falls behind Claude-3-Opus. This shows that our multilingual pre-training and instruction tuning data contribute to the multilingual capabilities of Qwen2-72B-Instruct and it is competitive with most state-of-the-art proprietary LLMs.
我们报告了在评估不同语言时,我们的模型与基线模型的结果。从表 13 可以看出,平均而言,Qwen2-72B-Instruct 显著优于 GPT-3.5-Turbo,并与 GPT-4-Turbo 竞争激烈,略逊于 Claude-3-Opus。这表明我们的多语言预训练和指令调优数据有助于提升 Qwen2-72B-Instruct 的多语言能力,使其与大多数最先进的专有模型LLMs相媲美。

5.2.5 Safety & Responsibility
5.2.5 安全与责任

Table 14: Performance of models in safety evaluation. We compare Qwen2-72B-Instruct with GPT-4 and Mixtral-8x22B-Instruct. The lower, the better. Qwen2-72B-Instruct rejected more prompts with risks than the competitors.
表 14:模型在安全性评估中的表现。我们对比了 Qwen2-72B-Instruct 与 GPT-4 和 Mixtral-8x22B-Instruct 的性能。数值越低表示表现越好。Qwen2-72B-Instruct 在风险提示方面拒绝的次数多于竞争对手。
Risk Category 风险类别 GPT-4 Mixtral-8x22B 混合模型-8x22B Qwen2-72B-Instruct
Illegal 非法 00.00 06.87 00.00
Fraud 欺诈 03.40 08.49 02.41
Pornography 色情内容 23.63 33.82 22.91
Privacy 隐私 03.37 15.03 02.47

LLMs with openly accessible weights effectively accelerate the development of the research as well as their applications. Moreover, we believe that it is crucial to build safe and responsible LLMs so that the effect of the misuse of AI technologies could be significantly alleviated.
开放获取权重的LLMs能有效加速研究及其应用的发展。此外,我们坚信构建安全且负责任的LLMs至关重要,如此方能显著减轻人工智能技术滥用带来的影响。

We implement a multilingual safety evaluation that tests the LLMs in different languages. Specifically, we assess the safety performance of the models in the topics about illegal behaviors, fraud, pornography, and privacy. We have collected prompts prone to jail-breaking and use them to test whether the models can provide safe responses by rejection.
我们实施了一项多语言安全评估,测试了LLMs在不同语言环境下的表现。具体而言,我们评估了模型在涉及非法行为、欺诈、色情及隐私等主题中的安全性能。我们收集了易导致模型越狱的提示,并利用这些提示测试模型是否能通过拒绝提供响应来确保安全回答。

The results are presented in Table 14, where the proportion of harmful responses generated by the models are shown and the lower, the better. It can be observed that Qwen2-72B-Instruct performs better than the proprietary model, GPT-4, and significantly outperforms the open-weight model, Mixtral-8x22B-Instruct. However, we believe that there is still much room for our model to improve to be a safer and more responsible model, especially in terms of pornography, which is a conventionally difficult category to differentiate even for humans.
结果展示于表 14 中,显示了各模型产生有害反应的比例,比例越低越好。可以看出,Qwen2-72B-Instruct 的表现优于专有模型 GPT-4,并显著超越开放权重模型 Mixtral-8x22B-Instruct。然而,我们认为模型仍有很大提升空间,以成为更安全、更负责任的模型,特别是在区分色情内容方面,这是人类也难以区分的传统难题。

6 Conclusion 6 结论

This technical report has presented the Qwen2 series, a versatile suite of foundational and instruction-tuned language models, ranging from 0.5 to 72 billion parameters, including models of dense and Mixture-of-Experts architecture. Qwen2 outperforms previous open-weight models, notably its predecessor Qwen1.5, and displays competitive performance against proprietary models across a broad spectrum of benchmarks in language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning. In this update, we have extra focus on long-context, multi-lingual, coding, mathematics capabilities and safety and responsibility. In a commitment to fostering innovation and accessibility within the community, we have made the Qwen2 model weights openly accessible, which enables researchers and developers to harness the full potential of Qwen2 in a variety of applications and research projects. Through these efforts, we aim to contribute to the advancement of AI technologies and their positive impact on society.
本技术报告介绍了 Qwen2 系列,这是一套多功能的语言模型集合,涵盖从 0.5 亿到 72 亿参数的基础模型和指令微调模型,包括密集型和专家混合架构模型。Qwen2 在性能上超越了以往的开源模型,特别是其前身 Qwen1.5,并在语言理解、生成、多语言能力、编码、数学和推理等多个基准测试中展现出与专有模型相媲美的竞争力。在此次更新中,我们特别关注了长上下文、多语言、编码、数学能力以及安全性和责任性。为了促进社区内的创新和可访问性,我们公开了 Qwen2 模型的权重,使研究人员和开发者能够在多种应用和研究项目中充分利用 Qwen2 的潜力。通过这些努力,我们旨在推动人工智能技术的发展及其对社会的积极影响。

References 参考文献

  • Abdin et al. (2024) 阿布丁等人(2024) Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, and Yi Zhang. Phi-2: The surprising power of small language models, 2024. URL https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
    马拉·阿布丁,乔蒂·安尼佳,塞巴斯蒂安·比贝克,凯奥·塞萨尔·特奥多罗·门德斯,韦志初,艾丽·德尔焦诺,罗南·埃尔丹,西瓦坎斯·戈皮,苏里亚·甘纳塞卡尔,莫扬·贾瓦赫里皮,皮耶罗·考夫曼,尹灿·李,元智·李,安·阮,古斯塔沃·德罗莎,奥利·萨里基维,阿迪尔·萨利姆,希塔尔·沙阿,迈克尔·桑塔克罗切,哈基里特·辛格·贝赫,亚当·陶曼·卡莱,辛·王,雷切尔·沃德,菲利普·维特尔,西里尔·张,以及易·张。Phi-2:小型语言模型的惊人力量,2024 年。网址:https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/。
  • AI@Meta (2024) AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
    AI@Meta. Llama 3 模型卡,2024 年。链接:https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md。
  • Ainslie et al. (2023) 安斯利等人(2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query Transformer models from multi-head checkpoints. In EMNLP, pp.  4895–4901. Association for Computational Linguistics, 2023.
    约书亚·安斯利, 詹姆斯·李-索普, 米歇尔·德容, 尤里·泽姆利亚尼基, 费德里科·莱布朗, 以及苏米特·桑哈。GQA: 从多头检查点训练广义多查询 Transformer 模型。在 EMNLP 会议上, 第 4895-4901 页。计算语言学协会, 2023 年。
  • An et al. (2024) 安等(2024) Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463, 2024.
    安晨欣, 黄飞, 张军, 龚善山, 邱锡鹏, 周畅, 孔令鹏. 无需训练的长上下文扩展大型语言模型. 计算机研究与发展, abs/2402.17463, 2024.
  • Anthropic (2024) Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, AI, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
    人类. Claude 3 模型家族:Opus、Sonnet、Haiku。技术报告,Anthropic,AI,2024。网址 https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf。
  • Austin et al. (2021) 奥斯汀等人(2021) Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
    雅各布·奥斯汀、奥古斯都·奥德纳、马克斯韦尔·I·奈、马丁·博斯马、亨利克·米哈沃夫斯基、戴维·多汉、艾伦·姜、凯莉·J·蔡、迈克尔·特里、奎克·V·勒和查尔斯·萨顿。大型语言模型在程序综合中的应用。CoRR, abs/2108.07732, 2021.
  • Bai et al. (2023a) 白等人(2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. CoRR, abs/2309.16609, 2023a.
    金泽白, 白帅, 储云飞, 崔泽宇, 党凯, 邓晓东, 范洋, 葛文斌, 韩宇, 黄飞, 惠斌元, 纪洛, 李梅, 林君阳, 林润基, 刘大恒, 刘高, 卢成强, 卢克明, 马建新, 门瑞, 任星章, 任煊城, 谭川奇, 谭思楠, 涂建宏, 王鹏, 王世杰, 王伟, 吴圣光, 徐本峰, 徐瑾, 杨安, 杨浩, 杨健, 杨书胜, 杨尧, 余博文, 袁鸿毅, 袁峥, 张建伟, 张星煊, 张一昌, 张振如, 周畅, 周敬人, 周晓焕, 朱天航. Qwen 技术报告. CoRR, abs/2309.16609, 2023a.
  • Bai et al. (2023b) Bai 等人(2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023b.
    金泽柏, 白帅, 杨书生, 王世杰, 谭思楠, 王鹏, 林俊阳, 周畅, 周靖人. Qwen-VL: 具备多才多艺能力的前沿大型视觉-语言模型. 计算机研究进展, abs/2308.12966, 2023b.
  • Bai et al. (2022) Bai 等人(2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemí Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback. CoRR, abs/2212.08073, 2022.
    白云涛, 苏拉夫·卡达瓦斯, 桑迪潘·昆杜, 阿曼达·阿斯克尔, 杰克逊·克尼昂, 安迪·琼斯, 陈安娜, 安娜·戈尔迪, 阿扎利亚·米尔霍塞尼, 卡梅伦·麦金农, 陈凯伦, 凯瑟琳·奥尔森, 克里斯托弗·奥拉, 丹尼·埃尔南德斯, 道恩·德莱恩, 迪普·甘加利, 达斯汀·李, 伊莱·特兰-约翰逊, 伊桑·佩雷斯, 杰米·科尔, 贾里德·穆勒, 杰弗里·拉迪什, 约书亚·兰多, 卡马尔·恩杜斯, 卡米莱·卢科修特, 莱安·洛维特, 迈克尔·塞利托, 纳尔逊·埃尔哈格, 尼古拉斯·席弗, 诺埃米·梅尔卡多, 诺瓦·达萨拉, 罗伯特·拉森比, 罗宾·拉尔森, 山姆·林格, 斯科特·约翰斯顿, 肖娜·克拉维克, 希尔·埃尔·肖克, 斯坦尼斯拉夫·福特, 塔梅拉·兰哈姆, 蒂莫西·特伦恩-劳顿, 汤姆·康纳利, 汤姆·赫尼汉, 特里斯坦·休谟, 塞缪尔·R·鲍曼, 扎克·哈特菲尔德-多兹, 本·曼恩, 达里奥·阿莫代伊, 尼古拉斯·约瑟夫, 萨姆·麦坎德利斯, 汤姆·布朗, 贾里德·卡普兰. 宪法 AI: 基于 AI 反馈的无害性. 计算机研究预印本, abs/2212.08073, 2022.
  • Bandarkar et al. (2023) 班达卡等人(2023) Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884, 2023.
    卢卡斯·班达卡尔, 戴维斯·梁, 本杰明·穆勒, 米克尔·阿尔特特, 萨蒂亚·纳拉扬·舒克拉, 唐纳德·胡萨, 纳曼·戈亚尔, 阿比南丹·克里希南, 卢克·泽特勒莫耶尔, 马迪安·赫卜萨. Belebele 基准测试: 一个包含 122 种语言变体的平行阅读理解数据集. 计算机研究报告, abs/2308.16884, 2023 年.
  • Cao et al. (2024) 曹等人(2024) Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, and Bowen Yu. Towards scalable automated alignment of LLMs: A survey. CoRR, abs/2406.01252, 2024.
    曹博熙, 卢克明, 卢欣宇, 陈家伟, 任梦婕, 向昊, 刘培林, 卢耀杰, 何斌, 韩先培, 孙乐, 林鸿宇, 于博文. 面向可扩展自动化对齐的LLMs: 一项调查. 计算机研究与发展, abs/2406.01252, 2024.
  • Cassano et al. (2023) 卡萨诺等人(2023) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Software Eng., 49(7):3675–3691, 2023.
    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, 及 Abhinav Jangda. MultiPL-E:一种可扩展且多语言的神经代码生成基准测试方法。IEEE 软件工程汇刊,49(7):3675–3691,2023 年。
  • Chen et al. (2021) 陈等人(2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
    陈马克, Jerry Tworek, 全熙宇, 袁启明, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. 评估在代码上训练的大型语言模型. CoRR, abs/2107.03374, 2021.
  • Chen et al. (2023a) 陈等人(2023a) Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In EMNLP, pp.  7889–7901. Association for Computational Linguistics, 2023a.
    陈文华, 尹明, 库马克斯, 卢潘, 万艺欣, 马学光, 徐建宇, 王欣怡, 夏彤. TheoremQA: 基于定理驱动的问答数据集. 在 EMNLP 会议上, 第 7889-7901 页. 计算语言学协会, 2023a.
  • Chen et al. (2023b) 陈等人(2023b) Zhihong Chen, Shuo Yan, Juhao Liang, Feng Jiang, Xiangbo Wu, Fei Yu, Guiming Hardy Chen, Junying Chen, Hongbo Zhang, Li Jianquan, Wan Xiang, and Benyou Wang. MultilingualSIFT: Multilingual supervised instruction fine-tuning, 2023b. URL https://github.com/FreedomIntelligence/MultilingualSIFT.
    陈志宏, 闫硕, 梁炬豪, 姜峰, 吴祥波, 余飞, 陈桂铭, 陈俊英, 张洪波, 李建全, 万翔, 王本友. MultilingualSIFT: 多语言监督指令微调, 2023b. 网址: https://github.com/FreedomIntelligence/MultilingualSIFT.
  • Chiang et al. (2024) 蒋等人(2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. CoRR, abs/2403.04132, 2024.
    魏林·江, 郑连民, 盛颖, 阿纳斯塔西奥斯·尼古拉斯·安杰洛普洛斯, 李天乐, 李大成, 张浩, 朱邦华, 迈克尔·I. 乔丹, 约瑟夫·E. 冈萨雷斯, 以及艾恩·斯托伊卡. 聊天机器人竞技场: 一个基于人类偏好评测LLMs的开放平台. 计算机研究预印本, abs/2403.04132, 2024 年.
  • Chu et al. (2023) 楚等人(2023) Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919, 2023.
    储云飞, 徐瑾, 周晓欢, 杨倩, 张世良, 闫志杰, 周畅, 周靖人. Qwen-Audio: 通过统一的大规模音频-语言模型推进通用音频理解. 计算机研究与发展, abs/2311.07919, 2023.
  • Clark et al. (2018) 克拉克等人(2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018.
    彼得·克拉克, 艾萨克·考伊, 奥伦·埃齐奥尼, 图沙尔·科特, 阿希什·萨巴瓦尔, 卡丽莎·舍内克, 以及奥伊文德·塔福尔. 你以为已经解决了问答问题吗? 试试 ARC, AI2 推理挑战赛. CoRR, abs/1803.05457, 2018.
  • Cobbe et al. (2021) Cobbe 等人(2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
    卡尔·科布, 瓦尼特·科斯拉朱, 穆罕默德·巴瓦里安, 马克·陈, 希宇·俊, 卢卡斯·凯泽, 马蒂亚斯·普拉普尔特, 杰瑞·特沃克, 雅各布·希尔顿, 内野玲治, 克里斯托弗·赫斯, 约翰·舒尔曼. 训练验证器解决数学应用题. 计算机研究预印本, abs/2110.14168, 2021.
  • Dai et al. (2024) 戴等人(2024) Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024.
    达迈戴, 邓成奇, 赵成刚, 徐瑞欣, 高华作, 陈德礼, 李家石, 曾望鼎, 余兴凯, 吴勇, 谢振达, 李永康, 黄攀攀, 罗富礼, 阮冲, 隋志芳, 梁文峰. DeepSeekMoE: 迈向混合专家语言模型中的极致专家专业化. 计算机研究与发展, abs/2401.06066, 2024.
  • Dauphin et al. (2017) 多芬等人(2017) Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pp.  933–941. PMLR, 2017.
    Yann N. Dauphin, Angela Fan, Michael Auli, 和 David Grangier. 门控卷积网络在语言建模中的应用。载于《机器学习研究》第 70 卷,ICML 会议论文集,第 933-941 页。PMLR, 2017 年。
  • Dong et al. (2023) 董等人(2023) Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. CoRR, abs/2310.05492, 2023.
    董冠霆, 袁弘毅, 卢克明, 李成鹏, 薛明峰, 刘大业, 王伟, 袁峥, 周畅, 周靖人. 监督微调数据组成对大型语言模型能力的影响. 计算机研究与发展, abs/2310.05492, 2023.
  • Dong et al. (2024) 董等人(2024) Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542, 2024.
    董冠霆, 陆克明, 李成鹏, 夏庭宇, 余博文, 周畅, 周靖人. 自玩带执行反馈: 提升大型语言模型指令遵循能力. 计算机研究与发展, abs/2406.13542, 2024.
  • Fenogenova et al. (2024) Fenogenova 等人(2024 年) Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton A. Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, and Sergey Markov. MERA: A comprehensive LLM evaluation in russian. CoRR, abs/2401.04531, 2024.
    阿莲娜·费诺根诺娃, 阿尔乔姆·切尔维亚科夫, 尼基塔·马尔蒂诺夫, 阿纳斯塔西娅·科兹洛娃, 玛丽亚·季霍诺娃, 阿尔比娜·阿赫梅特加列耶娃, 安东·A. 叶梅利亚诺夫, 丹尼斯·舍夫列夫, 帕维尔·列别杰夫, 列昂尼德·西涅夫, 乌利扬娜·伊萨耶娃, 卡捷琳娜·科洛梅伊采娃, 达尼尔·莫斯科夫斯基, 伊丽莎维塔·贡恰罗娃, 尼基塔·萨武什金, 波琳娜·米哈伊洛娃, 丹尼斯·季米特洛夫, 亚历山大·潘琴科, 谢尔盖·马尔科夫. MERA: 俄语综合评估 LLM. CoRR, abs/2401.04531, 2024.
  • Goyal et al. (2022) 高亚尔等人(2022) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguistics, 10:522–538, 2022.
    纳曼·高亚, 辛西娅·高, 维什拉夫·乔杜里, 彭珍·陈, 纪尧姆·文泽克, 达·居, 桑贾纳·克里希南, 马尔科·奥雷利奥·兰扎托, 弗朗西斯科·古兹曼, 安吉拉·范. Flores-101 评估基准: 面向低资源与多语种机器翻译的评测. 计算机语言学协会会刊, 10 卷:522–538 页, 2022 年.
  • Hendrycks et al. (2021a) Hendrycks 等人(2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a.
    丹·亨德里克斯, 科林·伯恩斯, 史蒂文·巴斯阿特, 安迪·邹, 曼塔斯·梅泽伊卡, 邓峰, 雅各布·斯坦哈特. 大规模多任务语言理解能力评估. 在 ICLR. OpenReview.net, 2021a.
  • Hendrycks et al. (2021b) 亨德里克斯等人(2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks, 2021b.
    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, 及 Jacob Steinhardt. 使用 MATH 数据集评估数学问题解决能力。收录于 NeurIPS 数据集与基准测试,2021 年 b 卷。
  • Huang et al. (2023) 黄等人(2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In NeurIPS, 2023.
    黄宇珍, 白宇洲, 朱智豪, 张俊磊, 张景涵, 苏唐俊, 刘俊腾, 吕传城, 张义凯, 雷嘉怡, 傅瑶, 孙茂松, 何俊贤. C-Eval: 面向基础模型的多层次多学科中文评估套件. 在 NeurIPS, 2023.
  • Jain et al. (2024) 贾因等人(2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974, 2024.
    纳曼·贾恩, 金汉, 顾磊, 李文鼎, 闫凡嘉, 张天军, 王思达, 阿尔曼多·索拉-莱萨马, 库什克·森, 以及艾昂·斯托伊卡. LiveCodeBench: 针对代码的大型语言模型的全面无污染评估. 预印本, abs/2403.07974, 2024.
  • Jiang et al. (2023a) 江等人(2023a) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.06825, 2023a.
    阿尔伯特·Q·蒋, 亚历山大·萨布莱罗尔斯, 亚瑟·门什, 克里斯·班福德, 迪文德拉·辛格·查普洛特, 迪亚戈·德·拉斯卡萨斯, 弗洛里安·布雷桑, 吉安娜·朗吉尔, 纪尧姆·兰普尔, 吕西勒·索利尼耶, 莱利奥·雷纳德·拉沃, 玛丽-安娜·拉沙, 皮埃尔·斯托克, 特文·勒·斯卡奥, 蒂博特·拉夫尔, 托马斯·王, 蒂莫泰·拉科, 威廉·埃尔·萨义德. Mistral 7B. CoRR, abs/2310.06825, 2023a.
  • Jiang et al. (2024) 江等人(2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts. CoRR, abs/2401.04088, 2024.
    阿尔伯特·Q·蒋, 亚历山大·萨布莱罗尔斯, 安托万·鲁克斯, 亚瑟·门什, 布朗什·萨瓦里, 克里斯·巴姆福德, 迪文德拉·辛格·查普洛特, 迪亚戈·德拉卡萨斯, 艾玛·布哈纳, 弗洛里安·布雷桑, 吉安娜·朗吉尔, 纪尧姆·布尔, 纪尧姆·朗普尔, 莱利奥·雷纳德·拉沃, 吕西勒·索利尼耶, 玛丽-安娜·拉沙, 皮埃尔·斯托克, 桑迪普·苏布拉马尼亚姆, 索菲亚·杨, 西蒙·安东尼亚克, 特文·勒斯卡奥, 泰奥菲尔·热尔韦, 蒂博·拉夫里尔, 托马斯·王, 蒂莫泰·拉克鲁瓦, 威廉·埃尔·萨伊德. 专家混合体. CoRR, abs/2401.04088, 2024.
  • Jiang et al. (2023b) 江等人(2023b) Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and efficient pre-LN Transformers. CoRR, abs/2305.14858, 2023b.
    江子璇, 顾家琪, 朱汉卿, 以及潘大卫. 预 RMSNorm 与预 CRMSNorm 变换器: 等效且高效的预 LN 变换器. 计算机研究预印本, abs/2305.14858, 2023b.
  • Kamradt (2023) 卡姆拉德(2023) Gregory Kamradt. Needle in a haystack - pressure testing LLMs, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
    格雷戈里·卡姆拉特。《大海捞针——压力测试LLMs》,2023 年。网址:https://github.com/gkamradt/LLMTest_NeedleInAHaystack。
  • Komatsuzaki et al. (2023)
    小松崎等人(2023)
    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In ICLR. OpenReview.net, 2023.
    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, 和 Neil Houlsby. 稀疏升级:从密集检查点训练专家混合模型. 在 ICLR. OpenReview.net, 2023.
  • Koto et al. (2023) Koto 等人(2023) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In EMNLP, pp.  12359–12374. Association for Computational Linguistics, 2023.
    Fajri Koto, Nurul Aisyah, Haonan Li, 及 Timothy Baldwin. 大型语言模型仅能通过印尼小学考试:对 IndoMMLU 的综合测试。载于 EMNLP,页 12359–12374。计算语言学协会,2023 年。
  • Li et al. (2023) 李等人(2023) Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. CoRR, abs/2306.09212, 2023.
    李昊南, 张义轩, Fajri Koto, 杨一飞, 赵海, 公业宇, 端木楠, 及 Timothy Baldwin. CMMLU: 中文大规模多任务语言理解测评. 计算机研究与发展, abs/2306.09212, 2023.
  • Li et al. (2024) 李等人(2024) Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939, 2024.
    李天乐, 江伟林, 弗里克·埃文, 邓拉普·丽莎, 吴天豪, 朱邦华, 约瑟夫·E. 冈萨雷斯, 以及艾昂·斯托伊卡. 从众包数据到高质量基准: Arena-Hard 与 BenchBuilder 流程. 计算机研究预印本, abs/2406.11939, 2024 年.
  • Lieber et al. (2024) Lieber 等人(2024 年) Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid Transformer-Mamba language model. CoRR, abs/2403.19887, 2024.
    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, 及 Yoav Shoham. Jamba:一种混合 Transformer-Mamba 语言模型。CoRR, abs/2403.19887, 2024.
  • Lin et al. (2022a) 林等人(2022a) Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In ACL (1), pp.  3214–3252. Association for Computational Linguistics, 2022a.
    斯蒂芬妮·林、雅各布·希兰与奥文·埃文斯。TruthfulQA:衡量模型模仿人类错误的能力。在 ACL(1)中,第 3214 至 3252 页。计算语言学协会,2022 年 a。
  • Lin et al. (2022b) 林等人(2022b) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona T. Diab, Veselin Stoyanov, and Xian Li. Few-shot learning with multilingual generative language models. In EMNLP, pp.  9019–9052. Association for Computational Linguistics, 2022b.
    Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona T. Diab, Veselin Stoyanov, 及 Xian Li. 基于多语种生成语言模型的小样本学习. 在 EMNLP 会议上,第 9019 至 9052 页. 计算语言学协会,2022b.
  • Liu et al. (2023a) 刘等人(2023a) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In NeurIPS, 2023a.
    刘家玮, 夏春秋 Steven Xia, 王宇遥, 张凌铭. ChatGPT 生成的代码真的正确吗? 大型语言模型在代码生成中的严格评估. 在 NeurIPS, 2023a.
  • Liu et al. (2023b) 刘等人(2023b) Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, and Jie Tang. AlignBench: Benchmarking Chinese alignment of large language models. CoRR, abs/2311.18743, 2023b.
    刘晓,雷炫宇,王圣元,黄悦,冯卓尔,温博斯,程嘉乐,柯沛,徐一凡,谭永祥,张晓涵,孙立超,王宏宁,张静,黄民烈,董宇霄,唐杰。AlignBench: 大型语言模型中文对齐基准测试。CoRR, abs/2311.18743, 2023b.
  • Lu et al. (2024a) Lu 等人(2024a) Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, and Chang Zhou. Online merging optimizers for boosting rewards and mitigating tax in alignment. CoRR, abs/2405.17931, 2024a.
    陆克明, 余博文, 黄飞, 范阳, 林润基, 周畅. 在线合并优化器以提升奖励并减轻对齐中的税收影响. 计算机研究预印本, abs/2405.17931, 2024a.
  • Lu et al. (2024b) Lu 等人(2024b) Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. CoRR, abs/2401.12474, 2024b. 重试    错误原因
  • Lu et al. (2024c) 重试    错误原因 Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #InsTag: Instruction tagging for analyzing supervised fine-tuning of large language models. In ICLR. OpenReview.net, 2024c. 重试    错误原因
  • Mesnard et al. (2024) 重试    错误原因 Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on Gemini research and technology. CoRR, abs/2403.08295, 2024. 重试    错误原因
  • Muennighoff et al. (2023)
    穆宁霍夫等人(2023 年)
    Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual generalization through multitask finetuning. In ACL (1), pp.  15991–16111. Association for Computational Linguistics, 2023.
    尼克拉斯·穆恩尼格霍夫, 托马斯·王, 林唐·苏塔维卡, 亚当·罗伯茨, 斯特拉·比德曼, 特文·勒·斯卡奥, M. 赛福尔·巴里, 盛申, 郑欣勇, 海莉·肖尔科夫, 向茹唐, 德拉戈米尔·拉德夫, 阿尔哈姆·菲克里·阿吉, 哈立德·阿尔姆巴拉克, 塞缪尔·阿尔巴尼奥, 扎伊德·阿利亚费, 阿尔伯特·韦布森, 爱德华·拉夫, 科林·拉菲尔. 通过多任务微调实现跨语言泛化. 在 ACL(1)中, 第 15991-16111 页. 计算语言学协会, 2023 年.
  • Ni et al. (2024) 倪等人(2024) Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, and Yang You. MixEval: Deriving wisdom of the crowd from LLM benchmark mixtures. CoRR, abs/2406.06565, 2024.
    倪锦杰,薛福照,岳翔,邓云天,马赫尔·沙阿,卡比尔·贾恩,格雷厄姆·纽比,杨扬。MixEval:从LLM基准混合中汲取群体智慧。CoRR,abs/2406.06565,2024 年。
  • OpenAI (2022) 开放人工智能(2022 年) OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/index/chatgpt/.
    OpenAI. 推出 ChatGPT,2022 年。网址:https://openai.com/index/chatgpt/。
  • OpenAI (2024) 重试    错误原因 OpenAI. Hello GPT-4o, 2024. URL https://openai.com/index/hello-gpt-4o/. 重试    错误原因
  • OpenCompass Contributors (2023) 重试    错误原因 OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models, 2023. URL https://github.com/open-compass/opencompass.
    OpenCompass 贡献者。OpenCompass:一个面向基础模型的通用评估平台,2023 年。网址:https://github.com/open-compass/opencompass。
  • Peng et al. (2023) 彭等人(2023) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071, 2023.
    彭博文,杰弗里·奎斯内尔,范鸿璐,和恩里科·希波洛。YaRN:大型语言模型的高效上下文窗口扩展。CoRR,abs/2309.00071,2023 年。
  • Ponti et al. (2020) 庞蒂等人(2020) Edoardo Maria Ponti, Goran Glavas, Olga Majewska, Qianchu Liu, Ivan Vulic, and Anna Korhonen. XCOPA: A multilingual dataset for causal commonsense reasoning. In EMNLP (1), pp.  2362–2376. Association for Computational Linguistics, 2020.
    Edoardo Maria Ponti, Goran Glavas, Olga Majewska, Qianchu Liu, Ivan Vulic, 和 Anna Korhonen. XCOPA: 因果常识推理的多语种数据集. 在 EMNLP (1) 中, 第 2362–2376 页. 计算语言学协会, 2020.
  • Qwen Team (2024a) Qwen 团队(2024a) Qwen Team. Introducing Qwen1.5, 2024a. URL https://qwenlm.github.io/blog/qwen1.5/.
    Qwen 团队. 介绍 Qwen1.5, 2024a. 网址 https://qwenlm.github.io/blog/qwen1.5/.
  • Qwen Team (2024b) 重试    错误原因 Qwen Team. Qwen1.5-110B: The first 100B+ model of the Qwen1.5 series, 2024b. URL https://qwenlm.github.io/blog/qwen1.5-110b/. 重试    错误原因
  • Qwen Team (2024c) 重试    错误原因 Qwen Team. Qwen1.5-MoE: Matching 7B model performance with 1/3 activated parameters, 2024c. URL https://qwenlm.github.io/blog/qwen-moe/. 重试    错误原因
  • Rafailov et al. (2023) 重试    错误原因 Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023. 重试    错误原因
  • Rajbhandari et al. (2022) 重试    错误原因 Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  18332–18346. PMLR, 2022. 重试    错误原因
  • Rein et al. (2023) 重试    错误原因 David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022, 2023. 重试    错误原因
  • Sakaguchi et al. (2021) 重试    错误原因 Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, 2021. 重试    错误原因
  • Su (2023) 重试    错误原因 Jianlin Su. The magical effect of the Bias term: RoPE + Bias = better length extrapolation, 2023. URL https://spaces.ac.cn/archives/9577. 重试    错误原因
  • Su et al. (2024) 重试    错误原因 Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced Transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
    苏建林, 穆尔塔扎·H. M. 艾哈迈德, 吕宇, 潘盛峰, 文博, 刘云峰. Roformer: 增强型 Transformer 与旋转位置嵌入. 神经计算, 568:127063, 2024.
  • Suzgun et al. (2023) Suzgun 等人(2023 年) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In ACL (Findings), pp.  13003–13051. Association for Computational Linguistics, 2023.
    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, 及 Jason Wei. 挑战 BIG-Bench 任务及其链式思考解决能力探讨. 见: ACL (Findings), 页码 13003–13051. 计算语言学协会, 2023.
  • Touvron et al. (2023) Touvron 等人(2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
    雨果·图夫龙, 蒂博·拉夫尔, 高捷·伊扎卡德, 泽维尔·马丁内, 玛丽-安娜·拉沙, 蒂莫泰·拉科, 巴蒂斯特·罗齐埃, 纳曼·戈亚尔, 埃里克·汉布罗, 法萨尔·阿兹哈尔, 奥雷连·罗德里格, 阿尔芒·茹兰, 爱德华·格拉夫, 纪尧姆·兰普尔. LLaMA: 开放且高效的基础语言模型. 计算机研究预印本, abs/2302.13971, 2023.
  • Vaswani et al. (2017) Vaswani 等人(2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pp.  5998–6008, 2017.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, 以及 Illia Polosukhin. 注意力机制即一切. 在 NIPS 中, 页码 5998–6008, 2017 年.
  • Wang et al. (2024) 重试    错误原因 Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024. 重试    错误原因
  • Xiong et al. (2023) 重试    错误原因 Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023.
    温汉雄, 刘敬宇, 伊戈尔·莫利博格, 张贺佳, 普拉杰瓦尔·巴尔加瓦, 侯锐, 路易斯·马丁, 拉希·朗塔, 卡西克·阿比纳夫·桑卡拉曼, 巴拉斯·奥古兹, 马迪安·赫布萨, 范涵, 亚沙尔·梅赫达德, 沙兰·纳朗, 克什蒂兹·马利克, 安吉拉·范, 舒鲁提·布霍萨尔, 谢尔盖·埃德诺夫, 迈克·刘易斯, 辛 ong 王, 和郝马. 基础模型的有效长上下文扩展. 计算机研究预印本, abs/2309.16039, 2023.
  • Yang et al. (2019) 杨等人(2019) Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In EMNLP/IJCNLP (1), pp.  3685–3690. Association for Computational Linguistics, 2019.
    杨银飞,张媛,Chris Tar,和 Jason Baldridge。PAWS-X:用于释义识别的跨语言对抗数据集。在 EMNLP/IJCNLP(1)中,第 3685-3690 页。计算语言学协会,2019 年。
  • Young et al. (2024) 杨等人(2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.AI. CoRR, abs/2403.04652, 2024.
    亚历克斯·杨, 陈蓓, 李超, 黄成恩, 张戈, 张冠威, 李恒, 朱江城, 陈建群, 常静, 余凯迪, 刘鹏, 刘强, 岳肖恩, 杨森彬, 杨世明, 余涛, 谢文, 黄文豪, 胡晓辉, 任晓义, 牛欣瑶, 聂鹏程, 徐宇驰, 刘宇东, 王悦, 蔡宇轩, 顾振宇, 刘志远, 戴宗宏. 01.AI 的开放基础模型. CoRR, abs/2403.04652, 2024.
  • Yuan et al. (2024) 袁等人(2024) Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136, 2024.
    陶渊、宁雪飞、周东、杨志杰、李诗瑶、庄明辉、谭哲月、姚竹雨、林大华、李博轩、戴国华、阎圣恩、王宇。LV-Eval:一个平衡的长上下文基准,包含 5 个长度级别,最高达 256K。CoRR,摘要/2402.05136,2024 年。
  • Yuan et al. (2023) 重试    错误原因 Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825, 2023. 重试    错误原因
  • Zellers et al. (2019) 重试    错误原因 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL (1), pp.  4791–4800. Association for Computational Linguistics, 2019. 重试    错误原因
  • Zeng et al. (2024) 重试    错误原因 Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. CoRR, abs/2406.12793, 2024. 重试    错误原因
  • Zhao et al. (2024) 重试    错误原因 Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Minghao Li, Fei Huang, Nevin L. Zhang, and Yongbin Li. Tree-Instruct: A preliminary study of the intrinsic relationship between complexity and alignment. In LREC/COLING, pp.  16776–16789. ELRA and ICCL, 2024. 重试    错误原因
  • Zheng et al. (2023) 重试    错误原因 Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. 重试    错误原因
  • Zhou et al. (2023) 重试    错误原因 Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911, 2023. 重试    错误原因