Evolutionary Optimization of Model Merging Recipes
《模型融合配方的进化优化》
Abstract 摘要
\added \新增Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources.
\deletedWe present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a \addedcost-effective promising approach for \deletedLLM developmentcreating new models by combining existing ones\deleted, due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.
大型语言模型(LLMs)的能力日益强大,但其开发通常需要大量计算资源。我们提出了一种进化算法的新颖应用,用于自动化创建强大的基础模型。尽管模型融合作为一种\added 高性价比的有前景的方法,通过组合现有模型来\deleted 开发 LLM 创建新模型\deleted(因其成本效益),但目前仍依赖人类直觉和领域知识,限制了其潜力。本文提出的进化方法通过自动发现多样化开源模型的有效组合,克服了这一限制,无需大量额外训练数据或算力即可利用其集体智能。我们的方法在参数空间和数据流空间同时运作,实现了超越单一模型权重层面的优化。该方法甚至支持跨领域融合,例如生成具备数学推理能力的日语 LLM。令人惊讶的是,我们的日语数学 LLM 在多个权威日语 LLM 基准测试中实现了最先进性能,即使面对参数量显著更大的模型也保持优势,尽管并未针对这些任务进行专门训练。此外,通过本方法生成的文化感知型日语视觉语言模型(VLM)在描述日本文化特定内容时表现出色,超越了此前所有日语 VLM。这项工作不仅为开源社区贡献了新的最先进模型,更开创了自动化模型组合的新范式,为基础模型开发探索替代性高效路径铺平了道路。
1 Introduction 1 引言
Model merging [22, 35], a recent development in the large language model (LLM) community, presents a novel paradigm shift. By strategically combining multiple LLMs into a single architecture, this exciting development has captured the attention of researchers due to its key advantage: it requires no additional training, making it an incredibly cost-effective approach for developing new models. This accessibility has fueled a surge in interest and experimentation with model merging. The Open LLM Leaderboard [27] is now dominated by merged models, showcasing its potential for democratizing foundation model development.
模型融合[22,35]作为大型语言模型(LLM)领域的最新进展,展现了一种范式转变。通过将多个 LLM 策略性地整合为单一架构,这一突破性进展因其核心优势——无需额外训练即可开发新模型的超高性价比——引起了研究者的广泛关注。这种易用性推动了模型融合研究的爆发式增长。Open LLM 排行榜[27]目前已被融合模型主导,充分证明其在 democratizing 基础模型开发方面的潜力。
However, model merging is considered by many to be a form of black art or alchemy, relying on the model maker’s intuition and instincts about model selection and merging recipes to create and refine a new model that performs well for a particular task. Furthermore, the model maker is often required to have some domain knowledge for the various different benchmark tasks.
Given the large diversity of open models and benchmarks in the community, human intuition can only go so far, and we believe a more systematic approach for discovering new model combinations will take things much further.
然而,模型合并被许多人视为一种"黑魔法"或炼金术,它依赖于模型制作者对模型选择和合并配方的直觉与本能,以创建并优化出在特定任务中表现优异的新模型。此外,模型制作者通常还需要具备各种不同基准任务的相关领域知识。鉴于社区中开源模型和基准测试的多样性,人类直觉的作用有限,我们相信采用更系统化的方法来发现新的模型组合将取得更大进展。
We believe evolutionary algorithms will be able to discover more effective model merging solutions, and thus provide a path for automating the creation of more capable models.
As a step towards this direction, in this work, we show that evolution can be employed to discover novel and unintuitive ways to merge \replacedmultiplevarious models to produce new models with a new combined ability.
In this work, we present a methodology that leverages evolutionary algorithms to facilitate the merging of foundation models. Our approach is distinguished by its ability to navigate both parameter space (weights) and the data flow space (inference path), proposing a framework that integrates these two dimensions.
我们相信进化算法能够发现更有效的模型合并方案,从而为自动化创建更强大模型提供路径。作为朝此方向迈出的一步,本研究证明进化算法可用于发现新颖且反直觉的模型合并方法,将多个不同模型合并为具有全新综合能力的新模型。我们提出了一种利用进化算法促进基础模型合并的方法论,其独特之处在于能同时探索参数空间(权重)与数据流空间(推理路径),提出了整合这两个维度的框架。
This work makes several key contributions to the field of foundation model development:
本研究为基础模型开发领域做出以下关键贡献:
-
1.
Automated Model Composition: We introduce Evolutionary Model Merge, a general evolutionary method to automatically discover \replacedeffectiveoptimal combinations of \replacedselecteddiverse open-source models for creating new foundation models with user-specified capabilities. This approach harnesses the collective intelligence of existing open models, enabling the creation of powerful models without the need for extensive training data or compute.
1. 自动化模型组合:我们提出"进化模型合并"方法,这种通用进化算法能自动发现各类开源模型的最佳组合,以创建具有用户指定能力的新基础模型。该方法利用现有开源模型的集体智慧,无需大量训练数据或算力即可创建强大模型。 -
2.
Cross-Domain Merging: We demonstrate that our method can discover novel ways to merge models from disparate domains (e.g., non-English language and Math, non-English language and Vision), potentially exceeding the capabilities achievable through conventional human design strategies.
2. 跨领域合并:我们证明该方法能发现合并不同领域模型(如非英语语言与数学、非英语语言与视觉)的新颖方式,其能力可能超越传统人工设计策略。 -
3.
State-of-the-Art Performance: We showcase the effectiveness of our method by automatically generating a Japanese LLM with Math reasoning capability and a Japanese Vision-Language Model (VLM). Notably, both models achieve state-of-the-art performance on various benchmarks, even without explicit optimization for those tasks.
3. 顶尖性能表现:通过自动生成具有数学推理能力的日语 LLM 和日语视觉语言模型(VLM),我们验证了方法的有效性。值得注意的是,这两个模型在多项基准测试中均达到顶尖水平,且未针对这些任务进行显式优化。 -
4.
High Efficiency and Surprising Generalizability: We observe that our 7B parameter LLM surpasses the performance of some previous 70B parameter Japanese LLMs on benchmark datasets, highlighting the high efficiency and surprising generalization capability of our approach. We believe this model can serve as a strong general-purpose Japanese LLM.
4. 高效性与惊人泛化能力:我们观察到 7B 参数的 LLM 在基准数据集上超越此前部分 70B 参数日语 LLM 的表现,凸显了该方法的高效性和惊人泛化能力。我们相信该模型可成为强大的通用日语 LLM。 -
5.
Culturally-Aware VLM: The generated Japanese VLM achieves top results when tested on a domestically-sourced dataset of Japanese image-description pairs, demonstrating its ability to handle Japanese culture-specific content.
5. 文化感知型 VLM:生成的日语 VLM 在日本本土采集的图文配对数据集测试中表现优异,彰显其处理日本文化特定内容的能力。
We are committed to open science and are excited to open-sourc\addede\deleteding our EvoLLM-JP and EvoVLM-JP, two state-of-the-art Japanese foundation models, to the community, enabling further research and development in the field. Our work challenges the conventional paradigm of expensive model development by demonstrating that our evolutionary-based method can produce competitive models without relying on gradient-based training. This paves the way for exploring alternative, potentially more efficient, approaches to foundation model development.
我们秉持开放科学精神,欣然向社区开源 EvoLLM-JP 和 EvoVLM-JP 这两个顶尖日语基础模型,推动该领域的持续研发。本研究通过证明基于进化的方法无需依赖梯度训练即可产出具有竞争力的模型,挑战了传统高成本模型开发范式,为探索更高效的基础模型开发方法开辟了新路径。
1.1 Background and Related Work
1.1 研究背景与相关工作
1.1.1 Overview of Model Merging.
1.1.1 模型融合概述
Model merging offers a novel approach to leverage the strengths of multiple pre-trained models. It allows us to combine task-specific models, each potentially fine-tuned for a particular downstream task, into a single unified model. This technique stands in contrast to traditional transfer learning, where a pre-trained model is further fine-tuned for a new task. While transfer learning offers advantages like improved performance and faster convergence, the resulting models are typically limited to single tasks. Model merging, on the other hand, strives to create a versatile and comprehensive model by combining the knowledge from multiple pre-trained models, potentially yielding a model capable of handling various tasks simultaneously.
模型融合开创性地整合多个预训练模型的优势,能将针对不同下游任务微调的专项模型统一为单一复合模型。与传统迁移学习(即对预训练模型进行新任务微调)相比,迁移学习虽具性能提升和快速收敛等优势,但所得模型通常局限于单一任务。模型融合则通过整合多个预训练模型的知识,致力于构建多功能综合模型,有望实现多任务并行处理能力。
While fine-tuning an existing model involves training an existing model further on a specialized dataset, model merging methods work by combining existing models without any additional gradient-based training, making it very cost-effective in terms of compute requirements. Many in the machine learning community, including researchers, hackers, and hobbyists alike, are using both fine-tuning and model-merging to create the best models for their purposes. In fact, this collaborative way of developing continually improved models has parallels with the traditional open-source software development approach [48] where the trend is for the community to develop foundational machine learning models like open source software [46, 47].
现有模型微调需在专业数据集上继续训练,而模型融合方法无需额外梯度训练即可组合现有模型,显著降低算力成本。机器学习社区的研究者、开发者和爱好者正同时采用微调与融合技术来打造最佳定制模型。这种持续优化模型的协作模式与传统的开源软件开发模式[48]异曲同工,当前趋势正是由社区像开发开源软件那样共同构建基础机器学习模型[46,47]。
A simple method of merging multiple models is to average the weights of multiple models fine-tuned from the same base initial model. This model soup approach [62] demonstrated significant improvements on relatively large image processing and image classification models. The work pointed to similarity between weighted model averaging leading to a flat local minima. Theoretical [18, 32, 45, 33] and empirical studies dating back from the 1990s [25, 26] demonstrated that flatter local optima generalize better to out-of-distribution shifts. More recent work [42, 15] examined the theoretical properties and practical implications of weight averaging applied to larger generative models.
合并多个模型的一种简单方法是对同一基础初始模型微调得到的多个模型的权重进行平均。这种模型汤(model soup)方法[62]在相对较大的图像处理和图像分类模型上展现出显著改进。该研究指出,权重平均带来的相似性会导致平坦的局部极小值。早至 20 世纪 90 年代的理论研究[18,32,45,33]和实证研究[25,26]表明,更平坦的局部最优解对分布偏移具有更好的泛化能力。近期工作[42,15]则探讨了权重平均应用于更大规模生成模型的理论特性和实际意义。
Linear weight averaging is performed as follows. Let represent the weight vectors of two distinct models, where is the dimension of the weight space. The merged model’s weights, denoted as , are computed using the following equation:
Here, is a weighting parameter that determines the relative contribution of each model to the final merged model. By adjusting , practitioners can control the balance between the two original models’ characteristics in the resulting merged model.
线性权重平均按如下方式执行:设 表示两个不同模型的权重向量,其中 为权重空间的维度。合并后的模型权重记为 ,通过以下公式计算: 。此处 是权重参数,用于确定每个模型对最终合并模型的相对贡献。通过调整 ,实践者可以控制两个原始模型特性在合并结果中的平衡比例。
Linear weight averaging not only works well for image processing and classification models, but is also effective for image generation models such as latent diffusion models. The release of Stable Diffusion [51] led to a variety of specialist fine-tunes of the model for a variety of different styles, from photorealism, to anime, to realistic painting styles. Popular open-source WebUIs [11] enabled \replacedpractitionershobbyists to experiment with merging different models using linear or spherical linear interpolation (SLERP) [61] of weight parameters of distinct fine-tuned Stable Diffusion models. This has led to the discovery of merged models that combine strengths of specialized models into a single model.
For some time, the most popular Stable Diffusion models were neither the original base models nor the fine-tuned versions, but rather the merged models created by enthusiasts.
线性权重平均不仅适用于图像处理和分类模型,对潜在扩散模型等图像生成模型同样有效。Stable Diffusion[51]的发布催生了针对不同风格(从照片写实到动漫再到写实绘画风格)的多种专项微调模型。流行的开源 WebUI[11]使得\replaced 从业者爱好者能够通过对不同微调 Stable Diffusion 模型的权重参数进行线性或球面线性插值(SLERP)[61]来尝试模型合并。这催生了许多将专项模型优势整合于一体的合并模型。有段时间最受欢迎的 Stable Diffusion 模型既非原始基础模型也非微调版本,而是爱好者创建的合并模型。
1.1.2 Merging Language Models
1.1.2 语言模型合并
Simple weight interpolation works well for merging image generation models that operate in a real valued latent and output space, but to our surprise, they also still work for merging language models under the right conditions, but have performance issues [64]. Recently, new methods have been proposed to address merging language models specifically. One method, Task Arithmetic [28], involves building task vectors by subtracting pre-trained from fine-tuned model weights, enabling manipulation through arithmetic operations to steer the merged model’s behavior. \deletedThe development
虽然简单权重插值法适用于操作实数潜在空间和输出空间的图像生成模型合并,但令人惊讶的是,在适当条件下该方法仍可用于语言模型合并,不过存在性能问题[64]。近期已有专门针对语言模型合并的新方法提出。其中任务算术(Task Arithmetic)[28]通过从微调模型权重中减去预训练权重来构建任务向量,从而通过算术运算操控合并模型的行为。\deleted 该发展
Task Arithmetic operates as follows.
Let denote the weights of a pre-trained base model.
For each task , let represent the weights of a model fine-tuned for that specific task.
The task vector for each task is defined as the difference between the fine-tuned model weights and the base model weights:
The weights of the new, merged model are then computed by adding a weighted sum of these task vectors to the base model weights:
Here, is a scaling parameter that determines the contribution of each task vector to the final merged model.
任务算术的工作原理如下。设 表示预训练基础模型的权重。对于每个任务 ,令 代表针对该任务微调后的模型权重。每个任务 的任务向量 定义为微调模型权重与基础模型权重之差: 。新合并模型 的权重则通过将这些任务向量的加权和加到基础模型权重上来计算: 。此处 是缩放参数,用于决定各任务向量对最终合并模型的贡献程度。
A key problem of weight interpolation-based methods is that they neglect parameter interference, leading to performance degradation. A recent work [64] identifies two key sources of interference: redundant parameter values and conflicting parameter signs across models, and proposes a method that addresses these issues to achieve improved merging performance. The proposed TIES-Merging method addresses information loss in existing merging methods by incorporating three steps: resetting minimal parameter changes, resolving sign conflicts, and merging only aligned parameters. The trimming of redundant parameter changes in TIES-Merging not only resulted in higher performance of merged models, but also achieved better compression for the fine-tuned parameter updates. This compression of fine-tuned weight parameter updates is further investigated in ComPEFT [63].
基于权重插值方法的一个关键问题是忽视了参数干扰,导致性能下降。近期研究[64]指出了两种主要干扰源:模型间冗余的参数值和相互冲突的参数符号,并提出了一种解决这些问题以提升合并性能的方法。所提出的 TIES-Merging 方法通过三个步骤解决现有合并方法中的信息丢失问题:重置最小参数变化、解决符号冲突以及仅合并对齐参数。TIES-Merging 中对冗余参数变化的修剪不仅提高了合并模型的性能,还实现了对微调参数更新更好的压缩效果。这种对微调权重参数更新的压缩在 ComPEFT[63]中得到了进一步研究。
The three steps of TIES-Merging are implemented as follows. In the first step, minimal parameter changes are reset by retaining only the most influential components of each task vector and setting the rest to zero. This effectively removes redundant or insignificant modifications. The second step resolves sign conflicts across models by determining an aggregate sign for each parameter based on the overall direction of significant changes across all tasks. This ensures that conflicting updates do not cancel each other out. In the final step, only the aligned parameters are merged by selectively averaging the values of parameters that agree with the elected sign from the previous step, while excluding those that disagree. This selective merging process ensures that the final model incorporates only consistent and significant parameter changes from each task-specific model, thereby preserving task-relevant information while minimizing interference.
TIES-Merging 的三个步骤实现如下:第一步通过仅保留各任务向量 中最具影响力的组件并将其余部分归零,重置最小参数变更,有效剔除了冗余或无关紧要的修改。第二步通过综合所有任务中显著变化的整体方向,为每个参数确定聚合符号,从而解决模型间的符号冲突,确保相互冲突的更新不会彼此抵消。最后一步仅合并对齐参数,选择性平均那些与上一步选定符号相符的参数值,同时排除不一致的参数。这种选择性合并流程确保最终模型仅整合来自各任务专用模型的一致且重要的参数变更,从而在最小化干扰的同时保留任务相关信息。
Another recent work [66] proposes the DARE method\added, which goes further by zeroing out small differences between the fine-tuned model and the original base model, while amplifying the differences. In practice, DARE [66] is often used together with Task Arithmetic [28] or TIES-Merging [64].
\added
Specifically, DARE operates as follows. It applies a random mask to task vector , where is the drop rate, and then rescales the remaining parameters:
另一项近期研究[66]提出了 DARE 方法\新增,该方法通过将微调模型与原始基础模型之间的微小差异归零并放大差异,实现了更进一步的优化。实践中,DARE[66]常与任务算术[28]或 TIES-Merging[64]结合使用。\新增具体而言,DARE 的操作流程如下:对任务向量 施加随机掩码 (其中 为丢弃率),然后对剩余参数进行重新缩放:
(1) |
The resulting sparsified task vector is then used to merge models. This task vector can be utilized in the aforementioned linear task arithmetic (referred to as DARE-Linear) or TIES (referred to as DARE-TIES) methods.
得到的稀疏化任务向量 随后被用于模型合并。该任务向量可用于前文所述的线性任务算术(称为 DARE-Linear)或 TIES 方法(称为 DARE-TIES)。
Model merging is gaining momentum in ML research, and has been applied in areas such as multimodal model development [8, 59] and merging experts [8]. It is also becoming popular in both the broader image generation and LLM communities, where due to the low barrier to entry, many researchers\replaced and, engineers\deleted, and hobbyists alike are continually merging existing models to create new customized models.
模型合并在机器学习研究中势头正盛,已被应用于多模态模型开发[8,59]和专家合并[8]等领域。在图像生成和 LLM 社区中,由于入门门槛较低,越来越多的研究人员\替换为、工程师\删除以及爱好者不断通过合并现有模型来创建新的定制化模型。
Image generation model merging really took off in the user community only after such methods were implemented in open-source toolkits [11]. \addedIn the image generation domain, model merging techniques became widely accessible to practitioners especially after their implementation in open-source toolkits [11].
Similarly, language model merging really gained popularity after the implementation of mergekit [22, 35]. This toolkit provides all of the popular recipes for merging language models. In addition to simple linear and spherical interpolation, advanced recipes such as Task Arithmetic, TIES-Merging, DARE are also implemented, offering users the ability to experiment with combining these methods to work with fine-tunes of popular base models such as Mistral [29]. With this toolkit, a large number of capable merged models are developed by the community, and most of the top models on the Open LLM Leaderboard [27] are gradually dominated by merged models produced by the community of language model enthusiasts.
图像生成模型合并技术直到这些方法在开源工具包中实现后,才真正在用户社区中兴起[11]。\新增在图像生成领域,模型合并技术尤其在开源工具包实现后,对从业者变得广泛可用[11]。类似地,语言模型合并在 mergekit 工具包[22,35]实现后才真正流行起来。该工具包提供了所有流行的语言模型合并配方。除了简单的线性和球面插值外,还实现了任务算术、TIES 合并、DARE 等高级配方,使用户能够尝试组合这些方法来处理流行基础模型(如 Mistral[29])的微调版本。借助该工具包,社区开发了大量性能优异的合并模型,Open LLM 排行榜[27]上的顶尖模型逐渐被语言模型爱好者社区生产的合并模型所主导。
Mergekit also introduced an additional method, called Frankenmerging which is not based on weight merging, but for the user to experiment with trying to stack different layers from multiple models to sequentially create a new model. This method has the advantage of not tying the user to one particular family of models with a fixed architecture for merging (e.g. Mistral-based models), but potentially creating new architectures from merging entirely different models. It is noted that discovering new Frankenmerging technique remains a challenge for the community, and requires much more trial and error to discover new recipes for this technique. To date, almost everyone uses a similar Frankenmerging merge recipe, and there has been very little trial and error to improve upon it. The field remains highly under-explored, and we believe this is where evolution can help.
Mergekit 还引入了一种称为 Frankenmerging 的额外方法,该方法不基于权重合并,而是让用户尝试堆叠来自多个模型的不同层来顺序创建新模型。这种方法的优势在于不将用户束缚于具有固定架构的特定模型系列(例如基于 Mistral 的模型),而可能通过合并完全不同的模型来创建新架构。值得注意的是,发现新的 Frankenmerging 技术对社区来说仍然是一个挑战,需要更多的试错来发现该技术的新配方。迄今为止,几乎每个人都使用类似的 Frankenmerging 合并配方,而对其进行改进的尝试非常有限。该领域仍处于高度未开发状态,我们相信这正是进化算法可以发挥作用的地方。
1.1.3 Connection to Evolutionary Neural Architecture Search
1.1.3 与进化神经架构搜索的关联
Model merging holds significant potential and democratizes the model-building process, making it accessible to a broad range of participants. However, it heavily relies on intuition and domain knowledge. Human intuition, however, has its limits. With the growing diversity of open models and tasks, we need a more systematic approach. We believe evolutionary algorithms, inspired by natural selection, can unlock more effective merging solutions. These algorithms can explore a vast space of possibilities, discovering novel and counter-intuitive combinations that traditional methods and human intuition might miss.
模型合并具有显著潜力,能够民主化模型构建过程,使其对广泛参与者开放。然而,该方法高度依赖直觉和领域知识,而人类直觉存在固有局限。随着开源模型和任务的多样性日益增长,我们需要更系统化的方法。受自然选择启发的进化算法,我们相信能够发掘更有效的合并方案。这类算法可探索广阔的潜在空间,发现传统方法和人类直觉可能忽略的新颖且反直觉的组合。
In deep learning, techniques such as Neural Architecture Search (NAS) [67] employed evolutionary techniques to discover new architectures [49, 56] that might be non-intuitive for human designers to discover. However, NAS-methods consumed significant computational resources given how each candidate model architecture needed to be trained. Our approach aims to save compute resources by discovering structures that take advantage of existing capabilities of existing Transformer blocks that had already consumed considerable resources to train in the first place.
在深度学习领域,神经架构搜索(NAS)[67]等技术曾运用进化方法发现人类设计者难以凭直觉构想的新架构[49, 56]。但由于每个候选模型架构都需要训练,NAS 方法消耗了大量计算资源。我们的方法通过利用已消耗大量资源预训练的 Transformer 模块的现有能力来发现结构,旨在节省计算资源。
Pre-trained transformer blocks are akin to neural network components one can choose to mix and match using an evolutionary architecture search algorithm. However, unlike NAS, we do not need to train the model, and can evaluate the candidates right away. In a sense, our work resonates with the goals of earlier works of morphology search such as NEAT [57] and more recent work such as Weight Agnostic Neural Networks [19] which evolved neural network structures with certain task-specific inductive biases, without the need to train the weight parameters using gradient descent. Similarly, this was how NAS methods such as SMASH [57] avoided costly inner-loop training through the application of a Hypernetwork [23] to estimate the weights of architectural candidates.
预训练的 Transformer 模块类似于可通过进化架构搜索算法混合搭配的神经网络组件。但与 NAS 不同,我们无需训练模型即可立即评估候选方案。从某种意义上,我们的工作与早期形态搜索研究(如 NEAT[57])和近期成果(如权重无关神经网络[19])目标一致——后者通过进化具有特定任务归纳偏好的神经网络结构,无需梯度下降训练权重参数。类似地,这也是 SMASH[57]等 NAS 方法通过超网络[23]估算架构候选权重来避免昂贵内循环训练的原理。
Recent work in language models tried to duplicate the layers of a pre-trained Transformer many times [59] as a starting point to pre-train a much larger network, to enable efficient training of much larger language models [65]. Similarly, in our approach, we are interested in applying automated search methods to discover the effective reuse of existing pre-trained layers.
语言模型的最新研究尝试通过多次复制预训练 Transformer 的层结构[59]作为预训练更大规模网络的起点,以实现高效训练超大规模语言模型[65]。类似地,我们的方法旨在应用自动化搜索技术,探索现有预训练层的高效复用方案。
In this work, we apply evolution to not only automate the optimization of model merging recipes in weight space, confined to a single architecture, but also explore applying evolution to optimize the stacking of layers from different models—a more involved approach that has the potential to create entirely novel neural architectures from existing building blocks. Unlike weight merging optimization which confine space of models to be merged together to fine-tuned descendents of the same parent base model, layer stacking optimization has no such constraints. Our work systematically explores the application of evolution applied in both parameter and layer space.
本研究不仅将进化算法应用于自动化优化单一架构内权重空间的模型融合配方,还探索了利用进化算法优化跨模型层堆叠——这种更复杂的方法有望用现有模块构建出全新的神经网络架构。与权重融合优化(要求合并模型必须源自同一基础模型的微调变体)不同,层堆叠优化不受此类限制。我们的工作系统性地探索了进化算法在参数空间和层空间的双重应用。
2 Methods 2 方法论
Our goal is to create a unified framework capable of automatically generating a merged model from a selection of foundation models, ensuring that the performance of this merged model surpasses that of any individual in the collection. Central to our approach is the application of evolutionary algorithms, which we employ to refine the intricacies involved in model merging. To systematically address this challenge, we first dissect the merging process into two distinct, orthogonal configuration spaces, analyzing their individual impacts. Building on this analysis, we then introduce a cohesive framework that seamlessly integrates these spaces. Figure 1 provides a schematic representation of our approach.
我们的目标是建立一个统一框架,能够从基础模型集合中自动生成融合模型,并确保其性能超越集合中的任何单个模型。该方法的核心理念是应用进化算法来优化模型融合的复杂过程。为系统解决这一挑战,我们首先将融合过程分解为两个正交的配置空间并分析其独立影响,继而提出整合这两个空间的协同框架。图 1 展示了我们方法的示意图解。

图 1:进化式模型合并概览。我们的方法包含(1)在参数空间(PS)中进化各层混合参数的权重;(2)在数据流空间(DFS)中进化层排列顺序;(3)结合两种策略在 PS 与 DFS 中同时进行合并。需注意 PS 中的合并并非简单复制拼接层参数,而是对权重进行混合——这种合并类似于此处演示的颜色混合(如红色与蓝色混合为紫色)。请注意我们已将问题翻译为英文以便读者理解,模型实际处理的是日语文本。
2.1 Merging in the Parameter Space
2.1 参数空间中的模型融合
Model merging in the parameter space (PS) aims to integrate the weights of multiple foundational models into a unified entity with the same neural network architecture, yet outperforming the individual models. While various strategies for combining model parameters exist [42, 62], our approach leverages task vectors analysis to understand each model’s strengths, based on the specific tasks they are optimized for or excel in [28]. Specifically, we enhance TIES-Merging with DARE [64, 66], allowing for more granular, layer-wise merging (In this paper, by “layer” we mean the input/output embedding layers or a transformer block). We establish merging configuration parameters for sparsification and weight mixing at each layer, including input and output embeddings. These configurations are then optimized using an evolutionary algorithm, such as CMA-ES [24], for selected tasks, guided by critical task-specific metrics (e.g., accuracy for MGSM, ROUGE score for VQA).
参数空间(PS)中的模型融合旨在将多个基础模型的权重整合为具有相同神经网络架构的统一实体,同时性能超越单个模型。尽管存在多种模型参数组合策略[42, 62],我们的方法基于任务向量分析来理解每个模型在其优化或擅长的特定任务中的优势[28]。具体而言,我们通过 DARE 方法[64, 66]增强 TIES-Merging 技术,实现更细粒度的分层融合(本文中"层"指输入/输出嵌入层或 Transformer 模块)。我们为每层(包括输入输出嵌入层)建立稀疏化与权重混合的融合配置参数,并采用 CMA-ES 等进化算法[24]针对选定任务优化这些配置,优化过程由关键任务指标引导(如 MGSM 任务采用准确率,VQA 任务采用 ROUGE 分数)。
2.2 Merging in the Data Flow Space
2.2 数据流空间融合
Recent analysis and discoveries imply that knowledge is stored distributedly in language models [21, 44, 43], suggesting simple yet novel model merging possibilities in the data flow space (DFS). Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the -th layer in model , a token may be directed to the -th layer in model .
最新分析与发现表明,知识分布式存储在语言模型中[21, 44, 43],这为数据流空间(DFS)中实现简单而新颖的模型融合提供了可能。与 PS 融合不同,DFS 中的模型融合保持各层原始权重不变,转而优化 token 在神经网络中流动时的推理路径。例如,模型 的第 层处理后,token 可能被导向模型 的第 层。
In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of models and a budget , our method searches for a sequence of layer indices that delineates the path all the tokens should follow for a specific task. Here denotes the -th layer in the -th model, with marking the step in the inference path.
在此领域的初步探索中,我们仅研究串行连接与非自适应配置,将更灵活的模型融合方案留待未来工作。具体而言,给定 个模型和预算 ,我们的方法搜索能描述特定任务中所有 token 应遵循路径的层索引序列 。其中 表示第 个模型的第 层, 标记推理路径的步骤。
One can quickly imagine how large the search space is. Assuming the total number of layers across all models is , the size of the search space is , here the extra one indicates the inclusion of a pass-through layer. Even with a modest setting of (e.g., 2 models of 32 layers each) and , this setting translates to an astronomically large search space, a challenge even for a capable evolutionary search algorithm. Luckily, our preliminary studies indicated that certain layer arrangements, particularly repetitive or permuted sequences from earlier in the model, can adversely affect performance. Based on this, we modify our settings to include an indicator array of size in the evolutionary search space, here is number of repetitions.
搜索空间的庞大程度可见一斑。假设所有模型总层数为 ,搜索空间规模达 (额外加 1 表示包含直通层)。即使采用保守设置 (如 2 个 32 层模型)和 ,其对应的搜索空间仍呈天文数字,这对强大的进化搜索算法也是挑战。值得庆幸的是,初步研究表明某些层排列(特别是模型中早期重复或置换的序列)会损害性能。据此,我们在进化搜索空间中引入大小为 的指示数组 ( 为重复次数)来优化设置。
Conceptually, we layout all the layers in sequential order (i.e., all layers in the -th model followed by those in the -th model) and repeat them times, the indicator array then manages the inclusion/exclusion of layers. If we include the layer corresponding to index in the slots in the merged model, otherwise we exclude it. Consequently, our search space is reduced to , which is still large, but tractable for evolutionary search.
从概念上讲,我们将所有层按顺序排列(即先排列第 个模型的所有层,再排列第 个模型的层),并重复 次,指示数组则控制各层的包含/排除状态。若 为真,我们会在合并模型的对应槽位中包含索引 指向的层,否则将其排除。由此,搜索空间被缩减至 ,虽然规模仍然庞大,但已处于进化算法可处理范围内。
In our search, we only optimize the data inference path inside the merged model and keep parameters in the models intact. In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, \addedour preliminary studies [58] show that swapping a pair of neighboring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically we find that appropriately scaling an input that wishes to go from layer to by help alleviate the problem. Here, is a matrix that is also optimized by the evolutionary search together with the indicator array .
在搜索过程中,我们仅优化合并模型内部的数据推理路径,保持各模型参数不变。在此设定下,某一层可能接收到与原始模型训练分布不同的输入,从而导致异常输出。例如,\added 我们前期研究[58]表明:交换语言模型中相邻两层的顺序会导致性能下降。尽管需要更多理论研究来建模这种分布偏移,但实证发现通过 对从第 层流向第 层的输入进行适当缩放可缓解该问题。其中缩放矩阵 将与指示数组 一同通过进化算法优化。
The size of grows quadratically with , for scenarios involving a large number of layers. An alternative approach to contain the search space size involves parameterizing with a neural network [52, 23]. We can instead evolve a feed-forward network to output the scaling weights conditioned on the layer and step indices: where ’s are the parameters to be evolved, whose size does not change when grows.
当涉及大量层级时, 的规模会随 呈二次方增长。为控制搜索空间规模,可采用神经网络参数化方法[52,23]处理 。具体而言,可进化一个前馈网络来生成基于层索引和步进索引的条件缩放权重: ,其中待进化参数 的规模不会随 增长而变化。
2.3 Merging in Both Spaces
2.3 双空间合并
Model merging in the PS and in the DFS are orthogonal approaches, however, it is straightforward to combine these disentangled methods and further boost the performance of a merged model. As we show in the rightmost illustration in Figure 1 and in Section 3.1, it is possible to first apply PS merging to a collection of models, and then put back this merged model in the collection and apply DFS merging from this enlarged collection.
在参数空间(PS)和深度特征空间(DFS)进行模型合并虽是正交方法,但将这两种解耦方法相结合可进一步提升合并模型的性能。如图 1 最右侧示意图及第 3.1 节所示,我们可以先对模型集合应用 PS 合并,再将合并后的模型放回原集合,对这个扩充后的集合实施 DFS 合并。
This can be extremely helpful when one considers model merging with multiple objectives, wherein PS merging can be applied first to produce several merged models each of which targets one of the muliple objectives of interest, and then DFS merging is applied with multi-objective genetic algorithms such as NSGA-II [17] to further expand the final model’s performance in relevant metrics.
该方法对多目标模型合并尤为有效:首先通过 PS 合并生成多个分别针对不同目标任务的合并模型,继而采用 NSGA-II[17]等多目标遗传算法进行 DFS 合并,从而在相关指标上全面提升最终模型的性能。
3 Results 3 实验结果
Most merged models in the community optimize for a narrow set of tasks defined in The Open LLM Leaderboard [27]. Our motivation is for evolutionary search to discover novel ways to merge different models from vastly different domains (e.g., non-English language and Math, or non-English language and Vision) which might be difficult for human experts to discover effective merging solutions themselves. Furthermore, effectively merging models from very different domains can lead to models of wider real-world applicability and enable us to develop models beyond the large population of models that are optimized for the narrow range of tasks defined by a leaderboard.
当前社区中大多数合并模型仅针对《Open LLM Leaderboard》[27]定义的有限任务集进行优化。我们希望通过进化搜索发现跨领域模型(如非英语语言与数学,或非英语语言与视觉领域)的创新合并方案——这些方案往往超出人类专家手动设计的有效性边界。更重要的是,有效合并差异显著的领域模型能增强实际应用泛化能力,推动模型开发突破排行榜定义的任务局限。
We demonstrate our Evolutionary Model Merge approach described in Section 2 by evolving a Japanese LLM capable of Math reasoning, and a Japanese VLM proficient in handling culturally-specific content.
Specifically, in Section 3.1, we apply evolution to merge a Japanese LLM with an English Math LLM to build a Japanese Math LLM, and in Section 3.3, we apply evolution to merge a Japanese LLM with an English VLM to create a Japanese VLM.
我们通过两个案例验证第 2 章所述的进化模型合并方法:构建具备数学推理能力的日语 LLM,以及擅长处理文化特定内容的日语视觉语言模型(VLM)。具体而言,在第 3.1 节中进化合并日语 LLM 与英语数学 LLM 生成日语数学 LLM;在第 3.3 节中进化合并日语 LLM 与英语 VLM 构建日语 VLM。
3.1 Evolving Japanese Math LLM
3.1 日语数学 LLM 的进化
3.1.1 Experimental Setup 3.1.1 实验设置
Source Models 源模型
To develop a model capable of solving math problems in Japanese, we apply evolutionary model merge on a set of source models containing a Japanese LLM and Math LLMs: shisa-gamma-7b-v1 [10] (Japanese LLM), WizardMath-7B-V1.1 [41] and Abel-7B-002 [13]. All these models are fine-tuned from Mistral-7B-v0.1 [29].
为开发能够解决日语数学问题的模型,我们在包含日语 LLM 和数学 LLM 的一组源模型上应用了进化模型融合:shisa-gamma-7b-v1 [10](日语 LLM)、WizardMath-7B-V1.1 [41] 和 Abel-7B-002 [13]。这些模型均基于 Mistral-7B-v0.1 [29]进行微调。
Dataset 数据集
For testing, we used the MGSM dataset [53], a multilingual translation of a subset of the GSM8k dataset [14]. The Japanese test set of MGSM, consisting of 250 samples, was used for the final evaluation. \addedSpecifically, MGSM contains translations of the first 250 samples (ID 0-249) from the GSM8k test set. We used a different dataset for evolutionary search to avoid overfitting the test set. Specifically, we translated the remaining 1069 samples (out of 1319 examples) of the GSM8k test set that were not included in the MGSM test set into Japanese. \addedThese correspond to samples with IDs 250-1318 in the original GSM8k test set, ensuring no overlap with the MGSM Japanese test set. One may think that it is more natural to translate the GSM8k training set. However, in our preliminary efforts, this approach did not work well. Since open-source math models were trained on the GSM8k training set, we were unable to perform accurate evaluations.
测试阶段采用 MGSM 数据集[53](GSM8k 数据集[14]的多语言翻译子集),其中包含 250 个样本的日语测试集用于最终评估。\added 具体而言,MGSM 包含 GSM8k 测试集前 250 个样本(ID 0-249)的翻译版本。为避免过拟合测试集,我们使用不同数据集进行进化搜索:将 GSM8k 测试集中未纳入 MGSM 测试集的剩余 1069 个样本(共 1319 例)翻译为日语。\added 这些样本对应原始 GSM8k 测试集中 ID 250-1318 的样本,确保与 MGSM 日语测试集无重叠。虽然翻译 GSM8k 训练集看似更合理,但初步尝试表明该方法效果不佳。由于开源数学模型均在 GSM8k 训练集上训练,我们无法进行准确评估。
Evaluation 评估
We evaluated the ability to generate Japanese answers to Japanese math problems.
Therefore, we considered an answer correct if it met the following criteria: (1) the concluding numerical value must be correct, and (2) the reasoning text should be written in Japanese.
我们评估了模型生成日语数学问题日语解答的能力。因此,只要满足以下标准即判定答案正确:(1) 最终数值必须正确;(2) 推理解释文本需使用日语书写。
We treated the last numerical value appearing in the output as the answer. We needed to adopt this heuristic because we are merging multiple models that were trained in different formats, which made it difficult to correct the output format. This method appeared to extract the answers correctly in almost all cases. Additionally, to determine the language of the output, we utilized fasttext [31, 30]. We used greedy sampling for generation and calculated the zero-shot pass@1 accuracy.
我们将输出中最后出现的数值视为答案。由于合并了多种训练格式不同的模型,难以统一校正输出格式,故采用此启发式方法。该方案在绝大多数情况下能正确提取答案。此外,我们使用 fasttext[31,30]检测输出文本语言,采用贪婪采样生成答案,并计算零样本 pass@1 准确率。
Optimization 优化方法
For optimization in PS, we used the CMA-ES[24] algorithm implemented in Optuna [9] with default hyper-parameters. Specifically, we set all initial parameter values to 0.5, sigma to 1/6, and the population size to , where is the number of parameters to optimize.
The fitness value is defined as the accuracy for all 1069 training samples. Please note that this set is disjoint from MGSM’s test set.
The optimization was conducted for 1000 trials, and the best trial with respect to the training accuracy was chosen as the final model.
We decided to employ TIES-Merging [64] with DARE [66] through preliminary experiments, and optimized its parameters.
参数空间优化采用 Optuna[9]实现的 CMA-ES[24]算法,保留默认超参数。具体设置:所有参数初始值为 0.5,sigma=1/6,种群规模为 ( 表示待优化参数数量)。适应度定义为 1069 个训练样本的整体准确率(注:该数据集与 MGSM 测试集无交集)。经过 1000 次试验后,选择训练准确率最高的方案作为最终模型。根据前期实验,我们确定采用结合 DARE[66]的 TIES-Merging[64]方法并优化其参数。
In our DFS merging experiments, , and consequently, .
We kept the last 200 examples in the training data as our validation set and optimize on the rest of the data with a batch size of 200. We report the performance of the snapshot that achieved the highest accuracy in the validation set, and the test set is strictly isolated from the optimization process.
We adopted CMA-ES in EvoJAX [60], it optimized and for a total of 100 generations with a popluation size of 128, and we used the default hyper-parameters.
We limited our DFS merging to two models and to ensure that the final model remains modest in size and can be run on a single GPU, but in principle, the methodology can scale to merging multiple models.
During the merging, model ’s tokenizer, and input/output embeddings are utilized. Furthermore, to maintain compatibility with the embedding layers, we mandate that the initial and final transformer layers of model define the start and the end of the inference path. We initialized the indicator array so that all layers in model are more likely to be included as initial hops in the inference path to shorten the search time.
深度优先搜索合并实验中, ,因而 。我们将训练数据最后 200 例作为验证集,其余数据以 200 为批次进行优化,最终报告验证集准确率最高的快照性能(测试集严格隔离于优化过程)。采用 EvoJAX[60]的 CMA-ES 算法,默认超参数下优化 和 共 100 代(种群规模 128)。为保证最终模型体积适中(可单 GPU 运行),DFS 合并仅针对 和 两个模型,但该方法理论上支持多模型合并。合并过程中使用模型 的分词器及输入/输出嵌入层,并要求模型 的初始与最终变换层分别定义推理路径的起点和终点以保持嵌入层兼容性。初始化指示数组 时,优先包含模型 的层作为推理路径初始跳转以缩短搜索时间。
3.1.2 Experimental Results
3.1.2 实验结果
Table 1 summarizes the performance of the LLMs on Japanese Math and overall Japanese benchmark tasks. The MGSM-JA column reports the results from the MGSM test set, employing the previously described metrics. The Japanese language model (Model 1) demonstrates limited mathematical proficiency, while the Math models (Models 2 and 3), though mathematically adept, show insufficient command of the Japanese language. Consequently, all three models score low scores on the MGSM-JA, with accuracy at or below 30.0.
表 1 总结了 LLMs 在日本数学及整体日语基准任务上的表现。MGSM-JA 列展示了采用前述指标的 MGSM 测试集结果。日语语言模型(模型 1)展现出有限的数学能力,而数学专用模型(模型 2 和 3)虽具备数学优势,却表现出日语能力的不足。因此,三个模型在 MGSM-JA 测试中均得分较低,准确率均≤30.0。
表 1:LLMs 性能对比。模型 1–3 为源模型,模型 4–6 为我们的优化合并模型,模型 7–11 为参考模型。MGSM-JA 用于衡量日语数学能力,JP-LMEH 评估通用日语能力,详细分类见表 2。
# | Model | Type 类型 | Size 大小 | MGSM-JA (acc ) | JP-LMEH (avg ) |
0 | Mistral-7B-v0.1 | EN general EN 通用 | 7B | 2.0 | 55.7 |
1 | Shisa Gamma 7B v1 | JA general JA 通用 | 7B | 9.6 | 66.1 |
2 | WizardMath 7B v1.1 | EN math EN 数学 | 7B | 18.4 | 60.1 |
3 | Abel 7B 002 | EN math EN 数学 | 7B | 30.0 | 56.5 |
4 | Ours (PS) 我们的方法 (PS) | 1 + 2 + 3 | 7B | 52.0 | 70.5 |
5 | Ours (DFS) 我们的方法 (DFS) | 3 + 1 | 10B | 36.4 | 53.2 |
6 | Ours (PS+DFS) 我们的方法 (PS+DFS) | 4 + 1 | 10B | 55.2 | 66.2 |
7 | Llama 2 70B | EN general EN 通用 | 70B | 18.0 | 64.5 |
8 | Japanese StableLM 70B 日本 StableLM 70B | JA general 日语通用 | 70B | 17.2 | 68.3 |
9 | Swallow 70B | JA general 日语通用 | 70B | 13.6 | 71.5 |
10 | GPT-3.5 | commercial 商业用途 | - | 50.4 | - |
11 | GPT-4 | commercial 商业用途 | - | 78.8 | - |
In contrast, our merged models (Models 4–6) manifest a substantial elevation in performance. Notably, the model merged in PS (Model 4) achieves an impressive score of 52.0, highlighting the remarkable potential in combining models with distinct expertises. The DFS-merged model (Model 5) also shows a performance enhancement, with an over 6 percent increase in accuracy compared to the source models. While the leap in performance is not as pronounced as with PS merging, it still proves to be a valid and orthogonal approach. Finally, our hybrid model (Model 6), which integrates both merging strategies, shows further enhancements on the task. \addedThe order of the source models in the indicator array affects the performance of the DFS merging method. We conduct experiments with all possible combinations and report the best scores in Table 1, we defer detailed analysis to Sec 3.2.5.
相比之下,我们合并的模型(模型 4-6)表现出显著的性能提升。值得注意的是,通过 PS 方法合并的模型(模型 4)取得了 52.0 分的优异成绩,这凸显了整合不同专业领域模型的巨大潜力。采用 DFS 方法合并的模型(模型 5)同样展现出性能增强,与源模型相比准确率提升超过 6%。虽然其性能飞跃不如 PS 合并法显著,但仍是有效且正交的解决方案。最终,我们融合两种策略的混合模型(模型 6)在任务中实现了进一步的性能提升。\added 指标数组 中源模型的排列顺序会影响 DFS 合并法的性能表现。我们测试了所有可能的组合方案,表 1 中呈现的是最佳得分,详细分析将在 3.2.5 节展开。
Figure 2 gives an overview of the five models’ “answer sheet” on the math problems. Our merged models retain the foundational knowledge in the source models, as evidenced by the similar score patterns on problems 1–15. Moreover, they exhibit emergent capabilities, successfully tackling problems that stumped the source models (e.g., problems 20–30). Evidently, by effectively integrating a Japanese LLM and mathematical models, we have succeeded in producing models that are proficient in both Japanese language understanding and mathematical problem-solving.
图 2 直观展示了五款模型在数学题"答题卡"上的表现。合并模型完整继承了源模型的基础知识(如第 1-15 题相似的得分模式),同时展现出突破性的新兴能力——成功解决了源模型未能攻克的难题(如第 20-30 题)。实验证明,通过有效整合日语 LLM 与数学专用模型,我们成功培育出兼具日语理解与数学解题能力的复合型模型。

图 2:性能总览。本图展示了各模型在 MGSM-JA 任务中的表现,横轴按题号排列 250 道测试题,正确答题位置用彩色标记标注。
Furthermore, Table 2 presents the results of evaluating the general Japanese language ability using the Japanese Language Model Evaluation Harness (JP-LMEH) benchmark suite. This benchmark suite consists of nine tasks, and the average score across these tasks is widely used as an indicator of overall Japanese language proficiency. Our models achieve remarkably high scores of 70.5 and 66.2, surpassing the source models and even the previous state-of-the-art 70B parameter Japanese LLM (specifically, Japanese StableLM 70B), despite having only 7B to 10B parameters.
此外,表 2 展示了使用日语语言模型评估套件(JP-LMEH)基准测试对通用日语能力进行评估的结果。该基准套件包含九项任务,其平均分被广泛用作衡量整体日语能力的指标。尽管我们的模型仅有 70 亿至 100 亿参数,却取得了 70.5 和 66.2 的显著高分,不仅超越了源模型,甚至超过了先前拥有 700 亿参数的日语 LLM(即 Japanese StableLM 70B)的最先进水平。
Compared to the source Japanese model (shisa-gamma-7b-v1), it is interesting to note that there is not only a significant improvement in the MGSM scores but also improvements in other tasks such as JSQuAD and JAQKET. \addedHowever, we also acknowledge that there are certain tasks, such as JNLI and MARC (for DFS), where performance has dropped. It should be noted that the MGSM scores here do not match those in Table 1, due to the differences in evaluation protocols (few-shot, prompting, etc.). Additional details are provided in Appendix A, including extensive comparisons with other models in Table 6.
与源日语模型(shisa-gamma-7b-v1)相比,值得注意的是不仅在 MGSM 分数上有显著提升,在 JSQuAD 和 JAQKET 等其他任务中也有所改进。\added 然而我们也发现,在 JNLI 和 MARC(DFS 相关)等特定任务上性能有所下降。需要说明的是,由于评估协议(少样本学习、提示方式等)的差异,此处的 MGSM 分数与表 1 中的数据并不一致。附录 A 提供了更多细节,包括表 6 中与其他模型的全面对比。
Moreover, Appendix C showcases intriguing examples that demonstrate the utility of our models merged using evolution. The merged models correctly answered questions that require both knowledge about Japanese-culture and Math ability. In contrast, even if such Japanese questions were translated into English and answered in English, English math models would likely fail to provide the correct answers as they may not be aware of Japanese culture-specific context in the questions.
此外,附录 C 展示了一些引人入胜的案例,证明了通过进化合并的模型具有实际应用价值。这些合并模型能正确回答需要同时具备日本文化知识和数学能力的问题。相比之下,即使将这类日语问题翻译成英文并用英语回答,英语数学模型也可能因缺乏对问题中日本文化特定背景的理解而无法给出正确答案。
表 2:日语能力 JP-LMEH 分数细分表。JP-LMEH(日语语言模型评估套件)是由 9 项任务组成的基准测试套件,其平均分(Avg 列)被用作整体日语能力的指标。完整结果见表 6
JP Language Model Evaluation Harness 日语语言模型评估套件 |
|||||||||||
Model | Size 参数量 | JComQA | JNLI | MARC | JSQuAD | JAQKET | XLSum | XWino | MGSM | JCoLA | Avg 平均分 |
Shisa Gamma 7b v1 | 7B | 91.2 | 72.1 | 94.6 | 73.9 | 68.0 | 25.9 | 80.5 | 29.6 | 58.7 | 66.1 |
WizardMath 7B V1.1 | 7B | 74.7 | 42.7 | 90.4 | 84.6 | 68.5 | 22.3 | 69.8 | 38.8 | 48.9 | 60.1 |
Abel 7B 002 | 7B | 70.3 | 51.8 | 62.3 | 83.8 | 69.0 | 22.5 | 68.2 | 28.0 | 52.7 | 56.5 |
Ours (PS) 本模型(PS) | 7B | 89.1 | 65.7 | 95.4 | 89.5 | 77.7 | 25.5 | 81.2 | 50.0 | 60.5 | 70.5 |
Ours (DFS) 我们的方法(DFS) | 10B | 67.7 | 58.2 | 53.5 | 66.8 | 54.3 | 17.3 | 65.6 | 30.0 | 65.6 | 53.2 |
Ours (PS+DFS) 我们的方法(PS+DFS) | 10B | 88.2 | 50.3 | 91.5 | 78.6 | 77.8 | 23.2 | 73.0 | 40.0 | 73.0 | 66.2 |
Llama 2 70B | 70B | 80.2 | 53.4 | 94.4 | 91.6 | 80.1 | 21.8 | 73.6 | 30.4 | 54.6 | 64.5 |
Japanese Stable LM 70B | 70B | 91.2 | 50.4 | 92.9 | 87.1 | 88.4 | 24.3 | 82.0 | 37.2 | 61.7 | 68.3 |
Swallow 70B | 70B | 95.3 | 57.2 | 91.7 | 94.1 | 93.9 | 23.1 | 83.3 | 45.2 | 59.5 | 71.5 |
3.1.3 Analysis 3.1.3 分析
In our exploration of model merging in the parameter space, we experimented with diverse configurations, such as varying the assignment of merging parameters across different layer groups. However, due to a constrained dataset, we did not witness notable improvements in performance correlating with increase in configuration complexity. Consequently, we focused our reporting on a PS merged model (Model 4 in Table 1) that adopts the simplest setting: considering each source model as a singular layer and allocating two DARE-TIES associated parameters to each for evolutionary merging. Figure 3 illustrates the evolved parameter configuration post PS merging.
在参数空间模型融合的探索中,我们尝试了多样化配置(例如对不同层组分配不同融合参数)。但由于数据集规模有限,未能观察到配置复杂度提升带来的显著性能改善。因此我们最终选择报告采用最简单设置的 PS 融合模型(表 1 中的模型 4):将每个源模型视为单一层级,并为每个模型分配两个 DARE-TIES 关联参数进行进化融合。图 3 展示了 PS 融合后的进化参数配置结果。
The CMA-ES optimization results reveals that all three models are important, as suggested by the uniformity of the optimized weighting values.
\addedThe fact that the sum of the weights exceeds 1 and is approaching 2, is noteworthy. This suggests that a combination method which amplifies the contributions of the models, rather than a simple interpolation, proved to be more effective.
CMA-ES 优化结果表明三个模型都具有重要性,这体现在优化权重值的均匀分布上。\added 值得注意的是权重总和超过 1 且接近 2 的现象,这表明采用放大模型贡献的组合方法(而非简单插值)被证明更为有效。
The dominant density from the Japanese LLM suggests its critical contribution to solving the task. We conjecture that this may also be partially attributed to the Japanese LM’s larger amount of fine-tunings from the Mistral base model.
\addedJapanese LMs based on English models, such as Shisa-Gamma-7B-v1, are typically created through a two-step process: continued pretraining and instruction fine-tuning. The continued pretraining phase involves learning from a significantly larger dataset compared to standard fine-tuning. For instance, Shisa-Gamma-7B-v1 is based on a Japanese base model that underwent continued pretraining on 100B tokens of Japanese text. The resulting differences in weights between Shisa-Gamma-7B-v1 and the original Mistral-7B-v1 likely encapsulate more information than standard fine-tuning, making them more challenging to sparsify.
In line with the discussion in [66, Section 4.6], the sparsification of DARE tends to degrade performance when applied to such extensively fine-tuned models. Our evolutionary search has seemingly managed to address this issue by increasing the density for the Japanese LM.
日本 LLM 的主导密度表明其对任务解决的关键贡献。我们推测这可能部分归因于日本语言模型基于 Mistral 基础模型进行了更大量的微调。\added 基于英语模型的日本语言模型(如 Shisa-Gamma-7B-v1)通常通过两阶段流程创建:持续预训练和指令微调。其中持续预训练阶段使用的数据量远超常规微调,例如 Shisa-Gamma-7B-v1 的日语基础模型曾接受过 1000 亿日语 token 的持续预训练。这使得 Shisa-Gamma-7B-v1 与原始 Mistral-7B-v1 的权重差异包含比标准微调更丰富的信息,导致稀疏化更为困难。如[66,第 4.6 节]所述,对此类深度微调模型应用 DARE 稀疏化往往会降低性能。我们的进化搜索通过提高日本语言模型的密度,似乎成功解决了这一问题。
The parameter settings derived from our PS merging experiments align well with outcomes from our DFS merging efforts. By incorporating the PS-merged model into our pool of source models and applying DFS merging across all potential pairings, we observed optimal performance with the combination of the PS-merged model and the Japanese language mode (Model 6 in Table 1). This finding echoes the significant influence of the Japanese language model, as indicated by its notable presence in Figure 3, and reaffirms the substantial promise of evolutionary model merging.
PS 融合实验得出的参数设置与 DFS 融合结果高度吻合。将 PS 融合模型加入源模型池并对所有可能配对进行 DFS 融合后,我们发现 PS 融合模型与日语语言模型的组合(表 1 中的模型 6)表现最优。这一发现既呼应了图 3 中日语语言模型的显著影响力,也再次验证了进化模型融合的巨大潜力。

图 3:PS 合并的进化配置。尽管三个源模型的权重相似,但日本 LLM 显著的密度凸显了其关键作用。
Figure 4 displays the evolution of the inference path, where our approach consistently recognized the value of initial steps, incorporating every layer from the first model (our PS-merged model), except for the last decoding layer and the embedding layer. As the process advanced, the method refined the selection to a smaller, more effective set of layers and strategically alternated between layers from both contributing models. Significantly, the scaling parameters emerged as crucial elements, our ablation studies revealed that eliminating them in the evolved model (e.g., by setting ) led to a performance decline exceeding 20 percent, highlighting their importance in the model’s efficacy.
图 4 展示了推理路径的演变过程,我们的方法始终认识到初始步骤的价值,除了解码最后一层和嵌入层外,纳入了第一个模型(即我们的 PS 合并模型)的每一层。随着进程推进,该方法将选择精炼至更小、更有效的层集,并策略性地交替使用两个贡献模型的层。值得注意的是,缩放参数 成为关键要素——我们的消融研究表明,若在进化模型中移除这些参数(例如设为 ),会导致性能下降超过 20%,这凸显了它们对模型效能的重要性。

图 4:模型 与 的 DFS 合并进化配置。三幅子图描绘了 MGSM-JA 任务上推理路径的演变。y 轴表示层索引 ,x 轴对应路径索引 。蓝色标记代表使用模型 层的路径步骤,红色标记表示来自 的层。\删除标记大小反映缩放因子 的强度。\添加标记颜色深浅反映缩放因子 的强度。进化搜索结果早期包含 的大部分层,随后在两组模型层间交替选择。结果来自我们的 100 亿参数模型(PS+DFS)。
3.2 Method Behavior Analysis
3.2 方法行为分析
In this section, we conduct a more in-depth analysis of our method’s performance. We compare it with multiple types of baselines, experimentally analyze the effect of manual model selection, verify the scalability of our proposed approach to larger models, and show results from ablation studies.
本节我们将更深入分析方法的性能表现。通过与多种基线对比,实验性评估人工模型选择的影响,验证所提方法在大模型上的可扩展性,并展示消融研究结果。
3.2.1 Comparison with Unoptimized Model Merging
3.2.1 与非优化模型合并的对比
Table 3 presents the performance of models created through unoptimized model merging. We utilized three methods: TIES-Merge [64], DARE-TIES [66], and Frankenmerging [22].
We used mergekit to establish the baselines.
Since merkit does not include default hyperparamters, we adopted those from a widely recognized resource for learning about model merging and mergekit [35].
表 3 呈现了非优化模型合并的性能表现。我们采用三种方法:TIES-Merge[64]、DARE-TIES[66]和 Frankenmerging[22],使用 mergekit 建立基线。由于 merkit 未包含默认超参数,我们采用了模型合并学习权威资源[35]中的参数设置。
表 3:所提方法与基线的性能对比。结果显示 MGSM-JA 准确率和 JP-LMEH 平均分数对比,包括我们的方法(PS、DFS、PS+DFS)、非优化模型合并技术、微调方法以及无关模型干扰实验。
Method 方法 | Size 规模 | MGSM-JA (acc ) MGSM-JA(acc ) | JP-LMEH (avg ) JP-LMEH(avg ) |
---|---|---|---|
Ours (PS) 本方法(PS) | 7B | 52.0 | 70.5 |
Ours (DFS) 本方法(DFS) | 10B | 36.4 | 53.2 |
Ours (PS+DFS) 本方法(PS+DFS) | 10B | 55.2 | 66.2 |
Unoptimized model merging: 非优化模型合并: |
|||
TIES-Merge [64] | 7B | 4.4 | 63.7 |
DARE-TIES [66] | 7B | 35.2 | 66.3 |
Frankenmerging [22] | 9B | 0.0 | 16.1 |
Fine-tuning: 微调: | |||
LoRA, Shisa Gamma 7B v1 LoRA,Shisa Gamma 7B v1 |
7B | 20.8 | 66.4 |
LoRA, WizardMath 7B v1.1 LoRA,WizardMath 7B v1.1 | 7B | 43.2 | 55.9 |
LoRA, Abel 7B 002 LoRA,Abel 7B 002 | 7B | 40.0 | 52.8 |
Full, Shisa Gamma 7B v1 完整模型,Shisa Gamma 7B v1 |
7B | 19.2 | 54.5 |
Full, WizardMath 7B v1.1 完整模型,WizardMath 7B v1.1 | 7B | 33.6 | 43.4 |
Full, Abel 7B 002 完整模型,Abel 7B 002 | 7B | 36.4 | 37.5 |
Distraction: 干扰项: | |||
irrelevant model 无关模型 | 7B | 50.0 | 65.9 |
irrelevant model 无关模型 | 7B | 46.8 | 64.2 |
irrelevant model 无关模型 | 7B | 46.8 | 64.1 |
irrelevant model 无关模型 | 7B | 48.4 | 64.0 |
irrelevant model 无关模型 | 7B | 40.8 | 65.8 |
For the PS baseline equivalents (TIES-Merge and DARE-TIES), we used shisa-gamma-7b-v1, WizardMath-7B-V1.1, and Abel-7B-002 as the source models, allowing direct comparison with our PS experiments. Similarly, for the DFS baseline equivalent (Frankenmerging), we used the same combination of source models as in our DFS experiments, enabling direct comparisons.
对于 PS 基线等效方法(TIES-Merge 和 DARE-TIES),我们使用 shisa-gamma-7b-v1、WizardMath-7B-V1.1 和 Abel-7B-002 作为源模型,以便与我们的 PS 实验直接比较。同样地,对于 DFS 基线等效方法(Frankenmerging),我们采用了与 DFS 实验中相同的源模型组合,从而实现直接对比。
The results indicate that TIES-Merge and Frankenmerging led to decreased performance in both MGSM-JA and JP-LMEH, suggesting unsuccessful merging. Notably, Frankenmerging resulted in a MGSM-JA score of 0, which is consistent with its general tendency to often produce models that fail to generate meaningful outputs. DARE-TIES performed relatively better among these baselines, with a slight improvement in MGSM-JA compared to the source models. However, our PS model significantly outperformed all these baselines, and the hybrid merging method (PS+DFS) gives the best test performance on the target task.
结果表明,TIES-Merge 和 Frankenmerging 导致 MGSM-JA 和 JP-LMEH 性能下降,表明合并未获成功。值得注意的是,Frankenmerging 的 MGSM-JA 得分为 0,这与其常生成无意义输出模型的普遍倾向一致。在这些基线方法中,DARE-TIES 表现相对较好,其 MGSM-JA 得分较源模型略有提升。然而,我们的 PS 模型显著优于所有这些基线方法,而混合合并方法(PS+DFS)在目标任务上取得了最佳测试性能。
3.2.2 Comparison with Fine-tuning
3.2.2 与微调方法的比较
Table 3 also compares our evolutionary model merging approach with the fine-tuning method. To ensure a fair comparison, we used the identical dataset. Specifically, we utilized the 1069 samples of Japanese-translated GSM8k that were employed in our PS and DFS optimization for fine-tuning the models. We tested learning rates of 1e-5, 5e-5, and 1e-4, choosing the optimal one. We set the number of epochs to 3 and tried both LoRA and full parameter fine-tuning. We experimented with all three source models: shisa-gamma-7b-v1, WizardMath-7B-V1.1, and Abel-7B-002.
表 3 还对比了我们的进化模型合并方法与微调方法。为确保公平比较,我们使用了相同的数据集。具体而言,我们采用 PS 和 DFS 优化中使用的 1069 个日语翻译 GSM8k 样本对模型进行微调。我们测试了 1e-5、5e-5 和 1e-4 三种学习率并选择最优值,设置训练轮数为 3,同时尝试了 LoRA 和全参数微调。我们对三个源模型(shisa-gamma-7b-v1、WizardMath-7B-V1.1 和 Abel-7B-002)均进行了实验。
The MGSM-JA scores show that while fine-tuning generally improved performance over the original models, the highest score achieved was 43.2. This is notably lower than our proposed method, which exceeded a score of 50. Additionally, unlike our approach, fine-tuning often led to significant decreases in JP-LMEH scores. This suggests that fine-tuning may be more data-hungry and prone to catastrophic forgetting issues.
MGSM-JA 分数显示,虽然微调通常能提升原始模型性能,但最高得分仅为 43.2,显著低于我们提出的超过 50 分的方法。此外,与我们的方法不同,微调常导致 JP-LMEH 分数大幅下降,这表明微调可能更依赖数据且容易产生灾难性遗忘问题。
Other potential fine-tuning baselines.
Another potential fine-tuning baseline would be combining all data used by all source models to fine-tune a base model. However, this approach is often impractical in general settings, and in our specific case, it is indeed impossible to implement for comparison. We elaborate on these points below. Our proposed approach’s strength lies in its ability to freely combine openly published models from platforms like HuggingFace. However, model publishers often only release the models themselves, not the training datasets or detailed training methods, as these are considered proprietary knowledge. For instance, the training data for WizardMath-7B-V1.1 and Abel-7B-002 is not publicly available. shisa-gamma-7b-v1 is a further fine-tuned version of japanese-stablelm-base-gamma-7b (which is Mistral-7B-v0.1 fine-tuned for Japanese), and the data used for training japanese-stablelm-base-gamma-7b is also not public. Moreover, the Japanese fine-tuning for japanese-stablelm-base-gamma-7b involved 100B tokens, indicating a substantial computational cost (such as more than 10,000 GPU hours). Therefore, a comparison with this approach is not feasible. It is also worth noting that such fine-tuning would require careful tuning of dataset mixing ratios, which is non-trivial and costly.
其他潜在的微调基线。另一种潜在的微调基线是将所有源模型使用的数据合并后对基础模型进行微调。然而,这种方法在通用场景中通常不切实际,而在我们的具体案例中,确实无法实现对比。我们在下文详述这些要点。我们提出方法的优势在于能自由组合 HuggingFace 等平台公开的模型,但模型发布者通常仅公开模型本身,不提供训练数据集或详细训练方法——这些被视为专有知识。例如 WizardMath-7B-V1.1 和 Abel-7B-002 的训练数据未公开,shisa-gamma-7b-v1 是 japanese-stablelm-base-gamma-7b(基于 Mistral-7B-v0.1 日语微调的版本)的进一步微调版,而 japanese-stablelm-base-gamma-7b 的训练数据同样未公开。此外,japanese-stablelm-base-gamma-7b 的日语微调消耗了 1000 亿 token,意味着巨大的计算成本(如超 1 万 GPU 小时)。因此与该方法的对比不可行。值得注意的是,此类微调还需谨慎调整数据集混合比例,这既复杂又昂贵。
3.2.3 Impact of Manual Model Selection
3.2.3 人工模型选择的影响
In our main experiments, we manually selected three source models. To assess the broader applicability of our approach, it is useful to investigate how sensitive the method is to the selection of source models. To this end, we conducted experiments to measure the distraction effect of including irrelevant models, which simulate the scenario where suboptimal models are inadvertently selected.
在主体实验中我们人工选择了三个源模型。为评估方法的普适性,有必要研究该方法对源模型选择的敏感性。为此我们通过引入无关模型测量干扰效应,模拟意外选择次优模型的情景。
To investigate the sensitivity of our method to source model selection, we conducted experiments to measure the distraction effect of including irrelevant models. In addition to our original seed models (shisa-gamma-7b-v1, WizardMath-7B-V1.1, Abel-7B-002), we experimented with one, two, four, and eight irrelevant models, and performed PS experiments to examine the impact on performance.
The irrelevant models, chosen for their lack of relation to Japanese or mathematics, were: leo-mistral-hessianai-7b (German), lince-mistral-7b-it-es (Spanish), Mistral-7B-v0.3-Chinese-Chat (Chinese), BioMistral-7B (biomedical domain), Claire-Mistral-7B-0.1 (French), komt-mistral-7b-v1 (Korean), Mistral-7B-v0.1-arabic (Arabic), and Loquace-7B-Mistral (Italy). All these models are openly available on HuggingFace.
为探究方法对源模型选择的敏感性,我们通过引入无关模型测量干扰效应。除原始种子模型(shisa-gamma-7b-v1、WizardMath-7B-V1.1、Abel-7B-002)外,我们分别实验引入 1/2/4/8 个无关模型,并通过 PS 实验观察性能影响。所选无关模型均与日语或数学无关,包括:leo-mistral-hessianai-7b(德语)、lince-mistral-7b-it-es(西班牙语)、Mistral-7B-v0.3-Chinese-Chat(中文)、BioMistral-7B(生物医学领域)、Claire-Mistral-7B-0.1(法语)、komt-mistral-7b-v1(韩语)、Mistral-7B-v0.1-arabic(阿拉伯语)和 Loquace-7B-Mistral(意大利语)。这些模型均可在 HuggingFace 公开获取。
We slightly modified the PS optimization by adding weight normalization, which became necessary due to the increased number of models (up to 11 in total). This modification explains why the "+0 irrelevant model" result does not exactly match our original PS result.
我们对 PS 优化稍作修改,增加了权重归一化处理——这是由于模型数量增加(最多达 11 个)的必要调整。该修改解释了"+0 无关模型"结果与原 PS 结果不完全一致的原因。
The results, shown in the "Distraction" part of Table 3, are generally favorable. As expected, using only the initial three models yielded the best results. However, the performance degradation when adding irrelevant models was limited, especially up to four additional models, and even with 8 additional models, the performance did not collapse. This demonstrates that our optimization method is relatively robust to manual source model selection. Due to the huge GPU memory requirement at inference time, the distraction experiments were not conducted for the DFS merging method. We instead present behavior analysis and ablation studies on the topic in the following sections.
结果如表 3 的"干扰"部分所示,总体上表现良好。正如预期,仅使用初始三个模型时效果最佳。但加入无关模型时的性能下降有限,特别是在添加不超过四个额外模型时,即便添加 8 个额外模型性能也未崩溃。这表明我们的优化方法对人工选择的源模型具有较强鲁棒性。由于推理时 GPU 显存需求过大,未对 DFS 合并方法进行干扰实验,相关行为分析和消融研究将在后续章节展开。
3.2.4 Scaling to Larger Models
3.2.4 大规模模型扩展
Finally, we demonstrate our proposed methods also work with larger models. Concretely, we used fine-tuned versions of Llama-2-13b: ELYZA-japanese-Llama-2-13b-instruct as the Japanese general model and MetaMath-13B-V1.0 as the English math model. All other settings remained identical to the 7B experiments. Due to the limited availability of suitable 13B models, we could only select two source models.
最后我们验证了该方法在大模型上的适用性。具体采用 Llama-2-13b 的微调版本:日语通用模型 ELYZA-japanese-Llama-2-13b-instruct 和英语数学模型 MetaMath-13B-V1.0,其他设置与 7B 实验保持一致。因适用 13B 模型有限,仅能选择两个源模型。
The results are shown in Table 4. Similar to the 7B case, the MGSM-JA score significantly improved over both source models. This consistent behavior at the 13B size demonstrates the scalability of our evolutionary model merging approach.
结果如表 4 所示。与 7B 情况类似,MGSM-JA 分数相对两个源模型均有显著提升。这种在 13B 规模上的一致性验证了我们进化式模型合并方法的可扩展性。
表 4:13B 模型扩展及 DFS 合并消融研究。MGSM-JA 和 JP-LMEH 任务的性能对比与消融结果,灰色行为消融研究。
# | Model | Type 类型 | Size 规模 | MGSM-JA (acc) MGSM-JA(准确率) | JP-LMEH (acc) JP-LMEH(准确率) |
---|---|---|---|---|---|
0 | Llama-2-13b | EN general 英语通用 | 13B | 2.8 | 54.1 |
1 | ELYZA-japanese-Llama-2-13b-instruct | JA general 日语通用 | 13B | 13.2 | 60.2 |
2 | MetaMath-13B-V1.0 | EN math 英语数学 | 13B | 8.0 | 48.7 |
3 | Ours (PS) 本方法(PS) | 1 + 2 | 13B | 31.2 | 59.7 |
4 | Ours (DFS) 我们的方法(DFS) | 1 + 2 | 19B | 12.8 | 61.1 |
5 | Ours (DFS) 我们的方法(DFS) | 2 + 1 | 21B | 23.2 | 46.6 |
6 | Ours (PS+DFS w/o ) 我们的方法(PS+DFS 不含 ) |
3 + 1 | 19B | 26.4 | 58.1 |
7 | Ours (PS+DFS) 我们的方法(PS+DFS) | 3 + 1 | 21B | 34.0 | 60.4 |
The 7B models achieved higher MGSM-JA scores (52.0, 36.4, 55.2) than the 13B models (31.2, 23.2, 34.0) for (PS, DFS, PS+DFS). This difference is attributed to the source models rather than our proposed method. Mistral-7B-v0.1 is known to outperform Llama-2-13b in performance, especially in basic mathematical abilities, despite its smaller size. This is reflected in the performance of math-specialized models: WizardMath-7B-V1.1 scores 83.2 on GSM8k (an English math benchmark), while MetaMath-13B-V1.0 scores 72.3. This difference explains the performance gap in Japanese math abilities between our merged 7B and 13B models.
在(PS、DFS、PS+DFS)任务中,7B 模型获得的 MGSM-JA 分数(52.0、36.4、55.2)高于 13B 模型(31.2、23.2、34.0)。这一差异源于基础模型本身而非我们的方法。已知 Mistral-7B-v0.1 尽管规模较小,但其性能(尤其是基础数学能力)优于 Llama-2-13b——数学专项模型的表现印证了这点:WizardMath-7B-V1.1 在 GSM8k(英文数学基准)上获得 83.2 分,而 MetaMath-13B-V1.0 得分为 72.3。这一差距解释了我们的 7B 与 13B 合并模型在日语数学能力上的表现差异。
On the other hand, our DFS and PS+DFS models (models #4 and #7) achieved higher JP-LMEH average scores than ELYZA-japanese-Llama-2-13b-instruct, the Japanese general model.
In general, the qualitative behavior on JP-LMEH is consistent between 7B and 13B models. Both sets of results show improvements in JSQuAD, JAQKET, and MGSM scores, with slight decreases in JNLI and MARC compared to the original Japanese model, indicating similar trends in task-specific score changes (see Table 6).
另一方面,我们的 DFS 和 PS+DFS 模型(模型#4 与#7)在 JP-LMEH 平均分上超越了日语通用模型 ELYZA-japanese-Llama-2-13b-instruct。总体而言,7B 与 13B 模型在 JP-LMEH 上的定性表现具有一致性。两组结果均显示 JSQuAD、JAQKET 和 MGSM 分数提升,而 JNLI 与 MARC 分数相较原始日语模型略有下降,这表明任务特定分数变化趋势相似(见表 6)。
3.2.5 Analysis on DFS Merging
3.2.5 DFS 合并分析
We included the results from two ablation studies for DFS merging in Table 4, and we wish to get insights as to how and why DFS works.
我们在表 4 中列入了两项关于 DFS 合并的消融实验结果,旨在探究 DFS 机制的工作原理及其有效性。
Comparing models #4 and #5, we show that the order of the source models in DFS merging plays a critical role. Specifically, if we put the layers from the English math model in front of those from the Japanese general model in the indicator array (model #4), the accuracy drops by more than 10 points than the case where the layers orders are reversed (model #5). This “sub-optimal” behavior is a trade-off between flexibility and search efficiency. In our experiments, we initialize the indicator array with zeros, but set the values of those corresponding to layers in model A in the first repetition at where is the initial standard deviation of CMA-ES exploration (recall that in DFS, layers from model A come before those from model B in each repetition). In this way, the merged model behaves just like model A at the beginning of the DFS process, and is allowed to gradually add/remove layers if necessary. Comparing with the other scenario where we initialize with only zeros, this setting effectively avoids the cases where poor stacking of the layers causes the LLM to output complete nonsense and waste explorations, at the cost of losing the flexibility to automatically swap the order of models in .
通过对比模型#4 与#5,我们发现源模型在 DFS 合并中的层顺序至关重要。具体而言,若在指示数组 中将英语数学模型的层置于日语通用模型之前(模型#4),其准确率会比层序相反的情况(模型#5)降低 10 分以上。这种"次优"表现是灵活性与搜索效率之间的权衡。实验中,我们将指示数组 初始化为零值,但在首次重复时将模型 A 对应层的值设为 (其中 为 CMA-ES 探索的初始标准差)。如此,合并模型在 DFS 初期行为与模型 A 完全一致,后续可逐步增减层数。相较于全零初始化方案 ,该设置有效避免了因层堆叠不当导致 LLM 输出乱码而浪费搜索资源的情况,但代价是丧失了自动调换 中模型顺序的灵活性。
In our second ablation study, the difference between models #6 and #7 highlights the importance of the inclusion of the scaling matrix . Without scaling, the performance of model #6 is even worse than the PS-merged model, one of its source models. Sufficient as it is in this setting, we wish to point out that this simple scaling treatment does not work if any of the source models has been fine-tuned for a significantly longer time. We suspect that prolonged fine-tuning might lead to increased incompatibility for data traversing through layers from different models. This could be due to permutations in the weight matrices. For example, assuming we are organizing the 4096 dimensions of the hidden states by their absolute values, data from layer 1 in model A might look like (1, 2, … 4096), but in model B, after extended fine-tuning, it might appear as (4001, 133, 3, … 989) vs (1.1, 1.9, …, 4097) when it was only slightly fine-tuned.
第二项消融研究通过模型#6 与#7 的对比揭示了缩放矩阵 的重要性。未引入缩放时,模型#6 性能甚至逊于其源模型之一的 PS 合并模型。尽管当前设置中缩放机制已足够有效,但需指出:若任一源模型经过超长时微调,这种简单缩放将失效。我们推测长期微调可能导致跨模型层间数据兼容性下降,这或许源于权重矩阵的排列变化。例如,假设按隐藏状态 4096 维度的绝对值排序,模型 A 第 1 层数据可能呈现(1,2,…4096),而经深度微调的模型 B 中可能变为(4001,133,3,…989)而非轻微微调时的(1.1,1.9,…4097)。
While it is hard to understand thoroughly why DFS merging helped improve performance, we conducted analysis of the merged model (model #5 in Table 4) to get insights. Figure 5 shows the final hops and scales configuration from the merged model. It is important to notice that we forced the scaling factor if and layers are from the same model. This is to prevent the possibility of introducing extra modeling power via , in this manner we are certain that all the performance boost comes from layer stacking and permutation.
It is interesting and surprising to notice that the DFS-merged model decides to skip layer #30 from MetaMath-13B-V1.0 at the beginning of the inference path. Based on this finding, we removed layer #30 from MetaMath-13B-V1.0, and without any further modifications we found the performance increased to 10%. This implies that DFS merging works because it is able to identify the redundant/harmful operations (with respect to a specific task) embedded in certain layers, and performance increases once these layers are removed.
虽然难以彻底理解 DFS 合并为何能提升性能,但我们通过对合并模型(表 4 中的模型#5)的分析获得了一些洞见。图 5 展示了该合并模型的最终跳跃路径与缩放系数配置。值得注意的是,我们强制规定当 和 层来自同一模型时,缩放因子设为 。这是为了避免通过 引入额外建模能力的可能性,从而确保所有性能提升都源于层堆叠与排列。有趣且令人惊讶的是,DFS 合并模型决定在推理路径起始处跳过 MetaMath-13B-V1.0 的第 30 层。基于这一发现,我们移除了 MetaMath-13B-V1.0 的第 30 层,未作其他修改时性能竟提升了 10%。这表明 DFS 合并的有效性在于它能识别特定任务中某些层内冗余/有害的操作,移除这些层即可提升性能。

图 5:DFS 合并 13B 模型的演化配置。本图样式与设置与图 4 完全相同。令人惊讶的是,DFS 选择跳过原模型(表 4 中模型#2)的第 30 层。若我们仅从模型#2 中移除该层且不作其他修改,其在 MGSM-JA 上的性能将提升至 10%。此配置来自表 4 中的模型#5。
Furthermore, we analyzed the problems where our DFS-merged model answered correctly while the math source model didn’t, and we identify mainly two scenarios where DFS helped improve the performance, see Figure 6 for illustrations from these scenarios. In the first scenario, we found that the English math model was able to reason correctly. However, it outputs the reasoning steps or answers in English, violating the request that the response must be written in Japanese. In this scenario, DFS-merging managed to shift the output tokens’ distribution toward Japanese vocabularies, fixing the problem. In the second scenario, the English math model was actually reasoning in Japanese. But due to the error in understanding the Japanese problem or in the reasoning, it was not able to give correct answers. DFS-merging, due to its appending of the layers from the Japanese general model, was able to largely avoid the problem. Unlike the “subtraction” it did previously, these scenarios suggest DFS also works because it is capable of “addition” – finding what is missing in one of the source models and compensate that by adding layers in the inference path from other models.
此外,我们分析了 DFS 合并模型答对而数学源模型答错的问题,发现 DFS 主要通过两种场景提升性能(示例见图 6)。第一种场景中,英文数学模型本可正确推理,但其输出步骤或答案使用英语,违反了必须用日语回答的要求。DFS 合并通过将输出词分布向日语词汇偏移解决了该问题。第二种场景中,英文数学模型虽用日语推理,但由于对日语问题的理解错误或推理失误未能给出正确答案。DFS 合并通过追加日语通用模型的层,大幅规避了这个问题。与先前"减法"操作不同,这些场景表明 DFS 还能通过"加法"发挥作用——发现源模型缺失的能力,并通过从其他模型添加推理路径中的层来补偿。

图 6:DFS 提升性能的原因。DFS 合并模型主要通过以下两种方式提升 MGSM-JA 任务表现:(1) 将输出词元分布向日语倾斜(情景 1);(2) 帮助 LLM 更好地理解问题(情景 2)。表 4 展示了模型#2 和#5 的原始输出,但为便于可视化删减了冗余行。
3.3 Evolving Japanese VLM 3.3 日语视觉语言模型的进化
3.3.1 Multi-modality Extension
3.3.1 多模态扩展
We now extend our method to multi-modal models, and evolve a culturally-specific content aware Japanese VLM. VLMs have recently shown remarkable progress by applying the powerful instruction-following capabilities of pre-trained LLMs. The architecture of a VLM generally consists of three components: (1) A vision encoder to extract image features; (2) An LLM to generate text (for the purpose of describing an image); and (3) A projection network to map image features into the LLM’s embedding space [36, 16, 40, 38, 12]. Crucially, the LLM component is initialized with powerful pre-trained LLMs for their text generation capabilities. During training, the projection network and optionally the LLM are trained on various vision-language datasets, while the vision encoder is fixed.
我们将方法扩展至多模态领域,进化出具有文化特定内容理解能力的日语视觉语言模型(VLM)。近年来,通过结合预训练 LLMs 强大的指令跟随能力,VLMs 取得了显著进展。典型 VLM 架构包含三个组件:(1) 提取图像特征的视觉编码器;(2) 生成文本(用于描述图像)的 LLM;(3) 将图像特征映射到 LLM 嵌入空间的投影网络[36,16,40,38,12]。关键之处在于,LLM 组件会初始化具有强大文本生成能力的预训练模型。训练过程中,投影网络和 LLM(可选)会在多模态数据集上进行微调,而视觉编码器保持固定。
3.3.2 Experimental Setup 3.3.2 实验设置
Source Models 源模型
The LLM component inside a VLM can be regarded as a standalone LLM, with the extra capability of understanding visual soft prompts.
From this perspective, by fixing the vision encoder and the projection network and only focusing on the LLM component, it is straightforward to apply the methodologies detailed in Section 2 to produce a new LLM with expanded capabilities.
VLM 中的 LLM 组件可视为具备视觉软提示理解能力的独立 LLM。基于此视角,通过固定视觉编码器和投影网络并仅聚焦 LLM 组件,可直接应用第 2 章所述方法生成具备扩展能力的新 LLM。
In this experiment, we merge a Japanese LLM and the LLM component in a VLM in the parameter space. We select shisa-gamma-7b-v1 [10] as the Japanese LLM and LLaVA-1.6-Mistral-7B [39] as the VLM. Both models are fine-tunes of the Mistral-7B-v0.1 [29] base model.
本实验中,我们在参数空间合并日语 LLM 与 VLM 的 LLM 组件。选择 shisa-gamma-7b-v1[10]作为日语 LLM,LLaVA-1.6-Mistral-7B[39]作为 VLM。二者均基于 Mistral-7B-v0.1[29]微调而来。
Dataset 数据集
To the best of our knowledge, publically accessible Japanese VLM datasets are scarce. In response, we created a new open Japanese VLM benchmark and assessed our VLM on a widely recognized Japanese VQA dataset. Our new benchmark dataset consists of:
据我们所知,公开可用的日语 VLM 数据集稀缺。为此,我们创建了新的日语 VLM 基准测试,并在广泛认可的日语 VQA 数据集上评估模型。新基准数据集包含:
-
•
JA-VG-VQA-500: A 500-sample test set extracted from the Japanese Visual Genome VQA dataset [54].
• JA-VG-VQA-500:从日语 Visual Genome VQA 数据集[54]提取的 500 条测试样本 -
•
JA-VLM-Bench-In-the-Wild: A Japanese version of LLaVA-Bench-In-the-Wild [40] . We compiled a rich collection of 42 images, accompanied by a total of 50 questions, featuring a variety of Japanese cultural elements and objects found in Japan. The QAs were crafted with the assistance of GPT-4V [2] and underwent a human-in-the-loop filtering process to eliminate nonsensical outcomes. Compared to the JA-VG-VQA-500 dataset, our set poses more complex challenges, demanding more nuanced and detailed responses.
• JA-VLM-Bench-In-the-Wild:LLaVA-Bench-In-the-Wild[40]的日语版本。我们收集了 42 张包含日本文化元素和本土物体的图像,配套 50 个问题。问答对在 GPT-4V[2]辅助下生成,并经过人工筛选剔除无效结果。相比 JA-VG-VQA-500,该数据集需要更细致复杂的回答。
We used another subset of the Japanese Visual Genome VQA dataset during the evolutionary search. This subset is not overlapped with examples in the JA-VG-VQA-500 dataset, to avoid leakage in the optimization process.
进化搜索阶段使用了日语 Visual Genome VQA 数据集的另一子集,该子集与 JA-VG-VQA-500 无重叠样本以避免优化过程泄露。
The images in the JA-VLM-Bench-In-the-Wild dataset, which predominantly represent Japanese content, were carefully selected by native Japanese speakers to minimize the risk of insensitive or biased representations. All images were sourced from Unsplash and are published under the Unsplash license, which ensures that there are no ethical or legal issues with their use. Unsplash contributors agree to obtain necessary permissions from related individuals in their photos before uploading, which addresses potential concerns regarding the use of images containing recognizable humans. Additionally, the captions were generated using GPT-4-V and were meticulously reviewed by human annotators to ensure accuracy and cultural sensitivity. This process aims to create a dataset that is both ethically sourced and culturally appropriate.
JA-VLM-Bench-In-the-Wild 数据集中的图像主要代表日本内容,这些图像由日语母语者精心挑选,以最大程度地减少敏感或有偏见的表现风险。所有图像均来自 Unsplash,并按照 Unsplash 许可发布,确保其使用不存在伦理或法律问题。Unsplash 的贡献者同意在上传前获得照片中相关个人的必要许可,这解决了使用包含可识别人物图像的潜在问题。此外,标题使用 GPT-4-V 生成,并由人工标注者仔细审核,以确保准确性和文化敏感性。这一过程旨在创建一个既符合伦理来源又具有文化适应性的数据集。
Evaluation 评估
We consider two baselines in our experiments: LLaVA-1.6-Mistral-7B [39], one of our source models, and Japanese Stable VLM [55] a Japanese VLM trained from scratch on Japanese datasets.
我们在实验中设置了两个基线模型:作为源模型之一的 LLaVA-1.6-Mistral-7B[39],以及完全基于日本数据集训练的日语视觉语言模型 Japanese Stable VLM[55]。
All models adopt the same generation configurations, with deterministic decoding. We compute ROUGE-L with a Japanese language detector to replace non-Japanese responses with empty texts, resulting in a score of zero for non-Japanese responses. To be consistent with our LLM experiments in Section 3.1, we also employed fasttext [31, 30] for this language detection task. However, we made an exception for cases where the ground-truth answer itself contains non-Japanese but commonly seen words in Japanese texts (e.g., a widely recognized acronym such as “UFO”). In these instances, non-Japanese responses from models are not converted to empty texts.
所有模型均采用相同的确定性解码生成配置。我们通过日语检测器计算 ROUGE-L 分数,将非日文响应替换为空文本(非日文响应得分为零)。为保持与 3.1 节 LLM 实验的一致性,语言检测同样使用 fasttext[31,30]。但当地面真值答案本身包含日语文本中常见的非日文词汇(如"UFO"等广泛认可的缩写词)时,模型生成的非日文响应将保留原样。
Optimization 优化方法
We use the identical settings as the earlier LLM \deletedPS-merging \addedmerging experiment\addeds in Section 3.1\added. \deleted, and\addedConcretely, we use TIES-Merging with DARE for merging the source models in the parameter space. \addedFor merging in the data flow space, we treat LlaVa 1.6 Mistral 7B as our model A and shisa-gamma-7b-v1 as model B. For PS+DFS, our PS-merged model is model A and shisa-gamma-7b-v1 is model B.
采用与 3.1 节 LLM 参数空间(PS)融合实验完全相同的设置:使用 TIES-Merging 结合 DARE 方法进行源模型的参数空间融合。在数据流空间(DFS)融合中,将 LlaVa 1.6 Mistral 7B 作为模型 A,shisa-gamma-7b-v1 作为模型 B;对于 PS+DFS 混合融合,PS 融合后的模型作为模型 A,shisa-gamma-7b-v1 仍为模型 B。
3.3.3 Experimental Results
3.3.3 实验结果
Table 5 compares the performance of our VLM with the baselines. Please note that the Japanese Stable VLM cannot be evaluated on JA-VG-VQA-500 because it was trained on this dataset.
表 5 对比了我们的视觉语言模型与基线模型的性能。需注意 Japanese Stable VLM 无法在 JA-VG-VQA-500 数据集上评估,因其训练数据包含该数据集。
Our \addedmerged VLMs’ \deletedVLM’s enhanced performance on the JA-VG-VQA-500 benchmark indicates \deletedits\addedtheir proficiency in Japanese, highlighting the successful integration of the source Japanese LLM with the LLM component of the original VLM through evolutionary merging. \addedConsistent with the findings we show in the previous discussions, simple merging without evolutionary search do not give strong performance as ours, see the last 3 rows in Table 5. Furthermore, our \deletedVLM’s\addedmodels’ superior results on the JA-VLM-Bench-In-the-Wild compared to both baselines exhibits its adeptness at navigating culturally-specific content.
我们融合后的视觉语言模型在 JA-VG-VQA-500 基准上的优异表现,证实了通过进化式融合成功将日语 LLM 整合至原视觉语言模型的 LLM 组件中。与先前讨论一致,未经进化搜索的简单融合方法(见表 5 最后三行)性能显著弱于我们的方法。此外,我们的模型在 JA-VLM-Bench-In-the-Wild 上超越两个基线模型的表现,展现出处理文化特定内容的卓越能力。
Besides the quantitative results in Table 5, we qualitatively compare our VLM with the baseline models in Appendix C. Our evolved model is able to handle Japanese culture-specific content remarkably well, generally producing more detailed responses with correct information.
除表 5 的量化结果外,附录 C 提供了与基线模型的定性对比。进化后的模型能出色处理日本文化特定内容,通常能生成信息更准确、细节更丰富的响应。
表 5:视觉语言模型性能对比。LLaVA 1.6 Mistral 7B 是源视觉语言模型,Japanese Stable VLM 是开源日语视觉语言模型。JA-VG-VQA-500 评估日语通用视觉问答能力,JA-VLM-Bench-In-the-Wild 则测试模型在日本文化语境下的复杂视觉问答处理能力。
JA-VG-VQA-500 | JA-VLM-Bench-In-the-Wild | ||
Model | Size 模型规模 | (ROUGE-L ) | (ROUGE-L ) |
LLaVA 1.6 Mistral 7B | 8B | 14.3 | 41.1 |
Japanese Stable VLM | 8B | - | 40.5 |
Ours (PS) 我们的方法(PS) | 8B | 19.7 | 51.2 |
Ours (DFS) 我们的方法(DFS) | 12B | 16.8 | 46.5 |
Ours (PS+DFS) 我们的方法(PS+DFS) | 11B | 20.4 | 47.6 |
TIES [64] | 8B | 16.0 | 46.5 |
DARE-TIES [66] | 8B | 9.4 | 36.0 |
Passthrough [22] 直通式[22] | 9B | 7.3 | 26.7 |
4 Discussion 4 讨论
In this report, we propose a general method that uses evolutionary techniques to efficiently discover the best ways to combine different models from the vast ocean of different open-source models with diverse capabilities. By working with the vast collective intelligence of existing open models, our method is able to automatically create new foundation models with desired capabilities specified by the user. We find that our approach is able to automatically discover novel ways to merge different models from vastly different domains (e.g., non-English language and Math, or non-English language and Vision), in non-trivial ways that might be difficult for human experts to discover themselves.
本报告中,我们提出了一种利用进化技术从海量不同能力的开源模型中高效发现最佳组合方式的通用方法。通过整合现有开放模型的群体智慧,我们的方法能根据用户指定的能力需求,自动创建具有目标特性的新基础模型。我们发现该方法能以非平凡方式自动发现跨领域模型(如非英语语言与数学,或非英语语言与视觉)的创新融合方案,这些方案即便是领域专家也难以手动设计。
To test our approach, we apply our method to automatically create a Japanese LLM capable of Math reasoning, and a culturally-specific content aware Japanese VLM. Surprisingly, we find that both models achieve state-of-the-art results on several LLM and Vision benchmarks, while not being explicitly optimized to be good at these benchmarks, attaining the top performance on a vast array of other Japanese LLM benchmarks, even exceeding the performance of some previous SOTA 70B parameter Japanese LLMs.
为验证方法有效性,我们将其应用于自动创建具备数学推理能力的日语 LLM,以及具有文化特定内容识别能力的日语视觉语言模型(VLM)。令人惊讶的是,这两个模型在多项 LLM 和视觉基准测试中均达到最先进水平,且并未针对这些基准进行专门优化,同时在大量其他日语 LLM 基准测试中取得顶尖性能,甚至超越了某些先前 70B 参数的日语 SOTA 模型。
With these promising initial results, we believe we are just scratching the surface of unlocking the full capabilities of evolutionary model merging, and this is the inception of a long-term development of applying evolutionary principles to foundation model development.
基于这些初步成果,我们相信进化式模型融合的潜力才刚刚被发掘,这标志着将进化原理应用于基础模型开发的长期探索的开端。
Currently, we are already achieving promising results in applying evolutionary model merging to image diffusion models, enabling the creation of high performance cross-domain image generation models by merging existing building blocks in novel ways discovered by evolution.
目前,我们已在图像扩散模型中应用进化式模型融合技术,通过进化发现的新颖方式合并现有模块,成功创建了高性能跨领域图像生成模型。
The method currently requires the user to select a set of source models to use as ingredients for evolutionary search. We believe it is also possible to leverage evolution to search for candidate source models from a vast population of existing models as well. In addition to model selection, we are also exploring using evolution to produce swarms of diverse foundation models each with its own niche and behaviors. This holds the potential of enabling the emergence of a collective intelligence consisting of a swarm of models capable of self-improvement by continuously producing new complementary internal models of the world through interaction.
当前方法需要用户预先选择一组源模型作为进化搜索的素材。我们认为同样可以运用进化算法从海量现有模型中自动筛选候选源模型。除模型选择外,我们还在探索利用进化算法生成具有不同专长和行为模式的多样化基础模型集群。这将有望催生由模型群构成的集体智能,通过持续交互产生互补的世界内部模型,实现自我进化。
Related to our work is an experiment, called Automerger [34], released at around the same time as this work. This interesting experiment works by selecting two random models from the top 20 models on the Open LLM Leaderboard [27] and randomly apply SLERP [61] or DARE-TIES [66, 64] to create new models. Over time, some of these models will do well, or even better on the benchmark tasks that define this leaderboard, becoming part of the leaderboard. We predict this approach will lead to combinations of the merged models that overfit to the benchmark tasks defined on the leaderboard. The author acknowledged that the idea behind this project was less about creating better models, but more about getting more metrics to help derive a more principled approach to model merging.
与我们的工作相关的是一个名为 Automerger[34]的实验,该实验大约与本研究同期发布。这个有趣的实验从 Open LLM Leaderboard[27]排名前 20 的模型中随机选择两个模型,并随机应用 SLERP[61]或 DARE-TIES[66,64]来创建新模型。随着时间的推移,其中一些模型会在定义该排行榜的基准任务上表现良好,甚至更优,从而成为排行榜的一部分。我们预测这种方法将导致合并模型的组合对排行榜定义的基准任务过拟合。作者承认,该项目背后的理念与其说是为了创建更好的模型,不如说是为了获取更多指标,以帮助推导出更具原则性的模型合并方法。
Our work takes an orthogonal approach of optimizing for tasks outside of the domain specified by the original leaderboard [27], rather than being confined by it. As we have shown, surprisingly, stepping away from optimizing for a particular benchmark occasionally results in even greater generalization to numerous other benchmark tasks that we had not intended to optimize for, and such emergent generalization might be the key to unlocking the next great advancements in AI.
我们的工作采取了一种正交的方法,即针对原始排行榜[27]指定领域之外的任务进行优化,而不是受其限制。正如我们所展示的,令人惊讶的是,脱离对特定基准的优化偶尔会带来对我们未曾意图优化的众多其他基准任务更强的泛化能力,而这种突现的泛化可能是解锁人工智能下一重大进步的关键。
The ability to evolve new models with new emergent capabilities, from a large variety of existing, diverse models with various capabilities have important implications. With the rising costs and resource requirement for training foundation models, by leveraging the rich variety of foundation models in the rich open-source ecosystem, large institutions or governments may consider the cheaper evolutionary approach for developing proof-of-concept prototype models quickly, before committing substantial capital or tapping into the nation’s resources to develop entirely custom models from scratch, if that is even needed at all.
从大量现有、多样化且具备不同能力的模型中演化出具有新突现能力的新模型,这一能力具有重要意义。随着训练基础模型的成本和资源需求不断上升,通过利用丰富的开源生态系统中多样化的基础模型,大型机构或政府可以考虑采用这种更经济的演化方法快速开发概念验证原型模型,然后再投入大量资金或动用国家资源从头开发完全定制的模型——如果这确实有必要的话。
Further Applications and Impact. After the release of the preprint version of this paper, researchers have explored evolutionary model merging in different domains, highlighting the method’s versatility and effectiveness.
A notable example is EvoSDXL [3], which applied evolutionary model merging to diffusion image generation models.
This proves that our method works well not just for LLMs and VLMs, but for other types of models as well.
Moreover, what makes EvoSDXL particularly interesting is its success in merging SDXL-Lightning [37] with other standard SDXL fine-tunes. SDXL-Lightning is a specialized variant of SDXL that employs an adversarial loss during training, enabling rapid image generation in just a few steps, compared to the 50 or 100 steps typically required by standard diffusion models.
The evolutionary model merging technique effectively combined this unique model with conventional SDXL fine-tunes, despite the different protocols used in their development. This success illustrates that our method is capable of integrating models created through varying protocols, combining their strengths to create more robust and powerful models.
Additionally, other unique models such as EvoVLM-JP-v2 [5] and EvoUkiyoe [4] have also been developed using evolutionary model merging, further demonstrating the method’s potential and adaptability.
Moreover, after the publication of the preprint, evolutionary model merging was implemented in two famous open-source software packages, MergeKit [22] and Optuna Hub [9]. It has become widely available to many people, is being used practically, and further possibilities are being explored.
进一步应用与影响。本文预印版发布后,研究人员已在不同领域探索进化模型融合方法,凸显了该技术的通用性和有效性。一个典型范例是 EvoSDXL[3],它将进化模型融合应用于扩散图像生成模型。这证明我们的方法不仅适用于 LLMs 和视觉语言模型,还能适配其他类型模型。更引人注目的是,EvoSDXL 成功将 SDXL-Lightning[37]与其他标准 SDXL 微调模型相融合。SDXL-Lightning 作为 SDXL 的专用变体,在训练中采用对抗性损失函数,仅需几步即可快速生成图像,而标准扩散模型通常需要 50 至 100 步。进化模型融合技术有效整合了这一特殊模型与传统 SDXL 微调模型,尽管两者的开发协议存在差异。这一成功案例表明,我们的方法能够整合不同协议创建的模型,融合其优势以构建更强大稳健的模型。此外,通过进化模型融合还开发了 EvoVLM-JP-v2[5]、EvoUkiyoe[4]等独特模型,进一步验证了该方法的潜力与适应性。预印版发表后,进化模型融合技术已被集成至 MergeKit[22]和 Optuna Hub[9]两大著名开源软件包,实现广泛普及和实际应用,相关探索仍在持续深化。
Limitations. We acknowledge that although our evolutionary model merging effectively integrates diverse expertise from the source models, it also inherits their limitations. For instance, we encountered instances where the merged models produced responses that lacked logical coherence. Additionally, this study does not encompass instruction fine-tuning or alignment, raising the potential for the models to yield outputs that may be factually flawed.
局限性。我们承认,虽然进化模型融合能有效整合源模型的多样化专长,但也会继承其固有缺陷。例如,我们发现某些融合模型生成的响应存在逻辑连贯性问题。此外,本研究未涉及指令微调或对齐环节,可能导致模型输出存在事实性错误。
Author Contributions. Takuya Akiba initiated the “Evolutionary Optimization of Model Merging Recipes” project, wrote the project design document, and initiated the parameter space model merging experiments, laying the groundwork for the methodology. Makoto Shing expanded the parameter space model merging to encompass vision-language models and diffusion models. Yujin Tang directed the efforts in data flow space model merging by incorporating ideas and methods inspired from the neural architecture search and morphology evolution literature, establishing foundational methods in this domain and in hybrid merging strategies. Qi Sun contributed to the implementation of our parameter space model merging framework and assisted in model evaluation. David Ha provided overarching guidance for the research project, offering technical insight, advice, feedback and writing.
作者贡献。Takuya Akiba 发起"进化优化模型融合配方"项目,撰写项目设计文档并启动参数空间模型融合实验,奠定方法论基础。Makoto Shing 将参数空间模型融合扩展至视觉语言模型与扩散模型领域。Yujin Tang 通过借鉴神经架构搜索和形态进化领域的思路与方法,主导数据流空间模型融合研究,在该领域及混合融合策略中建立基础方法。Qi Sun 参与实现参数空间模型融合框架并协助模型评估工作。David Ha 为研究项目提供全局指导,包括技术见解、建议、反馈和文稿撰写。
Ethical and Societal Impact. Evolutionary model merging offers significant positive societal impacts by enabling the creation of small yet highly capable models at lower costs. This approach democratizes access to advanced AI capabilities, potentially reducing the environmental footprint of AI development and deployment. By efficiently combining existing models, it can lead to more accessible and versatile AI solutions, particularly benefiting regions and languages with limited resources. However, as with other model development techniques, this approach may present certain considerations. The combination of diverse models could potentially lead to unexpected behaviors or biases, and the complexity of merged models might affect their interpretability. While these challenges are common in AI development, they underscore the importance of continued research and evaluation. It’s worth noting that the models and techniques presented in this work serve primarily as a proof of concept. For applications in mission-critical scenarios or models intended for wide public use, further verification and refinement of the methodology may be necessary. This ongoing improvement process is crucial for realizing the full potential of evolutionary model merging while ensuring responsible and ethical deployment.
伦理与社会影响。进化式模型融合通过以更低成本创建小型但高能力模型,为社会带来显著积极影响。该方法 democratizes 先进 AI 技术的获取途径,可能减少 AI 开发与部署的环境足迹。通过高效整合现有模型,可催生更易获取且多功能的 AI 解决方案,尤其惠及资源有限的地区和语言群体。然而与其他模型开发技术类似,该方法也存在需考量的因素:多样化模型的组合可能引发意外行为或偏见,合并模型的复杂性可能影响其可解释性。虽然这些挑战在 AI 发展中普遍存在,但凸显了持续研究与评估的重要性。值得注意的是,本研究提出的模型与技术主要作为概念验证。若应用于关键任务场景或面向公众广泛使用的模型,可能需要进一步验证和方法论优化。这种持续改进过程对于充分释放进化式模型融合潜力、同时确保负责任且符合伦理的部署至关重要。
Data Availability Statement
数据可用性声明
All datasets used in this paper are publicly available. Detailed sources are provided below:
本文使用的所有数据集均为公开资源,具体来源如下:
-
•
LLM Experiments:
• LLM 实验:-
–
Data for Optimization: Available on Hugging Face at https://huggingface.co/datasets/SakanaAI/gsm8k-ja-test_250-1319.
–优化数据:存放于 Hugging Face 平台 https://huggingface.co/datasets/SakanaAI/gsm8k-ja-test_250-1319 -
–
Data for Tests: Available on GitHub at https://github.com/openai/grade-school-math.
–测试数据:存放于 GitHub 仓库 https://github.com/openai/grade-school-math
-
–
-
•
VLM Experiments:
• VLM 实验:-
–
VLM Benchmark Datasets:
–VLM 基准数据集:-
*
JA-VG-VQA-500: Available on Hugging Face at https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500.
* JA-VG-VQA-500:存放于 Hugging Face 平台 https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500 -
*
JA-VLM-Bench-In-the-Wild: Available on Hugging Face at https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild.
* JA-VLM-Bench-In-the-Wild:存放于 Hugging Face 平台 https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild
-
*
-
–
VLM Training Data for Optimization: This subset of the Japanese Visual Genome VQA dataset is also accessible on the JA-VG-VQA-500 page under the train split.
–优化用 VLM 训练数据:该日语 Visual Genome VQA 数据集子集可通过 JA-VG-VQA-500 页面的 train 分割获取
-
–
References
- [1]
- AI [2023] Open AI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
- AI [2024a] Sakana AI. 2024a. EvoSDXL-JP-v1. https://sakana.ai/evosdxl-jp/.
- AI [2024b] Sakana AI. 2024b. EvoUkiyoe. https://sakana.ai/evo-ukiyoe/.
- AI [2024c] Sakana AI. 2024c. EvoVLM-JP-v2. https://sakana.ai/evovlm-jp/.
- AI [2024d] Stability AI. 2024d. Japanese Stable LM Beta. https://ja.stability.ai/blog/japanese-stable-lm-beta.
- AI [2024e] Stability AI. 2024e. JP Language Model Evaluation Harness. https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable.
- Aiello et al. [2023] Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. 2023. Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564 (2023).
- Akiba et al. [2019] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. https://doi.org/10.1145/3292500.3330701
- augmxnt [2023] augmxnt. 2023. shisa-gamma-7b. HuggingFace. https://hf.co/augmxnt/shisa-gamma-7b-v1
- AUTOMATIC1111 [2022] AUTOMATIC1111. 2022. Stable Diffusion WebUI. https://github.com/AUTOMATIC1111/stable-diffusion-webui.
- Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]
- Chern et al. [2023] Ethan Chern, Haoyang Zou, Xuefeng Li, Jiewen Hu, Kehua Feng, Junlong Li, and Pengfei Liu. 2023. Generative AI for Math: Abel. https://github.com/GAIR-NLP/abel.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. CoRR abs/2110.14168 (2021). arXiv:2110.14168 https://arxiv.org/abs/2110.14168
- Daheim et al. [2024] Nico Daheim, Thomas Möllenhoff, Edoardo Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan. 2024. Model Merging by Uncertainty-Based Gradient Matching. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=D7KJmfEDQP
- Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
- Deb et al. [2002] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation 6, 2 (2002), 182–197.
- Dziugaite and Roy [2017] Gintare Karolina Dziugaite and Daniel M Roy. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 (2017).
- Gaier and Ha [2019] Adam Gaier and David Ha. 2019. Weight agnostic neural networks. Advances in neural information processing systems 32 (2019).
- Gao et al. [2023] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.10256836
- Geva et al. [2022] Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022).
- Goddard [2024] Charles O. Goddard. 2024. mergekit. https://github.com/arcee-ai/mergekit
- Ha et al. [2016] David Ha, Andrew Dai, and Quoc V Le. 2016. Hypernetworks. arXiv preprint arXiv:1609.09106 (2016).
- Hansen [2006] Nikolaus Hansen. 2006. The CMA evolution strategy: a comparing review. Towards a new evolutionary computation: Advances in the estimation of distribution algorithms (2006), 75–102.
- Hochreiter and Schmidhuber [1994] Sepp Hochreiter and Jürgen Schmidhuber. 1994. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems 7 (1994).
- Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Flat minima. Neural computation 9, 1 (1997), 1–42.
- HuggingFace [2023] HuggingFace. 2023. Open LLM Leaderboard. HuggingFace. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- Ilharco et al. [2022] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2022. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089 (2022).
- Jiang et al. [2023] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL]
- Joulin et al. [2016b] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016b. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
- Joulin et al. [2016a] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).
- Kaddour et al. [2022] Jean Kaddour, Linqing Liu, Ricardo Silva, and Matt J Kusner. 2022. When do flat minima optimizers work? Advances in Neural Information Processing Systems 35 (2022), 16577–16595.
- Keskar et al. [2017] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations. https://openreview.net/forum?id=H1oyRlYgg
- Labonne [2024a] Maxime Labonne. 2024a. Automerger Experiment. Tweet Thread (2024). https://twitter.com/maximelabonne/status/1767124527551549860
- Labonne [2024b] Maxime Labonne. 2024b. Merge Large Language Models with mergekit. Hugging Face Blog (2024). https://huggingface.co/blog/mlabonne/merge-models
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
- Lin et al. [2024] Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. SDXL-Lightning: Progressive Adversarial Diffusion Distillation. CoRR abs/2402.13929 (2024). https://doi.org/10.48550/ARXIV.2402.13929 arXiv:2402.13929
- Liu et al. [2023a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]
- Liu et al. [2024] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
- Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual Instruction Tuning. arXiv:2304.08485 [cs.CV]
- Luo et al. [2023] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. CoRR abs/2308.09583 (2023). https://doi.org/10.48550/ARXIV.2308.09583 arXiv:2308.09583
- Matena and Raffel [2022] Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems 35 (2022), 17703–17716.
- Meng et al. [2022] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35 (2022), 17359–17372.
- nostalgebraist [2021] nostalgebraist. 2021. Interpreting GPT: The Logit Lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2024-03-08.
- Petzka et al. [2021] Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. 2021. Relative Flatness and Generalization. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=sygvo7ctb_
- Raffel [2021] Colin Raffel. 2021. A call to build models like we build open-source software. https://colinraffel.com/blog/a-call-to-build-models-like-we-build-open-source-software.html.
- Raffel [2023] Colin Raffel. 2023. Building machine learning models like open source software. Commun. ACM 66, 2 (2023), 38–40.
- Raymond [1999] Eric Raymond. 1999. The cathedral and the bazaar. Knowledge, Technology & Policy 12, 3 (1999), 23–49.
- Real et al. [2019] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.
- rinna [2024] rinna. 2024. LM Benchmark. https://rinnakk.github.io/research/benchmarks/lm/index.html.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Schmidhuber [1992] Jürgen Schmidhuber. 1992. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4, 1 (1992), 131–139.
- Shi et al. [2023] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=fR3wGCk-IXp
- Shimizu et al. [2018] Nobuyuki Shimizu, Na Rong, and Takashi Miyazaki. 2018. Visual Question Answering Dataset for Bilingual Image Understanding: A Study of Cross-Lingual Transfer Using Attention Maps. In Proceedings of the 27th International Conference on Computational Linguistics (Santa Fe, New Mexico, USA). Association for Computational Linguistics, 1918–1928. http://aclweb.org/anthology/C18-1163
- Shing and Akiba [2023] Makoto Shing and Takuya Akiba. 2023. Japanese Stable VLM. https://huggingface.co/stabilityai/japanese-stable-vlm
- So et al. [2019] David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International conference on machine learning. PMLR, 5877–5886.
- Stanley and Miikkulainen [2002] Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127.
- Sun et al. [2024] Qi Sun, Marc Pickett, Aakash Kumar Nain, and Llion Jones. 2024. Transformer Layers as Painters. arXiv preprint arXiv:2407.09298 (2024).
- Sung et al. [2023] Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, and Lijuan Wang. 2023. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933 (2023).
- Tang et al. [2022] Yujin Tang, Yingtao Tian, and David Ha. 2022. EvoJAX: Hardware-Accelerated Neuroevolution. arXiv preprint arXiv:2202.05008 (2022).
- White [2016] Tom White. 2016. Sampling generative networks. arXiv preprint arXiv:1609.04468 (2016).
- Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning. PMLR, 23965–23998.
- Yadav et al. [2023a] Prateek Yadav, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023a. Compeft: Compression for communicating parameter efficient updates via sparsification and quantization. arXiv preprint arXiv:2311.13171 (2023).
- Yadav et al. [2023b] Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, and Mohit Bansal. 2023b. TIES-Merging: Resolving Interference When Merging Models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (Eds.). http://papers.nips.cc/paper_files/paper/2023/hash/1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html
- Young et al. [2024] Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024).
- Yu et al. [2024] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. arXiv:2311.03099 [cs.CL]
- Zoph and Le [2016] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
SUPPLEMENTARY INFORMATION
补充信息
Appendix A Evaluation Details
附录 A 评估细节
In Table 1, GPT-3.5 is gpt-3.5-turbo-0613, GPT-4 is gpt-4-0125-preview, Llama 2 70B is Llama-2-70b-chat, Japanese Stable LM 70B is japanese-stablelm-instruct-beta-70b and Swallow 70B is Swallow-70b-instruct-hf.
表 1 中,GPT-3.5 指 gpt-3.5-turbo-0613,GPT-4 指 gpt-4-0125-preview,Llama 2 70B 指 Llama-2-70b-chat,Japanese Stable LM 70B 指 japanese-stablelm-instruct-beta-70b,Swallow 70B 指 Swallow-70b-instruct-hf。
For the evaluation using the Japanese LM Evaluation Harness [7], we utilized Stability AI Japan’s fork of lm-eval-harness [20] and configured it according to their convention. This configuration is widely used and compatible with the results on their report [6] and Rinna leaderboards [50], thus allowing direct comparison of scores with a large number of Japanese LLMs.
在日本语言模型评估工具[7]的评测中,我们采用了 Stability AI Japan 分叉的 lm-eval-harness[20]并按其规范进行配置。该配置被广泛使用,且与其报告[6]及 Rinna 排行榜[50]中的结果兼容,从而能够直接与大量日语 LLMs 的评分进行横向对比。
Table 6 is the full version of Table 2, allowing for comparisons between our models and a wider range of models. Some of the numbers are from the Stability AI Japan report and the Rinna leaderboard. Despite having only 7B to 10B parameters, our models achieve higher scores than all existing <70B parameters models and exhibit scores on par with the SOTA 70B models.
表 6 是表 2 的完整版本,便于将我们的模型与更广泛的模型进行对比。部分数据来自 Stability AI Japan 报告和 Rinna 排行榜。尽管参数量仅为 70 亿至 100 亿,我们的模型得分超越了所有现存<700 亿参数量的模型,并与当前最先进的 700 亿参数量模型表现相当。
表 6:日语能力 JP-LMEH 评分细目(表 2 完整版)。JP-LMEH(日语大模型评估套件)由 9 项任务构成基准测试,平均分(Avg 列)作为整体日语能力的衡量指标。
JP Language Model Evaluation Harness 日语大模型评估套件 |
|||||||||||
Model | Size 大小 | JComQA | JNLI | MARC | JSQuAD | JAQKET | XLSum | XWino | MGSM | JCoLA | Avg 平均分 |
7B source models: 70 亿参数源模型: | |||||||||||
Mistral-7B-v0.1 | 7B | 74.0 | 44.2 | 63.3 | 85.7 | 69.8 | 22.8 | 69.9 | 20.4 | 51.1 | 55.7 |
shisa-gamma-7b-v1 | 7B | 91.2 | 72.1 | 94.6 | 73.9 | 68.0 | 25.9 | 80.5 | 29.6 | 58.7 | 66.1 |
WizardMath-7B-V1.1 | 7B | 74.7 | 42.7 | 90.4 | 84.6 | 68.5 | 22.3 | 69.8 | 38.8 | 48.9 | 60.1 |
Abel-7B-002 | 7B | 70.3 | 51.8 | 62.3 | 83.8 | 69.0 | 22.5 | 68.2 | 28.0 | 52.7 | 56.5 |
7B merged models: 70 亿参数融合模型: | |||||||||||
Ours (PS) | 7B | 89.1 | 65.7 | 95.4 | 89.5 | 77.7 | 25.5 | 81.2 | 50.0 | 60.5 | 70.5 |
Ours (DFS) 我们的方法(DFS) | 10B | 67.7 | 58.2 | 53.5 | 66.8 | 54.3 | 17.3 | 65.6 | 30.0 | 65.6 | 53.2 |
Ours (PS+DFS) 我们的方法(PS+DFS) | 10B | 88.2 | 50.3 | 91.5 | 78.6 | 77.8 | 23.2 | 73.0 | 40.0 | 73.0 | 66.2 |
Ours (PS-A; see Appendix B) 我们的方法(PS-A;见附录 B) |
7B | 87.0 | 62.3 | 91.9 | 90.4 | 77.4 | 23.0 | 78.9 | 46.4 | 63.6 | 69.0 |
13B source models: 13B 源模型: | |||||||||||
Llama-2-13b-hf | 13B | 78.3 | 43.2 | 50.3 | 89.0 | 75.8 | 25.7 | 63.4 | 10.4 | 50.6 | 54.1 |
MetaMath-13B-V1.0 | 13B | 65.6 | 35.5 | 51.3 | 78.5 | 56.7 | 22.4 | 61.8 | 17.2 | 49.3 | 48.7 |
ELYZA-japanese-Llama-2-13b-instruct | 13B | 84.6 | 58.3 | 83.5 | 83.8 | 63.3 | 25.0 | 74.3 | 18.0 | 50.5 | 60.2 |
13B merged models: 13B 合并模型: | |||||||||||
Ours (PS) 我们的方法(PS) | 13B | 82.7 | 49.8 | 83.1 | 88.3 | 68.0 | 24.4 | 62.3 | 24.8 | 54.3 | 59.7 |
Ours (DFS) 我们的方法(DFS) | 21B | 62.1 | 36.3 | 55.6 | 76.4 | 56.7 | 15.9 | 60.8 | 6.4 | 49.6 | 46.6 |
Ours (PS+DFS) 我们的方法(PS+DFS) | 21B | 82.2 | 48.9 | 88.7 | 86.7 | 68.9 | 18.1 | 62.0 | 30.8 | 57.1 | 60.4 |
Other models: 其他模型: | |||||||||||
Swallow-70b-instruct-hf | 70B | 95.3 | 57.2 | 91.7 | 94.1 | 93.9 | 23.1 | 83.3 | 45.2 | 59.5 | 71.5 |
Swallow-70b-hf | 70B | 94.8 | 55.8 | 87.8 | 93.5 | 93.8 | 23.2 | 83.7 | 47.2 | 60.7 | 71.2 |
japanese-stablelm-base-beta-70b | 70B | 93.7 | 45.2 | 91.0 | 94.2 | 92.8 | 25.4 | 83.6 | 41.2 | 63.8 | 70.1 |
nekomata-14b-instruction | 14B | 93.7 | 57.5 | 90.0 | 93.6 | 89.0 | 25.2 | 77.4 | 35.2 | 67.3 | 69.9 |
japanese-stablelm-instruct-beta-70b | 70B | 91.2 | 50.4 | 92.9 | 87.1 | 88.4 | 24.3 | 82.0 | 37.2 | 61.7 | 68.3 |
nekomata-14b | 14B | 93.1 | 41.7 | 84.1 | 94.2 | 90.8 | 21.3 | 78.6 | 37.6 | 65.0 | 67.4 |
youri-7b-chat | 7B | 91.8 | 70.3 | 96.7 | 79.6 | 83.7 | 24.2 | 80.9 | 25.2 | 53.8 | 67.4 |
Llama-2-70b-hf | 70B | 89.3 | 53.0 | 70.4 | 93.2 | 89.5 | 23.8 | 79.8 | 40.0 | 65.4 | 67.2 |
nekomata-7b-instruction | 7B | 88.0 | 74.7 | 92.3 | 91.3 | 82.5 | 21.1 | 76.8 | 19.2 | 53.8 | 66.6 |
youri-7b-instruction | 7B | 88.8 | 63.6 | 93.8 | 92.2 | 83.9 | 24.7 | 78.9 | 17.2 | 54.0 | 66.3 |
Qwen-14B | 14B | 89.5 | 65.2 | 76.7 | 93.0 | 79.5 | 18.4 | 71.6 | 37.2 | 61.0 | 65.8 |
Swallow-MX-8x7b-NVE-v0.1 | 47B | 92.0 | 52.4 | 59.8 | 93.1 | 91.9 | 26.1 | 72.9 | 44.8 | 59.2 | 65.8 |
youri-7b-chat-gptq | 7B | 89.5 | 68.0 | 96.0 | 88.6 | 79.7 | 15.6 | 79.1 | 21.6 | 53.8 | 65.8 |
nekomata-7b-instruction-gguf | 7B | 87.9 | 68.4 | 89.6 | 89.9 | 79.8 | 20.8 | 77.3 | 19.2 | 55.3 | 65.3 |
nekomata-14b-instruction-gguf | 14B | 93.3 | 57.8 | 51.7 | 91.3 | 86.2 | 23.6 | 77.4 | 35.2 | 68.7 | 65.0 |
Llama-2-70b-chat-hf | 70B | 80.2 | 53.4 | 94.4 | 91.6 | 80.1 | 21.8 | 73.6 | 30.4 | 54.6 | 64.5 |
youri-7b-instruction-gptq | 7B | 88.1 | 62.0 | 93.5 | 90.6 | 74.4 | 15.3 | 78.2 | 15.2 | 53.3 | 63.4 |
japanese-stablelm-base-gamma-7b | 7B | 89.5 | 37.6 | 79.2 | 90.6 | 89.5 | 22.9 | 82.9 | 20.8 | 55.6 | 63.2 |
Swallow-13b-instruct-hf | 13B | 89.0 | 43.1 | 74.1 | 92.8 | 92.5 | 20.6 | 70.4 | 24.0 | 61.9 | 63.2 |
japanese-stablelm-instruct-gamma-7b | 7B | 91.4 | 40.7 | 72.9 | 89.9 | 88.2 | 22.8 | 82.2 | 17.2 | 56.4 | 62.4 |
Swallow-13b-hf | 13B | 86.0 | 44.8 | 74.5 | 92.2 | 92.2 | 21.6 | 71.4 | 19.2 | 59.5 | 62.4 |
nekomata-14b-gguf | 14B | 93.0 | 41.1 | 50.0 | 93.0 | 88.3 | 22.2 | 80.5 | 32.4 | 60.2 | 62.3 |
Swallow-MS-7b-v0.1 | 7B | 88.3 | 35.8 | 83.1 | 90.3 | 89.4 | 20.7 | 74.6 | 22.4 | 55.5 | 62.2 |
Swallow-7b-instruct-hf | 7B | 86.8 | 36.3 | 94.2 | 90.0 | 89.4 | 17.4 | 76.8 | 10.4 | 50.0 | 61.2 |
llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | 13B | 83.2 | 74.5 | 74.9 | 94.0 | 85.1 | 9.6 | 71.3 | 3.2 | 50.4 | 60.7 |
Swallow-7b-hf | 7B | 80.2 | 37.9 | 93.1 | 89.2 | 90.1 | 18.3 | 76.6 | 10.4 | 50.0 | 60.7 |
nekomata-7b | 7B | 82.9 | 38.6 | 67.0 | 89.8 | 82.5 | 16.9 | 74.5 | 16.0 | 60.0 | 58.7 |
japanese-stablelm-instruct-beta-7b | 7B | 81.9 | 39.9 | 88.3 | 84.1 | 78.2 | 20.8 | 75.7 | 4.8 | 51.7 | 58.4 |
japanese-stablelm-base-beta-7b | 7B | 76.8 | 38.0 | 86.0 | 87.1 | 79.5 | 21.9 | 76.6 | 7.6 | 49.2 | 58.1 |
Qwen-7B | 7B | 76.0 | 46.9 | 83.3 | 89.7 | 69.2 | 13.8 | 60.6 | 22.0 | 57.3 | 57.6 |
youri-7b-gptq | 7B | 78.9 | 40.5 | 78.2 | 85.7 | 78.1 | 16.3 | 78.3 | 4.8 | 54.0 | 57.2 |
youri-7b | 7B | 82.8 | 35.9 | 90.1 | 57.2 | 82.5 | 20.5 | 78.0 | 6.8 | 58.5 | 56.9 |
ELYZA-japanese-Llama-2-7b-instruct | 7B | 79.2 | 38.8 | 89.9 | 83.2 | 68.7 | 16.8 | 71.1 | 11.2 | 50.5 | 56.6 |
weblab-10b-instruction-sft | 10B | 63.6 | 42.7 | 86.4 | 87.8 | 73.9 | 23.1 | 72.4 | 2.8 | 50.5 | 55.9 |
ELYZA-japanese-Llama-2-7b | 7B | 75.2 | 36.1 | 87.3 | 86.4 | 68.0 | 16.3 | 72.0 | 5.2 | 53.1 | 55.5 |
nekomata-7b-gguf | 7B | 82.3 | 33.7 | 50.0 | 86.8 | 74.9 | 19.9 | 78.3 | 13.6 | 55.8 | 55.0 |
japanese-stablelm-instruct-ja_vocab-beta-7b | 7B | 79.9 | 39.2 | 79.5 | 82.0 | 78.9 | 9.9 | 68.9 | 6.0 | 49.2 | 54.8 |
japanese-stablelm-base-ja_vocab-beta-7b | 7B | 69.9 | 38.8 | 80.0 | 83.8 | 80.1 | 7.5 | 72.5 | 3.6 | 50.0 | 54.0 |
calm2-7b-chat | 7B | 70.3 | 33.7 | 87.7 | 82.6 | 77.8 | 5.4 | 70.9 | 5.6 | 51.9 | 54.0 |
stockmark-13b | 13B | 70.7 | 36.2 | 56.5 | 87.2 | 87.3 | 7.7 | 73.4 | 9.6 | 50.0 | 53.2 |
Llama-2-7b-hf | 7B | 65.9 | 35.1 | 83.2 | 82.7 | 70.6 | 19.0 | 65.2 | 6.8 | 47.1 | 52.9 |
calm2-7b | 7B | 58.2 | 33.3 | 81.2 | 81.8 | 81.9 | 2.4 | 73.0 | 5.6 | 53.3 | 52.3 |
plamo-13b | 13B | 54.5 | 33.8 | 85.0 | 78.6 | 74.5 | 12.9 | 70.5 | 2.4 | 54.5 | 51.9 |
weblab-10b | 10B | 44.0 | 30.7 | 64.0 | 83.3 | 74.6 | 20.3 | 71.4 | 2.0 | 50.8 | 49.0 |
llm-jp-13b-v1.0 | 13B | 39.6 | 35.0 | 67.4 | 79.3 | 78.0 | 10.9 | 69.5 | 2.0 | 54.4 | 48.5 |
plamo-13b-instruct-nc | 13B | 45.8 | 35.3 | 91.6 | 75.2 | 63.5 | 5.8 | 65.2 | 1.2 | 50.6 | 48.3 |
plamo-13b-instruct | 13B | 42.5 | 34.9 | 90.0 | 77.3 | 67.0 | 5.4 | 62.9 | 2.0 | 50.0 | 48.0 |
Appendix B Evolving for License Specific Open-Source Models
附录 B 针对特定开源许可证模型的演化方案
In the main section, our EvoLLM-JP results were evolved using models found on HuggingFace. However, some of the models uses, in particular, WizardMath-7B-V1.1 [41] has been released under a Non-Commercial, Research-only, Microsoft License, which is not truly open-source. Therefore, our release of EvoLLM-JP is also released under a Non-Commercial, Research-only License to be consistent with the WizardMath-7B-V1.1 model.
在主体部分,我们的 EvoLLM-JP 成果是通过 HuggingFace 上的模型演化而来。但其中部分模型(特别是 WizardMath-7B-V1.1[41])采用的是非商业、仅供研究的微软许可证,并非真正的开源协议。因此,为与 WizardMath-7B-V1.1 模型保持一致,我们发布的 EvoLLM-JP 同样采用非商业、仅供研究的许可证。
As researchers who benefited from the open-source community, we would like for models that we release to also be under an open-source license. In the spirit of open-source, and to showcase the applicability of our method to tackle even challenging issues like model licenses. We have ran a similar experiment where we incorporated only models that have been released under a true open-source license, such as MIT or Apache 2.0, and have produced a similar performing model called EvoLLM-JP-A, which we will release under Apache 2.0. This model will also be released on our GitHub: https://github.com/SakanaAI/.
作为开源社区的受益者,我们更希望发布的模型采用开源许可证。本着开源精神,并为了证明我们的方法能解决包括模型许可证在内的复杂问题,我们进行了类似实验——仅合并采用 MIT 或 Apache 2.0 等真正开源协议的模型,最终产生了性能相近的 EvoLLM-JP-A 模型,该模型将通过 Apache 2.0 协议发布,相关代码将发布于 GitHub:https://github.com/SakanaAI/。
Specifically, our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel-7B-002, all of which are under MIT or Apache 2.0 License.
The MGSM-JA score measured using the protocol described in Section 3.1 is 52.4, and the Japanese Language Model Evaluation Harness score is 69.0.
We have included results of this Apache 2.0-licensed model for comparison in Table 6, which provides a more comprehensive comparison than Table 2 in the main text.
具体而言,EvoLLM-JP-A 由 shisa-gamma-7b-v1、Arithmo2-Mistral-7B 和 Abel-7B-002 合并而成(三者均采用 MIT 或 Apache 2.0 许可证)。根据 3.1 节所述协议测量,其 MGSM-JA 得分为 52.4,日语模型评估基准得分为 69.0。我们已将这款 Apache 2.0 许可模型的对比结果纳入表 6,该表提供了比正文表 2 更全面的性能比较。
Appendix C Case Study 附录 C 案例研究
Table 7 provides an example of responses to a mathematical question by existing models and our model. By merging a Japanese language model, we not only improve our capability to understand and use Japanese in reading and writing but also expand our knowledge about Japan. This example requires both mathematical reasoning and Japanese-specific knowledge, specifically that Setsubun is the day before the beginning of spring (Risshun). Notably, only our merged model provides the correct answer. Even when the question is translated into English, the English math model WizardMath-7B-V1.1 fails to answer correctly, demonstrating that our merged model is superior to a combination of machine translation and an English math model.
表 7 展示了现有模型与我们的模型对数学问题的响应示例。通过合并日语模型,我们不仅提升了日文读写能力,还扩展了关于日本的文化知识。该案例需要同时运用数学推理和日本特定知识(需知"节分"是立春前日)。值得注意的是,只有我们的合并模型给出了正确答案。即使将问题翻译成英文,英文数学模型 WizardMath-7B-V1.1 仍无法正确回答,这证明我们的合并模型优于"机器翻译+英文数学模型"的组合方案。
It is worth noting that in Table 7, we also provide the answer from the general-purpose math model WizardMath-7B-V1.1 with the question translated into English. However, even with the translated question, WizardMath-7B-V1.1 fails to provide the correct answer. This highlights the fact that simply translating the question is not sufficient to solve the problem, as it also requires an understanding of the Japanese cultural context. Even if one were to consider a combination of machine translation and an English math model as an alternative approach, it would still fall short in capturing the nuances and context-specific knowledge required to answer the question correctly. This demonstrates the unique value of our merged model, which effectively combines mathematical reasoning capabilities with Japanese language understanding and cultural knowledge.
值得注意的是,在表 7 中我们还提供了通用数学模型 WizardMath-7B-V1.1 对英文翻译版本问题的回答结果。然而即便将问题翻译为英文,WizardMath-7B-V1.1 仍未能给出正确答案。这凸显了一个事实:仅翻译问题并不足以解决问题,因为理解日本文化语境同样至关重要。即便考虑将机器翻译与英文数学模型结合的替代方案,这种方法仍然无法捕捉到正确回答问题所需的细微差别和特定语境知识。这充分证明了我们融合模型的独特价值——它有效地将数学推理能力与日语理解及文化知识相结合。
On the other hand, Table 8 highlights an instance where the LLMs failed to solve a problem correctly, using a question from the MGSM test set. The problem involved calculating the internal volume of boxes. When presented in Japanese, all models encountered difficulties in correctly accounting for the wall thickness. EvoLLM-JP-v1-7B attempted to adjust for it but made an arithmetic error, leading to an incorrect volume of 351 cubic inches. shisa-gamma-7b-v1 ignored the wall thickness entirely, calculating the external dimensions only, and thus arrived at 360 cubic inches. WizardMath-7B-V1.1 recognized the need for an adjustment but applied it incorrectly, resulting in 288 cubic inches. Notably, when the original English problem was presented to WizardMath-7B-V1.1, it produced the correct answer of 72 cubic inches.
This suggests that there may still be some loss in transferring the model’s high mathematical reasoning abilities from English to Japanese.
另一方面,表 8 展示了 LLMs 在使用 MGSM 测试集问题时未能正确解决的案例。该问题涉及计算盒子的内部容积。当日语呈现问题时,所有模型在正确计算箱壁厚度时都遇到了困难。EvoLLM-JP-v1-7B 尝试进行调整但犯下算术错误,导致得出 351 立方英寸的错误结果;shisa-gamma-7b-v1 完全忽略箱壁厚度,仅计算外部尺寸从而得到 360 立方英寸;WizardMath-7B-V1.1 虽然意识到需要调整却应用错误,最终输出 288 立方英寸。尤为值得注意的是,当原始英文问题呈现给 WizardMath-7B-V1.1 时,该模型正确输出了 72 立方英寸的答案。这表明模型从英语到日语转移其高水平数学推理能力时仍可能存在某些损耗。
Additionally, Table 9 provides the case studies for our VLM. We observe that our VLM successfully obtained knowledge\added, cultural understanding, and fluency in Japanese expression of the Japanese LLM. \deletedIn the 1st example, our VLM generates the correct answer (Koi-nobori) while the original VLM (LLaVA-1.6-Mistral-7B) answers incorrectly. Compared to the Japanese VLM (JSVLM), our VLM generally describes the photo more precisely as the 2nd example of a deer \addedphoto. Also, we found that the original VLM hallucinates a lot in Japanese such that it answers the color of traffic light is green while our VLM answers correctly.
此外,表 9 提供了我们 VLM 的案例分析。我们观察到,我们的 VLM 成功获得了知识\新增、文化理解以及日语表达的流畅性\新增。\删除在第一个案例中,我们的 VLM 生成正确答案(鲤鱼旗)而原始 VLM(LLaVA-1.6-Mistral-7B)回答错误。相较于日本 VLM(JSVLM),我们的 VLM 在第二个梅花鹿\新增照片案例中通常能更精确地描述图像。我们还发现原始 VLM 在日语环境下存在大量幻觉现象,例如错误回答交通信号灯颜色为绿色,而我们的 VLM 则能正确应答。
The first example features a photo of Koi-nobori, a cultural tradition specific to Japan. LLaVA-1.6-Mistral-7B, lacking sufficient knowledge of Japanese culture, failed to answer Koi-nobori correctly. In contrast, both Japanese-Stable-VLM and EvoVLM-JP, possessing knowledge of Japanese culture, successfully identified the Koi-nobori. Notably, EvoVLM-JP’s response demonstrates greater fluency in its explanation.
第一个案例展示了日本特有的鲤鱼旗文化传统照片。缺乏足够日本文化知识的 LLaVA-1.6-Mistral-7B 未能正确识别鲤鱼旗。相比之下,具备日本文化知识的 Japanese-Stable-VLM 和 EvoVLM-JP 都成功识别出了鲤鱼旗。特别值得注意的是,EvoVLM-JP 的回答在解释时展现出更流畅的语言表达。
The third example involves a question about the color of traffic lights. In Japan, the green traffic light is called "blue" because historically, the word "blue" referred to a range of colors, including blue and green, and also because the initial traffic lights used in Japan had a bluish-green hue, leading to the continuation of this terminology today. EvoVLM-JP is the only model that correctly identifies the traffic light color as "blue" in Japanese, demonstrating its accurate understanding of this cultural and linguistic nuance. LLaVA-1.6-Mistral-7B consistently but incorrectly labels the light as "green" in Japanese, revealing its lack of specific cultural knowledge. Japanese-Stable-VLM shows inconsistent responses, alternating between "green," "green and white," and "white," indicating confusion. These results highlight EvoVLM-JP’s superior grasp of Japanese language conventions and cultural context in this specific scenario.
第三个例子涉及交通灯颜色的问题。在日本,绿色交通灯被称为"蓝色",因为历史上"蓝色"一词涵盖包括蓝色和绿色在内的多种颜色,且日本最初使用的交通灯带有蓝绿色调,这一术语因此延续至今。EvoVLM-JP 是唯一能正确识别日语中交通灯颜色为"蓝色"的模型,展现了其对这种文化和语言细微差异的精准理解。LLaVA-1.6-Mistral-7B 始终错误地将灯光标记为日语中的"绿色",显示出其缺乏特定文化知识。Japanese-Stable-VLM 则表现出不一致的回答,在"绿色"、"绿白色"和"白色"之间摇摆,表明其存在混淆。这些结果凸显了 EvoVLM-JP 在此特定场景下对日本语言习惯和文化背景的卓越把握。
表 7:EvoLLM-JP-v1-7B 案例研究。本研究展示了 LLMs 对需要日本特定知识和数学推理能力的问题的回答。
Problem 问题 | |
JA Instruction 日语指令 |
{CJK}UTF8ipxm去年の立春は2月3日でした。今年の立春は2月4日です。去年の節分の月、去年の節分の日、今年の節分の月、今年の節分の日を足し合わせると、いくつになりますか? 去年の立春は2 月 3 日でした。今年の立春は2 月 4 日です。去年の節分の月、去年の節分の日、今年の節分の月、今年の節分の日を足し合わせると、いくつになりますか? |
EN Instruction 英语指令 | Last year, Risshun was on February 3rd. This year, it is on February 4th. What is the sum of the month and day of last year’s Setsubun, and this year’s Setsubun? |
Ground Truth Answer 标准答案 | 9 |
Note 注释 |
Setsubun, the eve of the first day of spring, always falls on the day before Risshun, which marks the start of spring according to the lunar calendar. The date of Risshun can vary annually. Setsubun(立春前夜)总是落在 Risshun(标志着农历春季开始)的前一天。Risshun 的日期每年可能有所不同。 |
Answers 回答 | |
{CJK}UTF8ipxmEvoLLM-JP-v1-7B EvoLLM-JP-v1-7B JA Instruction 日语指令 |
{CJK}UTF8ipxm去年の立春は2月3日で、今年の立春は2月4日です。
去年の節分は立春の前日、つまり2月2日でした。
今年の節分は立春の前日、つまり2月3日です。
去年の節分の月は2月、去年の節分の日は2日で、合計は2 + 2 = 4です。
今年の節分の月は2月、今年の節分の日は3日で、合計は2 + 3 = 5です。
足し合わせると、4 + 5 = 9になります。 去年の立春は2 月 3 日で、今年の立春は2 月 4 日です。去年の節分は立春の前日、つまり2 月 2 日でした。今年の節分は立春の前日、つまり2 月 3 日です。去年の節分の月は2 月、去年の節分の日は2 日で、合計は2 + 2 = 4です。今年の節分の月は2 月、今年の節分の日は3 日で、合計は2 + 3 = 5です。足し合わせると、4 + 5 = 9になります。 #### 9 The answer is: 9. 答案是:9。 |
{CJK}UTF8ipxmshisa-gamma-7b shisa-gamma-7b JA Instruction JA 指令 |
{CJK}UTF8ipxmまず、去年の立春の日付を計算してみましょう。 首先,计算去年立春的日期。 去年の立春の日付 = 2021年の2月3日 去年立春的日期 = 2021 年 2 月 3 日 次に、去年の節分の日付を計算してみましょう。 接着,计算去年节分的日期。 去年の節分の日付 = 2021年の2月2日 去年节分的日期 = 2021 年 2 月 2 日 次に、今年の立春の日付を計算してみましょう。 然后,计算今年立春的日期。 今年の立春の日付 = 2022年の2月4日 今年立春的日期 = 2022 年 2 月 4 日 次に、今年の節分の日付を計算してみましょう。 接着,计算今年节分的日期。 今年の節分の日付 = 2022年の2月1日 今年节分的日期 = 2022 年 2 月 1 日 最後に、去年の節分の月と日、今年の節分の月と日を足し合わせてみましょう。 最后,将去年节分的月日与今年节分的月日相加。 去年の節分の月と日 = 2021年の2月2日 今年の節分の月と日 = 2022年の2月1日 去年节分的月日 = 2021 年 2 月 2 日今年节分的月日 = 2022 年 2 月 1 日 これで、足し合わせることができます。 这样,就可以进行相加了。 足し合わせた月と日 = 2021年の2月2日 + 2022年の2月1日 = 2023年の2月3日 相加后的月日 = 2021 年 2 月 2 日 + 2022 年 2 月 1 日 = 2023 年 2 月 3 日 つまり、去年の節分の月と日、今年の節分の月と日を足し合わせると、2023年の2月3日になります。 也就是说,将去年节分的月日与今年节分的月日相加,得到的是 2023 年 2 月 3 日。 |
{CJK}UTF8ipxmWizardMath-7B-V1.1 WizardMath-7B-V1.1 JA Instruction JA 指令 |
{CJK}UTF8ipxmStep 1: Identify the dates for the given events. 第一步:确定给定事件的日期。 - Last year’s Setsubun (February 3) - 去年的节分(2 月 3 日) - This year’s Setsubun (February 4) - 今年的节分(2 月 4 日) Step 2: Determine the dates for the corresponding events in the previous and next years. 第二步:确定前一年和后一年对应事件的日期。 - Last year’s Setsubun (February 3) falls on a Monday. - 去年的节分(2 月 3 日)是星期一。 - This year’s Setsubun (February 4) falls on a Tuesday. - 今年的节分(2 月 4 日)是星期二。 Step 3: Calculate the number of days between the corresponding events. 第三步:计算对应事件之间的天数。 - The number of days between last year’s and this year’s Setsubun is 1 day (Tuesday - Monday = 1 day). - 去年和今年节分之间的天数为 1 天(星期二 - 星期一 = 1 天)。 Step 4: Determine the dates for the corresponding events in the lunar calendar. 第四步:确定农历中对应事件的日期。 - Last year’s Setsubun was on the 1st day of the 2nd month. - 去年的节分是在农历二月初一。 - This year’s Setsubun will be on the 2nd day of the 2nd month. - 今年的节分将在农历二月初二。 Step 5: Calculate the number of days between the corresponding events in the lunar calendar. 第五步:计算农历中对应事件之间的天数。 - The number of days between last year’s and this year’s Setsubun in the lunar calendar is 1 day (2nd day - 1st day = 1 day). - 农历中去年和今年节分之间的天数为 1 天(初二 - 初一 = 1 天)。 Step 6: Add the number of days between the corresponding events in both calendars. 第六步:将两种日历中对应事件之间的天数相加。 - The total number of days between last year’s and this year’s Setsubun is 1 (from the solar calendar) + 1 (from the lunar calendar) = 2 days. - 去年和今年节分之间的总天数为 1(公历)+ 1(农历)= 2 天。 Step 7: Write the final answer. 第七步:写出最终答案。 - The sum of the dates for the corresponding events is 2. 对应事件的日期之和为 2。 The answer is: 2. 答案是:2。 |
{CJK}UTF8ipxmWizardMath-7B-V1.1 EN Instruction 英文指令 |
{CJK}UTF8ipxmWe are given that last year, Risshun (the beginning of spring according to the lunisolar Japanese calendar) was on February 3rd, and this year it is on February 4th. We need to find the sum of the month and day of last year’s Setsubun, and this year’s Setsubun. {CJK}UTF8ipxm 已知去年日本的立春(根据阴阳历)是在 2 月 3 日,今年则在 2 月 4 日。我们需要计算去年和今年节分的月份与日期之和。 Step 2: Identify the month and day of last year’s Setsubun. Last year, Risshun was on February 3rd. So, the month is February (2) and the day is 3. 步骤 2:确定去年节分的月份和日期。去年立春是 2 月 3 日,因此月份为 2 月(2),日期为 3。 Step 3: Identify the month and day of this year’s Setsubun. This year, Risshun is on February 4th. So, the month is February (2) and the day is 4. 步骤 3:确定今年节分的月份和日期。今年立春是 2 月 4 日,因此月份为 2 月(2),日期为 4。 Step 4: Calculate the sum of the month and day of last year’s and this year’s Setsubun. The sum of the month and day of last year’s Setsubun is 2 + 3 = 5. The sum of the month and day of this year’s Setsubun is 2 + 4 = 6. 步骤 4:计算去年和今年节分的月份与日期之和。去年节分的和为 2 + 3 = 5,今年节分的和为 2 + 4 = 6。 Step 5: Add the two sums together. The total sum of the month and day of last year’s and this year’s Setsubun is 5 + 6 = 11. 步骤 5:将两个和相加。去年和今年节分的总和为 5 + 6 = 11。 So, the sum of the month and day of last year’s Setsubun, and this year’s Setsubun is 11. 因此,去年和今年节分的月份与日期之和为 11。 The answer is: 11. 答案是:11。 |
表 8:EvoLLM-JP-v1-7B 的案例研究。本研究展示了一个 LLMs 无法解决的日本数学问题案例。
Problem 问题 | |
JA Instruction 日文指令 |
{CJK}UTF8ipxmジョンは3箱持っています。それぞれの箱は5インチ×6インチ×4インチです。壁は1インチの暑さです。3箱全部の内容量はどれくらいですか? {CJK}UTF8ipxm 约翰有三个盒子,每个盒子的尺寸为 5 英寸×6 英寸×4 英寸,壁厚为 1 英寸。三个盒子的总容量是多少? |
EN Instruction 英文指令 |
John has 3 boxes. Each box is 5 inches by 6 inches by 4 inches. The walls are 1 inch thick. What is the total inner volume of all 3 boxes? 约翰有 3 个箱子。每个箱子的尺寸为 5 英寸×6 英寸×4 英寸。箱壁厚度为 1 英寸。3 个箱子的总内部容积是多少? |
Ground Truth Answer 真实答案 | 72 |
Answers 答案 | |
{CJK}UTF8ipxmEvoLLM-JP-v1-7B JA Instruction 日文指令 |
{CJK}UTF8ipxm1箱の内容量は、長さ×幅×高さ = 5インチ×6インチ×4インチ = 120立方インチです。
ジョンは3箱持っているので、3箱の内容量は、3×120立方インチ = 360立方インチです。
しかし、壁の厚みは1インチなので、実際の内容量は、360立方インチ - 3箱×1インチ×3面 = 360立方インチ - 9立方インチ = 351立方インチです。 {CJK}UTF8ipxm 单箱内容积为长×宽×高=5 英寸×6 英寸×4 英寸=120 立方英寸。由于约翰有 3 个箱子,3 箱总内容积为 3×120 立方英寸=360 立方英寸。但箱壁厚度为 1 英寸,因此实际内容积需扣除箱体部分,即 360 立方英寸-3 箱×1 英寸×3 面=360 立方英寸-9 立方英寸=351 立方英寸。 #### 351 指示: 351 指示:351 |
{CJK}UTF8ipxmshisa-gamma-7b JA Instruction 日文指令 |
{CJK}UTF8ipxmジョンは3箱持っています。それぞれの箱は5インチ×6インチ×4インチです。壁は1インチの暑さです。3箱全部の内容量はどれくらいですか? [/INST] 3つの箱の体積の合計を求めるには、それぞれの箱の体積を計算してからそれらを合計する必要があります。 {CJK}UTF8ipxm 约翰有 3 个箱子。每个箱子的尺寸为 5 英寸×6 英寸×4 英寸。箱壁厚度为 1 英寸。3 个箱子的总内容积是多少?[/INST] 计算三个箱子的总体积需要先分别计算每个箱子的体积再进行加总。 箱1の体積 = 長さ * 幅 * 高さ = 5インチ * 6インチ * 4インチ = 120立方インチ 箱 1 体积=长×宽×高=5 英寸×6 英寸×4 英寸=120 立方英寸 箱2の体積 = 長さ * 幅 * 高さ = 5インチ * 6インチ * 4インチ = 120立方インチ 箱 2 体积=长×宽×高=5 英寸×6 英寸×4 英寸=120 立方英寸 箱3の体積 = 長さ * 幅 * 高さ = 5インチ * 6インチ * 4インチ = 120立方インチ 箱 3 体积=长×宽×高=5 英寸×6 英寸×4 英寸=120 立方英寸 これで、3つの箱の体積を合計することができます。 现将三个箱子的体积进行合计 合計体積 = 箱1の体積 + 箱2の体積 + 箱3の体積 = 120立方インチ + 120立方インチ + 120立方インチ = 360立方インチ 总体积=箱 1 体积+箱 2 体积+箱 3 体积=120 立方英寸+120 立方英寸+120 立方英寸=360 立方英寸 つまり、3つの箱の合計体積は360立方インチです。 也就是说,三个箱子的总体积为 360 立方英寸。 |
{CJK}UTF8ipxmWizardMath-7B-V1.1 JA Instruction 日语指令 |
{CJK}UTF8ipxmLet’s think step by step. {CJK}UTF8ipxm 让我们一步步思考。 Step 1: Calculate the volume of one box. The volume of a box is calculated by multiplying its length, width, and height. In this case, the dimensions of the box are 5 inches by 6 inches by 4 inches. So, the volume of one box is 5 inches * 6 inches * 4 inches = 120 cubic inches. 第一步:计算一个箱子的体积。箱子的体积通过长、宽、高相乘得出。本例中箱子尺寸为 5 英寸×6 英寸×4 英寸,因此单个箱子体积为 5 英寸×6 英寸×4 英寸=120 立方英寸。 Step 2: Calculate the total volume of all boxes. John has 3 boxes, so the total volume of all boxes is 3 boxes * 120 cubic inches/box = 360 cubic inches. 第二步:计算所有箱子的总体积。约翰有 3 个箱子,因此总体积为 3 箱×120 立方英寸/箱=360 立方英寸。 Step 3: Adjust for the thickness of the walls. The walls of the boxes add an extra inch to each dimension, so we need to subtract the volume of the walls from the total volume of the boxes. The volume of the walls of one box is 1 inch * 6 inches * 4 inches = 24 cubic inches. Since there are 3 boxes, the total volume of the walls is 3 boxes * 24 cubic inches/box = 72 cubic inches. 第三步:调整箱壁厚度的影响。箱壁使每个维度增加 1 英寸,需从总体积中扣除箱壁体积。单个箱子的箱壁体积为 1 英寸×6 英寸×4 英寸=24 立方英寸。3 个箱子的箱壁总体积为 3 箱×24 立方英寸/箱=72 立方英寸。 Step 4: Subtract the volume of the walls from the total volume of the boxes. The adjusted total volume of the boxes is 360 cubic inches - 72 cubic inches = 288 cubic inches. 第四步:从箱子总体积中减去箱壁体积。调整后的箱子总体积为 360 立方英寸-72 立方英寸=288 立方英寸。 So, the total volume of the contents of the 3 boxes is 288 cubic inches. 因此,3 个箱子的内部内容物体积为 288 立方英寸。 The answer is: 288 cubic inches. 答案为:288 立方英寸。 |
{CJK}UTF8ipxmWizardMath-7B-V1.1 EN Instruction 英语指令 |
{CJK}UTF8ipxmJohn has 3 boxes. Each box is 5 inches by 6 inches by 4 inches. The walls are 1 inch thick. What is the total inner volume of all 3 boxes? {CJK}UTF8ipxm 约翰有 3 个箱子,每个尺寸为 5 英寸×6 英寸×4 英寸,箱壁厚 1 英寸。问:3 个箱子的内部总容积是多少? ### Response: Let’s think step by step.First, we need to find the inner dimensions of one box. Since the walls are 1 inch thick, we need to subtract 2 inches from each dimension (1 inch from each side of the wall). ### 回答步骤:首先计算单个箱子的内部尺寸。由于箱壁厚 1 英寸,需从每个维度减去 2 英寸(两侧箱壁各 1 英寸)。 The original dimensions of the box are 5 inches by 6 inches by 4 inches. So, the inner dimensions will be: 原始箱子尺寸为 5 英寸×6 英寸×4 英寸,因此内部尺寸为: - Length: 5 inches - 2 inches = 3 inches - Width: 6 inches - 2 inches = 4 inches - Height: 4 inches - 2 inches = 2 inches - 长度:5 英寸-2 英寸=3 英寸 - 宽度:6 英寸-2 英寸=4 英寸 - 高度:4 英寸-2 英寸=2 英寸 Now, we can calculate the inner volume of one box by multiplying the inner dimensions: 现在,我们可以通过将内部尺寸相乘来计算一个盒子的内部容积: Inner Volume = Length × Width × Height Inner Volume = 3 inches × 4 inches × 2 inches Inner Volume = 24 cubic inches 内部容积 = 长度 × 宽度 × 高度 内部容积 = 3 英寸 × 4 英寸 × 2 英寸 内部容积 = 24 立方英寸 Since John has 3 boxes, we need to find the total inner volume of all 3 boxes: 由于 John 有 3 个盒子,我们需要计算这 3 个盒子的总内部容积: Total Inner Volume = Inner Volume of one box × Number of boxes Total Inner Volume = 24 cubic inches × 3 Total Inner Volume = 72 cubic inches 总内部容积 = 单个盒子的内部容积 × 盒子数量 总内部容积 = 24 立方英寸 × 3 总内部容积 = 72 立方英寸 So, the total inner volume of all 3 boxes is 72 cubic inches. 因此,所有 3 个盒子的总内部容积为 72 立方英寸。 The answer is: 72. 答案是:72。 |
表 9:EvoVLM-JP 的案例研究。本研究展示了视觉语言模型在需要日本特定知识和视觉问答能力的问题上的回答。
Example 1 示例 1 | |
![]() |
|
User 用户 |
{CJK}UTF8ipxmこの画像に映っている魚の形をした飾りは、日本では何と呼ばれていますか?
(What is the fish-shaped decoration in this image called in Japan?) (这张图片中鱼形装饰在日本被称为什么?) |
LLaVA-1.6-Mistral-7B |
{CJK}UTF8ipxmこの画像に映っている魚の形をした飾りは、日本では「花魚」と呼ばれています。
(The fish-shaped decoration in this image is called "hana-gyo" in Japan.) (这张图片中的鱼形装饰在日本被称为“花鱼”。) |
Japanese Stable VLM | {CJK}UTF8ipxmこいのぼり (Koi-nobori) |
EvoVLM-JP |
{CJK}UTF8ipxmこの画像に映っている魚の形をした飾りは、日本では「鯉のぼり」と呼ばれています。 {CJK}UTF8ipxm 这张图片中鱼形装饰物在日本被称为"鲤帜"。 (The fish-shaped decoration in this image is called "Koi-nobori" in Japan.) |
Example 2 示例 2 | |
![]() |
|
User 用户 |
{CJK}UTF8ipxm写真を説明してください。 {CJK}UTF8ipxm 请描述这张照片。 (Please explain this photo.) |
LLaVA-1.6-Mistral-7B |
{CJK}UTF8ipxmこの写真では、自然の中で小さな動物が歩いています。その動物は、犬のような外貌を持っていますが、実際には犬ではありません。この動物は、野生動物として知られていますが、具体的な種類はわかりません。 背景には、秋の季節になっていることがわかります。多くの樹木には、黄色い葉が落ちています。このような環境で、動物が歩いていることを見ることができます。 この写真では、動物が自然の中で活動していることを表現しています。同時に、秋の季節になっていることを伝えることもできます。 {CJK}UTF8ipxm 照片中一只小型动物正在自然环境中行走。该动物外形似犬,但实际并非家犬。作为野生动物其具体种类难以辨别。背景可见时值秋季,众多树木黄叶飘落。在此秋景中可见动物漫步的身影。这张照片既展现了野生动物在自然中的活动状态,也传递出浓浓的秋日气息。 (In this photo, a small animal is walking in nature. The animal has a canine-like appearance but is not actually a dog. This creature is known as a wild animal, but the specific species is unknown. In the background, it is evident that it is autumn. Many trees have yellow leaves falling. In such an environment, the animal can be seen walking. The photo depicts the animal being active in nature while also conveying that it is autumn.) |
Japanese Stable VLM | {CJK}UTF8ipxm紅葉と鹿の写真です。 {CJK}UTF8ipxm 这是红叶与鹿的照片。 (It’s a photo of autumn leaves and deer.) |
EvoVLM-JP |
{CJK}UTF8ipxmこの写真は、一匹の小さな鹿が落ち葉の上に立っている様子を捉えています。鹿は警戒しているようで、周囲を注意深く見回しています。背景には紅葉した木々が見え、秋の雰囲気が感じられます。 {CJK}UTF8ipxm 照片捕捉到一只小鹿站在落叶上的画面。鹿显得十分警觉,正在谨慎地环视四周。背景中可见红叶满枝的树木,秋意盎然。 (This photo captures a small deer standing on fallen leaves. The deer appears to be alert, carefully scanning its surroundings. In the background, you can see trees with autumn foliage, creating an atmosphere of autumn.) |
Example 3 示例 3 | |
![]() |
|
User 用户 |
{CJK}UTF8ipxmこの信号機の色は何色ですか?
(What color is this traffic light?) (这个交通灯是什么颜色的?) |
LLaVA-1.6-Mistral-7B |
(A) {CJK}UTF8ipxmこの信号機の色は緑色です。
(The color of this traffic light is green.) (A) {CJK}UTF8ipxmこの信号機の色は緑色です。 (这个交通灯的颜色是绿色。) (B) {CJK}UTF8ipxmこの信号機の色は緑色です。 (The color of this traffic light is green.) (B) {CJK}UTF8ipxmこの信号機の色は緑色です。 (这个交通灯的颜色是绿色。) (C) {CJK}UTF8ipxmこの信号機の色は緑色です。 (The color of this traffic light is green.) (C) {CJK}UTF8ipxmこの信号機の色は緑色です。 (这个交通灯的颜色是绿色。) |
Japanese Stable VLM |
(A) {CJK}UTF8ipxm緑色です。
(It is green.) (A) {CJK}UTF8ipxm 緑色です。 (是绿色。) (B) {CJK}UTF8ipxm緑と白です。 (It is green and white.) (B) {CJK}UTF8ipxm 緑と白です。 (是绿色和白色。) (C) {CJK}UTF8ipxm白色 (White) (C) {CJK}UTF8ipxm 白色 (白色) |
EvoVLM-JP |
(A) {CJK}UTF8ipxmこの信号機の色は青です。
(The color of this traffic light is blue.) (A) {CJK}UTF8ipxmこの信号機の色は青です。 (这个交通灯的颜色是蓝色。) (B) {CJK}UTF8ipxmこの信号機の色は青です。 (The color of this traffic light is blue.) (B) {CJK}UTF8ipxmこの信号機の色は青です。 (这个交通灯的颜色是蓝色。) (C) {CJK}UTF8ipxmこの信号機の色は青です。 (The color of this traffic light is blue.) (C) {CJK}UTF8ipxmこの信号機の色は青です。 (这个交通灯的颜色是蓝色。) |
Example 4 示例 4 | |
![]() |
|
User 用户 |
{CJK}UTF8ipxmこの建物について教えてください。 {CJK}UTF8ipxm 请告诉我关于这座建筑的信息。 (Please explain this building.) (请解释这座建筑。) |
LLaVA-1.6-Mistral-7B |
{CJK}UTF8ipxmこの建物は、日本で位置しています。具体的には、東京都渋谷区にあります。この建物は、1945年8月15日に、第二次世界大戦中に日本軍によって攻撃され、完全に破壊されました。その後、1964年に再構築されました。 この建物は、「平和の塔」(Peace Tower)と呼ばれています。その名前は、この建物が再構築される際に決められた名前です。この建物は、平和を祈るために作られたものであり、国内外から多くの人々が訪れる場所です。 {CJK}UTF8ipxm 该建筑位于日本,具体位置在东京都涩谷区。1945 年 8 月 15 日第二次世界大战期间,该建筑遭到日军攻击完全损毁,后于 1964 年重建。这座建筑被称为"和平塔"(Peace Tower),其名称是在重建时确定的。该建筑是为祈愿和平而建,现已成为国内外众多访客前来瞻仰的场所。 (This building is located in Japan, specifically in Shibuya Ward, Tokyo. It was completely destroyed by the Japanese military during World War II on August 15, 1945. Subsequently, it was reconstructed in 1964. This building is called the "Peace Tower," a name decided upon during its reconstruction. It was created as a symbol of peace, and it is a place visited by many people from both within and outside the country to pray for peace.) (该建筑位于日本东京涩谷区,1945 年 8 月 15 日二战期间被日军彻底摧毁,1964 年重建后命名为"和平塔",作为和平象征吸引着国内外众多访客前来祈祷。) |
Japanese Stable VLM | {CJK}UTF8ipxm原爆ドーム {CJK}UTF8ipxm 原爆圆顶馆 (Atomic Bomb Dome) (原子弹爆炸圆顶屋) |
EvoVLM-JP |
{CJK}UTF8ipxmこの建物は、日本の広島市にある「原爆ドーム」です。 1945年8月6日、広島市に原子爆弾が投下されました。 この原子爆弾は、広島市の中心部にある「広島県産業奨励館」に直撃しました。 この奨励館は、現在は「原爆ドーム」と呼ばれています。 この原爆ドームは、原子爆弾の被害を受けた建物の中で、唯一残っている建物です。 この原爆ドームは、1996年に世界遺産に登録されました。 {CJK}UTF8ipxm 这是位于日本广岛市的"原爆圆顶馆"。1945 年 8 月 6 日广岛遭遇原子弹轰炸时,这颗原子弹直接击中了位于市中心的"广岛县产业奖励馆"。该建筑现称"原爆圆顶馆",是原子弹爆炸中唯一存留的建筑遗址,1996 年被列为世界文化遗产。 (This building is the "Atomic Bomb Dome" located in Hiroshima City, Japan. On August 6, 1945, an atomic bomb was dropped on Hiroshima City. The bomb directly hit the "Hiroshima Prefectural Industrial Promotion Hall," located in the central part of Hiroshima City. This promotion hall is now known as the "Atomic Bomb Dome." It is the only building that remains standing among those damaged by the atomic bomb. The Atomic Bomb Dome was registered as a UNESCO World Heritage Site in 1996.) 这座建筑是位于日本广岛市的"原爆圆顶"。1945 年 8 月 6 日,广岛市遭受原子弹轰炸。炸弹直接命中了位于广岛市中心部位的"广岛县产业奖励馆",这座奖励馆如今被称为"原爆圆顶",是原子弹爆炸后受损建筑中唯一屹立不倒的遗存。1996 年,原爆圆顶被联合国教科文组织列入世界遗产名录。 |