Evolutionary Optimization of Model Merging Recipes
《模型融合配方的进化优化》

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha
Sakana AI
Tokyo, Japan
{takiba,mkshing,yujintang,qisun,hadavid}@sakana.ai

Abstract 摘要

\added \新增

Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources. \deletedWe present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a \addedcost-effective promising approach for \deletedLLM developmentcreating new models by combining existing ones\deleted, due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.
大型语言模型（LLMs）的能力日益强大，但其开发通常需要大量计算资源。我们提出了一种进化算法的新颖应用，用于自动化创建强大的基础模型。尽管模型融合作为一种\added 高性价比的有前景的方法，通过组合现有模型来\deleted 开发 LLM 创建新模型\deleted（因其成本效益），但目前仍依赖人类直觉和领域知识，限制了其潜力。本文提出的进化方法通过自动发现多样化开源模型的有效组合，克服了这一限制，无需大量额外训练数据或算力即可利用其集体智能。我们的方法在参数空间和数据流空间同时运作，实现了超越单一模型权重层面的优化。该方法甚至支持跨领域融合，例如生成具备数学推理能力的日语 LLM。令人惊讶的是，我们的日语数学 LLM 在多个权威日语 LLM 基准测试中实现了最先进性能，即使面对参数量显著更大的模型也保持优势，尽管并未针对这些任务进行专门训练。此外，通过本方法生成的文化感知型日语视觉语言模型（VLM）在描述日本文化特定内容时表现出色，超越了此前所有日语 VLM。这项工作不仅为开源社区贡献了新的最先进模型，更开创了自动化模型组合的新范式，为基础模型开发探索替代性高效路径铺平了道路。

1 Introduction 1 引言

Model merging [22, 35], a recent development in the large language model (LLM) community, presents a novel paradigm shift. By strategically combining multiple LLMs into a single architecture, this exciting development has captured the attention of researchers due to its key advantage: it requires no additional training, making it an incredibly cost-effective approach for developing new models. This accessibility has fueled a surge in interest and experimentation with model merging. The Open LLM Leaderboard [27] is now dominated by merged models, showcasing its potential for democratizing foundation model development.
模型融合[22,35]作为大型语言模型（LLM）领域的最新进展，展现了一种范式转变。通过将多个 LLM 策略性地整合为单一架构，这一突破性进展因其核心优势——无需额外训练即可开发新模型的超高性价比——引起了研究者的广泛关注。这种易用性推动了模型融合研究的爆发式增长。Open LLM 排行榜[27]目前已被融合模型主导，充分证明其在 democratizing 基础模型开发方面的潜力。

However, model merging is considered by many to be a form of black art or alchemy, relying on the model maker’s intuition and instincts about model selection and merging recipes to create and refine a new model that performs well for a particular task. Furthermore, the model maker is often required to have some domain knowledge for the various different benchmark tasks. Given the large diversity of open models and benchmarks in the community, human intuition can only go so far, and we believe a more systematic approach for discovering new model combinations will take things much further.
然而，模型合并被许多人视为一种"黑魔法"或炼金术，它依赖于模型制作者对模型选择和合并配方的直觉与本能，以创建并优化出在特定任务中表现优异的新模型。此外，模型制作者通常还需要具备各种不同基准任务的相关领域知识。鉴于社区中开源模型和基准测试的多样性，人类直觉的作用有限，我们相信采用更系统化的方法来发现新的模型组合将取得更大进展。

We believe evolutionary algorithms will be able to discover more effective model merging solutions, and thus provide a path for automating the creation of more capable models. As a step towards this direction, in this work, we show that evolution can be employed to discover novel and unintuitive ways to merge \replacedmultiplevarious models to produce new models with a new combined ability. In this work, we present a methodology that leverages evolutionary algorithms to facilitate the merging of foundation models. Our approach is distinguished by its ability to navigate both parameter space (weights) and the data flow space (inference path), proposing a framework that integrates these two dimensions.
我们相信进化算法能够发现更有效的模型合并方案，从而为自动化创建更强大模型提供路径。作为朝此方向迈出的一步，本研究证明进化算法可用于发现新颖且反直觉的模型合并方法，将多个不同模型合并为具有全新综合能力的新模型。我们提出了一种利用进化算法促进基础模型合并的方法论，其独特之处在于能同时探索参数空间（权重）与数据流空间（推理路径），提出了整合这两个维度的框架。

This work makes several key contributions to the field of foundation model development:
本研究为基础模型开发领域做出以下关键贡献：

1.

Automated Model Composition: We introduce Evolutionary Model Merge, a general evolutionary method to automatically discover \replacedeffectiveoptimal combinations of \replacedselecteddiverse open-source models for creating new foundation models with user-specified capabilities. This approach harnesses the collective intelligence of existing open models, enabling the creation of powerful models without the need for extensive training data or compute.

1. 自动化模型组合：我们提出"进化模型合并"方法，这种通用进化算法能自动发现各类开源模型的最佳组合，以创建具有用户指定能力的新基础模型。该方法利用现有开源模型的集体智慧，无需大量训练数据或算力即可创建强大模型。
2.

Cross-Domain Merging: We demonstrate that our method can discover novel ways to merge models from disparate domains (e.g., non-English language and Math, non-English language and Vision), potentially exceeding the capabilities achievable through conventional human design strategies.

2. 跨领域合并：我们证明该方法能发现合并不同领域模型（如非英语语言与数学、非英语语言与视觉）的新颖方式，其能力可能超越传统人工设计策略。
3.

State-of-the-Art Performance: We showcase the effectiveness of our method by automatically generating a Japanese LLM with Math reasoning capability and a Japanese Vision-Language Model (VLM). Notably, both models achieve state-of-the-art performance on various benchmarks, even without explicit optimization for those tasks.

3. 顶尖性能表现：通过自动生成具有数学推理能力的日语 LLM 和日语视觉语言模型（VLM），我们验证了方法的有效性。值得注意的是，这两个模型在多项基准测试中均达到顶尖水平，且未针对这些任务进行显式优化。
4.

High Efficiency and Surprising Generalizability: We observe that our 7B parameter LLM surpasses the performance of some previous 70B parameter Japanese LLMs on benchmark datasets, highlighting the high efficiency and surprising generalization capability of our approach. We believe this model can serve as a strong general-purpose Japanese LLM.

4. 高效性与惊人泛化能力：我们观察到 7B 参数的 LLM 在基准数据集上超越此前部分 70B 参数日语 LLM 的表现，凸显了该方法的高效性和惊人泛化能力。我们相信该模型可成为强大的通用日语 LLM。
5.

Culturally-Aware VLM: The generated Japanese VLM achieves top results when tested on a domestically-sourced dataset of Japanese image-description pairs, demonstrating its ability to handle Japanese culture-specific content.

5. 文化感知型 VLM：生成的日语 VLM 在日本本土采集的图文配对数据集测试中表现优异，彰显其处理日本文化特定内容的能力。

We are committed to open science and are excited to open-sourc\addede\deleteding our EvoLLM-JP and EvoVLM-JP, two state-of-the-art Japanese foundation models, to the community, enabling further research and development in the field. Our work challenges the conventional paradigm of expensive model development by demonstrating that our evolutionary-based method can produce competitive models without relying on gradient-based training. This paves the way for exploring alternative, potentially more efficient, approaches to foundation model development.
我们秉持开放科学精神，欣然向社区开源 EvoLLM-JP 和 EvoVLM-JP 这两个顶尖日语基础模型，推动该领域的持续研发。本研究通过证明基于进化的方法无需依赖梯度训练即可产出具有竞争力的模型，挑战了传统高成本模型开发范式，为探索更高效的基础模型开发方法开辟了新路径。

1.1 Background and Related Work
1.1 研究背景与相关工作

1.1.1 Overview of Model Merging.
1.1.1 模型融合概述

Model merging offers a novel approach to leverage the strengths of multiple pre-trained models. It allows us to combine task-specific models, each potentially fine-tuned for a particular downstream task, into a single unified model. This technique stands in contrast to traditional transfer learning, where a pre-trained model is further fine-tuned for a new task. While transfer learning offers advantages like improved performance and faster convergence, the resulting models are typically limited to single tasks. Model merging, on the other hand, strives to create a versatile and comprehensive model by combining the knowledge from multiple pre-trained models, potentially yielding a model capable of handling various tasks simultaneously.
模型融合开创性地整合多个预训练模型的优势，能将针对不同下游任务微调的专项模型统一为单一复合模型。与传统迁移学习（即对预训练模型进行新任务微调）相比，迁移学习虽具性能提升和快速收敛等优势，但所得模型通常局限于单一任务。模型融合则通过整合多个预训练模型的知识，致力于构建多功能综合模型，有望实现多任务并行处理能力。

While fine-tuning an existing model involves training an existing model further on a specialized dataset, model merging methods work by combining existing models without any additional gradient-based training, making it very cost-effective in terms of compute requirements. Many in the machine learning community, including researchers, hackers, and hobbyists alike, are using both fine-tuning and model-merging to create the best models for their purposes. In fact, this collaborative way of developing continually improved models has parallels with the traditional open-source software development approach [48] where the trend is for the community to develop foundational machine learning models like open source software [46, 47].
现有模型微调需在专业数据集上继续训练，而模型融合方法无需额外梯度训练即可组合现有模型，显著降低算力成本。机器学习社区的研究者、开发者和爱好者正同时采用微调与融合技术来打造最佳定制模型。这种持续优化模型的协作模式与传统的开源软件开发模式[48]异曲同工，当前趋势正是由社区像开发开源软件那样共同构建基础机器学习模型[46,47]。

A simple method of merging multiple models is to average the weights of multiple models fine-tuned from the same base initial model. This model soup approach [62] demonstrated significant improvements on relatively large image processing and image classification models. The work pointed to similarity between weighted model averaging leading to a flat local minima. Theoretical [18, 32, 45, 33] and empirical studies dating back from the 1990s [25, 26] demonstrated that flatter local optima generalize better to out-of-distribution shifts. More recent work [42, 15] examined the theoretical properties and practical implications of weight averaging applied to larger generative models.
合并多个模型的一种简单方法是对同一基础初始模型微调得到的多个模型的权重进行平均。这种模型汤（model soup）方法[62]在相对较大的图像处理和图像分类模型上展现出显著改进。该研究指出，权重平均带来的相似性会导致平坦的局部极小值。早至 20 世纪 90 年代的理论研究[18,32,45,33]和实证研究[25,26]表明，更平坦的局部最优解对分布偏移具有更好的泛化能力。近期工作[42,15]则探讨了权重平均应用于更大规模生成模型的理论特性和实际意义。

\added

Linear weight averaging is performed as follows. Let $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ represent the weight vectors of two distinct models, where $d$ is the dimension of the weight space. The merged model’s weights, denoted as $\theta_{\text{new}}$ , are computed using the following equation: $\theta_{\text{new}}=\lambda\theta_{1}+(1-\lambda)\theta_{2}.$ Here, $\lambda\in[0,1]$ is a weighting parameter that determines the relative contribution of each model to the final merged model. By adjusting $\lambda$ , practitioners can control the balance between the two original models’ characteristics in the resulting merged model.
线性权重平均按如下方式执行：设 $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ 表示两个不同模型的权重向量，其中 $d$ 为权重空间的维度。合并后的模型权重记为 $\theta_{\text{new}}$ ，通过以下公式计算： $\theta_{\text{new}}=\lambda\theta_{1}+(1-\lambda)\theta_{2}.$ 。此处 $\lambda\in[0,1]$ 是权重参数，用于确定每个模型对最终合并模型的相对贡献。通过调整 $\lambda$ ，实践者可以控制两个原始模型特性在合并结果中的平衡比例。

Linear weight averaging not only works well for image processing and classification models, but is also effective for image generation models such as latent diffusion models. The release of Stable Diffusion [51] led to a variety of specialist fine-tunes of the model for a variety of different styles, from photorealism, to anime, to realistic painting styles. Popular open-source WebUIs [11] enabled \replacedpractitionershobbyists to experiment with merging different models using linear or spherical linear interpolation (SLERP) [61] of weight parameters of distinct fine-tuned Stable Diffusion models. This has led to the discovery of merged models that combine strengths of specialized models into a single model. For some time, the most popular Stable Diffusion models were neither the original base models nor the fine-tuned versions, but rather the merged models created by enthusiasts.
线性权重平均不仅适用于图像处理和分类模型，对潜在扩散模型等图像生成模型同样有效。Stable Diffusion[51]的发布催生了针对不同风格（从照片写实到动漫再到写实绘画风格）的多种专项微调模型。流行的开源 WebUI[11]使得\replaced 从业者爱好者能够通过对不同微调 Stable Diffusion 模型的权重参数进行线性或球面线性插值(SLERP)[61]来尝试模型合并。这催生了许多将专项模型优势整合于一体的合并模型。有段时间最受欢迎的 Stable Diffusion 模型既非原始基础模型也非微调版本，而是爱好者创建的合并模型。

1.1.2 Merging Language Models
1.1.2 语言模型合并

Simple weight interpolation works well for merging image generation models that operate in a real valued latent and output space, but to our surprise, they also still work for merging language models under the right conditions, but have performance issues [64]. Recently, new methods have been proposed to address merging language models specifically. One method, Task Arithmetic [28], involves building task vectors by subtracting pre-trained from fine-tuned model weights, enabling manipulation through arithmetic operations to steer the merged model’s behavior. \deletedThe development
虽然简单权重插值法适用于操作实数潜在空间和输出空间的图像生成模型合并，但令人惊讶的是，在适当条件下该方法仍可用于语言模型合并，不过存在性能问题[64]。近期已有专门针对语言模型合并的新方法提出。其中任务算术（Task Arithmetic）[28]通过从微调模型权重中减去预训练权重来构建任务向量，从而通过算术运算操控合并模型的行为。\deleted 该发展

\added \新增

Task Arithmetic operates as follows. Let $\theta_{\text{base}}\in\mathbb{R}^{d}$ denote the weights of a pre-trained base model. For each task $k$ , let $\theta_{k}\in\mathbb{R}^{d}$ represent the weights of a model fine-tuned for that specific task. The task vector $\tau_{k}$ for each task $k$ is defined as the difference between the fine-tuned model weights and the base model weights: $\tau_{k}=\theta_{k}-\theta_{\text{base}}.$ The weights of the new, merged model $\theta_{\text{new}}$ are then computed by adding a weighted sum of these task vectors to the base model weights: $\theta_{\text{new}}=\theta_{\text{base}}+\sum_{k}\lambda_{k}\tau_{k}$ Here, $\lambda_{k}$ is a scaling parameter that determines the contribution of each task vector to the final merged model.
任务算术的工作原理如下。设 $\theta_{\text{base}}\in\mathbb{R}^{d}$ 表示预训练基础模型的权重。对于每个任务 $k$ ，令 $\theta_{k}\in\mathbb{R}^{d}$ 代表针对该任务微调后的模型权重。每个任务 $k$ 的任务向量 $\tau_{k}$ 定义为微调模型权重与基础模型权重之差： $\tau_{k}=\theta_{k}-\theta_{\text{base}}.$ 。新合并模型 $\theta_{\text{new}}$ 的权重则通过将这些任务向量的加权和加到基础模型权重上来计算： $\theta_{\text{new}}=\theta_{\text{base}}+\sum_{k}\lambda_{k}\tau_{k}$ 。此处 $\lambda_{k}$ 是缩放参数，用于决定各任务向量对最终合并模型的贡献程度。

A key problem of weight interpolation-based methods is that they neglect parameter interference, leading to performance degradation. A recent work [64] identifies two key sources of interference: redundant parameter values and conflicting parameter signs across models, and proposes a method that addresses these issues to achieve improved merging performance. The proposed TIES-Merging method addresses information loss in existing merging methods by incorporating three steps: resetting minimal parameter changes, resolving sign conflicts, and merging only aligned parameters. The trimming of redundant parameter changes in TIES-Merging not only resulted in higher performance of merged models, but also achieved better compression for the fine-tuned parameter updates. This compression of fine-tuned weight parameter updates is further investigated in ComPEFT [63].
基于权重插值方法的一个关键问题是忽视了参数干扰，导致性能下降。近期研究[64]指出了两种主要干扰源：模型间冗余的参数值和相互冲突的参数符号，并提出了一种解决这些问题以提升合并性能的方法。所提出的 TIES-Merging 方法通过三个步骤解决现有合并方法中的信息丢失问题：重置最小参数变化、解决符号冲突以及仅合并对齐参数。TIES-Merging 中对冗余参数变化的修剪不仅提高了合并模型的性能，还实现了对微调参数更新更好的压缩效果。这种对微调权重参数更新的压缩在 ComPEFT[63]中得到了进一步研究。

\added \新增

The three steps of TIES-Merging are implemented as follows. In the first step, minimal parameter changes are reset by retaining only the most influential components of each task vector $\tau_{k}$ and setting the rest to zero. This effectively removes redundant or insignificant modifications. The second step resolves sign conflicts across models by determining an aggregate sign for each parameter based on the overall direction of significant changes across all tasks. This ensures that conflicting updates do not cancel each other out. In the final step, only the aligned parameters are merged by selectively averaging the values of parameters that agree with the elected sign from the previous step, while excluding those that disagree. This selective merging process ensures that the final model incorporates only consistent and significant parameter changes from each task-specific model, thereby preserving task-relevant information while minimizing interference.
TIES-Merging 的三个步骤实现如下：第一步通过仅保留各任务向量 $\tau_{k}$ 中最具影响力的组件并将其余部分归零，重置最小参数变更，有效剔除了冗余或无关紧要的修改。第二步通过综合所有任务中显著变化的整体方向，为每个参数确定聚合符号，从而解决模型间的符号冲突，确保相互冲突的更新不会彼此抵消。最后一步仅合并对齐参数，选择性平均那些与上一步选定符号相符的参数值，同时排除不一致的参数。这种选择性合并流程确保最终模型仅整合来自各任务专用模型的一致且重要的参数变更，从而在最小化干扰的同时保留任务相关信息。

Another recent work [66] proposes the DARE method\added, which goes further by zeroing out small differences between the fine-tuned model and the original base model, while amplifying the differences. In practice, DARE [66] is often used together with Task Arithmetic [28] or TIES-Merging [64]. \added Specifically, DARE operates as follows. It applies a random mask $m_{k}\sim\text{Bernoulli}(p)$ to task vector $\tau_{k}$ , where $p$ is the drop rate, and then rescales the remaining parameters:
另一项近期研究[66]提出了 DARE 方法\新增，该方法通过将微调模型与原始基础模型之间的微小差异归零并放大差异，实现了更进一步的优化。实践中，DARE[66]常与任务算术[28]或 TIES-Merging[64]结合使用。\新增具体而言，DARE 的操作流程如下：对任务向量 $\tau_{k}$ 施加随机掩码 $m_{k}\sim\text{Bernoulli}(p)$ （其中 $p$ 为丢弃率），然后对剩余参数进行重新缩放：

\hat{\tau}_{k}=\frac{1}{1-p}\left({(1-m_{k})\odot\tau_{k}}\right).

(1)

The resulting sparsified task vector $\hat{\tau}_{k}$ is then used to merge models. This task vector can be utilized in the aforementioned linear task arithmetic (referred to as DARE-Linear) or TIES (referred to as DARE-TIES) methods.
得到的稀疏化任务向量 $\hat{\tau}_{k}$ 随后被用于模型合并。该任务向量可用于前文所述的线性任务算术（称为 DARE-Linear）或 TIES 方法（称为 DARE-TIES）。

Model merging is gaining momentum in ML research, and has been applied in areas such as multimodal model development [8, 59] and merging experts [8]. It is also becoming popular in both the broader image generation and LLM communities, where due to the low barrier to entry, many researchers\replaced and, engineers\deleted, and hobbyists alike are continually merging existing models to create new customized models.
模型合并在机器学习研究中势头正盛，已被应用于多模态模型开发[8,59]和专家合并[8]等领域。在图像生成和 LLM 社区中，由于入门门槛较低，越来越多的研究人员\替换为、工程师\删除以及爱好者不断通过合并现有模型来创建新的定制化模型。

\deleted \已删除

Image generation model merging really took off in the user community only after such methods were implemented in open-source toolkits [11]. \addedIn the image generation domain, model merging techniques became widely accessible to practitioners especially after their implementation in open-source toolkits [11]. Similarly, language model merging really gained popularity after the implementation of mergekit [22, 35]. This toolkit provides all of the popular recipes for merging language models. In addition to simple linear and spherical interpolation, advanced recipes such as Task Arithmetic, TIES-Merging, DARE are also implemented, offering users the ability to experiment with combining these methods to work with fine-tunes of popular base models such as Mistral [29]. With this toolkit, a large number of capable merged models are developed by the community, and most of the top models on the Open LLM Leaderboard [27] are gradually dominated by merged models produced by the community of language model enthusiasts.
图像生成模型合并技术直到这些方法在开源工具包中实现后，才真正在用户社区中兴起[11]。\新增在图像生成领域，模型合并技术尤其在开源工具包实现后，对从业者变得广泛可用[11]。类似地，语言模型合并在 mergekit 工具包[22,35]实现后才真正流行起来。该工具包提供了所有流行的语言模型合并配方。除了简单的线性和球面插值外，还实现了任务算术、TIES 合并、DARE 等高级配方，使用户能够尝试组合这些方法来处理流行基础模型（如 Mistral[29]）的微调版本。借助该工具包，社区开发了大量性能优异的合并模型，Open LLM 排行榜[27]上的顶尖模型逐渐被语言模型爱好者社区生产的合并模型所主导。

Mergekit also introduced an additional method, called Frankenmerging which is not based on weight merging, but for the user to experiment with trying to stack different layers from multiple models to sequentially create a new model. This method has the advantage of not tying the user to one particular family of models with a fixed architecture for merging (e.g. Mistral-based models), but potentially creating new architectures from merging entirely different models. It is noted that discovering new Frankenmerging technique remains a challenge for the community, and requires much more trial and error to discover new recipes for this technique. To date, almost everyone uses a similar Frankenmerging merge recipe, and there has been very little trial and error to improve upon it. The field remains highly under-explored, and we believe this is where evolution can help.
Mergekit 还引入了一种称为 Frankenmerging 的额外方法，该方法不基于权重合并，而是让用户尝试堆叠来自多个模型的不同层来顺序创建新模型。这种方法的优势在于不将用户束缚于具有固定架构的特定模型系列（例如基于 Mistral 的模型），而可能通过合并完全不同的模型来创建新架构。值得注意的是，发现新的 Frankenmerging 技术对社区来说仍然是一个挑战，需要更多的试错来发现该技术的新配方。迄今为止，几乎每个人都使用类似的 Frankenmerging 合并配方，而对其进行改进的尝试非常有限。该领域仍处于高度未开发状态，我们相信这正是进化算法可以发挥作用的地方。

1.1.3 Connection to Evolutionary Neural Architecture Search
1.1.3 与进化神经架构搜索的关联

Model merging holds significant potential and democratizes the model-building process, making it accessible to a broad range of participants. However, it heavily relies on intuition and domain knowledge. Human intuition, however, has its limits. With the growing diversity of open models and tasks, we need a more systematic approach. We believe evolutionary algorithms, inspired by natural selection, can unlock more effective merging solutions. These algorithms can explore a vast space of possibilities, discovering novel and counter-intuitive combinations that traditional methods and human intuition might miss.
模型合并具有显著潜力，能够民主化模型构建过程，使其对广泛参与者开放。然而，该方法高度依赖直觉和领域知识，而人类直觉存在固有局限。随着开源模型和任务的多样性日益增长，我们需要更系统化的方法。受自然选择启发的进化算法，我们相信能够发掘更有效的合并方案。这类算法可探索广阔的潜在空间，发现传统方法和人类直觉可能忽略的新颖且反直觉的组合。

In deep learning, techniques such as Neural Architecture Search (NAS) [67] employed evolutionary techniques to discover new architectures [49, 56] that might be non-intuitive for human designers to discover. However, NAS-methods consumed significant computational resources given how each candidate model architecture needed to be trained. Our approach aims to save compute resources by discovering structures that take advantage of existing capabilities of existing Transformer blocks that had already consumed considerable resources to train in the first place.
在深度学习领域，神经架构搜索（NAS）[67]等技术曾运用进化方法发现人类设计者难以凭直觉构想的新架构[49, 56]。但由于每个候选模型架构都需要训练，NAS 方法消耗了大量计算资源。我们的方法通过利用已消耗大量资源预训练的 Transformer 模块的现有能力来发现结构，旨在节省计算资源。

Pre-trained transformer blocks are akin to neural network components one can choose to mix and match using an evolutionary architecture search algorithm. However, unlike NAS, we do not need to train the model, and can evaluate the candidates right away. In a sense, our work resonates with the goals of earlier works of morphology search such as NEAT [57] and more recent work such as Weight Agnostic Neural Networks [19] which evolved neural network structures with certain task-specific inductive biases, without the need to train the weight parameters using gradient descent. Similarly, this was how NAS methods such as SMASH [57] avoided costly inner-loop training through the application of a Hypernetwork [23] to estimate the weights of architectural candidates.
预训练的 Transformer 模块类似于可通过进化架构搜索算法混合搭配的神经网络组件。但与 NAS 不同，我们无需训练模型即可立即评估候选方案。从某种意义上，我们的工作与早期形态搜索研究（如 NEAT[57]）和近期成果（如权重无关神经网络[19]）目标一致——后者通过进化具有特定任务归纳偏好的神经网络结构，无需梯度下降训练权重参数。类似地，这也是 SMASH[57]等 NAS 方法通过超网络[23]估算架构候选权重来避免昂贵内循环训练的原理。

Recent work in language models tried to duplicate the layers of a pre-trained Transformer many times [59] as a starting point to pre-train a much larger network, to enable efficient training of much larger language models [65]. Similarly, in our approach, we are interested in applying automated search methods to discover the effective reuse of existing pre-trained layers.
语言模型的最新研究尝试通过多次复制预训练 Transformer 的层结构[59]作为预训练更大规模网络的起点，以实现高效训练超大规模语言模型[65]。类似地，我们的方法旨在应用自动化搜索技术，探索现有预训练层的高效复用方案。

In this work, we apply evolution to not only automate the optimization of model merging recipes in weight space, confined to a single architecture, but also explore applying evolution to optimize the stacking of layers from different models—a more involved approach that has the potential to create entirely novel neural architectures from existing building blocks. Unlike weight merging optimization which confine space of models to be merged together to fine-tuned descendents of the same parent base model, layer stacking optimization has no such constraints. Our work systematically explores the application of evolution applied in both parameter and layer space.
本研究不仅将进化算法应用于自动化优化单一架构内权重空间的模型融合配方，还探索了利用进化算法优化跨模型层堆叠——这种更复杂的方法有望用现有模块构建出全新的神经网络架构。与权重融合优化（要求合并模型必须源自同一基础模型的微调变体）不同，层堆叠优化不受此类限制。我们的工作系统性地探索了进化算法在参数空间和层空间的双重应用。

2 Methods 2 方法论

Our goal is to create a unified framework capable of automatically generating a merged model from a selection of foundation models, ensuring that the performance of this merged model surpasses that of any individual in the collection. Central to our approach is the application of evolutionary algorithms, which we employ to refine the intricacies involved in model merging. To systematically address this challenge, we first dissect the merging process into two distinct, orthogonal configuration spaces, analyzing their individual impacts. Building on this analysis, we then introduce a cohesive framework that seamlessly integrates these spaces. Figure 1 provides a schematic representation of our approach.
我们的目标是建立一个统一框架，能够从基础模型集合中自动生成融合模型，并确保其性能超越集合中的任何单个模型。该方法的核心理念是应用进化算法来优化模型融合的复杂过程。为系统解决这一挑战，我们首先将融合过程分解为两个正交的配置空间并分析其独立影响，继而提出整合这两个空间的协同框架。图 1 展示了我们方法的示意图解。

Refer to caption — Figure 1: Overview of Evolutionary Model Merge. Our approach encompasses (1) evolving the weights for mixing parameters at each layer in the parameter space (PS); (2) evolving layer permutations in the data flow space (DFS); and (3) an integrated strategy that combines both methods for merging in both PS and DFS. Notice that merging in the PS is not simple copying and stitching of the layers parameters, but also mixes the weights. This merging is akin to blending colors as illustrated here (e.g., red and blue becomes purple). Note that we translated the questions to English for the reader; the models operate on Japanese text.
图 1：进化式模型合并概览。我们的方法包含（1）在参数空间（PS）中进化各层混合参数的权重；（2）在数据流空间（DFS）中进化层排列顺序；（3）结合两种策略在 PS 与 DFS 中同时进行合并。需注意 PS 中的合并并非简单复制拼接层参数，而是对权重进行混合——这种合并类似于此处演示的颜色混合（如红色与蓝色混合为紫色）。请注意我们已将问题翻译为英文以便读者理解，模型实际处理的是日语文本。

2.1 Merging in the Parameter Space
2.1 参数空间中的模型融合

Model merging in the parameter space (PS) aims to integrate the weights of multiple foundational models into a unified entity with the same neural network architecture, yet outperforming the individual models. While various strategies for combining model parameters exist [42, 62], our approach leverages task vectors analysis to understand each model’s strengths, based on the specific tasks they are optimized for or excel in [28]. Specifically, we enhance TIES-Merging with DARE [64, 66], allowing for more granular, layer-wise merging (In this paper, by “layer” we mean the input/output embedding layers or a transformer block). We establish merging configuration parameters for sparsification and weight mixing at each layer, including input and output embeddings. These configurations are then optimized using an evolutionary algorithm, such as CMA-ES [24], for selected tasks, guided by critical task-specific metrics (e.g., accuracy for MGSM, ROUGE score for VQA).
参数空间（PS）中的模型融合旨在将多个基础模型的权重整合为具有相同神经网络架构的统一实体，同时性能超越单个模型。尽管存在多种模型参数组合策略[42, 62]，我们的方法基于任务向量分析来理解每个模型在其优化或擅长的特定任务中的优势[28]。具体而言，我们通过 DARE 方法[64, 66]增强 TIES-Merging 技术，实现更细粒度的分层融合（本文中"层"指输入/输出嵌入层或 Transformer 模块）。我们为每层（包括输入输出嵌入层）建立稀疏化与权重混合的融合配置参数，并采用 CMA-ES 等进化算法[24]针对选定任务优化这些配置，优化过程由关键任务指标引导（如 MGSM 任务采用准确率，VQA 任务采用 ROUGE 分数）。

2.2 Merging in the Data Flow Space
2.2 数据流空间融合

Recent analysis and discoveries imply that knowledge is stored distributedly in language models [21, 44, 43], suggesting simple yet novel model merging possibilities in the data flow space (DFS). Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the $i$ -th layer in model $A$ , a token may be directed to the $j$ -th layer in model $B$ .
最新分析与发现表明，知识分布式存储在语言模型中[21, 44, 43]，这为数据流空间（DFS）中实现简单而新颖的模型融合提供了可能。与 PS 融合不同，DFS 中的模型融合保持各层原始权重不变，转而优化 token 在神经网络中流动时的推理路径。例如，模型 $A$ 的第 $i$ 层处理后，token 可能被导向模型 $B$ 的第 $j$ 层。

In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of $N$ models and a budget $T$ , our method searches for a sequence of layer indices $L_{i,j}^{(t)}$ that delineates the path all the tokens should follow for a specific task. Here $L_{i,j}$ denotes the $j$ -th layer in the $i$ -th model, with $t\in[1,T]$ marking the step in the inference path.
在此领域的初步探索中，我们仅研究串行连接与非自适应配置，将更灵活的模型融合方案留待未来工作。具体而言，给定 $N$ 个模型和预算 $T$ ，我们的方法搜索能描述特定任务中所有 token 应遵循路径的层索引序列 $L_{i,j}^{(t)}$ 。其中 $L_{i,j}$ 表示第 $i$ 个模型的第 $j$ 层， $t\in[1,T]$ 标记推理路径的步骤。

One can quickly imagine how large the search space is. Assuming the total number of layers across all models is $M$ , the size of the search space is $(M+1)^{T}$ , here the extra one indicates the inclusion of a pass-through layer. Even with a modest setting of $M=64$ (e.g., 2 models of 32 layers each) and $T=60$ , this setting translates to an astronomically large search space, a challenge even for a capable evolutionary search algorithm. Luckily, our preliminary studies indicated that certain layer arrangements, particularly repetitive or permuted sequences from earlier in the model, can adversely affect performance. Based on this, we modify our settings to include an indicator array $\mathcal{I}$ of size $T=M\times r$ in the evolutionary search space, here $r$ is number of repetitions.
搜索空间的庞大程度可见一斑。假设所有模型总层数为 $M$ ，搜索空间规模达 $(M+1)^{T}$ （额外加 1 表示包含直通层）。即使采用保守设置 $M=64$ （如 2 个 32 层模型）和 $T=60$ ，其对应的搜索空间仍呈天文数字，这对强大的进化搜索算法也是挑战。值得庆幸的是，初步研究表明某些层排列（特别是模型中早期重复或置换的序列）会损害性能。据此，我们在进化搜索空间中引入大小为 $T=M\times r$ 的指示数组 $\mathcal{I}$ （ $r$ 为重复次数）来优化设置。

Conceptually, we layout all the layers in sequential order (i.e., all layers in the $i$ -th model followed by those in the $i+1$ -th model) and repeat them $r$ times, the indicator array then manages the inclusion/exclusion of layers. If $\mathcal{I}_{i}>0$ we include the layer corresponding to index $i$ in the slots in the merged model, otherwise we exclude it. Consequently, our search space is reduced to $2^{T}$ , which is still large, but tractable for evolutionary search.
从概念上讲，我们将所有层按顺序排列（即先排列第 $i$ 个模型的所有层，再排列第 $i+1$ 个模型的层），并重复 $r$ 次，指示数组则控制各层的包含/排除状态。若 $\mathcal{I}_{i}>0$ 为真，我们会在合并模型的对应槽位中包含索引 $i$ 指向的层，否则将其排除。由此，搜索空间被缩减至 $2^{T}$ ，虽然规模仍然庞大，但已处于进化算法可处理范围内。

In our search, we only optimize the data inference path inside the merged model and keep parameters in the models intact. In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, \addedour preliminary studies [58] show that swapping a pair of neighboring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically we find that appropriately scaling an input that wishes to go from layer $i$ to $j$ by $W_{ij}$ help alleviate the problem. Here, $W\in\mathcal{R}^{M\times M}$ is a matrix that is also optimized by the evolutionary search together with the indicator array $\mathcal{I}$ .
在搜索过程中，我们仅优化合并模型内部的数据推理路径，保持各模型参数不变。在此设定下，某一层可能接收到与原始模型训练分布不同的输入，从而导致异常输出。例如，\added 我们前期研究[58]表明：交换语言模型中相邻两层的顺序会导致性能下降。尽管需要更多理论研究来建模这种分布偏移，但实证发现通过 $W_{ij}$ 对从第 $i$ 层流向第 $j$ 层的输入进行适当缩放可缓解该问题。其中缩放矩阵 $W\in\mathcal{R}^{M\times M}$ 将与指示数组 $\mathcal{I}$ 一同通过进化算法优化。

The size of $W$ grows quadratically with $M$ , for scenarios involving a large number of layers. An alternative approach to contain the search space size involves parameterizing $W$ with a neural network [52, 23]. We can instead evolve a feed-forward network to output the scaling weights conditioned on the layer and step indices: $W_{ij}=\pi_{\theta}(i,j,t)$ where $\theta$ ’s are the parameters to be evolved, whose size does not change when $M$ grows.
当涉及大量层级时， $W$ 的规模会随 $M$ 呈二次方增长。为控制搜索空间规模，可采用神经网络参数化方法[52,23]处理 $W$ 。具体而言，可进化一个前馈网络来生成基于层索引和步进索引的条件缩放权重： $W_{ij}=\pi_{\theta}(i,j,t)$ ，其中待进化参数 $\theta$ 的规模不会随 $M$ 增长而变化。

2.3 Merging in Both Spaces
2.3 双空间合并

Model merging in the PS and in the DFS are orthogonal approaches, however, it is straightforward to combine these disentangled methods and further boost the performance of a merged model. As we show in the rightmost illustration in Figure 1 and in Section 3.1, it is possible to first apply PS merging to a collection of models, and then put back this merged model in the collection and apply DFS merging from this enlarged collection.
在参数空间（PS）和深度特征空间（DFS）进行模型合并虽是正交方法，但将这两种解耦方法相结合可进一步提升合并模型的性能。如图 1 最右侧示意图及第 3.1 节所示，我们可以先对模型集合应用 PS 合并，再将合并后的模型放回原集合，对这个扩充后的集合实施 DFS 合并。

This can be extremely helpful when one considers model merging with multiple objectives, wherein PS merging can be applied first to produce several merged models each of which targets one of the muliple objectives of interest, and then DFS merging is applied with multi-objective genetic algorithms such as NSGA-II [17] to further expand the final model’s performance in relevant metrics.
该方法对多目标模型合并尤为有效：首先通过 PS 合并生成多个分别针对不同目标任务的合并模型，继而采用 NSGA-II[17]等多目标遗传算法进行 DFS 合并，从而在相关指标上全面提升最终模型的性能。

3 Results 3 实验结果

Most merged models in the community optimize for a narrow set of tasks defined in The Open LLM Leaderboard [27]. Our motivation is for evolutionary search to discover novel ways to merge different models from vastly different domains (e.g., non-English language and Math, or non-English language and Vision) which might be difficult for human experts to discover effective merging solutions themselves. Furthermore, effectively merging models from very different domains can lead to models of wider real-world applicability and enable us to develop models beyond the large population of models that are optimized for the narrow range of tasks defined by a leaderboard.
当前社区中大多数合并模型仅针对《Open LLM Leaderboard》[27]定义的有限任务集进行优化。我们希望通过进化搜索发现跨领域模型（如非英语语言与数学，或非英语语言与视觉领域）的创新合并方案——这些方案往往超出人类专家手动设计的有效性边界。更重要的是，有效合并差异显著的领域模型能增强实际应用泛化能力，推动模型开发突破排行榜定义的任务局限。

We demonstrate our Evolutionary Model Merge approach described in Section 2 by evolving a Japanese LLM capable of Math reasoning, and a Japanese VLM proficient in handling culturally-specific content. Specifically, in Section 3.1, we apply evolution to merge a Japanese LLM with an English Math LLM to build a Japanese Math LLM, and in Section 3.3, we apply evolution to merge a Japanese LLM with an English VLM to create a Japanese VLM.
我们通过两个案例验证第 2 章所述的进化模型合并方法：构建具备数学推理能力的日语 LLM，以及擅长处理文化特定内容的日语视觉语言模型（VLM）。具体而言，在第 3.1 节中进化合并日语 LLM 与英语数学 LLM 生成日语数学 LLM；在第 3.3 节中进化合并日语 LLM 与英语 VLM 构建日语 VLM。

3.1 Evolving Japanese Math LLM
3.1 日语数学 LLM 的进化

3.1.1 Experimental Setup 3.1.1 实验设置

Source Models 源模型

To develop a model capable of solving math problems in Japanese, we apply evolutionary model merge on a set of source models containing a Japanese LLM and Math LLMs: shisa-gamma-7b-v1 [10] (Japanese LLM), WizardMath-7B-V1.1 [41] and Abel-7B-002 [13]. All these models are fine-tuned from Mistral-7B-v0.1 [29].
为开发能够解决日语数学问题的模型，我们在包含日语 LLM 和数学 LLM 的一组源模型上应用了进化模型融合：shisa-gamma-7b-v1 [10]（日语 LLM）、WizardMath-7B-V1.1 [41] 和 Abel-7B-002 [13]。这些模型均基于 Mistral-7B-v0.1 [29]进行微调。

Dataset 数据集

For testing, we used the MGSM dataset [53], a multilingual translation of a subset of the GSM8k dataset [14]. The Japanese test set of MGSM, consisting of 250 samples, was used for the final evaluation. \addedSpecifically, MGSM contains translations of the first 250 samples (ID 0-249) from the GSM8k test set. We used a different dataset for evolutionary search to avoid overfitting the test set. Specifically, we translated the remaining 1069 samples (out of 1319 examples) of the GSM8k test set that were not included in the MGSM test set into Japanese. \addedThese correspond to samples with IDs 250-1318 in the original GSM8k test set, ensuring no overlap with the MGSM Japanese test set. One may think that it is more natural to translate the GSM8k training set. However, in our preliminary efforts, this approach did not work well. Since open-source math models were trained on the GSM8k training set, we were unable to perform accurate evaluations.
测试阶段采用 MGSM 数据集[53]（GSM8k 数据集[14]的多语言翻译子集），其中包含 250 个样本的日语测试集用于最终评估。\added 具体而言，MGSM 包含 GSM8k 测试集前 250 个样本（ID 0-249）的翻译版本。为避免过拟合测试集，我们使用不同数据集进行进化搜索：将 GSM8k 测试集中未纳入 MGSM 测试集的剩余 1069 个样本（共 1319 例）翻译为日语。\added 这些样本对应原始 GSM8k 测试集中 ID 250-1318 的样本，确保与 MGSM 日语测试集无重叠。虽然翻译 GSM8k 训练集看似更合理，但初步尝试表明该方法效果不佳。由于开源数学模型均在 GSM8k 训练集上训练，我们无法进行准确评估。

Evaluation 评估

We evaluated the ability to generate Japanese answers to Japanese math problems. Therefore, we considered an answer correct if it met the following criteria: (1) the concluding numerical value must be correct, and (2) the reasoning text should be written in Japanese.
我们评估了模型生成日语数学问题日语解答的能力。因此，只要满足以下标准即判定答案正确：(1) 最终数值必须正确；(2) 推理解释文本需使用日语书写。

We treated the last numerical value appearing in the output as the answer. We needed to adopt this heuristic because we are merging multiple models that were trained in different formats, which made it difficult to correct the output format. This method appeared to extract the answers correctly in almost all cases. Additionally, to determine the language of the output, we utilized fasttext [31, 30]. We used greedy sampling for generation and calculated the zero-shot pass@1 accuracy.
我们将输出中最后出现的数值视为答案。由于合并了多种训练格式不同的模型，难以统一校正输出格式，故采用此启发式方法。该方案在绝大多数情况下能正确提取答案。此外，我们使用 fasttext[31,30]检测输出文本语言，采用贪婪采样生成答案，并计算零样本 pass@1 准确率。

Optimization 优化方法

For optimization in PS, we used the CMA-ES[24] algorithm implemented in Optuna [9] with default hyper-parameters. Specifically, we set all initial parameter values to 0.5, sigma to 1/6, and the population size to $4+\lfloor 3\ln(n_{\text{params}})\rfloor$ , where $n_{\text{params}}$ is the number of parameters to optimize. The fitness value is defined as the accuracy for all 1069 training samples. Please note that this set is disjoint from MGSM’s test set. The optimization was conducted for 1000 trials, and the best trial with respect to the training accuracy was chosen as the final model. We decided to employ TIES-Merging [64] with DARE [66] through preliminary experiments, and optimized its parameters.
参数空间优化采用 Optuna[9]实现的 CMA-ES[24]算法，保留默认超参数。具体设置：所有参数初始值为 0.5，sigma=1/6，种群规模为 $4+\lfloor 3\ln(n_{\text{params}})\rfloor$ （ $n_{\text{params}}$ 表示待优化参数数量）。适应度定义为 1069 个训练样本的整体准确率（注：该数据集与 MGSM 测试集无交集）。经过 1000 次试验后，选择训练准确率最高的方案作为最终模型。根据前期实验，我们确定采用结合 DARE[66]的 TIES-Merging[64]方法并优化其参数。

In our DFS merging experiments, $M=64,r=3$ , and consequently, $T=M\times r=192$ . We kept the last 200 examples in the training data as our validation set and optimize on the rest of the data with a batch size of 200. We report the performance of the snapshot that achieved the highest accuracy in the validation set, and the test set is strictly isolated from the optimization process. We adopted CMA-ES in EvoJAX [60], it optimized $\mathcal{I}$ and $W$ for a total of 100 generations with a popluation size of 128, and we used the default hyper-parameters. We limited our DFS merging to two models $A$ and $B$ to ensure that the final model remains modest in size and can be run on a single GPU, but in principle, the methodology can scale to merging multiple models. During the merging, model $A$ ’s tokenizer, and input/output embeddings are utilized. Furthermore, to maintain compatibility with the embedding layers, we mandate that the initial and final transformer layers of model $A$ define the start and the end of the inference path. We initialized the indicator array $\mathcal{I}$ so that all layers in model $A$ are more likely to be included as initial hops in the inference path to shorten the search time.
深度优先搜索合并实验中， $M=64,r=3$ ，因而 $T=M\times r=192$ 。我们将训练数据最后 200 例作为验证集，其余数据以 200 为批次进行优化，最终报告验证集准确率最高的快照性能（测试集严格隔离于优化过程）。采用 EvoJAX[60]的 CMA-ES 算法，默认超参数下优化 $\mathcal{I}$ 和 $W$ 共 100 代（种群规模 128）。为保证最终模型体积适中（可单 GPU 运行），DFS 合并仅针对 $A$ 和 $B$ 两个模型，但该方法理论上支持多模型合并。合并过程中使用模型 $A$ 的分词器及输入/输出嵌入层，并要求模型 $A$ 的初始与最终变换层分别定义推理路径的起点和终点以保持嵌入层兼容性。初始化指示数组 $\mathcal{I}$ 时，优先包含模型 $A$ 的层作为推理路径初始跳转以缩短搜索时间。

3.1.2 Experimental Results
3.1.2 实验结果

Table 1 summarizes the performance of the LLMs on Japanese Math and overall Japanese benchmark tasks. The MGSM-JA column reports the results from the MGSM test set, employing the previously described metrics. The Japanese language model (Model 1) demonstrates limited mathematical proficiency, while the Math models (Models 2 and 3), though mathematically adept, show insufficient command of the Japanese language. Consequently, all three models score low scores on the MGSM-JA, with accuracy at or below 30.0.
表 1 总结了 LLMs 在日本数学及整体日语基准任务上的表现。MGSM-JA 列展示了采用前述指标的 MGSM 测试集结果。日语语言模型（模型 1）展现出有限的数学能力，而数学专用模型（模型 2 和 3）虽具备数学优势，却表现出日语能力的不足。因此，三个模型在 MGSM-JA 测试中均得分较低，准确率均≤30.0。

Table 1: Performance Comparison of the LLMs. Models 1–3 are source models, Models 4–6 are our optimized merge models, and Models 7–11 are provided for reference. MGSM-JA measures math ability in Japanese, and JP-LMEH evaluates general Japanese language capabilities, with a detailed breakdown provided in Table 2.
表 1：LLMs 性能对比。模型 1–3 为源模型，模型 4–6 为我们的优化合并模型，模型 7–11 为参考模型。MGSM-JA 用于衡量日语数学能力，JP-LMEH 评估通用日语能力，详细分类见表 2。

#	Model	Type 类型	Size 大小	MGSM-JA (acc $\uparrow$ )	JP-LMEH (avg $\uparrow$ )
0	Mistral-7B-v0.1	EN general EN 通用	7B	2.0	55.7
1	Shisa Gamma 7B v1	JA general JA 通用	7B	9.6	66.1
2	WizardMath 7B v1.1	EN math EN 数学	7B	18.4	60.1
3	Abel 7B 002	EN math EN 数学	7B	30.0	56.5
4	Ours (PS) 我们的方法 (PS)	1 + 2 + 3	7B	52.0	70.5
5	Ours (DFS) 我们的方法 (DFS)	3 + 1	10B	36.4	53.2
6	Ours (PS+DFS) 我们的方法 (PS+DFS)	4 + 1	10B	55.2	66.2
7	Llama 2 70B	EN general EN 通用	70B	18.0	64.5
8	Japanese StableLM 70B 日本 StableLM 70B	JA general 日语通用	70B	17.2	68.3
9	Swallow 70B	JA general 日语通用	70B	13.6	71.5
10	GPT-3.5	commercial 商业用途	-	50.4	-
11	GPT-4	commercial 商业用途	-	78.8	-

In contrast, our merged models (Models 4–6) manifest a substantial elevation in performance. Notably, the model merged in PS (Model 4) achieves an impressive score of 52.0, highlighting the remarkable potential in combining models with distinct expertises. The DFS-merged model (Model 5) also shows a performance enhancement, with an over 6 percent increase in accuracy compared to the source models. While the leap in performance is not as pronounced as with PS merging, it still proves to be a valid and orthogonal approach. Finally, our hybrid model (Model 6), which integrates both merging strategies, shows further enhancements on the task. \addedThe order of the source models in the indicator array $\mathcal{I}$ affects the performance of the DFS merging method. We conduct experiments with all possible combinations and report the best scores in Table 1, we defer detailed analysis to Sec 3.2.5.
相比之下，我们合并的模型（模型 4-6）表现出显著的性能提升。值得注意的是，通过 PS 方法合并的模型（模型 4）取得了 52.0 分的优异成绩，这凸显了整合不同专业领域模型的巨大潜力。采用 DFS 方法合并的模型（模型 5）同样展现出性能增强，与源模型相比准确率提升超过 6%。虽然其性能飞跃不如 PS 合并法显著，但仍是有效且正交的解决方案。最终，我们融合两种策略的混合模型（模型 6）在任务中实现了进一步的性能提升。\added 指标数组 $\mathcal{I}$ 中源模型的排列顺序会影响 DFS 合并法的性能表现。我们测试了所有可能的组合方案，表 1 中呈现的是最佳得分，详细分析将在 3.2.5 节展开。

Figure 2 gives an overview of the five models’ “answer sheet” on the math problems. Our merged models retain the foundational knowledge in the source models, as evidenced by the similar score patterns on problems 1–15. Moreover, they exhibit emergent capabilities, successfully tackling problems that stumped the source models (e.g., problems 20–30). Evidently, by effectively integrating a Japanese LLM and mathematical models, we have succeeded in producing models that are proficient in both Japanese language understanding and mathematical problem-solving.
图 2 直观展示了五款模型在数学题"答题卡"上的表现。合并模型完整继承了源模型的基础知识（如第 1-15 题相似的得分模式），同时展现出突破性的新兴能力——成功解决了源模型未能攻克的难题（如第 20-30 题）。实验证明，通过有效整合日语 LLM 与数学专用模型，我们成功培育出兼具日语理解与数学解题能力的复合型模型。

Furthermore, Table 2 presents the results of evaluating the general Japanese language ability using the Japanese Language Model Evaluation Harness (JP-LMEH) benchmark suite. This benchmark suite consists of nine tasks, and the average score across these tasks is widely used as an indicator of overall Japanese language proficiency. Our models achieve remarkably high scores of 70.5 and 66.2, surpassing the source models and even the previous state-of-the-art 70B parameter Japanese LLM (specifically, Japanese StableLM 70B), despite having only 7B to 10B parameters.
此外，表 2 展示了使用日语语言模型评估套件（JP-LMEH）基准测试对通用日语能力进行评估的结果。该基准套件包含九项任务，其平均分被广泛用作衡量整体日语能力的指标。尽管我们的模型仅有 70 亿至 100 亿参数，却取得了 70.5 和 66.2 的显著高分，不仅超越了源模型，甚至超过了先前拥有 700 亿参数的日语 LLM（即 Japanese StableLM 70B）的最先进水平。

Compared to the source Japanese model (shisa-gamma-7b-v1), it is interesting to note that there is not only a significant improvement in the MGSM scores but also improvements in other tasks such as JSQuAD and JAQKET. \addedHowever, we also acknowledge that there are certain tasks, such as JNLI and MARC (for DFS), where performance has dropped. It should be noted that the MGSM scores here do not match those in Table 1, due to the differences in evaluation protocols (few-shot, prompting, etc.). Additional details are provided in Appendix A, including extensive comparisons with other models in Table 6.
与源日语模型（shisa-gamma-7b-v1）相比，值得注意的是不仅在 MGSM 分数上有显著提升，在 JSQuAD 和 JAQKET 等其他任务中也有所改进。\added 然而我们也发现，在 JNLI 和 MARC（DFS 相关）等特定任务上性能有所下降。需要说明的是，由于评估协议（少样本学习、提示方式等）的差异，此处的 MGSM 分数与表 1 中的数据并不一致。附录 A 提供了更多细节，包括表 6 中与其他模型的全面对比。

Moreover, Appendix C showcases intriguing examples that demonstrate the utility of our models merged using evolution. The merged models correctly answered questions that require both knowledge about Japanese-culture and Math ability. In contrast, even if such Japanese questions were translated into English and answered in English, English math models would likely fail to provide the correct answers as they may not be aware of Japanese culture-specific context in the questions.
此外，附录 C 展示了一些引人入胜的案例，证明了通过进化合并的模型具有实际应用价值。这些合并模型能正确回答需要同时具备日本文化知识和数学能力的问题。相比之下，即使将这类日语问题翻译成英文并用英语回答，英语数学模型也可能因缺乏对问题中日本文化特定背景的理解而无法给出正确答案。

Table 2: Breakdown of JP-LMEH Scores for Japanese Language Proficiency. JP-LMEH (Japanese Language Model Evaluation Harness) is a benchmark suite consisting of 9 tasks, and the average score (Avg column) is used as an indicator of overall Japanese language proficiency. Full results are in Table 6
表 2：日语能力 JP-LMEH 分数细分表。JP-LMEH（日语语言模型评估套件）是由 9 项任务组成的基准测试套件，其平均分（Avg 列）被用作整体日语能力的指标。完整结果见表 6

		JP Language Model Evaluation Harness 日语语言模型评估套件
Model	Size 参数量	JComQA	JNLI	MARC	JSQuAD	JAQKET	XLSum	XWino	MGSM	JCoLA	Avg 平均分
Shisa Gamma 7b v1	7B	91.2	72.1	94.6	73.9	68.0	25.9	80.5	29.6	58.7	66.1
WizardMath 7B V1.1	7B	74.7	42.7	90.4	84.6	68.5	22.3	69.8	38.8	48.9	60.1
Abel 7B 002	7B	70.3	51.8	62.3	83.8	69.0	22.5	68.2	28.0	52.7	56.5
Ours (PS) 本模型（PS）	7B	89.1	65.7	95.4	89.5	77.7	25.5	81.2	50.0	60.5	70.5
Ours (DFS) 我们的方法（DFS）	10B	67.7	58.2	53.5	66.8	54.3	17.3	65.6	30.0	65.6	53.2
Ours (PS+DFS) 我们的方法（PS+DFS）	10B	88.2	50.3	91.5	78.6	77.8	23.2	73.0	40.0	73.0	66.2
Llama 2 70B	70B	80.2	53.4	94.4	91.6	80.1	21.8	73.6	30.4	54.6	64.5
Japanese Stable LM 70B	70B	91.2	50.4	92.9	87.1	88.4	24.3	82.0	37.2	61.7	68.3
Swallow 70B	70B	95.3	57.2	91.7	94.1	93.9	23.1	83.3	45.2	59.5	71.5

3.1.3 Analysis 3.1.3 分析

In our exploration of model merging in the parameter space, we experimented with diverse configurations, such as varying the assignment of merging parameters across different layer groups. However, due to a constrained dataset, we did not witness notable improvements in performance correlating with increase in configuration complexity. Consequently, we focused our reporting on a PS merged model (Model 4 in Table 1) that adopts the simplest setting: considering each source model as a singular layer and allocating two DARE-TIES associated parameters to each for evolutionary merging. Figure 3 illustrates the evolved parameter configuration post PS merging.
在参数空间模型融合的探索中，我们尝试了多样化配置（例如对不同层组分配不同融合参数）。但由于数据集规模有限，未能观察到配置复杂度提升带来的显著性能改善。因此我们最终选择报告采用最简单设置的 PS 融合模型（表 1 中的模型 4）：将每个源模型视为单一层级，并为每个模型分配两个 DARE-TIES 关联参数进行进化融合。图 3 展示了 PS 融合后的进化参数配置结果。

The CMA-ES optimization results reveals that all three models are important, as suggested by the uniformity of the optimized weighting values. \addedThe fact that the sum of the weights exceeds 1 and is approaching 2, is noteworthy. This suggests that a combination method which amplifies the contributions of the models, rather than a simple interpolation, proved to be more effective.
CMA-ES 优化结果表明三个模型都具有重要性，这体现在优化权重值的均匀分布上。\added 值得注意的是权重总和超过 1 且接近 2 的现象，这表明采用放大模型贡献的组合方法（而非简单插值）被证明更为有效。

The dominant density from the Japanese LLM suggests its critical contribution to solving the task. We conjecture that this may also be partially attributed to the Japanese LM’s larger amount of fine-tunings from the Mistral base model. \addedJapanese LMs based on English models, such as Shisa-Gamma-7B-v1, are typically created through a two-step process: continued pretraining and instruction fine-tuning. The continued pretraining phase involves learning from a significantly larger dataset compared to standard fine-tuning. For instance, Shisa-Gamma-7B-v1 is based on a Japanese base model that underwent continued pretraining on 100B tokens of Japanese text. The resulting differences in weights between Shisa-Gamma-7B-v1 and the original Mistral-7B-v1 likely encapsulate more information than standard fine-tuning, making them more challenging to sparsify. In line with the discussion in [66, Section 4.6], the sparsification of DARE tends to degrade performance when applied to such extensively fine-tuned models. Our evolutionary search has seemingly managed to address this issue by increasing the density for the Japanese LM.
日本 LLM 的主导密度表明其对任务解决的关键贡献。我们推测这可能部分归因于日本语言模型基于 Mistral 基础模型进行了更大量的微调。\added 基于英语模型的日本语言模型（如 Shisa-Gamma-7B-v1）通常通过两阶段流程创建：持续预训练和指令微调。其中持续预训练阶段使用的数据量远超常规微调，例如 Shisa-Gamma-7B-v1 的日语基础模型曾接受过 1000 亿日语 token 的持续预训练。这使得 Shisa-Gamma-7B-v1 与原始 Mistral-7B-v1 的权重差异包含比标准微调更丰富的信息，导致稀疏化更为困难。如[66，第 4.6 节]所述，对此类深度微调模型应用 DARE 稀疏化往往会降低性能。我们的进化搜索通过提高日本语言模型的密度，似乎成功解决了这一问题。

The parameter settings derived from our PS merging experiments align well with outcomes from our DFS merging efforts. By incorporating the PS-merged model into our pool of source models and applying DFS merging across all potential pairings, we observed optimal performance with the combination of the PS-merged model and the Japanese language mode (Model 6 in Table 1). This finding echoes the significant influence of the Japanese language model, as indicated by its notable presence in Figure 3, and reaffirms the substantial promise of evolutionary model merging.
PS 融合实验得出的参数设置与 DFS 融合结果高度吻合。将 PS 融合模型加入源模型池并对所有可能配对进行 DFS 融合后，我们发现 PS 融合模型与日语语言模型的组合（表 1 中的模型 6）表现最优。这一发现既呼应了图 3 中日语语言模型的显著影响力，也再次验证了进化模型融合的巨大潜力。

Figure 4 displays the evolution of the inference path, where our approach consistently recognized the value of initial steps, incorporating every layer from the first model (our PS-merged model), except for the last decoding layer and the embedding layer. As the process advanced, the method refined the selection to a smaller, more effective set of layers and strategically alternated between layers from both contributing models. Significantly, the scaling parameters $W_{ij}$ emerged as crucial elements, our ablation studies revealed that eliminating them in the evolved model (e.g., by setting $W_{ij}=1$ ) led to a performance decline exceeding 20 percent, highlighting their importance in the model’s efficacy.
图 4 展示了推理路径的演变过程，我们的方法始终认识到初始步骤的价值，除了解码最后一层和嵌入层外，纳入了第一个模型（即我们的 PS 合并模型）的每一层。随着进程推进，该方法将选择精炼至更小、更有效的层集，并策略性地交替使用两个贡献模型的层。值得注意的是，缩放参数 $W_{ij}$ 成为关键要素——我们的消融研究表明，若在进化模型中移除这些参数（例如设为 $W_{ij}=1$ ），会导致性能下降超过 20%，这凸显了它们对模型效能的重要性。

3.2 Method Behavior Analysis
3.2 方法行为分析

\added \新增

In this section, we conduct a more in-depth analysis of our method’s performance. We compare it with multiple types of baselines, experimentally analyze the effect of manual model selection, verify the scalability of our proposed approach to larger models, and show results from ablation studies.
本节我们将更深入分析方法的性能表现。通过与多种基线对比，实验性评估人工模型选择的影响，验证所提方法在大模型上的可扩展性，并展示消融研究结果。

3.2.1 Comparison with Unoptimized Model Merging
3.2.1 与非优化模型合并的对比

\added \新增

Table 3 presents the performance of models created through unoptimized model merging. We utilized three methods: TIES-Merge [64], DARE-TIES [66], and Frankenmerging [22]. We used mergekit to establish the baselines. Since merkit does not include default hyperparamters, we adopted those from a widely recognized resource for learning about model merging and mergekit [35].
表 3 呈现了非优化模型合并的性能表现。我们采用三种方法：TIES-Merge[64]、DARE-TIES[66]和 Frankenmerging[22]，使用 mergekit 建立基线。由于 merkit 未包含默认超参数，我们采用了模型合并学习权威资源[35]中的参数设置。

Table 3: Performance Comparison of Proposed Methods and Baselines. Results show MGSM-JA accuracy and JP-LMEH average scores for our proposed methods (PS, DFS, PS+DFS), unoptimized model merging techniques, fine-tuning approaches, and experiments with irrelevant model distraction.
表 3：所提方法与基线的性能对比。结果显示 MGSM-JA 准确率和 JP-LMEH 平均分数对比，包括我们的方法（PS、DFS、PS+DFS）、非优化模型合并技术、微调方法以及无关模型干扰实验。

Unoptimized model merging: 非优化模型合并：
Method 方法	Size 规模	MGSM-JA (acc $\uparrow$ ) MGSM-JA（acc $\uparrow$ ）	JP-LMEH (avg $\uparrow$ ) JP-LMEH（avg $\uparrow$ ）
Ours (PS) 本方法(PS)	7B	52.0	70.5
Ours (DFS) 本方法(DFS)	10B	36.4	53.2
Ours (PS+DFS) 本方法(PS+DFS)	10B	55.2	66.2
TIES-Merge [64]	7B	4.4	63.7
DARE-TIES [66]	7B	35.2	66.3
Frankenmerging [22]	9B	0.0	16.1
Fine-tuning: 微调：
LoRA, Shisa Gamma 7B v1 LoRA，Shisa Gamma 7B v1	7B	20.8	66.4
LoRA, WizardMath 7B v1.1 LoRA，WizardMath 7B v1.1	7B	43.2	55.9
LoRA, Abel 7B 002 LoRA，Abel 7B 002	7B	40.0	52.8
Full, Shisa Gamma 7B v1 完整模型，Shisa Gamma 7B v1	7B	19.2	54.5
Full, WizardMath 7B v1.1 完整模型，WizardMath 7B v1.1	7B	33.6	43.4
Full, Abel 7B 002 完整模型，Abel 7B 002	7B	36.4	37.5
Distraction: 干扰项：
$+0$ irrelevant model $+0$ 无关模型	7B	50.0	65.9
$+1$ irrelevant model $+1$ 无关模型	7B	46.8	64.2
$+2$ irrelevant model $+2$ 无关模型	7B	46.8	64.1
$+4$ irrelevant model $+4$ 无关模型	7B	48.4	64.0
$+8$ irrelevant model $+8$ 无关模型	7B	40.8	65.8

\added \新增

For the PS baseline equivalents (TIES-Merge and DARE-TIES), we used shisa-gamma-7b-v1, WizardMath-7B-V1.1, and Abel-7B-002 as the source models, allowing direct comparison with our PS experiments. Similarly, for the DFS baseline equivalent (Frankenmerging), we used the same combination of source models as in our DFS experiments, enabling direct comparisons.
对于 PS 基线等效方法（TIES-Merge 和 DARE-TIES），我们使用 shisa-gamma-7b-v1、WizardMath-7B-V1.1 和 Abel-7B-002 作为源模型，以便与我们的 PS 实验直接比较。同样地，对于 DFS 基线等效方法（Frankenmerging），我们采用了与 DFS 实验中相同的源模型组合，从而实现直接对比。

\added \新增

The results indicate that TIES-Merge and Frankenmerging led to decreased performance in both MGSM-JA and JP-LMEH, suggesting unsuccessful merging. Notably, Frankenmerging resulted in a MGSM-JA score of 0, which is consistent with its general tendency to often produce models that fail to generate meaningful outputs. DARE-TIES performed relatively better among these baselines, with a slight improvement in MGSM-JA compared to the source models. However, our PS model significantly outperformed all these baselines, and the hybrid merging method (PS+DFS) gives the best test performance on the target task.
结果表明，TIES-Merge 和 Frankenmerging 导致 MGSM-JA 和 JP-LMEH 性能下降，表明合并未获成功。值得注意的是，Frankenmerging 的 MGSM-JA 得分为 0，这与其常生成无意义输出模型的普遍倾向一致。在这些基线方法中，DARE-TIES 表现相对较好，其 MGSM-JA 得分较源模型略有提升。然而，我们的 PS 模型显著优于所有这些基线方法，而混合合并方法（PS+DFS）在目标任务上取得了最佳测试性能。

3.2.2 Comparison with Fine-tuning
3.2.2 与微调方法的比较

\added \新增

Table 3 also compares our evolutionary model merging approach with the fine-tuning method. To ensure a fair comparison, we used the identical dataset. Specifically, we utilized the 1069 samples of Japanese-translated GSM8k that were employed in our PS and DFS optimization for fine-tuning the models. We tested learning rates of 1e-5, 5e-5, and 1e-4, choosing the optimal one. We set the number of epochs to 3 and tried both LoRA and full parameter fine-tuning. We experimented with all three source models: shisa-gamma-7b-v1, WizardMath-7B-V1.1, and Abel-7B-002.
表 3 还对比了我们的进化模型合并方法与微调方法。为确保公平比较，我们使用了相同的数据集。具体而言，我们采用 PS 和 DFS 优化中使用的 1069 个日语翻译 GSM8k 样本对模型进行微调。我们测试了 1e-5、5e-5 和 1e-4 三种学习率并选择最优值，设置训练轮数为 3，同时尝试了 LoRA 和全参数微调。我们对三个源模型（shisa-gamma-7b-v1、WizardMath-7B-V1.1 和 Abel-7B-002）均进行了实验。

\added \新增

The MGSM-JA scores show that while fine-tuning generally improved performance over the original models, the highest score achieved was 43.2. This is notably lower than our proposed method, which exceeded a score of 50. Additionally, unlike our approach, fine-tuning often led to significant decreases in JP-LMEH scores. This suggests that fine-tuning may be more data-hungry and prone to catastrophic forgetting issues.
MGSM-JA 分数显示，虽然微调通常能提升原始模型性能，但最高得分仅为 43.2，显著低于我们提出的超过 50 分的方法。此外，与我们的方法不同，微调常导致 JP-LMEH 分数大幅下降，这表明微调可能更依赖数据且容易产生灾难性遗忘问题。

\added \新增

Other potential fine-tuning baselines. Another potential fine-tuning baseline would be combining all data used by all source models to fine-tune a base model. However, this approach is often impractical in general settings, and in our specific case, it is indeed impossible to implement for comparison. We elaborate on these points below. Our proposed approach’s strength lies in its ability to freely combine openly published models from platforms like HuggingFace. However, model publishers often only release the models themselves, not the training datasets or detailed training methods, as these are considered proprietary knowledge. For instance, the training data for WizardMath-7B-V1.1 and Abel-7B-002 is not publicly available. shisa-gamma-7b-v1 is a further fine-tuned version of japanese-stablelm-base-gamma-7b (which is Mistral-7B-v0.1 fine-tuned for Japanese), and the data used for training japanese-stablelm-base-gamma-7b is also not public. Moreover, the Japanese fine-tuning for japanese-stablelm-base-gamma-7b involved 100B tokens, indicating a substantial computational cost (such as more than 10,000 GPU hours). Therefore, a comparison with this approach is not feasible. It is also worth noting that such fine-tuning would require careful tuning of dataset mixing ratios, which is non-trivial and costly.
其他潜在的微调基线。另一种潜在的微调基线是将所有源模型使用的数据合并后对基础模型进行微调。然而，这种方法在通用场景中通常不切实际，而在我们的具体案例中，确实无法实现对比。我们在下文详述这些要点。我们提出方法的优势在于能自由组合 HuggingFace 等平台公开的模型，但模型发布者通常仅公开模型本身，不提供训练数据集或详细训练方法——这些被视为专有知识。例如 WizardMath-7B-V1.1 和 Abel-7B-002 的训练数据未公开，shisa-gamma-7b-v1 是 japanese-stablelm-base-gamma-7b（基于 Mistral-7B-v0.1 日语微调的版本）的进一步微调版，而 japanese-stablelm-base-gamma-7b 的训练数据同样未公开。此外，japanese-stablelm-base-gamma-7b 的日语微调消耗了 1000 亿 token，意味着巨大的计算成本（如超 1 万 GPU 小时）。因此与该方法的对比不可行。值得注意的是，此类微调还需谨慎调整数据集混合比例，这既复杂又昂贵。

3.2.3 Impact of Manual Model Selection
3.2.3 人工模型选择的影响

\added \新增

In our main experiments, we manually selected three source models. To assess the broader applicability of our approach, it is useful to investigate how sensitive the method is to the selection of source models. To this end, we conducted experiments to measure the distraction effect of including irrelevant models, which simulate the scenario where suboptimal models are inadvertently selected.
在主体实验中我们人工选择了三个源模型。为评估方法的普适性，有必要研究该方法对源模型选择的敏感性。为此我们通过引入无关模型测量干扰效应，模拟意外选择次优模型的情景。

\added \新增

To investigate the sensitivity of our method to source model selection, we conducted experiments to measure the distraction effect of including irrelevant models. In addition to our original seed models (shisa-gamma-7b-v1, WizardMath-7B-V1.1, Abel-7B-002), we experimented with one, two, four, and eight irrelevant models, and performed PS experiments to examine the impact on performance. The irrelevant models, chosen for their lack of relation to Japanese or mathematics, were: leo-mistral-hessianai-7b (German), lince-mistral-7b-it-es (Spanish), Mistral-7B-v0.3-Chinese-Chat (Chinese), BioMistral-7B (biomedical domain), Claire-Mistral-7B-0.1 (French), komt-mistral-7b-v1 (Korean), Mistral-7B-v0.1-arabic (Arabic), and Loquace-7B-Mistral (Italy). All these models are openly available on HuggingFace.
为探究方法对源模型选择的敏感性，我们通过引入无关模型测量干扰效应。除原始种子模型（shisa-gamma-7b-v1、WizardMath-7B-V1.1、Abel-7B-002）外，我们分别实验引入 1/2/4/8 个无关模型，并通过 PS 实验观察性能影响。所选无关模型均与日语或数学无关，包括：leo-mistral-hessianai-7b（德语）、lince-mistral-7b-it-es（西班牙语）、Mistral-7B-v0.3-Chinese-Chat（中文）、BioMistral-7B（生物医学领域）、Claire-Mistral-7B-0.1（法语）、komt-mistral-7b-v1（韩语）、Mistral-7B-v0.1-arabic（阿拉伯语）和 Loquace-7B-Mistral（意大利语）。这些模型均可在 HuggingFace 公开获取。

\added \新增

We slightly modified the PS optimization by adding weight normalization, which became necessary due to the increased number of models (up to 11 in total). This modification explains why the "+0 irrelevant model" result does not exactly match our original PS result.
我们对 PS 优化稍作修改，增加了权重归一化处理——这是由于模型数量增加（最多达 11 个）的必要调整。该修改解释了"+0 无关模型"结果与原 PS 结果不完全一致的原因。

\added \新增

The results, shown in the "Distraction" part of Table 3, are generally favorable. As expected, using only the initial three models yielded the best results. However, the performance degradation when adding irrelevant models was limited, especially up to four additional models, and even with 8 additional models, the performance did not collapse. This demonstrates that our optimization method is relatively robust to manual source model selection. Due to the huge GPU memory requirement at inference time, the distraction experiments were not conducted for the DFS merging method. We instead present behavior analysis and ablation studies on the topic in the following sections.
结果如表 3 的"干扰"部分所示，总体上表现良好。正如预期，仅使用初始三个模型时效果最佳。但加入无关模型时的性能下降有限，特别是在添加不超过四个额外模型时，即便添加 8 个额外模型性能也未崩溃。这表明我们的优化方法对人工选择的源模型具有较强鲁棒性。由于推理时 GPU 显存需求过大，未对 DFS 合并方法进行干扰实验，相关行为分析和消融研究将在后续章节展开。

3.2.4 Scaling to Larger Models
3.2.4 大规模模型扩展

\added \新增

Finally, we demonstrate our proposed methods also work with larger models. Concretely, we used fine-tuned versions of Llama-2-13b: ELYZA-japanese-Llama-2-13b-instruct as the Japanese general model and MetaMath-13B-V1.0 as the English math model. All other settings remained identical to the 7B experiments. Due to the limited availability of suitable 13B models, we could only select two source models.
最后我们验证了该方法在大模型上的适用性。具体采用 Llama-2-13b 的微调版本：日语通用模型 ELYZA-japanese-Llama-2-13b-instruct 和英语数学模型 MetaMath-13B-V1.0，其他设置与 7B 实验保持一致。因适用 13B 模型有限，仅能选择两个源模型。

\added \新增

The results are shown in Table 4. Similar to the 7B case, the MGSM-JA score significantly improved over both source models. This consistent behavior at the 13B size demonstrates the scalability of our evolutionary model merging approach.
结果如表 4 所示。与 7B 情况类似，MGSM-JA 分数相对两个源模型均有显著提升。这种在 13B 规模上的一致性验证了我们进化式模型合并方法的可扩展性。

Table 4: Scaling to 13B Models with ablation studies on DFS merging. Performance comparison and ablation study results on MGSM-JA and JP-LMEH tasks. Rows in gray are ablation studies.
表 4：13B 模型扩展及 DFS 合并消融研究。MGSM-JA 和 JP-LMEH 任务的性能对比与消融结果，灰色行为消融研究。

#	Model	Type 类型	Size 规模	MGSM-JA (acc) MGSM-JA(准确率)	JP-LMEH (acc) JP-LMEH(准确率)
0	Llama-2-13b	EN general 英语通用	13B	2.8	54.1
1	ELYZA-japanese-Llama-2-13b-instruct	JA general 日语通用	13B	13.2	60.2
2	MetaMath-13B-V1.0	EN math 英语数学	13B	8.0	48.7
3	Ours (PS) 本方法(PS)	1 + 2	13B	31.2	59.7
4	Ours (DFS) 我们的方法（DFS）	1 + 2	19B	12.8	61.1
5	Ours (DFS) 我们的方法（DFS）	2 + 1	21B	23.2	46.6
6	Ours (PS+DFS w/o $W$ ) 我们的方法（PS+DFS 不含 $W$ ）	3 + 1	19B	26.4	58.1
7	Ours (PS+DFS) 我们的方法（PS+DFS）	3 + 1	21B	34.0	60.4

\added \新增

The 7B models achieved higher MGSM-JA scores (52.0, 36.4, 55.2) than the 13B models (31.2, 23.2, 34.0) for (PS, DFS, PS+DFS). This difference is attributed to the source models rather than our proposed method. Mistral-7B-v0.1 is known to outperform Llama-2-13b in performance, especially in basic mathematical abilities, despite its smaller size. This is reflected in the performance of math-specialized models: WizardMath-7B-V1.1 scores 83.2 on GSM8k (an English math benchmark), while MetaMath-13B-V1.0 scores 72.3. This difference explains the performance gap in Japanese math abilities between our merged 7B and 13B models.
在（PS、DFS、PS+DFS）任务中，7B 模型获得的 MGSM-JA 分数（52.0、36.4、55.2）高于 13B 模型（31.2、23.2、34.0）。这一差异源于基础模型本身而非我们的方法。已知 Mistral-7B-v0.1 尽管规模较小，但其性能（尤其是基础数学能力）优于 Llama-2-13b——数学专项模型的表现印证了这点：WizardMath-7B-V1.1 在 GSM8k（英文数学基准）上获得 83.2 分，而 MetaMath-13B-V1.0 得分为 72.3。这一差距解释了我们的 7B 与 13B 合并模型在日语数学能力上的表现差异。

\added \新增

On the other hand, our DFS and PS+DFS models (models #4 and #7) achieved higher JP-LMEH average scores than ELYZA-japanese-Llama-2-13b-instruct, the Japanese general model. In general, the qualitative behavior on JP-LMEH is consistent between 7B and 13B models. Both sets of results show improvements in JSQuAD, JAQKET, and MGSM scores, with slight decreases in JNLI and MARC compared to the original Japanese model, indicating similar trends in task-specific score changes (see Table 6).
另一方面，我们的 DFS 和 PS+DFS 模型（模型#4 与#7）在 JP-LMEH 平均分上超越了日语通用模型 ELYZA-japanese-Llama-2-13b-instruct。总体而言，7B 与 13B 模型在 JP-LMEH 上的定性表现具有一致性。两组结果均显示 JSQuAD、JAQKET 和 MGSM 分数提升，而 JNLI 与 MARC 分数相较原始日语模型略有下降，这表明任务特定分数变化趋势相似（见表 6）。

3.2.5 Analysis on DFS Merging
3.2.5 DFS 合并分析

\added \新增

We included the results from two ablation studies for DFS merging in Table 4, and we wish to get insights as to how and why DFS works.
我们在表 4 中列入了两项关于 DFS 合并的消融实验结果，旨在探究 DFS 机制的工作原理及其有效性。

\added \新增

Comparing models #4 and #5, we show that the order of the source models in DFS merging plays a critical role. Specifically, if we put the layers from the English math model in front of those from the Japanese general model in the indicator array $\mathcal{I}$ (model #4), the accuracy drops by more than 10 points than the case where the layers orders are reversed (model #5). This “sub-optimal” behavior is a trade-off between flexibility and search efficiency. In our experiments, we initialize the indicator array $\mathcal{I}$ with zeros, but set the values of those corresponding to layers in model A in the first repetition at $2\sigma$ where $\sigma$ is the initial standard deviation of CMA-ES exploration (recall that in DFS, layers from model A come before those from model B in each repetition). In this way, the merged model behaves just like model A at the beginning of the DFS process, and is allowed to gradually add/remove layers if necessary. Comparing with the other scenario where we initialize $\mathcal{I}$ with only zeros, this setting effectively avoids the cases where poor stacking of the layers causes the LLM to output complete nonsense and waste explorations, at the cost of losing the flexibility to automatically swap the order of models in $\mathcal{I}$ .
通过对比模型#4 与#5，我们发现源模型在 DFS 合并中的层顺序至关重要。具体而言，若在指示数组 $\mathcal{I}$ 中将英语数学模型的层置于日语通用模型之前（模型#4），其准确率会比层序相反的情况（模型#5）降低 10 分以上。这种"次优"表现是灵活性与搜索效率之间的权衡。实验中，我们将指示数组 $\mathcal{I}$ 初始化为零值，但在首次重复时将模型 A 对应层的值设为 $2\sigma$ （其中 $\sigma$ 为 CMA-ES 探索的初始标准差）。如此，合并模型在 DFS 初期行为与模型 A 完全一致，后续可逐步增减层数。相较于全零初始化方案 $\mathcal{I}$ ，该设置有效避免了因层堆叠不当导致 LLM 输出乱码而浪费搜索资源的情况，但代价是丧失了自动调换 $\mathcal{I}$ 中模型顺序的灵活性。

\added \新增

In our second ablation study, the difference between models #6 and #7 highlights the importance of the inclusion of the scaling matrix $W$ . Without scaling, the performance of model #6 is even worse than the PS-merged model, one of its source models. Sufficient as it is in this setting, we wish to point out that this simple scaling treatment does not work if any of the source models has been fine-tuned for a significantly longer time. We suspect that prolonged fine-tuning might lead to increased incompatibility for data traversing through layers from different models. This could be due to permutations in the weight matrices. For example, assuming we are organizing the 4096 dimensions of the hidden states by their absolute values, data from layer 1 in model A might look like (1, 2, … 4096), but in model B, after extended fine-tuning, it might appear as (4001, 133, 3, … 989) vs (1.1, 1.9, …, 4097) when it was only slightly fine-tuned.
第二项消融研究通过模型#6 与#7 的对比揭示了缩放矩阵 $W$ 的重要性。未引入缩放时，模型#6 性能甚至逊于其源模型之一的 PS 合并模型。尽管当前设置中缩放机制已足够有效，但需指出：若任一源模型经过超长时微调，这种简单缩放将失效。我们推测长期微调可能导致跨模型层间数据兼容性下降，这或许源于权重矩阵的排列变化。例如，假设按隐藏状态 4096 维度的绝对值排序，模型 A 第 1 层数据可能呈现(1,2,…4096)，而经深度微调的模型 B 中可能变为(4001,133,3,…989)而非轻微微调时的(1.1,1.9,…4097)。

\added \新增内容

While it is hard to understand thoroughly why DFS merging helped improve performance, we conducted analysis of the merged model (model #5 in Table 4) to get insights. Figure 5 shows the final hops and scales configuration from the merged model. It is important to notice that we forced the scaling factor $W_{i,j}=1$ if $j=i+1$ and layers $i,j$ are from the same model. This is to prevent the possibility of introducing extra modeling power via $W$ , in this manner we are certain that all the performance boost comes from layer stacking and permutation. It is interesting and surprising to notice that the DFS-merged model decides to skip layer #30 from MetaMath-13B-V1.0 at the beginning of the inference path. Based on this finding, we removed layer #30 from MetaMath-13B-V1.0, and without any further modifications we found the performance increased to 10%. This implies that DFS merging works because it is able to identify the redundant/harmful operations (with respect to a specific task) embedded in certain layers, and performance increases once these layers are removed.
虽然难以彻底理解 DFS 合并为何能提升性能，但我们通过对合并模型（表 4 中的模型#5）的分析获得了一些洞见。图 5 展示了该合并模型的最终跳跃路径与缩放系数配置。值得注意的是，我们强制规定当 $j=i+1$ 和 $i,j$ 层来自同一模型时，缩放因子设为 $W_{i,j}=1$ 。这是为了避免通过 $W$ 引入额外建模能力的可能性，从而确保所有性能提升都源于层堆叠与排列。有趣且令人惊讶的是，DFS 合并模型决定在推理路径起始处跳过 MetaMath-13B-V1.0 的第 30 层。基于这一发现，我们移除了 MetaMath-13B-V1.0 的第 30 层，未作其他修改时性能竟提升了 10%。这表明 DFS 合并的有效性在于它能识别特定任务中某些层内冗余/有害的操作，移除这些层即可提升性能。

\added \新增

Furthermore, we analyzed the problems where our DFS-merged model answered correctly while the math source model didn’t, and we identify mainly two scenarios where DFS helped improve the performance, see Figure 6 for illustrations from these scenarios. In the first scenario, we found that the English math model was able to reason correctly. However, it outputs the reasoning steps or answers in English, violating the request that the response must be written in Japanese. In this scenario, DFS-merging managed to shift the output tokens’ distribution toward Japanese vocabularies, fixing the problem. In the second scenario, the English math model was actually reasoning in Japanese. But due to the error in understanding the Japanese problem or in the reasoning, it was not able to give correct answers. DFS-merging, due to its appending of the layers from the Japanese general model, was able to largely avoid the problem. Unlike the “subtraction” it did previously, these scenarios suggest DFS also works because it is capable of “addition” – finding what is missing in one of the source models and compensate that by adding layers in the inference path from other models.
此外，我们分析了 DFS 合并模型答对而数学源模型答错的问题，发现 DFS 主要通过两种场景提升性能（示例见图 6）。第一种场景中，英文数学模型本可正确推理，但其输出步骤或答案使用英语，违反了必须用日语回答的要求。DFS 合并通过将输出词分布向日语词汇偏移解决了该问题。第二种场景中，英文数学模型虽用日语推理，但由于对日语问题的理解错误或推理失误未能给出正确答案。DFS 合并通过追加日语通用模型的层，大幅规避了这个问题。与先前"减法"操作不同，这些场景表明 DFS 还能通过"加法"发挥作用——发现源模型缺失的能力，并通过从其他模型添加推理路径中的层来补偿。

3.3 Evolving Japanese VLM 3.3 日语视觉语言模型的进化

3.3.1 Multi-modality Extension
3.3.1 多模态扩展

We now extend our method to multi-modal models, and evolve a culturally-specific content aware Japanese VLM. VLMs have recently shown remarkable progress by applying the powerful instruction-following capabilities of pre-trained LLMs. The architecture of a VLM generally consists of three components: (1) A vision encoder to extract image features; (2) An LLM to generate text (for the purpose of describing an image); and (3) A projection network to map image features into the LLM’s embedding space [36, 16, 40, 38, 12]. Crucially, the LLM component is initialized with powerful pre-trained LLMs for their text generation capabilities. During training, the projection network and optionally the LLM are trained on various vision-language datasets, while the vision encoder is fixed.
我们将方法扩展至多模态领域，进化出具有文化特定内容理解能力的日语视觉语言模型(VLM)。近年来，通过结合预训练 LLMs 强大的指令跟随能力，VLMs 取得了显著进展。典型 VLM 架构包含三个组件：(1) 提取图像特征的视觉编码器；(2) 生成文本（用于描述图像）的 LLM；(3) 将图像特征映射到 LLM 嵌入空间的投影网络[36,16,40,38,12]。关键之处在于，LLM 组件会初始化具有强大文本生成能力的预训练模型。训练过程中，投影网络和 LLM（可选）会在多模态数据集上进行微调，而视觉编码器保持固定。

3.3.2 Experimental Setup 3.3.2 实验设置