这是用户在 2024-7-5 23:45 为 https://arxiv.org/html/2401.15947v3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0 许可协议:CC BY 4.0
arXiv:2401.15947v3 [cs.CV] 17 Feb 2024
arXiv:2401.15947v3 [cs.CV] 2024 年 2 月 17 日

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA:大规模视觉语言模型的专家混合

Bin Lin  林斌    Zhenyu Tang  唐振宇    Yang Ye  叶阳    Jiaxi Cui  崔嘉熙    Bin Zhu  朱斌    Peng Jin  金鹏    Jinfa Huang  黄金发    Junwu Zhang  张俊武    Munan Ning  宁慕楠    Li Yuan  袁立
Abstract 摘要

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k𝑘kitalic_k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.
最近的进展表明,扩展大规模视觉语言模型(LVLMs)可以有效提高下游任务的性能。然而,现有的扩展方法使得每个标记在计算中都激活所有模型参数,这带来了巨大的训练和推理成本。在这项工作中,我们提出了一种简单而有效的 LVLMs 训练策略 MoE-Tuning。该策略创新性地解决了多模态稀疏学习中常见的性能下降问题,从而构建了一个具有大量参数但计算成本恒定的稀疏模型。此外,我们提出了基于 MoE 的稀疏 LVLM 架构 MoE-LLaVA,该架构在部署期间仅通过路由器激活前 k𝑘kitalic_k 个专家,其余专家保持不活跃。大量实验表明,MoE-LLaVA 在各种视觉理解和对象幻觉基准测试中表现显著。值得注意的是,MoE-LLaVA 仅用大约 30 亿个稀疏激活的参数,在各种视觉理解数据集上的表现与 LLaVA-1.5-7B 相当,甚至在对象幻觉基准测试中超越了 LLaVA-1.5-13B。通过 MoE-LLaVA,我们旨在为稀疏 LVLMs 建立一个基准,并为未来开发更高效和有效的多模态学习系统提供宝贵的见解。代码已在 https://github.com/PKU-YuanGroup/MoE-LLaVA 发布。

Machine Learning, ICML  机器学习,ICML

1 Introduction 引言

Refer to caption
Figure 1: Comparison between MoE-LLaVA-1.8B×4 and open-source LVLMs on object hallucination benchmark. We report the average performance on the POPE (Li et al., 2023d) benchmark, which includes three subsets of Adversarial, Random, and Popular. The red dashed line represents the linear fit to the data points of all models except MoE-LLaVA.
图 1:MoE-LLaVA-1.8B×4 与开源 LVLMs 在对象幻觉基准测试中的比较。我们报告了 POPE(Li 等,2023d)基准测试的平均性能,该基准测试包括对抗性、随机和流行三个子集。红色虚线表示除 MoE-LLaVA 外所有模型的数据点的线性拟合。
Refer to caption
Figure 2: Illustration of MoE-Tuning. The MoE-Tuning consists of three stages. In stage I, only the MLP is trained. In stage II, all parameters are trained except for the Vision Encoder (VE). In stage III, FFNs are used to initialize the experts in MoE, and only the MoE layers are trained. For each MoE layer, only two experts are activated for each token, while the other experts remain silent.
图 2:MoE-Tuning 的示意图。MoE-Tuning 包括三个阶段。在阶段 I 中,仅训练 MLP。在阶段 II 中,除视觉编码器(VE)外的所有参数都进行训练。在阶段 III 中,FFNs 用于初始化 MoE 中的专家,并且仅训练 MoE 层。对于每个 MoE 层,每个标记仅激活两个专家,而其他专家保持静默。

Large Vision-Language Models (LVLMs), such as LLaVA (Liu et al., 2023c) and MiniGPT-4 (Zhu et al., 2023), have shown promising results by leveraging an image encoder and several visual projection layers to enhance the visual perception capabilities of the Large Language Models (LLMs). Typically, increasing the model size (Zhang et al., 2023a; Bai et al., 2023b) and dataset scale (Zhang et al., 2023c; Zhao et al., 2023a; Chen et al., 2023d) can improve model performance. For instance, InternVL (Chen et al., 2023e) has extended the image encoder to 6B parameters. A series of works (Li et al., 2022; Dai et al., 2023; Liu et al., 2023b) have expanded the backend of LVLM to 13B parameters and achieved state-of-the-art performance on downstream tasks. IDEFICS (Laurençon et al., 2023) even trained an LVLM with 80B parameters. These methods have demonstrated superior performance even in LLMs, which are typically pretrained on 34B parameters (SUSTech-IDEA, 2023; 01-ai, 2023; FlagAI-Open, 2023) or 70B parameters (Touvron et al., 2023a, b; Bai et al., 2023a; DeepSeek-AI, 2024; Zhang & Yang, 2023), and models surpassing 100B parameters are common (Brown et al., 2020; Zeng et al., 2022; Zhang et al., 2022; Scao et al., 2022; Li et al., 2023c; falconry, 2023) .
大规模视觉语言模型(LVLMs),如 LLaVA(Liu 等,2023c)和 MiniGPT-4(Zhu 等,2023),通过利用图像编码器和多个视觉投影层来增强大规模语言模型(LLMs)的视觉感知能力,显示出令人鼓舞的结果。通常,增加模型规模(Zhang 等,2023a;Bai 等,2023b)和数据集规模(Zhang 等,2023c;Zhao 等,2023a;Chen 等,2023d)可以提高模型性能。例如,InternVL(Chen 等,2023e)将图像编码器扩展到 60 亿参数。一系列工作(Li 等,2022;Dai 等,2023;Liu 等,2023b)将 LVLM 的后端扩展到 130 亿参数,并在下游任务中取得了最先进的性能。IDEFICS(Laurençon 等,2023)甚至训练了一个具有 800 亿参数的 LVLM。这些方法在 LLMs 中也表现出色,通常预训练在 340 亿参数(SUSTech-IDEA,2023;01-ai,2023;FlagAI-Open,2023)或 700 亿参数(Touvron 等,2023a,b;Bai 等,2023a;DeepSeek-AI,2024;Zhang & Yang,2023)上,超过 1000 亿参数的模型也很常见(Brown 等,2020;Zeng 等,2022;Zhang 等,2022;Scao 等,2022;Li 等,2023c;falconry,2023)。

In practical applications, scaling model with high-quality training data is crucial for improving model performance (Lepikhin et al., 2020). However, training and deploying such large models demand significant computational costs and efficient implementation on parallel devices, which can be extremely expensive. This is because each token requires computations with all model parameters, called the dense model. In contrast, sparse Mixtures of Experts (MoE) (Jacobs et al., 1991; Eigen et al., 2013) effectively scale model capacity by using fixed activated parameters to process data, which has thrived in the field of NLP (Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022). Recently, Mistral LLM (Jiang et al., 2023) equipped with the MoE layers has gained popularity in LLMs. Mixtral-MoE-8×7B (Jiang et al., 2024) achieves performance comparable to LLaMA 2-70B with fewer computational resources.
在实际应用中,使用高质量的训练数据来扩展模型对于提升模型性能至关重要(Lepikhin 等,2020)。然而,训练和部署如此大规模的模型需要显著的计算成本,并且在并行设备上高效实现,这可能非常昂贵。这是因为每个 token 都需要与所有模型参数进行计算,这被称为密集模型。相比之下,稀疏专家混合(MoE)(Jacobs 等,1991;Eigen 等,2013)通过使用固定激活参数来处理数据,有效地扩展了模型容量,并在 NLP 领域取得了成功(Fedus 等,2022;Zoph 等,2022;Komatsuzaki 等,2022)。最近,配备 MoE 层的 Mistral LLM(Jiang 等,2023)在大型语言模型中获得了广泛关注。Mixtral-MoE-8×7B(Jiang 等,2024)在使用更少计算资源的情况下,达到了与 LLaMA 2-70B 相当的性能。

However, directly applying MoE to train sparse LVLMs is challenging. We observe that simultaneously converting LLM to LVLM and sparsifying the model leads to significant performance degradation. After multiple attempts, we find that proper initialization is crucial for sparsifying the LVLM, Therefore, we introduce a simple yet effective three-stage training strategy MoE-Tuning. Specifically, as shown in Figure 2, we first train an MLP that adapts visual tokens to the LLM in stage I. Then, we pre-empower the LVLM with a general multi-modal understanding capability by training the whole LLM’s parameters in stage II. Furthermore, in stage III we replicate the FFN as the initialization weights for the experts and only train the MoE layers. Finally, the sparse model gradually transitions from a general LVLM initialization to sparse mixture of experts.
然而,直接应用 MoE 来训练稀疏的 LVLM 具有挑战性。我们观察到,同时将 LLM 转换为 LVLM 并稀疏化模型会导致显著的性能下降。经过多次尝试,我们发现适当的初始化对于稀疏化 LVLM 至关重要。因此,我们引入了一种简单而有效的三阶段训练策略 MoE-Tuning。具体来说,如图 2 所示,我们首先在阶段 I 中训练一个 MLP,将视觉标记适配到 LLM。然后,在阶段 II 中,通过训练整个 LLM 的参数,预先赋予 LVLM 一般的多模态理解能力。此外,在阶段 III 中,我们复制 FFN 作为专家的初始化权重,并且只训练 MoE 层。最终,稀疏模型逐渐从一般的 LVLM 初始化过渡到稀疏的专家混合。

In this work, we explore a baseline for the LVLM with mixture of experts called MoE-LLaVA, which incorporates mixture of experts and learnable routers. MoE-LLaVA consists of multiple sparse paths where each token is dispatched to different experts through the router. The activated experts collectively process the tokens, while the inactive paths remain silent. By iteratively stacking MoE encoder layers, MoE-LLaVA provides a sparse path toward a larger and more powerful LVLM.
在这项工作中,我们探索了一种名为 MoE-LLaVA 的 LVLM 基线,它结合了专家混合和可学习的路由器。MoE-LLaVA 由多个稀疏路径组成,每个标记通过路由器分配给不同的专家。激活的专家共同处理这些标记,而未激活的路径则保持静默。通过迭代堆叠 MoE 编码器层,MoE-LLaVA 为更大更强的 LVLM 提供了一条稀疏路径。

As a result, in Figure 1, our MoE-LLaVA with only 2.2B sparse activated parameters outperforms models with similar activated parameters and LLaVA-1.5-13B, surpassing it by a large margin on the POPE object hallucination benchmark. Additionally, MoE-LLaVA achieves comparable performance to InternVL-Chat-19B, which has approximately 8 times the activated parameters. We further scale MoE-LLaVA to 3.6B sparse activated parameters, which outperform LLaVA-1.5-7B by 1.9%, 0.4%, 0.9%, 30.7%, and 3.8% in ScienceQA, POPE, MMBench, LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT, and MM-Vet, respectively. Extensive experiments validate the rationality of our MoE-LLaVA architecture and MoE-Tuning strategy.
因此,如图 1 所示,我们的 MoE-LLaVA 仅用 2.2B 稀疏激活参数就超越了具有相似激活参数的模型和 LLaVA-1.5-13B,在 POPE 对象幻觉基准测试中大幅领先。此外,MoE-LLaVA 在性能上与拥有大约 8 倍激活参数的 InternVL-Chat-19B 相当。我们进一步将 MoE-LLaVA 扩展到 3.6B 稀疏激活参数,在 ScienceQA、POPE、MMBench、LLaVA WW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT 和 MM-Vet 中分别超越 LLaVA-1.5-7B 1.9%、0.4%、0.9%、30.7%和 3.8%。大量实验验证了我们 MoE-LLaVA 架构和 MoE-Tuning 策略的合理性。

We summarize our primary contributions as follows:
我们总结了我们的主要贡献如下:

  • We explore the MoE-Tuning, a novel three-stage training strategy for adapting MoE to LVLMs and preventing the model degradation caused by sparsity.


    • 我们探索了 MoE-Tuning,这是一种新颖的三阶段训练策略,用于将 MoE 适配到 LVLMs 并防止稀疏性导致的模型退化。
  • We propose MoE-LLaVA, a MoE-based sparse LVLM framework, which significantly expands the number of parameters while maintaining computational costs.


    • 我们提出了 MoE-LLaVA,一个基于 MoE 的稀疏 LVLM 框架,它在保持计算成本的同时显著扩展了参数数量。
  • Extensive experiments demonstrate that our MoE-LLaVA has excellent multi-modal understanding and hallucination mitigation abilities. With only approximately 3B sparse activated parameters, our method achieves comparable performance with SOTA 7B models on the visual understanding benchmarks. It is worth noting that MoE-LLaVA outperforms LLaVA-1.5-13B by 1.1% on the POPE hallucination benchmark with 2.2B activated parameters.


    • 大量实验表明,我们的 MoE-LLaVA 具有出色的多模态理解和幻觉缓解能力。仅用大约 3B 稀疏激活参数,我们的方法在视觉理解基准测试中达到了与 SOTA 7B 模型相当的性能。值得注意的是,MoE-LLaVA 在 POPE 幻觉基准测试中以 2.2B 激活参数超越了 LLaVA-1.5-13B 1.1%。

2 Related Work 相关工作

2.1 Large Vision-Language Models
2.1 大型视觉语言模型

Powerful LLMs (OpenAI, 2023; Touvron et al., 2023a; Wei et al., 2022; Touvron et al., 2023b; Zheng et al., 2023; Team, 2023; Sun et al., 2023; Du et al., 2021; Bai et al., 2023a; Yang et al., 2023; Penedo et al., 2023; Taori et al., 2023) with strong instruction-following and generalization capabilities have been applied to LVLMs. Early works such as BLIP-2 (Li et al., 2023b) and FROMAGe (Koh et al., 2023) encoded visual signals into a sequence of visual tokens, successfully adapting vision to LLMs through several projection layers. Subsequently, recent works have focused on improving performance through methods such as expanding the instruction-tuning dataset (Liu et al., 2023a, c; Zhang et al., 2023c; Zhao et al., 2023a; Chen et al., 2023d), optimizing training strategies (Bai et al., 2023b; Chen et al., 2023b), increasing resolution of image (Liu et al., 2023b; Bai et al., 2023b; Wang et al., 2023d) enhancing image encoders (Chen et al., 2023e; Zhang et al., 2023a; Bai et al., 2023b), aligning the input (Lin et al., 2023) and projection layers (Cha et al., 2023; Alayrac et al., 2022; Bai et al., 2023b; Dai et al., 2023; Ye et al., 2023; Zhao et al., 2023a). These works empowered LVLMs with powerful visual understanding capabilities by expanding the visual instruction fine-tuning datasets and model scales.
强大的 LLMs(OpenAI, 2023; Touvron 等, 2023a; Wei 等, 2022; Touvron 等, 2023b; Zheng 等, 2023; Team, 2023; Sun 等, 2023; Du 等, 2021; Bai 等, 2023a; Yang 等, 2023; Penedo 等, 2023; Taori 等, 2023)具有强大的指令跟随和泛化能力,已被应用于 LVLMs。早期的工作如 BLIP-2(Li 等, 2023b)和 FROMAGe(Koh 等, 2023)将视觉信号编码为一系列视觉标记,通过多个投影层成功地将视觉适配到 LLMs。随后,最近的工作集中在通过扩展指令调优数据集(Liu 等, 2023a, c; Zhang 等, 2023c; Zhao 等, 2023a; Chen 等, 2023d)、优化训练策略(Bai 等, 2023b; Chen 等, 2023b)、提高图像分辨率(Liu 等, 2023b; Bai 等, 2023b; Wang 等, 2023d)、增强图像编码器(Chen 等, 2023e; Zhang 等, 2023a; Bai 等, 2023b)、对齐输入(Lin 等, 2023)和投影层(Cha 等, 2023; Alayrac 等, 2022; Bai 等, 2023b; Dai 等, 2023; Ye 等, 2023; Zhao 等, 2023a)等方法来提高性能。这些工作通过扩展视觉指令微调数据集和模型规模,使 LVLMs 具备了强大的视觉理解能力。

Currently, some works have endowed LVLMs with fine-grained image understanding capabilities, such as region understanding (Chen et al., 2023c; Zhao et al., 2023b; Liu et al., 2023e), multi-region understanding (Wang et al., 2023c; Pi et al., 2023; Peng et al., 2023), and pixel-wise grounding (Rasheed et al., 2023; Lai et al., 2023). However, the cost of scaling up dense visual data and models is challenging to bear (Liu et al., 2022; Yin et al., 2023). In this work, we aim to make state-of-the-art LVLMs research more accessible by leveraging mixture of experts.
目前,一些工作赋予了 LVLMs 细粒度图像理解能力,如区域理解(Chen 等, 2023c; Zhao 等, 2023b; Liu 等, 2023e)、多区域理解(Wang 等, 2023c; Pi 等, 2023; Peng 等, 2023)和像素级定位(Rasheed 等, 2023; Lai 等, 2023)。然而,扩展密集视觉数据和模型的成本是难以承受的(Liu 等, 2022; Yin 等, 2023)。在这项工作中,我们旨在通过利用专家混合技术,使最先进的 LVLMs 研究更易于获取。

2.2 Mixture of Experts in Multi-modal Learning
2.2 多模态学习中的专家混合

Mixture of Experts (MoE) (Jacobs et al., 1991; Eigen et al., 2013) is a hybrid model consisting of multiple sub-models, known as experts, which are integrated together. The key concept of MoE is the use of a router to determine the token set that each expert handles, thereby reducing interference between different types of samples.
专家混合模型(MoE)(Jacobs 等,1991;Eigen 等,2013)是一种由多个子模型(称为专家)集成在一起的混合模型。MoE 的关键概念是使用路由器来确定每个专家处理的令牌集,从而减少不同类型样本之间的干扰。

Refer to caption
Figure 3: Training framework and strategy. MoE-LLaVA adopts a three-stage training strategy. (a) We solely train the MLP to adapt the LLM to visual inputs. (b) Training the LLM backend empowers multi-modal understanding capability and MoE layers are not involved. (c) In this stage, we replicate the weights of the FFN to initialize each expert.
图 3:训练框架和策略。MoE-LLaVA 采用三阶段训练策略。(a) 我们仅训练 MLP 以适应视觉输入的 LLM。(b) 训练 LLM 后端以增强多模态理解能力,MoE 层不参与。(c) 在此阶段,我们复制 FFN 的权重以初始化每个专家。

Hard Routers. In the hard router mode, each expert is typically pre-defined as a specific pattern. This is because multi-modal data naturally exhibit gaps (Liang et al., 2022), making it difficult for soft routers to learn the optimal patterns for assigning tokens to different experts. A series of works (Bao et al., 2022; Long et al., 2023; Satar et al., 2022; Wang et al., 2022; Shen et al., 2023) naturally decouple experts based on modal categories and pre-define each expert to handle a specific modality. An important feature of these hard-based routers is that they do not require learning the router. This mode is also widely applied in the task-specific MoE (Li et al., 2023e; Zhu et al., 2022; Ma et al., 2023; Kudugunta et al., 2021).
硬路由器。在硬路由器模式下,每个专家通常被预定义为特定模式。这是因为多模态数据自然表现出差异(Liang 等,2022),使得软路由器难以学习将令牌分配给不同专家的最佳模式。一系列工作(Bao 等,2022;Long 等,2023;Satar 等,2022;Wang 等,2022;Shen 等,2023)自然地基于模态类别解耦专家,并预定义每个专家处理特定模态。这些基于硬路由器的一个重要特征是它们不需要学习路由器。这种模式也广泛应用于特定任务的 MoE(Li 等,2023e;Zhu 等,2022;Ma 等,2023;Kudugunta 等,2021)。

Soft Routers. Some works (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022) in natural language process have explored the MoE based on soft routers. Soft routers enable dynamic allocation of data among different experts, allowing each expert to focus on its expertise and achieve model sparsity. Therefore, our main focus is on leveraging soft routers in the MoE. Small-scale (million-level) models based on soft routers have also been explored in the context of multi-modal learning, such as EVE (Chen et al., 2023a) and LIMoE (Mustafa et al., 2022), which attempt a fusion of data by using soft routers. The work most relevant to ours is MoCLE (Gou et al., 2023). However, MoCLE clusters different instruction sets and distributes them to different experts, which compromises the flexibility and autonomy of the experts. Differently, MoE-LLaVA relies on knowledge-rich routers to distribute tokens to different paths.
软路由器。一些自然语言处理领域的工作(Shazeer 等,2017;Lepikhin 等,2020;Fedus 等,2022;Zoph 等,2022;Komatsuzaki 等,2022)探索了基于软路由器的 MoE。软路由器能够在不同专家之间动态分配数据,使每个专家专注于其专长并实现模型稀疏性。因此,我们主要关注在 MoE 中利用软路由器。基于软路由器的小规模(百万级)模型也在多模态学习中得到了探索,如 EVE(Chen 等,2023a)和 LIMoE(Mustafa 等,2022),它们尝试通过使用软路由器融合数据。与我们最相关的工作是 MoCLE(Gou 等,2023)。然而,MoCLE 将不同的指令集聚类并分配给不同的专家,这削弱了专家的灵活性和自主性。不同的是,MoE-LLaVA 依赖于知识丰富的路由器将令牌分配到不同的路径。

3 Method 方法

3.1 Overview 3.1 概述

As shown in Figure 3, MoE-LLaVA consists of a vision encoder, a visual projection layer (MLP), a word embedding layer, multiple stacked LLM blocks, and MoE blocks. We first introduce the model architecture of MoE-LLaVA in three stages in Section 3.2. Furthermore, in Section 3.3, we explain how to train MoE-LLaVA. Finally, in Section 3.4, we elaborate on the training objectives of MoE-LLaVA.
如图 3 所示,MoE-LLaVA 由视觉编码器、视觉投影层(MLP)、词嵌入层、多个堆叠的 LLM 块和 MoE 块组成。我们首先在第 3.2 节介绍 MoE-LLaVA 的模型架构的三个阶段。此外,在第 3.3 节中,我们解释如何训练 MoE-LLaVA。最后,在第 3.4 节中,我们详细说明 MoE-LLaVA 的训练目标。

Table 1: Architecture details of the MoE-LLaVA model. “FFN Factor” represents the number of linear layers in the FFN. “1.6B×4-Top2” represents a dense foundation model with 1.6B parameters, which is equipped with a total of four experts, two of them being activated.
表 1:MoE-LLaVA 模型的架构细节。“FFN 因子”表示 FFN 中的线性层数量。“1.6B×4-Top2”表示一个具有 1.6B 参数的密集基础模型,配备了四个专家,其中两个被激活。
Name 名称 Experts 专家 Top-k MoE Embedding 嵌入 Width 宽度 Layers  FFN FFN Heads  Activated 激活 Total 总计
Layers  Factor 因子 Param 参数 Param 参数
StableLM-1.6B (Team, ) StableLM-1.6B (团队) - - - 100352 2560 32 10240 2 32 1.6B 1.6B
MoE-LLaVA-1.6B×4-Top2 4 2 16 100352 2560 32 10240 2 32 2.0B 2.9B
Qwen-1.8B (Bai et al., 2023a)
Qwen-1.8B (Bai 等, 2023a)
- - - 151936 2048 24 5504 3 16 1.8B 1.8B
MoE-LLaVA-1.8B×4-Top2 4 2 12 151936 2048 24 5504 3 16 2.2B 3.1B
Phi2-2.7B (Microsoft, 2023)
Phi2-2.7B (微软, 2023)
- - - 51200 2560 32 10240 2 32 2.7B 2.7B
MoE-LLaVA-2.7B×4-Top2 4 2 16 51200 2560 32 10240 2 32 3.6B 5.3B

3.2 Architecture of MoE-LLaVA
3.2 MoE-LLaVA 架构

As shown in Table 1, we present the detailed configuration of MoE-LLaVA and more details can be found in Section A.1. Given a RGB image 𝐯H×W×3𝐯superscript𝐻𝑊3\mathbf{v}\in\mathbb{R}^{H\times W\times 3}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W are the origin resolution. The vision encoder processes input images to obtain a visual token sequence 𝒵=[z1,z2,,zP]P×C𝒵subscript𝑧1subscript𝑧2subscript𝑧𝑃superscript𝑃𝐶\mathcal{Z}=[z_{1},z_{2},\cdots,z_{P}]\in\mathbb{R}^{P\times C}caligraphic_Z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT, where P=H×W142𝑃𝐻𝑊superscript142P=\frac{H\times W}{14^{2}}italic_P = divide start_ARG italic_H × italic_W end_ARG start_ARG 14 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG represents the sequence length of visual tokens. A visual projection layer f𝑓fitalic_f is used to map 𝒵P×C𝒵superscript𝑃𝐶\mathcal{Z}\in\mathbb{R}^{P\times C}caligraphic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT to 𝒱P×D𝒱superscript𝑃𝐷\mathcal{V}\in\mathbb{R}^{P\times D}caligraphic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_D end_POSTSUPERSCRIPT, where D𝐷Ditalic_D represents the hidden size of LLM. Similarly, the text undergoes a word embedding layer g𝑔gitalic_g and is projected to obtain the sequence tokens 𝒯=[t1,t2,,tN]N×D𝒯subscript𝑡1subscript𝑡2subscript𝑡𝑁superscript𝑁𝐷\mathcal{T}=[t_{1},t_{2},\cdots,t_{N}]\in\mathbb{R}^{N\times D}caligraphic_T = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, where N𝑁Nitalic_N represents the sequence length of text tokens.
如表 1 所示,我们展示了 MoE-LLaVA 的详细配置,更多细节见 A.1 节。给定一张 RGB 图像 𝐯H×W×3𝐯superscript𝐻𝑊3\mathbf{v}\in\mathbb{R}^{H\times W\times 3}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT ,其中 H𝐻Hitalic_HW𝑊Witalic_W 是原始分辨率。视觉编码器处理输入图像以获得视觉标记序列 𝒵=[z1,z2,,zP]P×C𝒵subscript𝑧1subscript𝑧2subscript𝑧𝑃superscript𝑃𝐶\mathcal{Z}=[z_{1},z_{2},\cdots,z_{P}]\in\mathbb{R}^{P\times C}caligraphic_Z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT ,其中 P=H×W142𝑃𝐻𝑊superscript142P=\frac{H\times W}{14^{2}}italic_P = divide start_ARG italic_H × italic_W end_ARG start_ARG 14 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG 表示视觉标记的序列长度。使用视觉投影层 f𝑓fitalic_f𝒵P×C𝒵superscript𝑃𝐶\mathcal{Z}\in\mathbb{R}^{P\times C}caligraphic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT 映射到 𝒱P×D𝒱superscript𝑃𝐷\mathcal{V}\in\mathbb{R}^{P\times D}caligraphic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_D end_POSTSUPERSCRIPT ,其中 D𝐷Ditalic_D 表示 LLM 的隐藏大小。同样,文本经过词嵌入层 g𝑔gitalic_g 并投影以获得序列标记 𝒯=[t1,t2,,tN]N×D𝒯subscript𝑡1subscript𝑡2subscript𝑡𝑁superscript𝑁𝐷\mathcal{T}=[t_{1},t_{2},\cdots,t_{N}]\in\mathbb{R}^{N\times D}caligraphic_T = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT ,其中 N𝑁Nitalic_N 表示文本标记的序列长度。

Subsequently, we concatenate the visual tokens and text tokens together and feed them into a large language model. Instead, we solely train the visual projection layer. The large language model consists of stacked multi-head self-attention (MSA) and feed-forward neural networks (FFN). Layer normalization (LN) and residual connections are applied within each block (Wang et al., 2019; Baevski & Auli, 2018). Therefore, we formulate as:
随后,我们将视觉标记和文本标记连接在一起并输入大型语言模型。相反,我们仅训练视觉投影层。大型语言模型由堆叠的多头自注意力(MSA)和前馈神经网络(FFN)组成。每个块内应用层归一化(LN)和残差连接(Wang 等, 2019; Baevski & Auli, 2018)。因此,我们公式化为:

𝐱0=[v1,v2,,vP,,t1,t2,,tN],subscript𝐱0subscript𝑣1subscript𝑣2subscript𝑣𝑃subscript𝑡1subscript𝑡2subscript𝑡𝑁\mathbf{x}_{0}=[v_{1},v_{2},\cdots,v_{P},\cdots,t_{1},t_{2},\cdots,t_{N}],bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] , (1)
𝐱=MSA(LN(𝐱1))+𝐱1,=1L,formulae-sequencesuperscriptsubscript𝐱MSALNsubscript𝐱1subscript𝐱11𝐿\mathbf{x}_{\ell}^{\prime}=\mathrm{MSA}(\mathrm{LN}(\mathbf{x}_{\ell-1}))+% \mathbf{x}_{\ell-1},\ell=1\ldots L,bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_MSA ( roman_LN ( bold_x start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , roman_ℓ = 1 … italic_L , (2)
𝐱=MoE(LN(𝐱))+𝐱,=1L,formulae-sequencesubscript𝐱MoELNsubscriptsuperscript𝐱subscriptsuperscript𝐱1𝐿\mathbf{x}_{\ell}=\mathrm{MoE}(\mathrm{LN}(\mathbf{x^{\prime}}_{\ell}))+% \mathbf{x^{\prime}}_{\ell},\ell=1\ldots L,bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = roman_MoE ( roman_LN ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ = 1 … italic_L , (3)
𝒴=LN(𝐱L).𝒴LNsubscript𝐱𝐿\mathcal{Y}=\mathrm{LN}(\mathbf{x}_{L}).caligraphic_Y = roman_LN ( bold_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) . (4)

MoE Forward. Typically, a MoE layer consists of multiple FFNs. As an initialization step, we replicate the FFNs from stage II to form an ensemble of experts =[e1,e2,,eE]subscript𝑒1subscript𝑒2subscript𝑒𝐸\mathcal{E}=[e_{1},e_{2},\cdots,e_{E}]caligraphic_E = [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ]. The router is a linear layer that predicts the probability of each token being assigned to each expert. We formulate as:
MoE 前向。通常,一个 MoE 层由多个 FFN 组成。作为初始化步骤,我们从阶段 II 复制 FFN,形成一个专家集合 =[e1,e2,,eE]subscript𝑒1subscript𝑒2subscript𝑒𝐸\mathcal{E}=[e_{1},e_{2},\cdots,e_{E}]caligraphic_E = [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] 。路由器是一个线性层,用于预测每个 token 被分配给每个专家的概率。我们将其公式化为:

𝒫(𝐱)i=ef(𝐱)ijEef(𝐱)j,𝒫subscript𝐱𝑖superscript𝑒𝑓subscript𝐱𝑖superscriptsubscript𝑗𝐸superscript𝑒𝑓subscript𝐱𝑗\mathcal{P}(\mathbf{x})_{i}=\frac{e^{f(\mathbf{x})_{i}}}{\sum_{j}^{E}e^{f(% \mathbf{x})_{j}}},caligraphic_P ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_f ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (5)

where the router produces weight logits f(𝐱)=𝐖𝐱𝑓𝐱𝐖𝐱f(\mathbf{x})=\mathbf{W}\cdot\mathbf{x}italic_f ( bold_x ) = bold_W ⋅ bold_x, which are normalized by the softmax function. The 𝐖D×E𝐖superscript𝐷𝐸\mathbf{W}\in\mathbb{R}^{D\times E}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_E end_POSTSUPERSCRIPT represents the lightweight training parameters and E𝐸Eitalic_E represents the number of experts. Therefore, each token is processed by the top-k𝑘kitalic_k experts with the highest probabilities, and the weighted sum is calculated based on the softmax results of the probabilities:
路由器生成权重 logits f(𝐱)=𝐖𝐱𝑓𝐱𝐖𝐱f(\mathbf{x})=\mathbf{W}\cdot\mathbf{x}italic_f ( bold_x ) = bold_W ⋅ bold_x ,这些 logits 通过 softmax 函数进行归一化。 𝐖D×E𝐖superscript𝐷𝐸\mathbf{W}\in\mathbb{R}^{D\times E}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_E end_POSTSUPERSCRIPT 表示轻量级训练参数, E𝐸Eitalic_E 表示专家数量。因此,每个 token 由概率最高的前 k𝑘kitalic_k 个专家处理,并根据概率的 softmax 结果计算加权和:

MoE(𝐱)=i=1k𝒫(𝐱)i(𝐱)i.MoE𝐱superscriptsubscript𝑖1𝑘𝒫subscript𝐱𝑖subscript𝐱𝑖\mathrm{MoE}(\mathbf{x})=\sum_{i=1}^{k}\mathcal{P}(\mathbf{x})_{i}\cdot% \mathcal{E}(\mathbf{x})_{i}.roman_MoE ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT caligraphic_P ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_E ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (6)

3.3 MoE-Tuning 3.3MoE-微调

Stage I: In this stage, our objective is to adapt the image tokens to LLM, allowing the LLM to comprehend the instances in the images. To achieve this, we employ an MLP to project the image tokens into the input domain of the LLM, treating the image patches as pseudo-text tokens. During this stage, the LLM is trained to describe the images. MoE layers are not applied to the LLM during this stage.
阶段 I:在这个阶段,我们的目标是将图像 token 适配到 LLM,使 LLM 能够理解图像中的实例。为此,我们使用 MLP 将图像 token 投射到 LLM 的输入域,将图像块视为伪文本 token。在这个阶段,LLM 被训练来描述图像。在这个阶段,LLM 不应用 MoE 层。

Stage II: Tuning with multi-modal instruction data is a key technique to enhance the capabilities and controllability of large models (Zhang et al., 2023b). In this stage, LLM is adjusted to become an LVLM with multi-modal understanding. We use more complex instructions, including tasks such as image logical reasoning and text recognition, which require the model to have a stronger multi-modal understanding. Typically, for dense models, the LVLM training is considered complete at this stage. However, we encounter challenges in simultaneously transforming the LLM into an LVLM and sparsifying the LVLM. Therefore, MoE-LLaVA utilizes the weights from the second stage as initialization for the third stage to alleviate the learning difficulty of the sparse model.
阶段 II:使用多模态指令数据进行微调是增强大模型能力和可控性的关键技术(Zhang 等,2023b)。在这个阶段,LLM 被调整为具有多模态理解的 LVLM。我们使用更复杂的指令,包括图像逻辑推理和文本识别等任务,这些任务要求模型具有更强的多模态理解能力。通常,对于密集模型,LVLM 的训练在此阶段被认为是完成的。然而,我们在同时将 LLM 转变为 LVLM 并稀疏化 LVLM 时遇到了挑战。因此,MoE-LLaVA 利用第二阶段的权重作为第三阶段的初始化,以减轻稀疏模型的学习难度。

Stage III: As an initialization, we replicate the FFN multiple times to initialize the experts. When image tokens and text tokens are fed into the MoE layers, the router calculates the matching weights between each token and the experts. Each token is then processed by the top-k𝑘kitalic_k experts, and the outputs are aggregated by weighted summation based on the router’s weights. When the top-k𝑘kitalic_k experts are activated, the remaining experts remain silent. This modeling approach forms the MoE-LLaVA with infinitely possible sparse pathways, offering a wide range of capabilities.
第三阶段:作为初始化,我们多次复制 FFN 以初始化专家。当图像标记和文本标记被输入 MoE 层时,路由器计算每个标记与专家之间的匹配权重。然后,每个标记由前 k𝑘kitalic_k 名专家处理,输出根据路由器的权重进行加权求和。当前 k𝑘kitalic_k 名专家被激活时,其余专家保持静默。这种建模方法形成了具有无限可能稀疏路径的 MoE-LLaVA,提供了广泛的能力。

3.4 Training Objectives 3.4 训练目标

The totalsubscripttotal\mathcal{L}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT consists of auto-regressive loss regressivesubscriptregressive\mathcal{L}_{\text{regressive}}caligraphic_L start_POSTSUBSCRIPT regressive end_POSTSUBSCRIPT and auxiliary loss auxsubscriptaux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT, and auxiliary loss are scaled by the balancing coefficient α𝛼\alphaitalic_α:
totalsubscripttotal\mathcal{L}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT 包括自回归损失 regressivesubscriptregressive\mathcal{L}_{\text{regressive}}caligraphic_L start_POSTSUBSCRIPT regressive end_POSTSUBSCRIPT 和辅助损失 auxsubscriptaux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ,辅助损失由平衡系数 α𝛼\alphaitalic_α 缩放:

total=regressive+αaux.subscripttotalsubscriptregressive𝛼subscriptaux\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{regressive}}+\alpha\cdot\mathcal% {L}_{\text{aux}}.caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT regressive end_POSTSUBSCRIPT + italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT . (7)

Auto-Regressive Loss. We optimize the output of LLM through a generative loss in an auto-regressive manner. Given an image and text, MoE-LLaVA generates the output sequence 𝒴=[y1,y2,,yK]K×D𝒴subscript𝑦1subscript𝑦2subscript𝑦𝐾superscript𝐾𝐷\mathcal{Y}=[y_{1},y_{2},\cdots,y_{K}]\in\mathbb{R}^{K\times D}caligraphic_Y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D end_POSTSUPERSCRIPT by progressively generating each element, where K=P+D𝐾𝑃𝐷K=P+Ditalic_K = italic_P + italic_D represents the length of the output sequence. The formula is:
自回归损失。我们通过自回归方式优化 LLM 的输出,通过生成损失。给定图像和文本,MoE-LLaVA 通过逐步生成每个元素来生成输出序列 𝒴=[y1,y2,,yK]K×D𝒴subscript𝑦1subscript𝑦2subscript𝑦𝐾superscript𝐾𝐷\mathcal{Y}=[y_{1},y_{2},\cdots,y_{K}]\in\mathbb{R}^{K\times D}caligraphic_Y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D end_POSTSUPERSCRIPT ,其中 K=P+D𝐾𝑃𝐷K=P+Ditalic_K = italic_P + italic_D 表示输出序列的长度。公式如下:

regressive=i=1Nlogpθ(𝒴[P+i]𝒱,𝒯[:i1]),subscriptregressivesuperscriptsubscript𝑖1𝑁logsubscript𝑝𝜃conditionalsuperscript𝒴delimited-[]𝑃𝑖𝒱superscript𝒯delimited-[]:absent𝑖1\mathcal{L}_{\text{regressive}}=-\sum_{i=1}^{N}\text{log}\ p_{\theta}\left(% \mathcal{Y}^{[P+i]}\mid\mathcal{V},\mathcal{T}^{[:i-1]}\right),caligraphic_L start_POSTSUBSCRIPT regressive end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUPERSCRIPT [ italic_P + italic_i ] end_POSTSUPERSCRIPT ∣ caligraphic_V , caligraphic_T start_POSTSUPERSCRIPT [ : italic_i - 1 ] end_POSTSUPERSCRIPT ) , (8)

where θ𝜃\thetaitalic_θ is a trainable parameter and we only calculate the loss for the generated text.
其中 θ𝜃\thetaitalic_θ 是一个可训练参数,我们只计算生成文本的损失。

Auxiliary Loss. Due to the presence of multiple experts, it is necessary to impose load balancing constraints on the MoE layer. We incorporate differentiable load balancing loss (Fedus et al., 2022) into each MoE layer to encourage experts to handle tokens in a balanced manner as follows:
辅助损失。由于存在多个专家,有必要对 MoE 层施加负载平衡约束。我们在每个 MoE 层中加入可微分的负载平衡损失(Fedus 等,2022),以鼓励专家以平衡的方式处理 tokens,如下所示:

aux=Ei=1Ei𝒢i,subscriptaux𝐸superscriptsubscript𝑖1𝐸subscript𝑖subscript𝒢𝑖\mathcal{L}_{\text{aux}}=E\cdot\sum_{i=1}^{E}\mathcal{F}_{i}\cdot\mathcal{G}_{% i},caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT = italic_E ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (9)

where \mathcal{F}caligraphic_F represents the fraction of tokens processed by each expert isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝒢𝒢\mathcal{G}caligraphic_G represents the average routing probability of isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which can be expressed by the following formulas:
其中 \mathcal{F}caligraphic_F 表示每个专家处理的 tokens 比例 isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT𝒢𝒢\mathcal{G}caligraphic_G 表示 isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的平均路由概率,可以通过以下公式表示:

=1Ki=1E1{argmax𝒫(𝐱)=i},1𝐾superscriptsubscript𝑖1𝐸1argmax𝒫𝐱𝑖\mathcal{F}=\frac{1}{K}\sum_{i=1}^{E}\mathrm{1}\{\operatorname{argmax}\mathcal% {P}(\mathbf{x})=i\},caligraphic_F = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT 1 { roman_argmax caligraphic_P ( bold_x ) = italic_i } , (10)
𝒢=1Ki=1K𝒫(𝐱)i.𝒢1𝐾superscriptsubscript𝑖1𝐾𝒫subscript𝐱𝑖\mathcal{G}=\frac{1}{K}\sum_{i=1}^{K}\mathcal{P}(\mathbf{x})_{i}.caligraphic_G = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_P ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (11)

4 Experiments 实验

4.1 Experimental Setup 4.1 实验设置

Model Settings. Following LLaVA 1.5 (Liu et al., 2023b), we utilize CLIP-Large (Radford et al., 2021) as the vision encoder, and the MLP consists of two linear layers with GELU activation function (Hendrycks & Gimpel, 2016) between them. Unless otherwise specified, MoE-LLaVA employs an alternating replacement of FFN with MoE layers, meaning that the number of MoE layers is half of the total number of layers. The value of balancing coefficient α𝛼\alphaitalic_α is 0.01. We provide additional training details in Section A.2.
模型设置。按照 LLaVA 1.5(Liu 等,2023b),我们使用 CLIP-Large(Radford 等,2021)作为视觉编码器,MLP 由两个线性层组成,中间有 GELU 激活函数(Hendrycks & Gimpel,2016)。除非另有说明,MoE-LLaVA 采用 FFN 与 MoE 层交替替换的方式,这意味着 MoE 层的数量是总层数的一半。平衡系数 α𝛼\alphaitalic_α 的值为 0.01。我们在附录 A.2 中提供了更多训练细节。

Data Details. As shown in Table 2, we reorganize the currently available data for the three-stage training. For the first stage of pretraining, we use the pretrained data of LLaVA 1.5-558k (Liu et al., 2023b). For the second stage, we collect datasets from MIMIC-IT (Li et al., 2023a), LRV (Liu et al., 2023a), SViT (Zhao et al., 2023a) and LVIS (Wang et al., 2023b) to provide a robust initialization for MoE-LLaVA. For the third stage, we utilize the same data pipeline as LLaVA-mix-665k (Liu et al., 2023b).
数据详情。如表 2 所示,我们重新组织了当前可用的数据用于三阶段训练。对于第一阶段的预训练,我们使用 LLaVA 1.5-558k(Liu 等,2023b)的预训练数据。对于第二阶段,我们从 MIMIC-IT(Li 等,2023a)、LRV(Liu 等,2023a)、SViT(Zhao 等,2023a)和 LVIS(Wang 等,2023b)收集数据集,为 MoE-LLaVA 提供一个稳健的初始化。对于第三阶段,我们使用与 LLaVA-mix-665k(Liu 等,2023b)相同的数据管道。

Table 2: Composition of the data groups. For MIMIC-IT, and SViT datasets, we only use the LA split, and core split, respectively.
表 2:数据组的组成。对于 MIMIC-IT 和 SViT 数据集,我们分别只使用 LA 分割和核心分割。
Data group 数据组 Usage 使用 Source 来源 #Sample
LLaVA-PT Stage I 阶段 I LLaVA 1.5-558k 558k
Hybird-FT Stage II 阶段 II SViT-157k, LVIS-220k SViT-157k,LVIS-220k 964k
LRV-331k, MIMIC-IT-256k LRV-331k,MIMIC-IT-256k
LLaVA-FT Stage III 阶段 III LLaVA 1.5-mix-665k 665k
Table 3: Comparison among different LVLMs on image understanding benchmarks. “Res.”, “Act.”, “L”, “V”, “S”, “Q”, “P”, “M” and “I” respectively represent the input image resolution, activated parameters, LLaMA (Touvron et al., 2023a), Vicuna (Chiang et al., 2023), StableLM (Team, ), Qwen (Bai et al., 2023a), Phi-2 (Microsoft, 2023) MobileLLaMA (Chu et al., 2023) and IDEFICS (Laurençon et al., 2023). Evaluation Benchmarks include VQA-v2 (Goyal et al., 2017); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al., 2018); SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT: ScienceQA-IMG (Lu et al., 2022); VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT: TextVQA (Singh et al., 2019); POPE (Li et al., 2023d); MME (Fu et al., 2023); MMB: MMBench (Liu et al., 2023d); LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT: LLaVA-Bench (in-the-Wild) (Liu et al., 2023c); MM-Vet (Yu et al., 2023). *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT donates that there is some overlap in the training data. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT donates that the model is trained with an image resolution of 384. The best results and second best results are indicated by boldface and underline, respectively.
表 3:不同 LVLM 在图像理解基准测试中的比较。“Res.”、“Act.”、“L”、“V”、“S”、“Q”、“P”、“M”和“I”分别代表输入图像分辨率、激活参数、LLaMA(Touvron 等,2023a)、Vicuna(Chiang 等,2023)、StableLM(团队)、Qwen(Bai 等,2023a)、Phi-2(Microsoft,2023)、MobileLLaMA(Chu 等,2023)和 IDEFICS(Laurençon 等,2023)。评估基准包括 VQA-v2(Goyal 等,2017);GQA(Hudson & Manning,2019);VisWiz(Gurari 等,2018);SQA II{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT :ScienceQA-IMG(Lu 等,2022);VQA TT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT :TextVQA(Singh 等,2019);POPE(Li 等,2023d);MME(Fu 等,2023);MMB:MMBench(Liu 等,2023d);LLaVA WW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT :LLaVA-Bench(in-the-Wild)(Liu 等,2023c);MM-Vet(Yu 等,2023)。 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 表示训练数据中有一些重叠。 {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 表示模型以 384 的图像分辨率进行训练。最佳结果和次佳结果分别用粗体和下划线表示。
Methods 方法 LLM Act. 激活参数 Res. 分辨率 Image Question Answering 图像问答 Benchmark Toolkit 基准工具包
VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MME MMB LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT MM-Vet
Dense Model 密集模型
I-80B (Laurençon et al., 2023)
I-80B(Laurençon 等,2023)
L-65B 65B 224 60.0 45.2 36.0 - 30.9 - - 54.5 - -
LLaVA-1.5 (Liu et al., 2023b) V-13B 13B 336 80.0*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 63.3*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 53.6 71.6 61.3 85.9 1531.3 67.7 70.7 35.4
Qwen-VL (Bai et al., 2023b)
Qwen-VL(Bai 等,2023b)
Q-7B 6.7B 448 78.8*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 59.3*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 35.2 67.1 63.8 - - 38.2 - -
LLaVA-1.5 (Liu et al., 2023b) V-7B 6.7B 336 78.5*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 62.0*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 50.0 66.8 58.2 85.9 1510.7 64.3 63.4 30.5
TinyGPT-V (Yuan et al., 2023)
TinyGPT-V(Yuan 等,2023)
P-2.7B 2.7B 448 - 33.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 33.4 - - - - - - -
MobileVLM (Chu et al., 2023)
MobileVLM(Chu 等,2023)
M-2.7B 2.7B 336 - 59.0*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT - 61.0 47.5 84.9 1288.9 59.6 - -
LLaVA-Phi (Zhu et al., 2024)
LLaVA-Phi(Zhu 等,2024)
P-2.7B 2.7B 336 71.4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT - 35.9 68.4 48.6 85.0 1335.1 59.8 - 28.9
Sparse Model 稀疏模型
MoE-LLaVA-1.6B×4-Top2 S-1.6B 2.0B 336 76.7*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 60.3*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 36.2 62.6 50.1 85.7 1318.2 60.2 86.8 26.9
MoE-LLaVA-1.8B×4-Top2 Q-1.8B 2.2B 336 76.2*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 61.5*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 32.6 63.1 48.0 87.0 1291.6 59.7 88.7 25.3
MoE-LLaVA-2.7B×4-Top2 P-2.7B 3.6B 336 77.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 61.4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 43.9 68.5 51.4 86.3 1423.0 65.2 94.1 34.3
MoE-LLaVA-1.6B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT S-1.6B 2.0B 384 78.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 61.5*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 40.5 63.9 54.3 85.9 1335.7 63.3 90.3 32.3
MoE-LLaVA-2.7B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT P-2.7B 3.6B 384 79.9*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 62.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 43.7 70.3 57.0 85.7 1431.3 68.0 97.3 35.9
Table 4: Zero-shot object hallucination evaluation results. “Yes” indicates the proportion of positive responses to the given question.
表 4:零样本对象幻觉评估结果。“是”表示对给定问题的正面回答比例。
Methods 方法 LLM Activated 激活 Adersarial 对抗 Popular 流行 Random 随机
Acc 准确率 F1-Score F1-分数 Yes  Acc 准确率 F1-Score F1-分数 Yes  Acc 准确率 F1-Score F1-分数 Yes 
Dense Model 密集模型
mPLUG-Owl (Ye et al., 2023) L-7B 6.7B 82.4 81.6 45.2 85.5 84.3 42.1 86.3 85.3 42.3
MM-GPT (Gong et al., 2023) L-7B 6.7B 50.0 66.7 100.0 50.0 66.7 100.0 50.0 66.7 100.0
LLaVA-1.5 (Liu et al., 2023b) V-13B 13B 85.5 84.4 43.3 87.4 86.2 41.3 88.0 87.1 41.7
Sparse Model 稀疏模型
MoE-LLaVA-1.6B×4-Top2 S-1.6B 2.0B 86.9 85.7 41.7 85.3 84.2 43.5 88.0 87.1 41.6
MoE-LLaVA-1.8B×4-Top2 Q-1.8B 2.2B 86.1 85.4 44.9 88.6 87.7 42.5 88.7 88.0 43.0
MoE-LLaVA-2.7B×4-Top2 P-2.7B 3.6B 85.9 84.9 43.2 87.5 86.4 41.8 88.5 87.7 41.8
MoE-LLaVA-1.6B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT S-1.6B 2.0B 86.9 85.6 41.5 85.7 84.6 43.0 88.4 87.5 41.5
MoE-LLaVA-2.7B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT P-2.7B 3.6B 85.5 84.2 41.9 86.7 84.4 41.7 87.9 86.9 40.6

Refer to caption
Figure 4: Distribution of expert loadings. The discontinuous lines represent a perfectly balanced distribution of tokens among different experts or modalities. The first figure on the left illustrates the workload among experts, while the remaining four figures depict the preferences of experts towards different modalities.
图 4:专家负载分布。虚线表示在不同专家或模态之间完美平衡的标记分布。左边的第一个图展示了专家之间的工作负载,而其余四个图则描绘了专家对不同模态的偏好。

4.2 Image Understanding Evaluation
4.2 图像理解评估

Zero-shot Image Question Answering. As shown in Table 3, since MoE-LLaVA is a sparse model equipped with a soft router based on LVLM, we categorize the previous models as dense models. We evaluate the performance of MoE-LLaVA on five image question-answering benchmarks and report the number of activated parameters. Compared to the state-of-the-art method LLaVA 1.5, MoE-LLaVA demonstrates powerful image understanding capabilities and performs very close to LLaVA-1.5 on five benchmarks. Specifically, MoE-LLaVA-Phi-2.7B×4 surpasses LLaVA-1.5-7B by 2.7% on SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT using 3.6B sparse activated parameters. Notably, MoE-LLaVA-StableLM-1.6B×4 achieves comprehensive superiority over IDEFICS-80B with only 2.0B activated parameters. Furthermore, we observe the recent small-scale vision-language model, LLaVA-Phi. MoE-LLaVA-Phi-2.7B×4 outperforms LLaVA-Phi by more than 6.2% on VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT, highlighting the strong comprehension abilities of MoE-LLaVA in natural vision.
零样本图像问答。如表 3 所示,由于 MoE-LLaVA 是基于 LVLM 的软路由器的稀疏模型,我们将之前的模型分类为密集模型。我们评估了 MoE-LLaVA 在五个图像问答基准上的表现,并报告了激活参数的数量。与最先进的方法 LLaVA 1.5 相比,MoE-LLaVA 展示了强大的图像理解能力,并在五个基准上表现接近 LLaVA-1.5。具体来说,MoE-LLaVA-Phi-2.7B×4 在 SQA II{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT 上以 3.6B 稀疏激活参数超过了 LLaVA-1.5-7B 2.7%。值得注意的是,MoE-LLaVA-StableLM-1.6B×4 以仅 2.0B 激活参数全面优于 IDEFICS-80B。此外,我们观察到最近的小规模视觉语言模型 LLaVA-Phi。MoE-LLaVA-Phi-2.7B×4 在 VQA v2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT 上超过 LLaVA-Phi 6.2%以上,突显了 MoE-LLaVA 在自然视觉中的强大理解能力。

Evaluation under Benchmark Toolkits. To comprehensively evaluate the multi-modal understanding capabilities of MoE-LLaVA, we evaluate its performance on four benchmark toolkits. These benchmark toolkits typically involve open-ended answers, serving as tools to verify a model’s ability to engage in natural language questioning. In Table 3, MoE-LLaVA-Qwen-1.8B×4 surpasses Qwen-VL-7B by 21.5%, on MMBench, despite the latter utilizing higher image resolutions. These results collectively demonstrate that the sparse model MoE-LLaVA achieves comparable or even superior performance to dense models with fewer activated parameters.
基准工具包下的评估。为了全面评估 MoE-LLaVA 的多模态理解能力,我们在四个基准工具包上评估了其表现。这些基准工具包通常涉及开放式回答,作为验证模型自然语言问答能力的工具。在表 3 中,MoE-LLaVA-Qwen-1.8B×4 在 MMBench 上超过了 Qwen-VL-7B 21.5%,尽管后者使用了更高的图像分辨率。这些结果共同表明,稀疏模型 MoE-LLaVA 在激活参数更少的情况下,达到了与密集模型相当甚至更优的性能。

4.3 Object Hallucination Evaluation
4.3 对象幻觉评估

We adopt the evaluation pipeline of POPE (Li et al., 2023d), a polling-based query method, to evaluate object hallucination in MoE-LLaVA. The results are presented in Table 4, where MoE-LLaVA exhibits the best performance, indicating that MoE-LLaVA tends to generate objects consistent with the given image. Specifically, MoE-LLaVA-1.8B×4 surpasses LLaVA-1.5-13B by 1.0%, 1.5%, and 0.8% in adversarial sampling, popular sampling, and random sampling, respectively, with 2.2B activated parameters. Additionally, we observe that the yes ratio of MoE-LLaVA remains relatively balanced, indicating that our sparse model is capable of providing accurate feedback based on the given questions.
我们采用 POPE (Li et al., 2023d)的评估流程,这是一种基于投票的查询方法,用于评估 MoE-LLaVA 中的对象幻觉。结果如表 4 所示,MoE-LLaVA 表现最佳,表明 MoE-LLaVA 倾向于生成与给定图像一致的对象。具体来说,MoE-LLaVA-1.8B×4 在对抗采样、流行采样和随机采样中分别超过了 LLaVA-1.5-13B 1.0%、1.5%和 0.8%,激活参数为 2.2B。此外,我们观察到 MoE-LLaVA 的“是”比例相对平衡,表明我们的稀疏模型能够根据给定问题提供准确的反馈。

4.4 Quantitative Analysis 4.4 定量分析

Routing Distributions. In Figure 4, we present the expert loads (leftmost plot) and the modalities preferences of different experts (four subplots on the right) through MoE-LLaVA-2.7B×4-Top2 on ScienceQA. More visualization can be found in Section B.3. To begin with, the expert loads in all MoE layers are totally balanced. However, as the model gradually becomes sparser, the expert 3 loads for layers 17 to 27 suddenly increase, and they even dominate the workload of almost all tokens. For the shallow layers (5-11), experts 2, 3, and 4 mainly collaborate. It is worth noting that expert 1 only works predominantly in the first few layers, and as the model becomes deeper, expert 1 gradually withdraws from the workload. Therefore, the experts in MoE-LLaVA have learned a certain pattern that allows them to divide their tasks in a specific manner.
路由分布。在图 4 中,我们展示了通过 MoE-LLaVA-2.7B×4-Top2 在 ScienceQA 上的专家负载(最左边的图)和不同专家的模态偏好(右边的四个子图)。更多可视化内容见 B.3 节。首先,所有 MoE 层的专家负载是完全平衡的。然而,随着模型逐渐变得稀疏,17 到 27 层的专家 3 负载突然增加,甚至几乎主导了所有 token 的工作负载。在浅层(5-11 层),专家 2、3 和 4 主要协作。值得注意的是,专家 1 仅在前几层中占主导地位,随着模型变得更深,专家 1 逐渐退出工作负载。因此,MoE-LLaVA 中的专家已经学会了一种特定的模式,使他们能够以特定的方式分配任务。

Furthermore, we show the distribution of modalities across different experts in Figure 5. Similarly, experts develop their own preferences. Additionally, we find that the routing distributions for text and image are highly similar. For example, when expert 3 is actively working in layers 17-27, the proportions of text and image that MoE-LLaVA processes are similar. Each expert in MoE-LLaVA is capable of handling both text tokens and image tokens simultaneously, which demonstrates that MoE-LLaVA does not exhibit a clear preference for any modality. This serves as evidence of its strong interaction in multimodal learning.
此外,我们在图 5 中展示了不同专家的模态分布。同样,专家们发展了自己的偏好。此外,我们发现文本和图像的路由分布非常相似。例如,当专家 3 在 17-27 层中积极工作时,MoE-LLaVA 处理的文本和图像的比例是相似的。MoE-LLaVA 中的每个专家都能够同时处理文本 token 和图像 token,这表明 MoE-LLaVA 对任何模态都没有明显的偏好。这证明了其在多模态学习中的强大互动性。

Refer to caption
Figure 5: Distribution of modalities across different experts. Interrupted lines mean a perfectly balanced distribution of tokens.
图 5:不同专家的模态分布。虚线表示令牌的分布完全平衡。

Token Pathways. Furthermore, we examine the behavior of experts at the token level. More visualization can be found in Section B.4 and Section B.5. We track the trajectories of all tokens on downstream tasks. For all activated pathways, we employ PCA (Pearson, 1901) to obtain the top-10 pathways, as shown in Figure 6. We found that for a given unseen text token or image tokens, MoE-LLaVA consistently tends to assign experts 2 and 3 to handle them in the deeper layers of the model. Regarding experts 1 and 4, they tend to handle the tokens during the initialization phase. These findings contribute to a better understanding of the behavior of sparse models in multi-modal learning.
令牌路径。此外,我们还在令牌级别上检查专家的行为。更多可视化内容见 B.4 节和 B.5 节。我们跟踪所有令牌在下游任务中的轨迹。对于所有激活的路径,我们使用 PCA(Pearson, 1901)获得前 10 条路径,如图 6 所示。我们发现,对于给定的未见过的文本令牌或图像令牌,MoE-LLaVA 在模型的深层中始终倾向于分配专家 2 和 3 来处理它们。至于专家 1 和 4,他们倾向于在初始化阶段处理这些令牌。这些发现有助于更好地理解稀疏模型在多模态学习中的行为。

Refer to caption
Figure 6: Visualization of activated pathways. We highlight the top-10 activated pathways on the text and image. Among them, the colorful paths represent the top-2 paths for text and image, respectively, while the gray paths represent the remaining 8 paths.
图 6:激活路径的可视化。我们突出显示了文本和图像上的前 10 条激活路径。其中,彩色路径分别代表文本和图像的前 2 条路径,而灰色路径代表其余 8 条路径。

4.5 Ablation Study 4.5 消融研究

In this section, we first validate the necessity of the three-stage training strategy. We then explore the impact of different base models and conduct ablation studies on the number of experts and active experts, and the MoE structure. We provide additional results in Section B.2.
在本节中,我们首先验证三阶段训练策略的必要性。然后,我们探讨不同基础模型的影响,并对专家数量和活跃专家数量以及 MoE 结构进行消融研究。我们在 B.2 节提供了额外的结果。

Table 5: Ablation study about different training strategies. “LA” and “Hb” represent LLaVA-FT and Hybrid-FT in Table 2.
表 5:关于不同训练策略的消融研究。“LA”和“Hb”在表 2 中分别代表 LLaVA-FT 和 Hybrid-FT。
MoE Stage II 阶段 II Stage III 阶段 III GQA SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT POPE LLaVA𝐖𝐖{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT
(a) - LV+Hb 58.4 58.1 81.9 88.0
(b) Hb LV 61.5 63.1 87.0 88.7
(c) LV+Hb - 60.9 60.2 86.4 86.3
(d) Hb LV 60.9 62.5 86.9 90.1

Effect of Training Strategy. In Table 5, we conduct three variant experiments to demonstrate the rationale behind using the second-stage instruction tuning as the initialization for the third-stage MoE tuning. When adapting MoE to LVLMs, a straightforward approach is to replace the classic LLaVA’s FFN with a MoE layer and train it according to the original second-stage script, denoted as variant (a). However, variant (a) performs the worst, suggesting that the current multi-modal instruction dataset is insufficient to support both the conversion from LLM to LVLM and the conversion from LVLM to a sparse model simultaneously. Therefore, we collect more data, referred to as Hybrid-FT, and initially convert LLM to LVLM in the second stage. Subsequently, in the third stage, LVLM is sparsified by using the LLaVA-FT dataset, resulting in variant (b). Additionally, we expand the data of the original LLaVA’s second stage for fair comparison, denoted as variant (c). The results indicate that variants (b) outperformed variants (a) and (c). These findings demonstrate that providing a reasonable LVLM initialization allows the model to transition rapidly from a dense model to a sparse model, validating the principle behind our three-stage training strategy.
训练策略的效果。在表 5 中,我们进行了三个变体实验,以证明使用第二阶段指令微调作为第三阶段 MoE 微调初始化的合理性。当将 MoE 适应于 LVLMs 时,一个简单的方法是用 MoE 层替换经典的 LLaVA 的 FFN,并根据原始第二阶段脚本进行训练,记为变体(a)。然而,变体(a)表现最差,表明当前的多模态指令数据集不足以同时支持从 LLM 到 LVLM 的转换和从 LVLM 到稀疏模型的转换。因此,我们收集了更多数据,称为 Hybrid-FT,并在第二阶段初步将 LLM 转换为 LVLM。随后,在第三阶段,使用 LLaVA-FT 数据集将 LVLM 稀疏化,形成变体(b)。此外,我们扩展了原始 LLaVA 第二阶段的数据以进行公平比较,记为变体(c)。结果表明,变体(b)优于变体(a)和(c)。这些发现表明,提供合理的 LVLM 初始化可以使模型快速从密集模型过渡到稀疏模型,验证了我们三阶段训练策略的原理。

Table 6: Ablation study about training setting and architecture design decisions. Settings for results in Table 3 and Table 4 are highlighted in blue. We report the training time on 8 V100-32G.
表 6:关于训练设置和架构设计决策的消融研究。表 3 和表 4 中的结果设置以蓝色突出显示。我们报告了在 8 个 V100-32G 上的训练时间。
Subset 子集 GQA VisWiz VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE LLaVA𝐖𝐖{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT Time 时间
FFN 61.5 32.6 48.0 87.0 88.7 20h 20 小时
All 全部 61.3 31.9 47.6 87.0 88.1 27h 27 小时
(a) Tuning the parameters of different subsets.
(a) 调整不同子集的参数。
Experts 专家 GQA SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE LLaVA𝐖𝐖{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT Time 时间
1 60.9 60.2 48.3 86.4 86.3 13h 13 小时
2 61.2 60.8 47.0 87.5 86.5 14h 14 小时
(b) The number of experts.
(b) 专家的数量。
Top-k VQA𝐯𝟐𝐯𝟐{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE Time 时间
1 74.5 58.4 58.0 44.0 85.7 19h 19 小时
2 76.2 61.5 63.1 48.0 88.7 20h 20 小时
(c) The value of top-k.
(c) Top-k 的值。
Architecture 架构 VQA𝐯𝟐𝐯𝟐{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE Time 时间
First-Half 上半部分 75.9 61.3 62.4 47.0 86.9 20h 20 小时
Second-Half 下半部分 76.3 61.2 62.6 47.2 86.9 20h 20 小时
Interval 间隔 76.2 61.5 63.1 48.0 88.7 20h 20 小时
All 全部 74.5 61.5 62.1 47.1 87.0 32h 32 小时
(d) The architectures of MoE-LLaVA.
(d) MoE-LLaVA 的架构。

Effect of Tuning the Parameters of Different Subsets. In Table 6a, we examine the performance of fine-tuning different parts of the parameters. “FFN” represents fine-tuning all FFN layers and MoE layers in the model. “All” indicates fine-tuning all parameters. The results indicate tuning the FFN is sufficient to achieve results comparable to full-parameter tuning, but it requires only approximately 75% of the time. Therefore, to enhance generalization and reduce training costs, we only fine-tune FFN layers.
调整不同子集参数的效果。在表 6a 中,我们检查了微调不同部分参数的性能。“FFN”表示微调模型中的所有 FFN 层和 MoE 层。“全部”表示微调所有参数。结果表明,微调 FFN 层足以达到与全参数微调相当的效果,但只需要大约 75%的时间。因此,为了增强泛化能力并降低训练成本,我们只微调 FFN 层。

Effect of the Number of Experts. Typically, increasing the number of experts directly leads to higher performance (Lepikhin et al., 2020; Fedus et al., 2022). In Table 6b, we change the number of experts while keeping the number of activated experts the same, so the number of activated parameters for both models remains the same. More sparse experts outperform the single expert dense model by 1.1% on POPE and 0.6% on SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT, respectively. The results demonstrate that sparse experts can deliver superior performance.
专家数量的影响。通常,增加专家数量会直接提高性能(Lepikhin 等,2020;Fedus 等,2022)。在表 6b 中,我们在保持激活专家数量不变的情况下改变了专家数量,因此两个模型的激活参数数量保持不变。更多稀疏专家在 POPE 上比单一专家密集模型高出 1.1%,在 SQA 上高出 0.6%。结果表明,稀疏专家可以提供更优越的性能。

Effect of the Number of Activated Experts. To evaluate the effect of the number of activated experts, we compare the performance of using different top-k𝑘kitalic_k strategies. With the number of activated experts changing from 1 to 2, it brings a significant improvement with only 1h training time increasing. These results show that activating more experts can improve the MOE-LLaVA ability. To leverage the advantages of the MoE scheme, we set the number of activated experts to 2.
激活专家数量的影响。为了评估激活专家数量的影响,我们比较了使用不同 top- k𝑘kitalic_k 策略的性能。激活专家数量从 1 变为 2 时,仅增加 1 小时的训练时间就带来了显著的提升。这些结果表明,激活更多专家可以提高 MOE-LLaVA 的能力。为了利用 MoE 方案的优势,我们将激活专家数量设置为 2。

Effect of the Architectures. In Table 6d, we explore four variations of MoE architecture. Specifically, “First-Half” indicates that MoE layers are applied only to the first half of the model while the second half retains the original dense architecture. “Second-Half” means that MoE layers are placed in the second half of the model while the first half remains dense. “Interval” represents alternating occurrences of MoE layers and dense layers. “All” indicates that all layers are sparse MoE layers. Intuitively, it is expected that incorporating all MoE will enhance performance. However, using “All” does not yield better results and results in longer training times compared to other architectures. Therefore, MoE-LLaVA alternates the insertion of MoE layers.
架构的影响。在表 6d 中,我们探讨了 MoE 架构的四种变体。具体来说,“上半部分”表示 MoE 层仅应用于模型的前半部分,而后半部分保持原有的密集架构。“下半部分”表示 MoE 层放置在模型的后半部分,而前半部分保持密集。“间隔”表示 MoE 层和密集层交替出现。“全部”表示所有层都是稀疏的 MoE 层。直观上,预期所有 MoE 的加入会提升性能。然而,使用“全部”并没有带来更好的结果,且训练时间比其他架构更长。因此,MoE-LLaVA 交替插入 MoE 层。

Table 7: Ablation study about the model size of MoE-LLaVA.
表 7:关于 MoE-LLaVA 模型大小的消融研究。
Model MoE VQA𝐯𝟐𝐯𝟐{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB LLaVA𝐖𝐖{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT
StableLM 稳定语言模型 74.5 62.0 48.8 58.2 83.2
76.7 62.6 50.1 60.2 86.8
Qwen 74.9 60.2 48.3 60.6 86.3
76.2 63.1 48.0 59.7 88.7
Phi-2 75.6 67.8 50.0 65.0 91.3
77.6 68.5 51.4 65.2 94.1

Effect of the Model Size. As shown in Table 7, we compare the performance of models with different parameter sizes as the foundation models for MoE-LLaVA. For smaller models such as Phi2-MoE and Qwen-MoE, the performance with MoE surpasses that of dense models. We provide additional results in Section B.1.
模型大小的影响。如表 7 所示,我们比较了不同参数规模的模型作为 MoE-LLaVA 基础模型的性能。对于较小的模型,如 Phi2-MoE 和昆仑-MoE,MoE 的性能超过了密集模型。我们在附录 B.1 中提供了更多结果。

5 Conclusion and Future Directions
结论与未来方向

In this work, we propose the MoE-Tuning to adapting the MoE architecture to LVLMs, and construct the MoE-based spare model MoE-LLaVA, which can find a sparse pathway by simultaneously handling image and text features. Our framework demonstrates strong ability of multi-modal understanding and rich potential for hallucination inhibition, achieving comparable performance of LLaVA-1.5-7B with only 3B activated parameters.
在这项工作中,我们提出了 MoE-Tuning,将 MoE 架构适应于 LVLMs,并构建了基于 MoE 的稀疏模型 MoE-LLaVA,该模型可以通过同时处理图像和文本特征来找到稀疏路径。我们的框架展示了强大的多模态理解能力和丰富的幻觉抑制潜力,仅用 3B 激活参数就达到了 LLaVA-1.5-7B 的可比性能。

While MoE-LLaVA demonstrates competitive capabilities, we observe some difficulties in training stability, particularly with 16-bit float precision. Furthermore, due to the presence of multiple experts specializing in different abilities, MoE-LLaVA can easily be expanded to handle additional tasks such as detection, segmentation, generation, or handling more modalities such as video, depth, and thermal.
尽管 MoE-LLaVA 展示了竞争力,我们观察到在训练稳定性方面存在一些困难,特别是在 16 位浮点精度下。此外,由于存在多个专注于不同能力的专家,MoE-LLaVA 可以轻松扩展以处理额外的任务,如检测、分割、生成,或处理更多模态,如视频、深度和热成像。

Impact Statements 影响声明

Broader Impacts 更广泛的影响

While MoE-LLaVA holds great potential and application value in multi-modal understanding, it may also have some negative social impacts:
尽管 MoE-LLaVA 在多模态理解方面具有巨大的潜力和应用价值,但它也可能带来一些负面的社会影响:

  • Information credibility: MoE-LLaVA can generate realistic texts, including false information and misleading content.


    • 信息可信度:MoE-LLaVA 可以生成逼真的文本,包括虚假信息和误导性内容。
  • Bias and discrimination: The training data for MoE-LLaVA often comes from the internet, where various biases and discriminatory content may exist. If these unequal patterns are learned and amplified by the model, they may be reflected in the generated responses.


    • 偏见和歧视:MoE-LLaVA 的训练数据通常来自互联网,其中可能存在各种偏见和歧视性内容。如果这些不平等的模式被模型学习和放大,它们可能会反映在生成的响应中。
  • Social influence: People may become overly reliant on MoE-LLaVA for information and problem-solving, instead of actively thinking and seeking multiple sources of information. This can lead to increased dependency, reduced autonomy in thinking, and judgment skills.


    • 社会影响:人们可能会过度依赖 MoE-LLaVA 获取信息和解决问题,而不是主动思考和寻找多种信息来源。这可能导致依赖性增加,思考和判断能力的自主性降低。

Reproducibility 可重复性

In Section A.2, we have provided a detailed list of all the training hyperparameters. We have open-sourced all models and codes. Reproducibility can be achieved by using the code provided in the materials.
在附录 A.2 中,我们提供了所有训练超参数的详细列表。我们已开源了所有模型和代码。通过使用材料中提供的代码,可以实现可重复性。

Compute 计算

For the main results, we conducte experiments on 8 A800-80G. For the ablation study, we measure the time on 8 V100-32G.
对于主要结果,我们在 8 台 A800-80G 上进行了实验。对于消融研究,我们在 8 台 V100-32G 上测量了时间。

Licenses 许可证

The majority of this project is released under the Apache 2.0 license.
本项目的大部分内容是根据 Apache 2.0 许可证发布的。

References 参考文献

  • 01-ai (2023) 01-ai. Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi, 2023.
    01-ai. 构建下一代开源和双语 LLMs。https://github.com/01-ai/Yi, 2023。
  • Alayrac et al. (2022) Alayrac 等人 (2022) Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., 等。Flamingo:一种用于少样本学习的视觉语言模型。神经信息处理系统进展, 35:23716–23736, 2022。
  • Baevski & Auli (2018) Baevski, A. and Auli, M. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
    Baevski, A. 和 Auli, M. 自适应输入表示用于神经语言建模。arXiv 预印本 arXiv:1809.10853, 2018。
  • Bai et al. (2023a) Bai 等人 (2023a) Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., 等。Qwen 技术报告。arXiv 预印本 arXiv:2309.16609, 2023a。
  • Bai et al. (2023b) 白等人(2023b) Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
    白俊、白松、杨松、王松、谭松、王鹏、林俊、周成、周建。Qwen-vl:具有多种能力的前沿大规模视觉语言模型。arXiv 预印本 arXiv:2308.12966, 2023b。
  • Bao et al. (2022) 鲍等人(2022) Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
    鲍华、王伟、董磊、刘强、穆罕默德·O·K、阿格瓦尔·K、宋思、朴松、魏峰。Vlmo:使用多模态专家混合的统一视觉语言预训练。神经信息处理系统进展, 35:32897–32912, 2022。
  • Brown et al. (2020) 布朗等人(2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    布朗·T、曼恩·B、赖德·N、苏比亚·M、卡普兰·J·D、达里瓦尔·P、尼拉坎坦·A、夏姆·P、萨斯特里·G、阿斯克尔·A 等。语言模型是少样本学习者。神经信息处理系统进展, 33:1877–1901, 2020。
  • Cha et al. (2023) 车等人(2023) Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
    车俊、姜伟、文俊、卢本。Honeybee:用于多模态 LLM 的局部增强投影仪。arXiv 预印本 arXiv:2312.06742, 2023。
  • Chen et al. (2023a) 陈等人(2023a) Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., and Zhang, D. Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe. arXiv preprint arXiv:2308.11971, 2023a.
    陈俊、郭磊、孙杰、邵松、袁志、林磊、张东。Eve:通过掩码预测和模态感知 MOE 进行高效的视觉语言预训练。arXiv 预印本 arXiv:2308.11971, 2023a。
  • Chen et al. (2023b) 陈等人(2023b) Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
    陈俊、朱东、沈翔、李翔、刘志、张鹏、克里希纳穆尔西·R、钱德拉·V、熊勇、埃尔霍塞尼·M。Minigpt-v2:作为视觉语言多任务学习统一接口的大型语言模型。arXiv 预印本 arXiv:2310.09478, 2023b。
  • Chen et al. (2023c) 陈等人(2023c) Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
    陈凯、张志、曾伟、张锐、朱峰、赵锐。Shikra:释放多模态 LLM 的指代对话魔力。arXiv 预印本 arXiv:2306.15195, 2023c。
  • Chen et al. (2023d) 陈等人(2023d) Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023d.
    陈磊、李杰、董翔、张鹏、何成、王杰、赵峰、林东。Sharegpt4v:通过更好的字幕改进大型多模态模型。arXiv 预印本 arXiv:2311.12793, 2023d。
  • Chen et al. (2023e) 陈等人(2023e) Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Muyan, Z., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023e.
    陈志、吴杰、王伟、苏伟、陈刚、邢松、穆彦、张强、朱翔、陆磊 等。Internvl:扩展视觉基础模型并对齐通用视觉语言任务。arXiv 预印本 arXiv:2312.14238, 2023e。
  • Chiang et al. (2023) 蒋等人(2023) Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
    蒋伟林、李志、林志、盛勇、吴志、张华、郑磊、庄松、庄勇、冈萨雷斯·J·E 等。Vicuna:一个开源聊天机器人,以 90%*的 ChatGPT 质量给 GPT-4 留下深刻印象。见 https://vicuna.lmsys.org(访问于 2023 年 4 月 14 日),2023。
  • Chu et al. (2023) 储等人(2023) Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
    储翔、乔磊、林翔、徐松、杨勇、胡勇、魏峰、张翔、张彬、魏翔 等。Mobilevlm:一个快速、可复现且强大的移动设备视觉语言助手。arXiv 预印本 arXiv:2312.16886, 2023。
  • Dai et al. (2023) 戴等人(2023) Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
    戴伟、李杰、李东、张安明、赵杰、王伟、李彬、冯平。Instructblip:通过指令调优实现通用视觉语言模型,2023。
  • DeepSeek-AI (2024) DeepSeek-AI(2024) DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
    DeepSeek-AI。Deepseek LLM:通过长期主义扩展开源语言模型。arXiv 预印本 arXiv:2401.02954, 2024。
  • Du et al. (2021) 杜等人(2021) Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
    杜志、钱勇、刘翔、丁明、邱杰、杨志、唐杰。GLM:通过自回归空白填充进行通用语言模型预训练。arXiv 预印本 arXiv:2103.10360, 2021。
  • Eigen et al. (2013) 艾根等人(2013) Eigen, D., Ranzato, M., and Sutskever, I. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
    艾根·D、兰扎托·M、苏茨克弗·I。在深度专家混合中学习分解表示。arXiv 预印本 arXiv:1312.4314, 2013。
  • falconry (2023) falconry(2023) falconry. Falcon-180b. https://falconllm.tii.ae/, 2023.
    falconry。Falcon-180b。https://falconllm.tii.ae/, 2023。
  • Fedus et al. (2022) Fedus 等人 (2022) Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
    Fedus, W., Zoph, B., 和 Shazeer, N. Switch transformers: 使用简单高效的稀疏性扩展到万亿参数模型. 《机器学习研究杂志》, 23(1):5232–5270, 2022.
  • FlagAI-Open (2023) FlagAI-Open. Aquila2-34b. https://github.com/FlagAI-Open/Aquila2, 2023.
  • Fu et al. (2023) Fu 等人 (2023) Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., 和 Ji, R. Mme: 多模态大语言模型的综合评估基准. arXiv 预印本 arXiv:2306.13394, 2023.
  • Gong et al. (2023) Gong 等人 (2023) Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
    Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., 和 Chen, K. Multimodal-gpt: 用于与人类对话的视觉和语言模型. arXiv 预印本 arXiv:2305.04790, 2023.
  • Gou et al. (2023) Gou 等人 (2023) Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.-Y., Kwok, J. T., and Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
    Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.-Y., Kwok, J. T., 和 Zhang, Y. 用于视觉语言指令调优的簇条件 Lora 专家混合. arXiv 预印本 arXiv:2312.12379, 2023.
  • Goyal et al. (2017) Goyal 等人 (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., 和 Parikh, D. 让 VQA 中的 V 变得重要:提升图像理解在视觉问答中的作用。发表于 IEEE 计算机视觉与模式识别会议论文集,第 6904–6913 页,2017 年。
  • Gurari et al. (2018) Gurari 等人 (2018) Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
    Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., 和 Bigham, J. P. Vizwiz 大挑战:回答盲人提出的视觉问题。发表于 IEEE 计算机视觉与模式识别会议论文集,第 3608–3617 页,2018 年。
  • Hendrycks & Gimpel (2016)
    Hendrycks 和 Gimpel (2016)
    Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
    Hendrycks, D. 和 Gimpel, K. 高斯误差线性单元 (GELUs)。arXiv 预印本 arXiv:1606.08415, 2016 年。
  • Hudson & Manning (2019) Hudson 和 Manning (2019) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
    Hudson, D. A. 和 Manning, C. D. GQA:一个用于真实世界视觉推理和组合问答的新数据集。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 6700–6709 页,2019 年。
  • Jacobs et al. (1991) Jacobs 等人 (1991) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., 和 Hinton, G. E. 局部专家的自适应混合。神经计算,3(1):79–87, 1991 年。
  • Jiang et al. (2023) Jiang 等人 (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023.
    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., 和 Sayed, W. E. Mistral 7b, 2023 年。
  • Jiang et al. (2024) Jiang 等人 (2024) Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral of experts, 2024.
    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., 和 Sayed, W. E. Mixtral of experts, 2024 年。
  • Koh et al. (2023) Koh 等人 (2023) Koh, J. Y., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
    Koh, J. Y., Salakhutdinov, R., 和 Fried, D. 将语言模型与图像结合用于多模态生成。arXiv 预印本 arXiv:2301.13823, 2023 年。
  • Komatsuzaki et al. (2022)
    Komatsuzaki 等人 (2022)
    Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., and Houlsby, N. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
    Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., 和 Houlsby, N. 稀疏升级:从密集检查点训练专家混合模型。arXiv 预印本 arXiv:2212.05055, 2022 年。
  • Kudugunta et al. (2021) Kudugunta 等人 (2021) Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021.
    Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., 和 Firat, O. 超越蒸馏:用于高效推理的任务级专家混合模型。arXiv 预印本 arXiv:2110.03742, 2021 年。
  • Lai et al. (2023) Lai 等人 (2023) Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., and Jia, J. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., 和 Jia, J. Lisa: 通过大语言模型进行推理分割。arXiv 预印本 arXiv:2308.00692, 2023 年。
  • Laurençon et al. (2023) Laurençon 等人 (2023) Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A. M., Kiela, D., Cord, M., and Sanh, V. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
    Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A. M., Kiela, D., Cord, M., 和 Sanh, V. Obelics: 一个开放的网络规模过滤数据集,包含交错的图文文档,2023 年。
  • Lepikhin et al. (2020) Lepikhin 等人 (2020) Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
    Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., 和 Chen, Z. Gshard: 通过条件计算和自动分片扩展巨型模型。arXiv 预印本 arXiv:2006.16668, 2020 年。
  • Li et al. (2023a) Li 等人 (2023a) Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., and Liu, Z. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
    Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., 和 Liu, Z. Mimic-it: 多模态上下文指令调优。arXiv 预印本 arXiv:2306.05425, 2023a 年。
  • Li et al. (2022) Li 等人 (2022) Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
    Li, J., Li, D., Xiong, C., 和 Hoi, S. Blip: 通过引导语言-图像预训练实现统一的视觉-语言理解和生成。发表于国际机器学习会议论文集,第 12888–12900 页。PMLR, 2022 年。
  • Li et al. (2023b) Li 等人 (2023b) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
    Li, J., Li, D., Savarese, S., 和 Hoi, S. Blip-2: 通过冻结图像编码器和大型语言模型引导语言-图像预训练。arXiv 预印本 arXiv:2301.12597, 2023b 年。
  • Li et al. (2023c) 李等人(2023c) Li, X., Yao, Y., Jiang, X., Fang, X., Meng, X., Fan, S., Han, P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm and how to train it with 100 k budget. arXiv preprint arXiv:2309.03852, 2023c.
    李晓,姚远,姜晓,方晓,孟晓,范晓,韩鹏,李杰,杜磊,秦斌,等。Flm-101b:一个开放的 LLM 及其如何用 10 万美元预算进行训练。arXiv 预印本 arXiv:2309.03852,2023c。
  • Li et al. (2023d) 李等人(2023d) Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
    李阳,杜阳,周凯,王杰,赵文轩,温建荣。评估大型视觉语言模型中的对象幻觉。arXiv 预印本 arXiv:2305.10355,2023d。
  • Li et al. (2023e) 李等人(2023e) Li, Y., Hui, B., Yin, Z., Yang, M., Huang, F., and Li, Y. Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839, 2023e.
    李阳,惠彬,尹志,杨明,黄飞,李阳。Pace:通过渐进和组合专家进行统一的多模态对话预训练。arXiv 预印本 arXiv:2305.14839,2023e。
  • Liang et al. (2022) 梁等人(2022) Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, J. Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
    梁伟文,张宇,权勇,杨尚,邹建业。注意差距:理解多模态对比表示学习中的模态差距。神经信息处理系统进展,35:17612–17625,2022。
  • Lin et al. (2023) 林等人(2023) Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
    林斌,朱彬,叶宇,宁明,金鹏,袁磊。Video-llava:通过对齐再投影学习统一的视觉表示。arXiv 预印本 arXiv:2311.10122,2023。
  • Liu et al. (2023a) 刘等人(2023a) Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
    刘飞,林凯,李磊,王杰,雅各布,王磊。通过稳健的指令调优对齐大型多模态模型。arXiv 预印本 arXiv:2306.14565,2023a。
  • Liu et al. (2023b) 刘等人(2023b) Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
    刘 H., 李 C., 李 Y., 和李 Y. J. 通过视觉指令调优改进基线。arXiv 预印本 arXiv:2310.03744, 2023b.
  • Liu et al. (2023c) 刘等人(2023c) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
    刘 H., 李 C., 吴 Q., 和李 Y. J. 视觉指令调优。arXiv 预印本 arXiv:2304.08485, 2023c.
  • Liu et al. (2023d) 刘等人(2023d) Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
    刘 Y., 段 H., 张 Y., 李 B., 张 S., 赵 W., 袁 Y., 王 J., 何 C., 刘 Z., 等人。Mmbench:你的多模态模型是全能选手吗?arXiv 预印本 arXiv:2307.06281, 2023d.
  • Liu et al. (2022) 刘等人(2022) Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022.
    刘 Z., 胡 H., 林 Y., 姚 Z., 谢 Z., 魏 Y., 宁 J., 曹 Y., 张 Z., 董 L., 等人。Swin Transformer v2:扩展容量和分辨率。在 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 12009–12019, 2022.
  • Liu et al. (2023e) 刘等人(2023e) Liu, Z., He, Y., Wang, W., Wang, W., Wang, Y., Chen, S., Zhang, Q., Lai, Z., Yang, Y., Li, Q., et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023e.
    刘 Z., 何 Y., 王 W., 王 W., 王 Y., 陈 S., 张 Q., 赖 Z., 杨 Y., 李 Q., 等人。Interngpt:通过与 ChatGPT 互动解决以视觉为中心的任务。arXiv 预印本 arXiv:2305.05662, 3, 2023e.
  • Long et al. (2023) 龙等人(2023) Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A. Multiway-adapater: Adapting large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516, 2023.
    龙 Z., Killick G., McCreadie R., 和 Camarasa G. A. Multiway-adapater:适应大规模多模态模型以实现可扩展的图像-文本检索。arXiv 预印本 arXiv:2309.01516, 2023.
  • Lu et al. (2022) 陆等人(2022) Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
    陆 P., Mishra S., 夏 T., 邱 L., 张 K.-W., 朱 S.-C., Tafjord O., Clark P., 和 Kalyan A. 学习解释:通过思维链进行科学问答的多模态推理。神经信息处理系统进展,35:2507–2521, 2022.
  • Ma et al. (2023) 马等人(2023) Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195, 2023.
    马 G., 吴 X., 王 P., 和胡 S. Cot-mote:探索使用混合文本专家进行段落检索的上下文掩码自动编码器预训练。arXiv 预印本 arXiv:2304.10195, 2023.
  • Microsoft (2023) 微软(2023) Microsoft. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models, 2023.
    微软。Phi-2:小型语言模型的惊人力量。https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models, 2023.
  • Mustafa et al. (2022) Mustafa 等人(2022) Mustafa, B., Riquelme, C., Puigcerver, J., Jenatton, R., and Houlsby, N. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
    Mustafa B., Riquelme C., Puigcerver J., Jenatton R., 和 Houlsby N. 使用 LIMoE 进行多模态对比学习:语言-图像专家混合。神经信息处理系统进展,35:9564–9576, 2022.
  • OpenAI (2023) OpenAI(2023) OpenAI. Gpt-4 technical report, 2023.
    OpenAI。GPT-4 技术报告,2023.
  • Pearson (1901) 皮尔逊(1901) Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572, 1901.
    皮尔逊 K. Liii. 关于空间点系统的最接近拟合线和平面。伦敦、爱丁堡和都柏林哲学杂志和科学杂志,2(11):559–572, 1901.
  • Penedo et al. (2023) Penedo 等人(2023) Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
    Penedo G., Malartic Q., Hesslow D., Cojocaru R., Cappelli A., Alobeidli H., Pannier B., Almazrouei E., 和 Launay J. Falcon LLM 的 RefinedWeb 数据集:仅使用网络数据超越精心策划的语料库。arXiv 预印本 arXiv:2306.01116, 2023.
  • Peng et al. (2023) 彭等人(2023) Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
    彭 Z., 王 W., 董 L., 郝 Y., 黄 S., 马 S., 和魏 F. Kosmos-2:将多模态大语言模型与世界联系起来。arXiv 预印本 arXiv:2306.14824, 2023.
  • Pi et al. (2023) 皮等人(2023) Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., and Zhang, L. K. T. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
    皮 R., 高 J., 刁 S., 潘 R., 董 H., 张 J., 姚 L., 韩 J., 徐 H., 和张 L. K. T. DetGPT:通过推理检测你需要的。arXiv 预印本 arXiv:2305.14167, 2023.
  • Radford et al. (2021) Radford 等人(2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
    Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., 等人。从自然语言监督中学习可迁移的视觉模型。在国际机器学习会议论文集,页码 8748–8763. PMLR, 2021.
  • Rasheed et al. (2023) Rasheed 等人 (2023) Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and Khan, F. S. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., 和 Khan, F. S. Glamm: 像素定位大型多模态模型。arXiv 预印本 arXiv:2311.03356, 2023.
  • Riquelme et al. (2021) Riquelme 等人 (2021) Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
    Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., 和 Houlsby, N. 使用稀疏专家混合扩展视觉。神经信息处理系统进展, 34:8583–8595, 2021.
  • Satar et al. (2022) Satar 等人 (2022) Satar, B., Zhu, H., Zhang, H., and Lim, J. H. Rome: Role-aware mixture-of-expert transformer for text-to-video retrieval. arXiv preprint arXiv:2206.12845, 2022.
    Satar, B., Zhu, H., Zhang, H., 和 Lim, J. H. Rome: 面向文本到视频检索的角色感知专家混合 Transformer。arXiv 预印本 arXiv:2206.12845, 2022.
  • Scao et al. (2022) Scao 等人 (2022) Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., 等人。Bloom: 一个 1760 亿参数的开放访问多语言模型。arXiv 预印本 arXiv:2211.05100, 2022.
  • Shazeer et al. (2017) Shazeer 等人 (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., 和 Dean, J. 超大规模神经网络:稀疏门控专家混合层。arXiv 预印本 arXiv:1701.06538, 2017.
  • Shen et al. (2023) Shen 等人 (2023) Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
    Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., 和 He, Y. 使用稀疏专家混合扩展视觉语言模型。arXiv 预印本 arXiv:2303.07226, 2023.
  • Singh et al. (2019) Singh 等人 (2019) Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., 和 Rohrbach, M. 迈向能够阅读的 VQA 模型。在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 8317–8326 页, 2019.
  • Sun et al. (2023) Sun 等人 (2023) Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y., Tang, Q., Zhao, X., et al. Moss: Training conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7, 2023.
    Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y., Tang, Q., Zhao, X., 等人。Moss: 从合成数据训练对话语言模型。arXiv 预印本 arXiv:2307.15020, 7, 2023.
  • SUSTech-IDEA (2023) SUSTech-IDEA. Sus-chat: Instruction tuning done right. https://github.com/SUSTech-IDEA/SUS-Chat, 2023.
    SUSTech-IDEA。Sus-chat: 正确进行指令微调。https://github.com/SUSTech-IDEA/SUS-Chat, 2023.
  • Taori et al. (2023) Taori 等人 (2023) Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., 和 Hashimoto, T. B. Alpaca: 一个强大且可复制的指令跟随模型。斯坦福基础模型研究中心。https://crfm.stanford.edu/2023/03/13/alpaca.html, 3(6):7, 2023.
  • Team (2023) Team, I. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
    Team, I. Internlm: 一个多语言模型,具有逐步增强的能力, 2023.
  • (75) Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
    Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
  • Touvron et al. (2023a) Touvron 等人 (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., 等人。Llama: 开放且高效的基础语言模型。arXiv 预印本 arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Touvron 等人 (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., 等人。Llama 2: 开放基础和微调的聊天模型。arXiv 预印本 arXiv:2307.09288, 2023b.
  • Wang et al. (2023a) Wang 等人 (2023a) Wang, G., Cheng, S., Zhan, X., Li, X., Song, S., and Liu, Y. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023a.
    Wang, G., Cheng, S., Zhan, X., Li, X., Song, S., 和 Liu, Y. Openchat: 使用混合质量数据推进开源语言模型。arXiv 预印本 arXiv:2309.11235, 2023a.
  • Wang et al. (2023b) Wang 等人 (2023b) Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., and Jiang, Y.-G. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023b.
    Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., 和 Jiang, Y.-G. 眼见为实:提示 GPT-4V 以更好地进行视觉指令微调。arXiv 预印本 arXiv:2311.07574, 2023b.
  • Wang et al. (2019) Wang 等人 (2019) Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., and Chao, L. S. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019.
    王强、李斌、肖涛、朱杰、李成、黄大福、赵丽莎。学习深度 Transformer 模型用于机器翻译。arXiv 预印本 arXiv:1906.01787, 2019。
  • Wang et al. (2022) 王等人(2022) Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
    王伟、包华、董磊、比约克、彭志、刘强、阿格瓦尔、穆罕默德、辛格尔、索姆等。图像如外语:用于所有视觉和视觉语言任务的 Beit 预训练。arXiv 预印本 arXiv:2208.10442, 2022。
  • Wang et al. (2023c) 王等人(2023c) Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023c.
    王伟,陈志,陈晓,吴杰,朱晓,曾刚,罗鹏,陆涛,周杰,乔宇,等。Visionllm:大型语言模型也是面向视觉任务的开放式解码器。arXiv 预印本 arXiv:2305.11175,2023c。
  • Wang et al. (2023d) 王等人(2023d) Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023d.
    王伟,吕强,余伟,洪伟,齐杰,王宇,季杰,杨志,赵磊,宋晓,等。Cogvlm:预训练语言模型的视觉专家。arXiv 预印本 arXiv:2311.03079,2023d。
  • Wei et al. (2022) 魏等人(2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
    魏杰,王晓,舒尔曼斯,博斯马,夏飞,池恩,乐庆伟,周东,等。链式思维提示在大型语言模型中引发推理。神经信息处理系统进展,35:24824–24837,2022。
  • Yang et al. (2023) 杨等人(2023) Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
    杨安,肖斌,王斌,张斌,卞成,尹成,吕成,潘东,王东,严东,等。Baichuan 2:开放的大规模语言模型。arXiv 预印本 arXiv:2309.10305,2023。
  • Ye et al. (2023) 叶等人(2023) Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
    叶强,徐华,徐刚,叶杰,严明,周勇,王杰,胡安,石平,石勇,等。mplug-owl:模块化赋能大型语言模型多模态能力。arXiv 预印本 arXiv:2304.14178,2023。
  • Yin et al. (2023) 尹等人(2023) Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
    尹山,傅成,赵山,李凯,孙晓,徐涛,陈恩。多模态大型语言模型综述。arXiv 预印本 arXiv:2306.13549,2023。
  • Yu et al. (2023) 余等人(2023) Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
    余伟,杨志,李磊,王杰,林凯,刘志,王晓,王磊。Mm-vet:评估大型多模态模型的综合能力。arXiv 预印本 arXiv:2308.02490,2023。
  • Yuan et al. (2023) 袁等人(2023) Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
    袁志,李志,孙磊。Tinygpt-v:通过小型骨干网络实现高效多模态大型语言模型。arXiv 预印本 arXiv:2312.16862,2023。
  • Zeng et al. (2022) 曾等人(2022) Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
    曾安,刘翔,杜志,王志,赖华,丁明,杨志,徐阳,郑伟,夏晓,等。GLM-130B:一个开放的双语预训练模型。arXiv 预印本 arXiv:2210.02414,2022。
  • Zhang et al. (2023a) 张等人(2023a) Zhang, P., Wang, X. D. B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
    张鹏,王晓东,曹阳,徐成,欧阳磊,赵志,丁山,张山,段红,严红,等。Internlm-xcomposer:一个用于高级文本-图像理解和创作的视觉语言大型模型。arXiv 预印本 arXiv:2309.15112,2023a。
  • Zhang et al. (2022) 张等人(2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
    张山,罗勒,戈雅尔,阿特克谢,陈明,陈山,德万,迪亚布,李翔,林晓伟,等。OPT:开放预训练 Transformer 语言模型。arXiv 预印本 arXiv:2205.01068,2022。
  • Zhang et al. (2023b) 张等人(2023b) Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b.
    张山,董磊,李翔,张山,孙晓,王山,李杰,胡锐,张涛,吴峰,等。大型语言模型的指令调优:综述。arXiv 预印本 arXiv:2308.10792,2023b。
  • Zhang & Yang (2023) 张和杨(2023) Zhang, X. and Yang, Q. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp.  4435–4439, 2023.
    张翔和杨强。轩辕 2.0:一个拥有数千亿参数的大型中文金融聊天模型。2023 年 ACM 第 32 届国际信息与知识管理会议论文集,页码 4435–4439,2023。
  • Zhang et al. (2023c) 张等人(2023c) Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., and Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c.
    张勇,张锐,顾杰,周勇,利普卡,杨东,孙涛。Llavar:增强的视觉指令调优用于文本丰富的图像理解。arXiv 预印本 arXiv:2306.17107,2023c。
  • Zhao et al. (2023a) 赵等人(2023a) Zhao, B., Wu, B., and Huang, T. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a.
    赵斌,吴斌,黄涛。Svit:扩展视觉指令调优。arXiv 预印本 arXiv:2307.04087,2023a。
  • Zhao et al. (2023b) 赵等人(2023b) Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang, B. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023b.
    赵勇,林志,周东,黄志,冯杰,康斌。Bubogpt:在多模态 LLMs 中实现视觉定位。arXiv 预印本 arXiv:2307.08581,2023b。
  • Zheng et al. (2023) 郑等人 (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
    郑丽,蒋伟林,盛宇,庄思,吴志,庄宇,林志,李志,李东,邢恩等。使用 mt-bench 和 chatbot arena 评估 llm-as-a-judge。arXiv 预印本 arXiv:2306.05685, 2023。
  • Zhu et al. (2023) 朱等人 (2023) Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
    朱东,陈杰,沈晓,李晓,和 Elhoseiny, M. Minigpt-4: 使用先进的大型语言模型增强视觉语言理解。arXiv 预印本 arXiv:2304.10592, 2023。
  • Zhu et al. (2022) 朱等人 (2022) Zhu, J., Zhu, X., Wang, W., Wang, X., Li, H., Wang, X., and Dai, J. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
    朱杰,朱晓,王伟,王晓,李华,王晓,和戴杰。Uni-perceiver-moe: 使用条件 moes 学习稀疏通用模型。神经信息处理系统进展, 35:2664–2678, 2022。
  • Zhu et al. (2024) 朱等人 (2024) Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., and Tang, J. Llava-phi: Efficient multi-modal assistant with small language model, 2024.
    朱勇, 朱明, 刘宁, 欧志, 牟晓, 唐杰. Llava-phi: 高效的多模态助手,使用小型语言模型, 2024.
  • Zoph et al. (2022) Zoph 等人 (2022) Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
    Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., 和 Fedus, W. St-moe: 设计稳定且可迁移的稀疏专家模型. arXiv 预印本 arXiv:2202.08906, 2022.

 Appendix for MoE-LLaVA MoE-LLaVA 附录

Appendix A Implementation Details
附录 A 实现细节

A.1 More Model Architecture
A.1 更多模型架构

In Table 8, we present additional variants of the MoE-LLaVA. We introduce how the total parameters is calculated. When the number of activated experts is 2, setting Experts=2𝐸𝑥𝑝𝑒𝑟𝑡𝑠2Experts=2italic_E italic_x italic_p italic_e italic_r italic_t italic_s = 2 yields the number of activated parameters.
在表 8 中,我们展示了 MoE-LLaVA 的其他变体。我们介绍了总参数的计算方法。当激活的专家数量为 2 时,设置 Experts=2𝐸𝑥𝑝𝑒𝑟𝑡𝑠2Experts=2italic_E italic_x italic_p italic_e italic_r italic_t italic_s = 2 会得到激活参数的数量。

Total_Parameters=𝑇𝑜𝑡𝑎𝑙_𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠absent\displaystyle Total\text{\_}Parameters=italic_T italic_o italic_t italic_a italic_l _ italic_P italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r italic_s = EmbeddingWidth𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑊𝑖𝑑𝑡\displaystyle Embedding\cdot Widthitalic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g ⋅ italic_W italic_i italic_d italic_t italic_h (12)
+Layers(4WidthWidth+WidthFFNFFN_Factor+2Width)𝐿𝑎𝑦𝑒𝑟𝑠4𝑊𝑖𝑑𝑡𝑊𝑖𝑑𝑡𝑊𝑖𝑑𝑡𝐹𝐹𝑁𝐹𝐹𝑁_𝐹𝑎𝑐𝑡𝑜𝑟2𝑊𝑖𝑑𝑡\displaystyle+Layers\cdot(4\cdot Width\cdot Width+Width\cdot FFN\cdot FFN\text% {\_}Factor+2\cdot Width)+ italic_L italic_a italic_y italic_e italic_r italic_s ⋅ ( 4 ⋅ italic_W italic_i italic_d italic_t italic_h ⋅ italic_W italic_i italic_d italic_t italic_h + italic_W italic_i italic_d italic_t italic_h ⋅ italic_F italic_F italic_N ⋅ italic_F italic_F italic_N _ italic_F italic_a italic_c italic_t italic_o italic_r + 2 ⋅ italic_W italic_i italic_d italic_t italic_h )
+Width+WidthEmbedding𝑊𝑖𝑑𝑡𝑊𝑖𝑑𝑡𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔\displaystyle+Width+Width\cdot Embedding+ italic_W italic_i italic_d italic_t italic_h + italic_W italic_i italic_d italic_t italic_h ⋅ italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g
+MoE_Layers(Experts1)(WidthFFNFFN_Factor+2Width)𝑀𝑜𝐸_𝐿𝑎𝑦𝑒𝑟𝑠𝐸𝑥𝑝𝑒𝑟𝑡𝑠1𝑊𝑖𝑑𝑡𝐹𝐹𝑁𝐹𝐹𝑁_𝐹𝑎𝑐𝑡𝑜𝑟2𝑊𝑖𝑑𝑡\displaystyle+MoE\text{\_}Layers\cdot(Experts-1)\cdot(Width\cdot FFN\cdot FFN% \text{\_}Factor+2\cdot Width)+ italic_M italic_o italic_E _ italic_L italic_a italic_y italic_e italic_r italic_s ⋅ ( italic_E italic_x italic_p italic_e italic_r italic_t italic_s - 1 ) ⋅ ( italic_W italic_i italic_d italic_t italic_h ⋅ italic_F italic_F italic_N ⋅ italic_F italic_F italic_N _ italic_F italic_a italic_c italic_t italic_o italic_r + 2 ⋅ italic_W italic_i italic_d italic_t italic_h )
+MoE_Layers(WidthExperts)𝑀𝑜𝐸_𝐿𝑎𝑦𝑒𝑟𝑠𝑊𝑖𝑑𝑡𝐸𝑥𝑝𝑒𝑟𝑡𝑠\displaystyle+MoE\text{\_}Layers\cdot(Width\cdot Experts)+ italic_M italic_o italic_E _ italic_L italic_a italic_y italic_e italic_r italic_s ⋅ ( italic_W italic_i italic_d italic_t italic_h ⋅ italic_E italic_x italic_p italic_e italic_r italic_t italic_s )
Table 8: More architecture details of the MoE-LLaVA model. “FFN Factor“ represents the number of linear layers in the FFN. “*” donates the dimension of the hidden states for the keys (k) and values (v) is 1024. “1.6B×4-Top2” represents a dense foundation model with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “{\dagger}” donates all layers will equipped with MoE layer.
表 8: MoE-LLaVA 模型的更多架构细节。“FFN 因子”表示 FFN 中线性层的数量。“*”表示键 (k) 和值 (v) 的隐藏状态维度为 1024。“1.6B×4-Top2”表示一个具有 16 亿参数的密集基础模型,将配备总共四个专家,其中两个被激活。“ {\dagger} ”表示所有层都将配备 MoE 层。
Name 名称 Experts 专家 Top-k MoE Embedding 嵌入 Width 宽度 Layers  FFN FFN Heads  Activated 激活 Total 总计
Layers  Factor 因子 Param 参数 Param 参数
StableLM-1.6B (Team, ) StableLM-1.6B (团队) - - - 100352 2560 32 10240 2 32 1.6B 1.6B
MoE-LLaVA-1.6B×4-Top2 4 2 16 100352 2560 32 10240 2 32 2.0B 2.9B
MoE-LLaVA-1.6B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 4 2 32 100352 2560 32 10240 2 32 2.5B 4.1B
Qwen-1.8B (Bai et al., 2023a)
Qwen-1.8B (Bai 等, 2023a)
- - - 151936 2048 24 5504 3 16 1.8B 1.8B
MoE-LLaVA-1.8B×4-Top2 4 2 12 151936 2048 24 5504 3 16 2.2B 3.1B
MoE-LLaVA-1.8B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 4 2 24 151936 2048 24 5504 3 16 2.6B 4.3B
Phi2-2.7B (Microsoft, 2023)
Phi2-2.7B (微软, 2023)
- - - 51200 2560 32 10240 2 32 2.7B 2.7B
MoE-LLaVA-2.7B×4-Top2 4 2 16 51200 2560 32 10240 2 32 3.6B 5.3B
MoE-LLaVA-2.7B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 4 2 32 51200 2560 32 10240 2 32 4.5B 7.8B
OpenChat-7B (Wang et al., 2023a) - - - 32000 4096*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 32 14336 3 32 6.7B 6.7B
MoE-LLaVA-7B×4-Top2 4 2 16 32000 4096*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 32 14336 3 32 9.6B 15.2B
MoE-LLaVA-7B×4-Top2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 4 2 32 32000 4096*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 32 14336 3 32 12.4B 23.7B
Table 9: Training hyperparameters.
表 9: 训练超参数.
Config 配置 Stage I 阶段 I Stage II 阶段 II Stage III 阶段 III
Experts 专家 - - 4
Top-k - - 2
Deepspeed Zero2 Zero2 Zero2_offload
Data 数据 LLaVA-PT Hybird-PT LLaVA-FT
Image resolution 图像分辨率 336×336
Image encoder 图像编码器 CLIP-Large/336
Feature select layer 特征选择层 -2
Image projector 图像投影器 2 Linear layers with GeLU
2 层带 GeLU 的线性层
Epoch 训练轮数 1
Learning rate 学习率 1e-3 0.001 2e-5 0.00002 2e-5 0.00002
Learning rate schdule 学习率调度 Cosine 余弦
Weight decay 权重衰减 0.0
Text max length 文本最大长度 2048
Batch size per GPU 每个 GPU 的批量大小 32 16 16
GPU 8 × A800-80G
Precision 精度 Bf16

A.2 Training Details A.2 训练细节

As shown in Table 9, we present the training hyperparameters for all models, which are applicable to Qwen, StableLM, Phi and OpenChat. For the training process in all stages, we consistently train for 1 epoch, as we find that the models overfit when training for 2 epochs. The batch size for the first stage is 256 and 128 for the second and third stages. We use an image resolution of 336x336 for all three stages. Additionally, for smaller models like Qwen-1.8B, it is feasible to train them on 8 V100-32G GPUs. However, during the training process, using fp16 may sometimes lead to loss becoming NaN. Since our models are smaller than 7B, we can train them in zero2 mode. However, for stage 3, deepspeed temporarily does not support training MoE architecture in zero3 mode. Therefore, we choose zero2_offload to further reduce the memory requirements and enable running on 8 A800-80G GPUs. We enable the gradient checkpoint mode for all training stage.
如表 9 所示,我们展示了所有模型的训练超参数,适用于 Qwen、StableLM、Phi 和 OpenChat。在所有阶段的训练过程中,我们始终训练 1 个轮次,因为我们发现训练 2 个轮次时模型会过拟合。第一阶段的批量大小为 256,第二和第三阶段为 128。我们在所有三个阶段都使用 336x336 的图像分辨率。此外,对于像 Qwen-1.8B 这样的小模型,可以在 8 个 V100-32G GPU 上进行训练。然而,在训练过程中,使用 fp16 有时会导致损失变为 NaN。由于我们的模型小于 7B,我们可以在 zero2 模式下进行训练。然而,对于第三阶段,deepspeed 暂时不支持在 zero3 模式下训练 MoE 架构。因此,我们选择 zero2_offload 以进一步减少内存需求,并在 8 个 A800-80G GPU 上运行。我们在所有训练阶段启用了梯度检查点模式。

Appendix B Additional Results and Visualization
附录 B 附加结果和可视化

B.1 Model Scaling B.1 模型扩展

Table 10: Ablation study about the model size of MoE-LLaVA.
表 10:关于 MoE-LLaVA 模型大小的消融研究。
Model MoE VQA𝐯𝟐𝐯𝟐{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT SQA𝐈𝐈{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQA𝐓𝐓{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB LLaVA𝐖𝐖{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT
StableLM 稳定语言模型 74.5 62.0 48.8 58.2 83.2
76.0 62.6 47.8 59.4 85.9
Qwen 74.9 60.2 48.3 60.6 86.3
76.2 63.1 48.0 59.7 88.7
Phi-2 75.6 67.8 50.0 65.0 91.3
77.6 68.5 51.4 65.2 94.1
OpenChat 77.9 69.0 54.7 66.9 89.7
78.9 62.8 52.5 65.9 86.3

As shown in Table 10, for models smaller than 7B, we demonstrate a strong scale of law. MoE-LLaVA exhibits improved performance as the model size increases, as exemplified by StableLM-1.6B, Qwen-1.8B, and Phi-2.7B. But surprisingly, the overall performance of OpenChat-MoE is significantly inferior to dense models. We speculate that this may be due to the insufficient data for current multi-modal instruction tuning to support sparse pattern learning in 10B-level models, which should be addressed in future work when scaling up to larger MoE-LLaVA models.
如表 10 所示,对于小于 7B 的模型,我们展示了强大的规模定律。随着模型规模的增加,MoE-LLaVA 表现出更好的性能,典型例子有 StableLM-1.6B、Qwen-1.8B 和 Phi-2.7B。但令人惊讶的是,OpenChat-MoE 的整体性能显著低于密集模型。我们推测这可能是由于当前多模态指令调优的数据不足,无法支持 10B 级模型的稀疏模式学习,这在未来扩展到更大规模的 MoE-LLaVA 模型时应予以解决。

B.2 Training Capacity B.2 训练容量

For MoE layers, we employ the Batch Priority Routing (BPR) strategy (Riquelme et al., 2021). This strategy utilizes the routing results to determine which tokens should be dropped, ensuring a more balanced workload among the experts. During the training process, the BPR strategy dynamically adjusts the routing results for each expert based on their capacity. When the tokens assigned to an expert exceed its predefined capacity, the excess tokens are dropped. We conduct a ablation study on the hyperparameter capacity, as shown in Table 11. Increasing the capacity consistently improves performance for different sizes of MoE-LLaVA.
对于 MoE 层,我们采用批次优先路由(BPR)策略(Riquelme 等,2021)。该策略利用路由结果来确定哪些 token 应该被丢弃,从而确保专家之间的工作负载更加平衡。在训练过程中,BPR 策略根据每个专家的容量动态调整路由结果。当分配给某个专家的 token 超过其预定义的容量时,超出的 token 将被丢弃。我们在表 11 中对超参数容量进行了消融研究。增加容量可以持续提高不同大小的 MoE-LLaVA 的性能。

Table 11: Ablation study about the capacity of MoE-LLaVA. “Res.” represent the input image resolution. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT donates that there is some overlap in the training data.
表 11:关于 MoE-LLaVA 容量的消融研究。“Res.”表示输入图像分辨率。 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 表示训练数据中存在一些重叠。
Methods 方法 Res. 分辨率 Capacity 容量 Image Question Answering 图像问答 Benchmark Toolkit 基准工具包
VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MMB LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT MM-Vet Avg 平均
MoE-LLaVA-1.6B×4-Top2 336 1.5 76.7*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 60.3*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 36.2 62.6 50.1 85.7 60.2 86.8 26.9 60.6
1.0 76.0*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 60.4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 37.2 62.6 47.8 84.3 59.4 85.9 26.1 59.9
MoE-LLaVA-2.7B×4-Top2 336 1.5 77.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 61.4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 43.9 68.5 51.4 86.3 65.2 94.1 34.3 64.7
1.0 77.1*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 61.1*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 43.4 68.7 50.2 85.0 65.5 93.2 31.1 63.9
MoE-LLaVA-2.7B×4-Top2 384 1.5 79.9*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 62.6*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT