这是用户在 2025-2-1 23:51 为 https://app.immersivetranslate.com/pdf-pro/578f8545-81fc-4293-a341-6ef0350dcc47 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Janus-Pro:统一的多模态理解与生成,数据和模型扩展

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong RuanDeepSeek-AIProject Page: https://github.com/deepseek-ai/Janus
项目页面:https://github.com/deepseek-ai/Janus

Abstract  摘要

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
在这项工作中,我们介绍了 Janus-Pro,这是之前工作 Janus 的高级版本。具体而言,Janus-Pro 包含(1)优化的训练策略,(2)扩展的训练数据,以及(3)扩大到更大的模型规模。通过这些改进,Janus-Pro 在多模态理解和文本到图像的指令跟随能力方面取得了显著进展,同时还增强了文本到图像生成的稳定性。我们希望这项工作能激发该领域的进一步探索。代码和模型是公开可用的。

1. Introduction  1. 引言

Figure 1 | Multimodal understanding and visual generation results from our Janus-Pro. For multimodal understand, we average the accuracy of POPE, MME-Perception, GQA, and MMMU. The scores of MME-Perception are divided by 20 to scale to [ 0,100 ]. For visual generation, we evaluate the performance on two instruction-following benchamrks, GenEval and DPG-Bench. Overall, Janus-Pro outperforms the previous state-of-the-art unified multimodal models as well as some task-specific models. Best viewed on screen.
图 1 | 我们的 Janus-Pro 的多模态理解和视觉生成结果。对于多模态理解,我们对 POPE、MME-Perception、GQA 和 MMMU 的准确率进行平均。MME-Perception 的分数除以 20 以缩放到[0,100]。对于视觉生成,我们在两个遵循指令的基准上评估性能,GenEval 和 DPG-Bench。总体而言,Janus-Pro 在统一多模态模型以及一些特定任务模型中表现优于之前的最先进技术。最佳在屏幕上查看。

A minimalist photo of an orange tangerine A clear image of a blackboard with a clean, Capture a close-up shot of a vibrant sunflower with a green stem and leaves, symbolizing dark green surface and the word ‘Hello’ written in full bloom, with a honeybee perched on its prosperity, sitting on a red silk cloth during precisely and legibly in the center with bold, petals, its delicate wings catching the sunlight. Chinese New Year. white chalk letters.
一张极简主义的橙色橘子的照片,一幅清晰的黑板图像,捕捉一朵生机勃勃的向日葵的特写镜头,带有绿色的茎和叶子,象征着深绿色的表面,中央清晰可见地写着“Hello”的白色粉笔字,正盛开着,蜜蜂栖息在上面,坐在红色丝绸布上,花瓣娇嫩,翅膀在阳光下闪烁。中国新年。
Figure 2 | Comparison of text-to-image generation between Janus-Pro and its predecessor, Janus. Janus-Pro delivers more stable outputs for short prompts, with improved visual quality, richer details, and the ability to generate simple text. The image resolution is 384 × 384 384 × 384 384 xx384384 \times 384. Best viewed on screen.
图 2 | Janus-Pro 与其前身 Janus 之间的文本到图像生成比较。Janus-Pro 在短提示下提供了更稳定的输出,视觉质量更高,细节更丰富,并且能够生成简单文本。图像分辨率为 384 × 384 384 × 384 384 xx384384 \times 384 。最佳在屏幕上查看。
Recent advancements in unified multimodal understanding and generation models have demonstrated significant progress [ 30 , 40 , 45 , 46 , 48 , 50 , 54 , 55 30 , 40 , 45 , 46 , 48 , 50 , 54 , 55 30,40,45,46,48,50,54,5530,40,45,46,48,50,54,55 ]. These approaches have been proven to enhance the instruction-following capabilities in visual generation tasks while reducing model redundancy. Most of these methods utilize the same visual encoder to process inputs for both multimodal understanding and generation tasks. Since the representations required for these two tasks differ, this often results in suboptimal performance in multimodal understanding. To address this issue, Janus [46] proposes decoupling visual encoding, which alleviates the conflict between multimodal understanding and generation tasks, achieving excellent performance in both tasks.
最近在统一多模态理解和生成模型方面的进展显示出显著的进步 [ 30 , 40 , 45 , 46 , 48 , 50 , 54 , 55 30 , 40 , 45 , 46 , 48 , 50 , 54 , 55 30,40,45,46,48,50,54,5530,40,45,46,48,50,54,55 ]。这些方法已被证明能够增强视觉生成任务中的指令跟随能力,同时减少模型冗余。大多数这些方法使用相同的视觉编码器来处理多模态理解和生成任务的输入。由于这两项任务所需的表示不同,这通常导致多模态理解的性能不佳。为了解决这个问题,Janus [46] 提出了视觉编码的解耦,这缓解了多模态理解和生成任务之间的冲突,在这两项任务中都取得了出色的性能。
As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.
作为一个开创性的模型,Janus 在 1B 参数规模下得到了验证。然而,由于训练数据量有限和模型容量相对较小,它表现出某些缺点,例如在短提示图像生成上的性能不佳以及文本到图像生成质量不稳定。在本文中,我们介绍了 Janus-Pro,这是 Janus 的增强版本,涵盖了三个维度的改进:训练策略、数据和模型大小。Janus-Pro 系列包括两种模型大小:1B 和 7B,展示了视觉编码解码方法的可扩展性。
We evaluate Janus-Pro on multiple benchmarks, and the results reveal its superior multimodal understanding capabilities and significantly improved text-to-image instruction-following performance. Specifically, Janus-Pro-7B achieved a score of 79.2 on the multimodal understanding benchmark MMBench [29], surpassing state-of-the-art unified multimodal models such as Janus [46] (69.4), TokenFlow [34] (68.9) and MetaMorph [42] (75.2). Additionally, in the text-toimage instruction-following leaderboard GenEval [14], Janus-Pro-7B scores 0.80 , outperforming Janus [46] (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium [11] (0.74).
我们在多个基准上评估了 Janus-Pro,结果显示其卓越的多模态理解能力和显著改善的文本到图像指令跟随性能。具体而言,Janus-Pro-7B 在多模态理解基准 MMBench [29]上获得了 79.2 的分数,超越了最先进的统一多模态模型,如 Janus [46](69.4)、TokenFlow [34](68.9)和 MetaMorph [42](75.2)。此外,在文本到图像指令跟随排行榜 GenEval [14]上,Janus-Pro-7B 得分为 0.80,超过了 Janus [46](0.61)、DALL-E 3(0.67)和 Stable Diffusion 3 Medium [11](0.74)。

Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed on screen.
图 3 | 我们的 Janus-Pro 架构。我们将视觉编码解耦以实现多模态理解和视觉生成。“Und. Encoder”和“Gen. Encoder”分别是“Understanding Encoder”和“Generation Encoder”的缩写。最佳在屏幕上查看。

2. Method  2. 方法

2.1. Architecture  2.1. 架构

The architecture of Janus-Pro is shown in Figure 3, which is the same as Janus [46]. The core design principle of the overall architecture is to decouple visual encoding for multimodal understanding and generation. We apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. For multimodal understanding, we use the SigLIP [53] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [38] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. Apart from the built-in prediction head in the LLM, we also utilize a randomly initialized prediction head for image predictions in the visual generation task. The entire model adheres to an autoregressive framework.
Janus-Pro 的架构如图 3 所示,与 Janus [46]相同。整体架构的核心设计原则是将视觉编码解耦,以实现多模态理解和生成。我们应用独立的编码方法将原始输入转换为特征,然后由统一的自回归变换器进行处理。对于多模态理解,我们使用 SigLIP [53]编码器从图像中提取高维语义特征。这些特征从二维网格展平为一维序列,并使用理解适配器将这些图像特征映射到LLM的输入空间。对于视觉生成任务,我们使用来自[38]的 VQ 分词器将图像转换为离散 ID。在 ID 序列展平为一维后,我们使用生成适配器将与每个 ID 对应的代码本嵌入映射到LLM的输入空间。然后,我们将这些特征序列连接起来,形成多模态特征序列,随后将其输入到LLM进行处理。 除了LLM中的内置预测头,我们还在视觉生成任务中使用一个随机初始化的预测头进行图像预测。整个模型遵循自回归框架。

2.2. Optimized Training Strategy
2.2. 优化训练策略

The previous version of Janus employs a three-stage training process. Stage I focuses on training the adaptors and the image head. Stage II handles unified pretraining, during which all components except the understanding encoder and the generation encoder has their parameters updated. Stage III is supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training. This training strategy has certain issues. In Stage II, Janus divides the training for text-to-image capabilities into two parts following PixArt [4]. The first part trains on ImageNet [9] data, using image category names as prompts for text-to-image generation, with the goal of modeling pixel dependence. The second part trains on normal text-to-image data. During implementation, 66.67 % 66.67 % 66.67%66.67 \% of the text-to-image training steps in Stage II are allocated to the first part. However, through further
Janus 的前一个版本采用了三阶段的训练过程。第一阶段专注于训练适配器和图像头。第二阶段处理统一预训练,在此期间,除了理解编码器和生成编码器之外的所有组件都更新其参数。第三阶段是监督微调,基于第二阶段,通过在训练过程中进一步解锁理解编码器的参数。这种训练策略存在某些问题。在第二阶段,Janus 将文本到图像能力的训练分为两部分,遵循 PixArt [4]。第一部分在 ImageNet [9]数据上进行训练,使用图像类别名称作为文本到图像生成的提示,目标是建模像素依赖性。第二部分在正常的文本到图像数据上进行训练。在实施过程中,第二阶段的文本到图像训练步骤的 66.67 % 66.67 % 66.67%66.67 \% 分配给第一部分。然而,通过进一步

experimentation, we find that this strategy is suboptimal and lead to significant computational inefficiency.
实验中,我们发现这种策略是次优的,并导致显著的计算低效。
To address this issue, we make two modifications.
为了解决这个问题,我们做了两个修改。
  • Longer Training in Stage I: We increase the training steps in Stage I, allowing sufficient training on the ImageNet dataset. Our findings reveals that even with the LLM parameters fixed, the model could effectively model pixel dependence and generate reasonable images based on category names.
    阶段一的更长训练:我们增加了阶段一的训练步骤,允许在 ImageNet 数据集上进行充分的训练。我们的发现表明,即使固定LLM个参数,模型仍然能够有效地建模像素依赖,并根据类别名称生成合理的图像。
  • Focused Training in Stage II: In Stage II, we drop ImageNet data and directly utilize normal text-to-image data to train the model to generate images based on dense descriptions. This redesigned approach enables Stage II to utilize the text-to-image data more efficiently, resulting in improved training efficiency and overall performance.
    阶段 II 的集中训练:在阶段 II 中,我们放弃了 ImageNet 数据,直接利用普通的文本到图像数据来训练模型,以根据密集描述生成图像。这种重新设计的方法使阶段 II 能够更有效地利用文本到图像数据,从而提高训练效率和整体性能。
We also adjust the data ratio in Stage III supervised fine-tuning process across different types of datasets, changing the proportion of multimodal data, pure text data, and text-to-image data from 7:3:10 to 5:1:4. By slightly reducing the proportion of text-to-image data, we observe that this adjustment allows us to maintain strong visual generation capabilities while achieving improved multimodal understanding performance.
我们还在第三阶段的监督微调过程中调整了不同类型数据集的数据比例,将多模态数据、纯文本数据和文本到图像数据的比例从 7:3:10 更改为 5:1:4。通过稍微减少文本到图像数据的比例,我们观察到这一调整使我们能够保持强大的视觉生成能力,同时提高多模态理解性能。

2.3. Data Scaling  2.3. 数据扩展

We scale up the training data used for Janus in both multimodal understanding and visual generation aspects.
我们扩大了用于 Janus 的训练数据,在多模态理解和视觉生成方面。
  • Multimodal Understanding. For the Stage II pretraining data, we refer to DeepSeekVL2 [49] and add approximately 90 million samples. These include image caption datasets (e.g., YFCC [31]), as well as data for table, chart, and document understanding (e.g., Docmatix [20]). For the Stage III supervised fine-tuning data, we also incorporate additional datasets from DeepSeek-VL2, such as MEME understanding, Chinese conversational data, and datasets aimed at enhancing dialogue experiences. These additions significantly expanded the model’s capabilities, enriching its ability to handle diverse tasks while improving the overall conversational experience.
    多模态理解。对于阶段 II 的预训练数据,我们参考了 DeepSeekVL2 [49] 并增加了大约 9000 万个样本。这些样本包括图像标题数据集(例如,YFCC [31]),以及用于表格、图表和文档理解的数据(例如,Docmatix [20])。对于阶段 III 的监督微调数据,我们还结合了来自 DeepSeek-VL2 的额外数据集,如 MEME 理解、中文对话数据以及旨在增强对话体验的数据集。这些补充显著扩展了模型的能力,丰富了其处理多样任务的能力,同时改善了整体对话体验。
  • Visual Generation. We observe that the real-world data used in the previous version of Janus lacks quality and contains significant noise, which often leads to instability in text-to-image generation, resulting in aesthetically poor outputs. In Janus-Pro, we incorporate approximately 72 million samples of synthetic aesthetic data, bringing the ratio of real to synthetic data to 1 : 1 1 : 1 1:11: 1 during the unified pretraining stage. The prompts for these synthetic data samples are publicly available, such as those in [43]. Experiments demonstrat that the model converges faster when trained on synthetic data, and the resulting text-to-image outputs are not only more stable but also exhibit significantly improved aesthetic quality.
    视觉生成。我们观察到,之前版本的 Janus 使用的真实世界数据质量较差,包含大量噪声,这常常导致文本到图像生成的不稳定,产生美观性差的输出。在 Janus-Pro 中,我们引入了大约 7200 万样本的合成美学数据,使得在统一预训练阶段真实数据与合成数据的比例达到 1 : 1 1 : 1 1:11: 1 。这些合成数据样本的提示是公开可用的,例如在[43]中。实验表明,当在合成数据上训练时,模型收敛更快,生成的文本到图像输出不仅更加稳定,而且美学质量显著提高。

2.4. Model Scaling  2.4. 模型扩展

The previous version of Janus validates the effectiveness of visual encoding decoupling using a 1.5B LLM. In Janus-Pro, we scaled the model up to 7B, with the hyperparameters of both the 1.5B and 7B LLMs detailed in Table 1. We observe that when utilizing a larger-scale LLM, the convergence speed of losses for both multimodal understanding and visual generation improved significantly compared to the smaller model. This finding further validates the strong scalability of this approach.
Janus 的前一个版本验证了使用 1.5B LLM 的视觉编码解耦的有效性。在 Janus-Pro 中,我们将模型扩展到 7B,1.5B 和 7B 的超参数在表 1 中详细列出。我们观察到,当使用更大规模的LLM时,多模态理解和视觉生成的损失收敛速度相比于较小模型显著提高。这一发现进一步验证了该方法的强大可扩展性。
Table 1 | Architectural configuration for Janus-Pro. We list the hyperparameters of the architecture.
表 1 | Janus-Pro 的架构配置。我们列出了架构的超参数。
Janus-Pro-1B Janus-Pro-7B
Vocabulary size  词汇大小 100 K 100 K
Embedding size  嵌入大小 2048 4096
Context Window  上下文窗口 4096 4096
#Attention heads  #注意力头 16 32
#Layers 24 30
Janus-Pro-1B Janus-Pro-7B Vocabulary size 100 K 100 K Embedding size 2048 4096 Context Window 4096 4096 #Attention heads 16 32 #Layers 24 30| | Janus-Pro-1B | Janus-Pro-7B | | :--- | :---: | :---: | | Vocabulary size | 100 K | 100 K | | Embedding size | 2048 | 4096 | | Context Window | 4096 | 4096 | | #Attention heads | 16 | 32 | | #Layers | 24 | 30 |
Table 2 | Detailed hyperparameters for training Janus-Pro. Data ratio refers to the ratio of multimodal understanding data, pure text data, and visual generation data.
表 2 | Janus-Pro 训练的详细超参数。数据比例指的是多模态理解数据、纯文本数据和视觉生成数据的比例。
Janus-Pro-1B Janus-Pro-7B
Hyperparameters  超参数 Stage 1  阶段 1 Stage 2  阶段 2 Stage 3  阶段 3 Stage 1  阶段 1 Stage 2  阶段 2 Stage 3  阶段 3
Learning rate  学习率 1.0 × 10 3 1.0 × 10 3 1.0 xx10^(-3)1.0 \times 10^{-3} 1.0 × 10 4 1.0 × 10 4 1.0 xx10^(-4)1.0 \times 10^{-4} 4.0 × 10 5 4.0 × 10 5 4.0 xx10^(-5)4.0 \times 10^{-5} 1.0 × 10 3 1.0 × 10 3 1.0 xx10^(-3)1.0 \times 10^{-3} 1.0 × 10 4 1.0 × 10 4 1.0 xx10^(-4)1.0 \times 10^{-4} 4.0 × 10 5 4.0 × 10 5 4.0 xx10^(-5)4.0 \times 10^{-5}
LR scheduler  LR 调度器 Constant  常量 Constant  常量 Constant  常量 Constant  常量 Constant  常量 Constant  常量
Weight decay  权重衰减 0.0 0.0 0.0 0.0 0.0 0.0
Gradient clip  梯度裁剪 1.0 1.0 1.0 1.0 1.0 1.0
Optimizer  优化器 AdamW ( β 1 = 0.9 , β 2 = 0.95 ) β 1 = 0.9 , β 2 = 0.95 (beta_(1)=0.9,beta_(2)=0.95)\left(\beta_{1}=0.9, \beta_{2}=0.95\right) AdamW ( β 1 = 0.9 , β 2 = 0.95 ) β 1 = 0.9 , β 2 = 0.95 (beta_(1)=0.9,beta_(2)=0.95)\left(\beta_{1}=0.9, \beta_{2}=0.95\right)
Warm-up steps  热身步骤 600 5000 0 600 5000 0
Training steps  训练步骤 20 K 360 K 80 K 20 K 360 K 40 K
Batch size  批量大小 256 512 128 256 512 128
Data Ratio  数据比率 1 : 0 : 3 1 : 0 : 3 1:0:31: 0: 3 2 : 3 : 5 2 : 3 : 5 2:3:52: 3: 5 5 : 1 : 4 5 : 1 : 4 5:1:45: 1: 4 1 : 0 : 3 1 : 0 : 3 1:0:31: 0: 3 2 : 3 : 5 2 : 3 : 5 2:3:52: 3: 5 5 : 1 : 4 5 : 1 : 4 5:1:45: 1: 4
Janus-Pro-1B Janus-Pro-7B Hyperparameters Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Learning rate 1.0 xx10^(-3) 1.0 xx10^(-4) 4.0 xx10^(-5) 1.0 xx10^(-3) 1.0 xx10^(-4) 4.0 xx10^(-5) LR scheduler Constant Constant Constant Constant Constant Constant Weight decay 0.0 0.0 0.0 0.0 0.0 0.0 Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW (beta_(1)=0.9,beta_(2)=0.95) AdamW (beta_(1)=0.9,beta_(2)=0.95) Warm-up steps 600 5000 0 600 5000 0 Training steps 20 K 360 K 80 K 20 K 360 K 40 K Batch size 256 512 128 256 512 128 Data Ratio 1:0:3 2:3:5 5:1:4 1:0:3 2:3:5 5:1:4| | Janus-Pro-1B | | | Janus-Pro-7B | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Hyperparameters | Stage 1 | Stage 2 | Stage 3 | Stage 1 | Stage 2 | Stage 3 | | Learning rate | $1.0 \times 10^{-3}$ | $1.0 \times 10^{-4}$ | $4.0 \times 10^{-5}$ | $1.0 \times 10^{-3}$ | $1.0 \times 10^{-4}$ | $4.0 \times 10^{-5}$ | | LR scheduler | Constant | Constant | Constant | Constant | Constant | Constant | | Weight decay | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | Gradient clip | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | | Optimizer | AdamW | $\left(\beta_{1}=0.9, \beta_{2}=0.95\right)$ | AdamW | $\left(\beta_{1}=0.9, \beta_{2}=0.95\right)$ | | | | Warm-up steps | 600 | 5000 | 0 | 600 | 5000 | 0 | | Training steps | 20 K | 360 K | 80 K | 20 K | 360 K | 40 K | | Batch size | 256 | 512 | 128 | 256 | 512 | 128 | | Data Ratio | $1: 0: 3$ | $2: 3: 5$ | $5: 1: 4$ | $1: 0: 3$ | $2: 3: 5$ | $5: 1: 4$ |

3. Experiments  3. 实验

3.1. Implementation Details
3.1. 实施细节

In our experiments, we utilize DeepSeek-LLM (1.5B and 7B) [3] with a maximum supported sequence length of 4096 as the base language model. For the vision encoder used in understanding tasks, we select SigLIP-Large-Patch16-384 [53]. The generation encoder has a codebook of size 16,384 and downsamples images by a factor of 16 . Both the understanding adaptor and the generation adaptor are two-layer MLPs. The detailed hyperparameters for each stage are provided in Table 2. All images are resized to 384 × 384 384 × 384 384 xx384384 \times 384 pixels. For multimodal understanding data, we resize the long side of the image and pad the short side with the background color (RGB: 127, 127, 127) to reach 384 . For visual generation data, the short side is resized to 384 , and the long side is cropped to 384 . We use sequence packing during training to improve training efficiency. We mix all data types according to the specified ratios in a single training step. Our Janus is trained and evaluated using HAI-LLM [15], which is a lightweight and efficient distributed training framework built on top of PyTorch. The whole training process took about 7 / 14 7 / 14 7//147 / 14 days on a cluster of 16 / 32 16 / 32 16//3216 / 32 nodes for 1.5 B / 7 B 1.5 B / 7 B 1.5B//7B1.5 \mathrm{~B} / 7 \mathrm{~B} model, each equipped with 8 Nvidia A100 (40GB) GPUs.
在我们的实验中,我们使用 DeepSeek-LLM (1.5B 和 7B) [3] 作为基础语言模型,最大支持序列长度为 4096。对于理解任务中使用的视觉编码器,我们选择 SigLIP-Large-Patch16-384 [53]。生成编码器的代码本大小为 16,384,并将图像下采样 16 倍。理解适配器和生成适配器都是两层 MLP。每个阶段的详细超参数在表 2 中提供。所有图像的大小调整为 384 × 384 384 × 384 384 xx384384 \times 384 像素。对于多模态理解数据,我们将图像的长边调整大小,并用背景色 (RGB: 127, 127, 127) 填充短边,以达到 384。对于视觉生成数据,短边调整为 384,长边裁剪为 384。我们在训练过程中使用序列打包以提高训练效率。我们根据指定比例在单个训练步骤中混合所有数据类型。我们的 Janus 使用 HAI-LLM [15] 进行训练和评估,这是一个基于 PyTorch 构建的轻量级高效分布式训练框架。 整个训练过程在一个由 16 / 32 16 / 32 16//3216 / 32 个节点组成的集群上进行了大约 7 / 14 7 / 14 7//147 / 14 天,使用的是 1.5 B / 7 B 1.5 B / 7 B 1.5B//7B1.5 \mathrm{~B} / 7 \mathrm{~B} 模型,每个节点配备了 8 个 Nvidia A100(40GB)GPU。

3.2. Evaluation Setup  3.2. 评估设置

Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our model on widely recognized image-based vision-language benchmarks, which include GQA
多模态理解。为了评估多模态理解能力,我们在广泛认可的基于图像的视觉语言基准上评估我们的模型,其中包括 GQA。
Table 3 | Comparison with state-of-the-arts on multimodal understanding benchmarks. “Und.” and “Gen.” denote “understanding” and “generation”, respectively. Models using external pretrained diffusion model are marked with ^(†){ }^{\dagger}.
表 3 | 与多模态理解基准的最新技术比较。“Und.” 和 “Gen.” 分别表示“理解”和“生成”。使用外部预训练扩散模型的模型标记为 ^(†){ }^{\dagger}
Type  类型 Model  模型 # LLM Params  # LLM 参数 POPE uarr\uparrow MME-P uarr\uparrow MMB uarr\uparrow SEED uarr\uparrow GQA uarr\uparrow MMMU MM-Vet uarr\uparrow
Und. Only  仅此而已。 LLaVA-v1.5-Phi-1.5 [50] 1.3B 84.1 1128.0 - - 56.5 30.7 -
MobileVLM [6] 1.4B 84.5 1196.2 53.2 - 56.1 - -
MobileVLM-V2 [7] 1.4B 84.3 1302.8 57.7 - 59.3 - -
MobileVLM [6] 2.7B 84.9 1288.9 59.6 - 59.0 - -
MobileVLM-V2 [7] 2.7B 84.7 1440.5 63.2 - 61.1 - -
LLaVA-Phi [56] 2.7B 85.0 1335.1 59.8 - - - 28.9
LLaVA [27] 7B 76.3 809.6 38.7 33.5 - - 25.5
LLaVA-v1.5 [26] 7B 85.9 1510.7 64.3 58.6 62.0 35.4 31.1
InstructBLIP [8] 7B - - 36.0 53.4 49.2 - 26.2
Qwen-VL-Chat [1] 7B - 1487.5 60.6 58.2 57.5 - -
IDEFICS-9B [19] 8B - - 48.2 - 38.4 - -
Emu3-Chat [45] 8B 85.2 1244 58.5 68.2 60.3 31.6 37.2
InstructBLIP [8] 13B 78.9 1212.8 - - 49.5 - 25.6
Und. and Gen. DreamLLM ^(†){ }^{\dagger} [10] 7B - - - - - - 36.6
LaVIT ^(†){ }^{\dagger} [18] 7B - - - - 46.8 - -
MetaMorph ^(†){ }^{\dagger} [42] 8B - - 75.2 71.8 - - -
Emu Emu Emu^(†)\mathrm{Emu}^{\dagger} [39] 13B - - - - - - -
NExT-GPT ^(†){ }^{\dagger} [47] 13B - - - - - - -
Show-o [50] 1.3 B 73.8 948.4 - - 48.7 25.1 -
D-Dit [24] 2.0B 84.0 1124.7 - - 59.2 - -
Gemini-Nano-1 [41] 1.8B - - - - - 26.3 -
ILLUME [44] 7B 88.5 1445.3 65.1 72.9 - 38.2 37.0
TokenFlow-XL [34] 13B 86.8 1545.9 68.9 68.7 62.7 38.7 40.7
LWM [28] 7B 75.2 - - - 44.8 - 9.6
VILA-U [48] 7B 85.8 1401.8 - 59.0 60.8 - 33.5
Chameleon [40]  变色龙 [40] 7B - - - - - 22.4 8.3
Janus 1.5B 87.0 1338.0 69.4 63.7 59.1 30.5 34.3
Janus-Pro-1B 1.5B 86.2 1444.0 75.5 68.3 59.3 36.3 39.8
Janus-Pro-7B 7B 87.4 1567.1 79.2 72.1 62.0 41.0 50.0
Type Model # LLM Params POPE uarr MME-P uarr MMB uarr SEED uarr GQA uarr MMMU MM-Vet uarr Und. Only LLaVA-v1.5-Phi-1.5 [50] 1.3B 84.1 1128.0 - - 56.5 30.7 - MobileVLM [6] 1.4B 84.5 1196.2 53.2 - 56.1 - - MobileVLM-V2 [7] 1.4B 84.3 1302.8 57.7 - 59.3 - - MobileVLM [6] 2.7B 84.9 1288.9 59.6 - 59.0 - - MobileVLM-V2 [7] 2.7B 84.7 1440.5 63.2 - 61.1 - - LLaVA-Phi [56] 2.7B 85.0 1335.1 59.8 - - - 28.9 LLaVA [27] 7B 76.3 809.6 38.7 33.5 - - 25.5 LLaVA-v1.5 [26] 7B 85.9 1510.7 64.3 58.6 62.0 35.4 31.1 InstructBLIP [8] 7B - - 36.0 53.4 49.2 - 26.2 Qwen-VL-Chat [1] 7B - 1487.5 60.6 58.2 57.5 - - IDEFICS-9B [19] 8B - - 48.2 - 38.4 - - Emu3-Chat [45] 8B 85.2 1244 58.5 68.2 60.3 31.6 37.2 InstructBLIP [8] 13B 78.9 1212.8 - - 49.5 - 25.6 Und. and Gen. DreamLLM ^(†) [10] 7B - - - - - - 36.6 LaVIT ^(†) [18] 7B - - - - 46.8 - - MetaMorph ^(†) [42] 8B - - 75.2 71.8 - - - Emu^(†) [39] 13B - - - - - - - NExT-GPT ^(†) [47] 13B - - - - - - - Show-o [50] 1.3 B 73.8 948.4 - - 48.7 25.1 - D-Dit [24] 2.0B 84.0 1124.7 - - 59.2 - - Gemini-Nano-1 [41] 1.8B - - - - - 26.3 - ILLUME [44] 7B 88.5 1445.3 65.1 72.9 - 38.2 37.0 TokenFlow-XL [34] 13B 86.8 1545.9 68.9 68.7 62.7 38.7 40.7 LWM [28] 7B 75.2 - - - 44.8 - 9.6 VILA-U [48] 7B 85.8 1401.8 - 59.0 60.8 - 33.5 Chameleon [40] 7B - - - - - 22.4 8.3 Janus 1.5B 87.0 1338.0 69.4 63.7 59.1 30.5 34.3 Janus-Pro-1B 1.5B 86.2 1444.0 75.5 68.3 59.3 36.3 39.8 Janus-Pro-7B 7B 87.4 1567.1 79.2 72.1 62.0 41.0 50.0| Type | Model | # LLM Params | POPE $\uparrow$ | MME-P $\uparrow$ | MMB $\uparrow$ | SEED $\uparrow$ | GQA $\uparrow$ | MMMU | MM-Vet $\uparrow$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Und. Only | LLaVA-v1.5-Phi-1.5 [50] | 1.3B | 84.1 | 1128.0 | - | - | 56.5 | 30.7 | - | | | MobileVLM [6] | 1.4B | 84.5 | 1196.2 | 53.2 | - | 56.1 | - | - | | | MobileVLM-V2 [7] | 1.4B | 84.3 | 1302.8 | 57.7 | - | 59.3 | - | - | | | MobileVLM [6] | 2.7B | 84.9 | 1288.9 | 59.6 | - | 59.0 | - | - | | | MobileVLM-V2 [7] | 2.7B | 84.7 | 1440.5 | 63.2 | - | 61.1 | - | - | | | LLaVA-Phi [56] | 2.7B | 85.0 | 1335.1 | 59.8 | - | - | - | 28.9 | | | LLaVA [27] | 7B | 76.3 | 809.6 | 38.7 | 33.5 | - | - | 25.5 | | | LLaVA-v1.5 [26] | 7B | 85.9 | 1510.7 | 64.3 | 58.6 | 62.0 | 35.4 | 31.1 | | | InstructBLIP [8] | 7B | - | - | 36.0 | 53.4 | 49.2 | - | 26.2 | | | Qwen-VL-Chat [1] | 7B | - | 1487.5 | 60.6 | 58.2 | 57.5 | - | - | | | IDEFICS-9B [19] | 8B | - | - | 48.2 | - | 38.4 | - | - | | | Emu3-Chat [45] | 8B | 85.2 | 1244 | 58.5 | 68.2 | 60.3 | 31.6 | 37.2 | | | InstructBLIP [8] | 13B | 78.9 | 1212.8 | - | - | 49.5 | - | 25.6 | | Und. and Gen. | DreamLLM ${ }^{\dagger}$ [10] | 7B | - | - | - | - | - | - | 36.6 | | | LaVIT ${ }^{\dagger}$ [18] | 7B | - | - | - | - | 46.8 | - | - | | | MetaMorph ${ }^{\dagger}$ [42] | 8B | - | - | 75.2 | 71.8 | - | - | - | | | $\mathrm{Emu}^{\dagger}$ [39] | 13B | - | - | - | - | - | - | - | | | NExT-GPT ${ }^{\dagger}$ [47] | 13B | - | - | - | - | - | - | - | | | Show-o [50] | 1.3 B | 73.8 | 948.4 | - | - | 48.7 | 25.1 | - | | | D-Dit [24] | 2.0B | 84.0 | 1124.7 | - | - | 59.2 | - | - | | | Gemini-Nano-1 [41] | 1.8B | - | - | - | - | - | 26.3 | - | | | ILLUME [44] | 7B | 88.5 | 1445.3 | 65.1 | 72.9 | - | 38.2 | 37.0 | | | TokenFlow-XL [34] | 13B | 86.8 | 1545.9 | 68.9 | 68.7 | 62.7 | 38.7 | 40.7 | | | LWM [28] | 7B | 75.2 | - | - | - | 44.8 | - | 9.6 | | | VILA-U [48] | 7B | 85.8 | 1401.8 | - | 59.0 | 60.8 | - | 33.5 | | | Chameleon [40] | 7B | - | - | - | - | - | 22.4 | 8.3 | | | Janus | 1.5B | 87.0 | 1338.0 | 69.4 | 63.7 | 59.1 | 30.5 | 34.3 | | | Janus-Pro-1B | 1.5B | 86.2 | 1444.0 | 75.5 | 68.3 | 59.3 | 36.3 | 39.8 | | | Janus-Pro-7B | 7B | 87.4 | 1567.1 | 79.2 | 72.1 | 62.0 | 41.0 | 50.0 |
[17], POPE [23], MME [12], SEED [21], MMB [29], MM-Vet [51], and MMMU [52].
[17], POPE [23], MME [12], SEED [21], MMB [29], MM-Vet [51], 和 MMMU [52].

Visual Generation. For evaluating visual generation capabilities, we use GenEval [14] and DPG-Bench [16]. GenEval is a challenging benchmark for image-to-text generation, designed to reflect the comprehensive generative abilities of visual generation models by offering a detailed instance-level analysis of their compositional capabilities. DPG-Bench (Dense Prompt Graph Benchmark) is a comprehensive dataset consisting of 1065 lengthy, dense prompts, designed to assess the intricate semantic alignment capabilities of text-to-image models.
视觉生成。为了评估视觉生成能力,我们使用 GenEval [14]和 DPG-Bench [16]。GenEval 是一个具有挑战性的图像到文本生成基准,旨在通过提供详细的实例级分析来反映视觉生成模型的综合生成能力。DPG-Bench(密集提示图基准)是一个综合数据集,包含 1065 个冗长、密集的提示,旨在评估文本到图像模型的复杂语义对齐能力。

3.3. Comparison with State-of-the-arts
3.3. 与最先进技术的比较

Multimodal Understanding Performance. We compare the proposed method with state-of-the-art unified models and understanding-only models in Table 3. Janus-Pro achieves the overall best results. This can be attributed to decoupling the visual encoding for multimodal understanding and generation, mitigating the conflict between these two tasks. When compared to models with significantly larger sizes, Janus-Pro remains highly competitive. For instance, Janus-Pro-7B outperforms TokenFlow-XL (13B) on all benchmarks except GQA.
多模态理解性能。我们在表 3 中将所提方法与最先进的统一模型和仅理解模型进行了比较。Janus-Pro 取得了整体最佳结果。这可以归因于将多模态理解和生成的视觉编码解耦,从而减轻了这两项任务之间的冲突。与显著更大规模的模型相比,Janus-Pro 仍然具有很强的竞争力。例如,Janus-Pro-7B 在除 GQA 外的所有基准测试中都优于 TokenFlow-XL(13B)。
Table 4 | Evaluation of text-to-image generation ability on GenEval benchmark. “Und.” and “Gen.” denote “understanding” and “generation”, respectively. Models using external pretrained diffusion model are marked with ^(†){ }^{\dagger}.
表 4 | 在 GenEval 基准上评估文本到图像生成能力。“Und.”和“Gen.”分别表示“理解”和“生成”。使用外部预训练扩散模型的模型标记为 ^(†){ }^{\dagger}
Type  类型 Method  方法 Single Obj.  单个对象。 Two Obj.  两个对象。 Counting  计数 Colors  颜色 Position  位置 Color Attri.  颜色属性 Overall uarr\uparrow  整体 uarr\uparrow
Gen. Only  将。仅 LlamaGen [38] 0.71 0.34 0.21 0.58 0.07 0.04 0.32
LDM [37] 0.92 0.29 0.23 0.70 0.02 0.05 0.37
SDv1.5 [37] 0.97 0.38 0.35 0.76 0.04 0.06 0.43
PixArt- α α alpha\alpha [4] 0.98 0.50 0.44 0.80 0.08 0.07 0.48
SDv2.1 [37] 0.98 0.51 0.44 0.85 0.07 0.17 0.50
DALL-E 2 [35] 0.94 0.66 0.49 0.77 0.10 0.19 0.52
Emu3-Gen [45] 0.98 0.71 0.34 0.81 0.17 0.21 0.54
SDXL [32] 0.98 0.74 0.39 0.85 0.15 0.23 0.55
DALL-E 3 [2] 0.96 0.87 0.47 0.83 0.43 0.45 0.67
SD3-Medium [11] 0.99 0.94 0.72 0.89 0.33 0.60 0.74
Und. and Gen. SEED-X ^(†){ }^{\dagger} [13] 0.97 0.58 0.26 0.80 0.19 0.14 0.49
Show-o [50] 0.95 0.52 0.49 0.49 ¯ bar(0.49)\overline{0.49} 0.82 0.11 0 . 28 0 . 28 ¯ 0. bar(28)0 . \overline{28} 0 .5 3 0 ¯ .5 3 ¯ bar(0).5 bar(3)\overline{0} .5 \overline{3}
D-DiT [24] 0.97 0.80 0.54 0.76 0.32 0.50 0.65
LWM [28] 0.93 0.41 0.46 0.79 0.09 0.15 0.47
Transfusion [55]  输血 [55] - - - - - - 0.63
ILLUME [44] 0.99 0.86 0.45 0.71 0.39 0.28 0.61
TokenFlow-XL [28] 0.95 0.60 0.41 0.81 0.16 0.24 0.55
Chameleon [40]  变色龙 [40] - - - - - - 0.39
Janus [46] 0.97 0.68 0.30 0.84 0.46 0.42 0.61
Janus-Pro-1B 0.98 0.82 0.51 0.89 0.65 0.56 0.73
Janus-Pro-7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80
Type Method Single Obj. Two Obj. Counting Colors Position Color Attri. Overall uarr Gen. Only LlamaGen [38] 0.71 0.34 0.21 0.58 0.07 0.04 0.32 LDM [37] 0.92 0.29 0.23 0.70 0.02 0.05 0.37 SDv1.5 [37] 0.97 0.38 0.35 0.76 0.04 0.06 0.43 PixArt- alpha [4] 0.98 0.50 0.44 0.80 0.08 0.07 0.48 SDv2.1 [37] 0.98 0.51 0.44 0.85 0.07 0.17 0.50 DALL-E 2 [35] 0.94 0.66 0.49 0.77 0.10 0.19 0.52 Emu3-Gen [45] 0.98 0.71 0.34 0.81 0.17 0.21 0.54 SDXL [32] 0.98 0.74 0.39 0.85 0.15 0.23 0.55 DALL-E 3 [2] 0.96 0.87 0.47 0.83 0.43 0.45 0.67 SD3-Medium [11] 0.99 0.94 0.72 0.89 0.33 0.60 0.74 Und. and Gen. SEED-X ^(†) [13] 0.97 0.58 0.26 0.80 0.19 0.14 0.49 Show-o [50] 0.95 0.52 bar(0.49) 0.82 0.11 0. bar(28) bar(0).5 bar(3) D-DiT [24] 0.97 0.80 0.54 0.76 0.32 0.50 0.65 LWM [28] 0.93 0.41 0.46 0.79 0.09 0.15 0.47 Transfusion [55] - - - - - - 0.63 ILLUME [44] 0.99 0.86 0.45 0.71 0.39 0.28 0.61 TokenFlow-XL [28] 0.95 0.60 0.41 0.81 0.16 0.24 0.55 Chameleon [40] - - - - - - 0.39 Janus [46] 0.97 0.68 0.30 0.84 0.46 0.42 0.61 Janus-Pro-1B 0.98 0.82 0.51 0.89 0.65 0.56 0.73 Janus-Pro-7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80| Type | Method | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. | Overall $\uparrow$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Gen. Only | LlamaGen [38] | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 | 0.32 | | | LDM [37] | 0.92 | 0.29 | 0.23 | 0.70 | 0.02 | 0.05 | 0.37 | | | SDv1.5 [37] | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 | 0.43 | | | PixArt- $\alpha$ [4] | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 | | | SDv2.1 [37] | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 | | | DALL-E 2 [35] | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 | 0.52 | | | Emu3-Gen [45] | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | | | SDXL [32] | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | | | DALL-E 3 [2] | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | | | SD3-Medium [11] | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 | | Und. and Gen. | SEED-X ${ }^{\dagger}$ [13] | 0.97 | 0.58 | 0.26 | 0.80 | 0.19 | 0.14 | 0.49 | | | Show-o [50] | 0.95 | 0.52 | $\overline{0.49}$ | 0.82 | 0.11 | $0 . \overline{28}$ | $\overline{0} .5 \overline{3}$ | | | D-DiT [24] | 0.97 | 0.80 | 0.54 | 0.76 | 0.32 | 0.50 | 0.65 | | | LWM [28] | 0.93 | 0.41 | 0.46 | 0.79 | 0.09 | 0.15 | 0.47 | | | Transfusion [55] | - | - | - | - | - | - | 0.63 | | | ILLUME [44] | 0.99 | 0.86 | 0.45 | 0.71 | 0.39 | 0.28 | 0.61 | | | TokenFlow-XL [28] | 0.95 | 0.60 | 0.41 | 0.81 | 0.16 | 0.24 | 0.55 | | | Chameleon [40] | - | - | - | - | - | - | 0.39 | | | Janus [46] | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 | | | Janus-Pro-1B | 0.98 | 0.82 | 0.51 | 0.89 | 0.65 | 0.56 | 0.73 | | | Janus-Pro-7B | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 |
Table 5 | Performances on DPG-Bench. The methods in this table are all generation-specific models except Janus and Janus-Pro.
表 5 | DPG-Bench 上的表现。此表中的方法都是特定于生成的模型,除了 Janus 和 Janus-Pro。
Method  方法 Global  全球 Entity  实体 Attribute  属性 Relation  关系 Other  其他 Overall uarr\uparrow  整体 uarr\uparrow
SDv1.5 [36] 74.63 74.23 75.39 73.49 67.81 63.18
PixArt- α α alpha\alpha [4] 74.97 79.32 78.60 82.57 76.96 71.11
Lumina-Next [57] 82.82 88.65 86.44 80.53 81.82 74.63
SDXL [33] 83.27 82.43 80.91 86.76 80.41 74.65
Playground v2.5 [22] 83.06 82.59 81.20 84.08 83.50 75.47
Hunyuan-DiT [25] 84.59 80.59 88.01 74.36 86.41 78.87
PixArt- [5] 86.89 82.89 88.94 86.59 87.68 80.54
Emu3-Gen [45] 85.21 86.68 86.84 90.22 83.15 80.60
Janus 82.33 87.38 87.70 85.46 86.41 79.68
DALL-E 3 [2] 90.97 89.61 88.39 90.58 89.83 83.50
SD3-Medium [11] 87.90 91.01 88.83 80.70 88.68 84.08
Janus 82.33 87.38 87.70 85.46 86.41 79.68
Janus-Pro-1B 87.58 88.63 88.17 88.98 88.30 82.63
Janus-Pro-7B 86.90 88.90 89.40 89.32 89.48 84.19
Method Global Entity Attribute Relation Other Overall uarr SDv1.5 [36] 74.63 74.23 75.39 73.49 67.81 63.18 PixArt- alpha [4] 74.97 79.32 78.60 82.57 76.96 71.11 Lumina-Next [57] 82.82 88.65 86.44 80.53 81.82 74.63 SDXL [33] 83.27 82.43 80.91 86.76 80.41 74.65 Playground v2.5 [22] 83.06 82.59 81.20 84.08 83.50 75.47 Hunyuan-DiT [25] 84.59 80.59 88.01 74.36 86.41 78.87 PixArt- [5] 86.89 82.89 88.94 86.59 87.68 80.54 Emu3-Gen [45] 85.21 86.68 86.84 90.22 83.15 80.60 Janus 82.33 87.38 87.70 85.46 86.41 79.68 DALL-E 3 [2] 90.97 89.61 88.39 90.58 89.83 83.50 SD3-Medium [11] 87.90 91.01 88.83 80.70 88.68 84.08 Janus 82.33 87.38 87.70 85.46 86.41 79.68 Janus-Pro-1B 87.58 88.63 88.17 88.98 88.30 82.63 Janus-Pro-7B 86.90 88.90 89.40 89.32 89.48 84.19| Method | Global | Entity | Attribute | Relation | Other | Overall $\uparrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | SDv1.5 [36] | 74.63 | 74.23 | 75.39 | 73.49 | 67.81 | 63.18 | | PixArt- $\alpha$ [4] | 74.97 | 79.32 | 78.60 | 82.57 | 76.96 | 71.11 | | Lumina-Next [57] | 82.82 | 88.65 | 86.44 | 80.53 | 81.82 | 74.63 | | SDXL [33] | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 | 74.65 | | Playground v2.5 [22] | 83.06 | 82.59 | 81.20 | 84.08 | 83.50 | 75.47 | | Hunyuan-DiT [25] | 84.59 | 80.59 | 88.01 | 74.36 | 86.41 | 78.87 | | PixArt- [5] | 86.89 | 82.89 | 88.94 | 86.59 | 87.68 | 80.54 | | Emu3-Gen [45] | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 | 80.60 | | Janus | 82.33 | 87.38 | 87.70 | 85.46 | 86.41 | 79.68 | | DALL-E 3 [2] | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 | 83.50 | | SD3-Medium [11] | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 | 84.08 | | Janus | 82.33 | 87.38 | 87.70 | 85.46 | 86.41 | 79.68 | | Janus-Pro-1B | 87.58 | 88.63 | 88.17 | 88.98 | 88.30 | 82.63 | | Janus-Pro-7B | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 | 84.19 |
Visual Generation Performance. We report visual generation performance on GenEval and DPG-Bench. As shown in Table 4, our Janus-Pro-7B obtains 80 % 80 % 80%80 \% overall accuracy on GenEval, which outperforms all the other unified or generation-only methods, e.g., Transfusion [55] (63%) SD3-Medium (74%) and DALL-E 3 (67%). This demonstrates that our approach has better instruction-following capabilities. As shown in Table 5, Janus-Pro achieves a score of 84.19 on DPG-Bench, surpassing all other methods. This demonstrates that Janus-Pro excels in following dense instructions for text-to-image generation.
视觉生成性能。我们在 GenEval 和 DPG-Bench 上报告视觉生成性能。如表 4 所示,我们的 Janus-Pro-7B 在 GenEval 上获得了 80 % 80 % 80%80 \% 的整体准确率,超越了所有其他统一或仅生成的方法,例如,Transfusion [55](63%)、SD3-Medium(74%)和 DALL-E 3(67%)。这表明我们的方法具有更好的指令遵循能力。如表 5 所示,Janus-Pro 在 DPG-Bench 上获得了 84.19 的分数,超过了所有其他方法。这表明 Janus-Pro 在遵循密集指令进行文本到图像生成方面表现出色。

3.4. Qualitative Results
3.4. 定性结果

We present results on multimodal understanding in Figure 4. Janus-Pro exhibits impressive comprehension abilities when handling inputs from various contexts, showcasing its powerful capabilities. We also present some text-to-image generation results in the lower part of Figure 4. The images generated by Janus-Pro-7B are highly realistic, and despite having a resolution of only 384 × 384 384 × 384 384 xx384384 \times 384, they still contain a lot of details. For imaginative and creative scenes, Janus-Pro7B accurately captures the semantic information from the prompts, producing well-reasoned and coherent images.
我们在图 4 中展示了多模态理解的结果。Janus-Pro 在处理来自不同上下文的输入时表现出令人印象深刻的理解能力,展示了其强大的能力。我们还在图 4 的下半部分展示了一些文本到图像生成的结果。Janus-Pro-7B 生成的图像非常逼真,尽管分辨率仅为 384 × 384 384 × 384 384 xx384384 \times 384 ,但仍然包含很多细节。对于富有想象力和创意的场景,Janus-Pro7B 准确捕捉了提示中的语义信息,生成了合理且连贯的图像。

Figure 4 | Qualitative results of multimodal understanding and visual generation capability. The model is Janus-Pro-7B and the image output resolution of visual generation is 384 × 384 384 × 384 384 xx384384 \times 384. Best viewed on screen.
图 4 | 多模态理解和视觉生成能力的定性结果。模型为 Janus-Pro-7B,视觉生成的图像输出分辨率为 384 × 384 384 × 384 384 xx384384 \times 384 。最佳在屏幕上查看。

4. Conclusion  4. 结论

This paper introduces improvements to Janus from three aspects: training strategy, data, and model size. These enhancements have led to significant advancements in both multimodal understanding and text-to-image instruction-following capabilities. However, Janus-Pro still has certain limitations. In terms of multimodal understanding, the input resolution is limited to 384 × 384 384 × 384 384 xx384384 \times 384, which affects its performance in fine-grained tasks such as OCR. For text-toimage generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.
本文从训练策略、数据和模型大小三个方面介绍了对 Janus 的改进。这些增强导致了多模态理解和文本到图像指令跟随能力的显著进展。然而,Janus-Pro 仍然存在某些局限性。在多模态理解方面,输入分辨率限制为 384 × 384 384 × 384 384 xx384384 \times 384 ,这影响了其在细粒度任务(如 OCR)中的表现。在文本到图像生成方面,低分辨率加上视觉标记器引入的重建损失,导致生成的图像虽然在语义内容上丰富,但仍然缺乏细节。例如,占据有限图像空间的小面部区域可能显得细节不足。提高图像分辨率可能会缓解这些问题。

References  参考文献

[1] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[1] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, 和 J. Zhou. Qwen-vl: 一种具有多种能力的前沿大型视觉-语言模型。arXiv 预印本 arXiv:2308.12966, 2023.

[2] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[2] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, 等. 通过更好的标题改善图像生成. 计算机科学. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.

[3] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
[3] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: 通过长期主义扩展开源语言模型。arXiv 预印本 arXiv:2401.02954, 2024.

[4] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. Pixartalpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[4] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, 等. Pixartalpha: 快速训练扩散变换器以实现逼真的文本到图像合成. arXiv 预印本 arXiv:2310.00426, 2023.

[5] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4 K text-to-image generation. arXiv preprint arXiv:2403.04692, 2024.
[5] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, 和 Z. Li. PixArt-Sigma: 从弱到强的扩散变换器训练用于 4K 文本到图像生成. arXiv 预印本 arXiv:2403.04692, 2024.

[6] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
[6] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, 等. Mobilevlm: 一种快速、可重复和强大的移动设备视觉语言助手. arXiv 预印本 arXiv:2312.16886, 2023.

[7] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
[7] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, 等. Mobilevlm v2:视觉语言模型的更快更强基线。arXiv 预印本 arXiv:2402.03766, 2024.

[8] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
[8] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, 和 S. Hoi. Instructblip: 朝着具有指令调优的通用视觉-语言模型,2023。

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248-255. Ieee, 2009.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, 和 L. Fei-Fei. Imagenet: 一个大规模分层图像数据库. 在 2009 年 IEEE 计算机视觉与模式识别会议上, 页码 248-255. Ieee, 2009.

[10] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
[10] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, 等. Dreamllm: 协同多模态理解与创作. arXiv 预印本 arXiv:2309.11499, 2023.

[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206.
[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, 和 R. Rombach. 扩展整流流变换器以进行高分辨率图像合成, 2024. URL https://arxiv.org/abs/2403.03206.

[12] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[12] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, 等. Mme: 一种针对多模态大型语言模型的综合评估基准。arXiv 预印本 arXiv:2306.13394, 2023.

[13] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024.
[13] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, 和 Y. Shan. Seed-x: 具有统一多粒度理解和生成的多模态模型. arXiv 预印本 arXiv:2404.14396, 2024.

[14] D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
[14] D. Ghosh, H. Hajishirzi, 和 L. Schmidt. Geneval: 一个面向对象的文本与图像对齐评估框架. 神经信息处理系统进展, 36, 2024.

[15] High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
[15] High-flyer. Hai-llm: 高效且轻量的训练工具,用于大型模型,2023。网址 https://www.high-flyer.cn/en/blog/hai-llm。

[16] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
[16] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, 和 G. Yu. Ella: 为增强语义对齐而装备扩散模型 llm。arXiv 预印本 arXiv:2403.05135, 2024.

[17] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700-6709, 2019.
[17] D. A. Hudson 和 C. D. Manning. Gqa: 一个用于现实世界视觉推理和组合问答的新数据集。在 2019 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页 6700-6709。

[18] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
[18] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, 等. 通过动态离散视觉标记化进行统一的语言-视觉预训练. arXiv 预印本 arXiv:2309.04669, 2023.

[19] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id efics.
[19] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, 等. 介绍 idefics:一种开放的最先进视觉语言模型的再现,2023。网址 https://huggingface.co/blog/id efics.

[20] H. Laurençon, A. Marafioti, V. Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024.
[20] H. Laurençon, A. Marafioti, V. Sanh, 和 L. Tronchon. 构建和更好地理解视觉-语言模型:见解与未来方向., 2024.

[21] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[21] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, 和 Y. Shan. Seed-bench: 基于生成理解的多模态 llms 基准测试. arXiv 预印本 arXiv:2307.16125, 2023.

[22] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024.
[22] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, 和 S. Doshi. Playground v2.5: 三个提升文本到图像生成美学质量的见解. arXiv 预印本 arXiv:2402.17245, 2024.

[23] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[23] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. 评估大型视觉语言模型中的物体幻觉。arXiv 预印本 arXiv:2305.10355, 2023.

[24] Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P. Wang. Dual diffusion for unified image generation and understanding. arXiv preprint arXiv:2501.00289, 2024.
[24] Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P. Wang. 双重扩散用于统一图像生成和理解。arXiv 预印本 arXiv:2501.00289, 2024.

[25] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024.
[25] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: 一种强大的多分辨率扩散变换器,具有细粒度的中文理解。arXiv 预印本 arXiv:2405.08748, 2024.

[26] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296-26306, 2024.
[26] H. Liu, C. Li, Y. Li, 和 Y. J. Lee. 通过视觉指令调优改进基线。在 2024 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页码 26296-26306。

[27] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[27] H. Liu, C. Li, Q. Wu, and Y. J. Lee. 视觉指令调优。神经信息处理系统进展,36,2024。

[28] H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024.
[28] H. Liu, W. Yan, M. Zaharia, 和 P. Abbeel. 使用环注意力的百万长度视频和语言的世界模型. arXiv 预印本 arXiv:2402.08268, 2024.

[29] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[29] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, 等. Mmbench: 你的多模态模型是全能选手吗?arXiv 预印本 arXiv:2307.06281, 2023.

[30] Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024.
[30] Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, 和 C. Ruan. Janusflow: 协调自回归和修正流以实现统一的多模态理解和生成, 2024.

[31] mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024.
[32] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[32] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, 和 R. Rombach. Sdxl: 改进高分辨率图像合成的潜在扩散模型. arXiv 预印本 arXiv:2307.01952, 2023.

[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. 2024.
[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, 和 R. Rombach. SDXL: 改进高分辨率图像合成的潜在扩散模型. 2024.

[34] L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, and X. Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024.
[34] L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, 和 X. Wu. Tokenflow: 统一的图像标记器,用于多模态理解和生成. arXiv 预印本 arXiv:2412.03069, 2024.

[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, 和 M. Chen. 基于文本条件的层次化图像生成与 clip 潜变量. arXiv 预印本 arXiv:2204.06125, 1(2):3, 2022.

[36] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022.
[36] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, 和 B. Ommer. 使用潜在扩散模型进行高分辨率图像合成. 2022.

[37] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022.
[37] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 使用潜在扩散模型进行高分辨率图像合成。发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 10684-10695。

[38] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
[38] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, 和 Z. Yuan. 自回归模型超越扩散:Llama 用于可扩展图像生成。arXiv 预印本 arXiv:2406.06525, 2024.

[39] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
[39] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, 和 X. Wang. 多模态生成预训练. arXiv 预印本 arXiv:2307.05222, 2023.

[40] C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
[40] C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv 预印本 arXiv:2405.09818, 2024.

[41] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[41] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, 等. Gemini: 一系列高能力的多模态模型. arXiv 预印本 arXiv:2312.11805, 2023.

[42] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024.
[42] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, 和 Z. Liu. Metamorph: 通过指令调优实现多模态理解和生成. arXiv 预印本 arXiv:2412.14164, 2024.

[43] Vivym. Midjourney prompts dataset. https://huggingface.co/datasets/vivym/ midjourney-prompts, 2023. Accessed: [Insert Date of Access, e.g., 2023-10-15].
[44] C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: Illuminating your llms to see, draw, and self-enhance. arXiv preprint arXiv:2412.06673, 2024.
[44] C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: 照亮你的 llms 以便查看、绘制和自我增强。arXiv 预印本 arXiv:2412.06673, 2024.

[45] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024.
[45] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, 等。Emu3: 下一标记预测就是你所需要的一切。arXiv 预印本 arXiv:2409.18869, 2024。

[46] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024.
[46] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, 等. Janus: 解耦视觉编码以实现统一的多模态理解和生成. arXiv 预印本 arXiv:2410.13848, 2024.

[47] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
[47] S. Wu, H. Fei, L. Qu, W. Ji, 和 T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv 预印本 arXiv:2309.05519, 2023.

[48] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024.
[48] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, 等. Vila-u: 一个统一的基础模型,集成视觉理解和生成. arXiv 预印本 arXiv:2409.04429, 2024.

[49] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024.
[49] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, 等。Deepseek-vl2:用于高级多模态理解的专家混合视觉语言模型。arXiv 预印本 arXiv:2412.10302,2024。

[50] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
[50] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, 和 M. Z. Shou. Show-o: 一个单一的变换器统一多模态理解和生成. arXiv 预印本 arXiv:2408.12528, 2024.

[51] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[51] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: 评估大型多模态模型的综合能力。arXiv 预印本 arXiv:2308.02490, 2023.

[52] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu : A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556-9567, 2024.
[52] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, 等. Mmmu : 一项针对专家 AGI 的大规模多学科多模态理解与推理基准。在 2024 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页码 9556-9567。

[53] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975-11986, 2023.
[53] X. Zhai, B. Mustafa, A. Kolesnikov, 和 L. Beyer. 用于语言图像预训练的 Sigmoid 损失. 在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中, 页码 11975-11986.

[54] C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, and J. Wang. Monoformer: One transformer for both diffusion and autoregression. arXiv preprint arXiv:2409.16280, 2024.
[54] C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, 和 J. Wang. Monoformer: 一个用于扩散和自回归的变换器. arXiv 预印本 arXiv:2409.16280, 2024.

[55] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
[55] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, 和 O. Levy. Transfusion: 使用一个多模态模型预测下一个标记并扩散图像。arXiv 预印本 arXiv:2408.11039, 2024.

[56] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
[56] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, 和 J. Tang. Llava-phi: 高效的多模态助手,使用小型语言模型。arXiv 预印本 arXiv:2401.02330, 2024.

[57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: Making Lumina-T2X stronger and faster with Next-DiT. arXiv preprint arXiv:2406.18583, 2024.
[57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: 通过 Next-DiT 使 Lumina-T2X 更强大、更快速。arXiv 预印本 arXiv:2406.18583, 2024.