A minimalist photo of an orange tangerine A clear image of a blackboard with a clean, Capture a close-up shot of a vibrant sunflower with a green stem and leaves, symbolizing dark green surface and the word ‘Hello’ written in full bloom, with a honeybee perched on its prosperity, sitting on a red silk cloth during precisely and legibly in the center with bold, petals, its delicate wings catching the sunlight. Chinese New Year. white chalk letters.
一张极简主义的橙色橘子的照片,一幅清晰的黑板图像,捕捉一朵生机勃勃的向日葵的特写镜头,带有绿色的茎和叶子,象征着深绿色的表面,中央清晰可见地写着“Hello”的白色粉笔字,正盛开着,蜜蜂栖息在上面,坐在红色丝绸布上,花瓣娇嫩,翅膀在阳光下闪烁。中国新年。
As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.
作为一个开创性的模型,Janus 在 1B 参数规模下得到了验证。然而,由于训练数据量有限和模型容量相对较小,它表现出某些缺点,例如在短提示图像生成上的性能不佳以及文本到图像生成质量不稳定。在本文中,我们介绍了 Janus-Pro,这是 Janus 的增强版本,涵盖了三个维度的改进:训练策略、数据和模型大小。Janus-Pro 系列包括两种模型大小:1B 和 7B,展示了视觉编码解码方法的可扩展性。
Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed on screen.
图 3 | 我们的 Janus-Pro 架构。我们将视觉编码解耦以实现多模态理解和视觉生成。“Und. Encoder”和“Gen. Encoder”分别是“Understanding Encoder”和“Generation Encoder”的缩写。最佳在屏幕上查看。
The architecture of Janus-Pro is shown in Figure 3, which is the same as Janus [46]. The core design principle of the overall architecture is to decouple visual encoding for multimodal understanding and generation. We apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. For multimodal understanding, we use the SigLIP [53] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [38] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. Apart from the built-in prediction head in the LLM, we also utilize a randomly initialized prediction head for image predictions in the visual generation task. The entire model adheres to an autoregressive framework.
Janus-Pro 的架构如图 3 所示,与 Janus [46]相同。整体架构的核心设计原则是将视觉编码解耦,以实现多模态理解和生成。我们应用独立的编码方法将原始输入转换为特征,然后由统一的自回归变换器进行处理。对于多模态理解,我们使用 SigLIP [53]编码器从图像中提取高维语义特征。这些特征从二维网格展平为一维序列,并使用理解适配器将这些图像特征映射到LLM的输入空间。对于视觉生成任务,我们使用来自[38]的 VQ 分词器将图像转换为离散 ID。在 ID 序列展平为一维后,我们使用生成适配器将与每个 ID 对应的代码本嵌入映射到LLM的输入空间。然后,我们将这些特征序列连接起来,形成多模态特征序列,随后将其输入到LLM进行处理。 除了LLM中的内置预测头,我们还在视觉生成任务中使用一个随机初始化的预测头进行图像预测。整个模型遵循自回归框架。
To address this issue, we make two modifications.
为了解决这个问题,我们做了两个修改。
We also adjust the data ratio in Stage III supervised fine-tuning process across different types of datasets, changing the proportion of multimodal data, pure text data, and text-to-image data from 7:3:10 to 5:1:4. By slightly reducing the proportion of text-to-image data, we observe that this adjustment allows us to maintain strong visual generation capabilities while achieving improved multimodal understanding performance.
我们还在第三阶段的监督微调过程中调整了不同类型数据集的数据比例,将多模态数据、纯文本数据和文本到图像数据的比例从 7:3:10 更改为 5:1:4。通过稍微减少文本到图像数据的比例,我们观察到这一调整使我们能够保持强大的视觉生成能力,同时提高多模态理解性能。
We scale up the training data used for Janus in both multimodal understanding and visual generation aspects.
我们扩大了用于 Janus 的训练数据,在多模态理解和视觉生成方面。
The previous version of Janus validates the effectiveness of visual encoding decoupling using a 1.5B LLM. In Janus-Pro, we scaled the model up to 7B, with the hyperparameters of both the 1.5B and 7B LLMs detailed in Table 1. We observe that when utilizing a larger-scale LLM, the convergence speed of losses for both multimodal understanding and visual generation improved significantly compared to the smaller model. This finding further validates the strong scalability of this approach.
Janus 的前一个版本验证了使用 1.5B LLM 的视觉编码解耦的有效性。在 Janus-Pro 中,我们将模型扩展到 7B,1.5B 和 7B 的超参数在表 1 中详细列出。我们观察到,当使用更大规模的LLM时,多模态理解和视觉生成的损失收敛速度相比于较小模型显著提高。这一发现进一步验证了该方法的强大可扩展性。
Table 1 | Architectural configuration for Janus-Pro. We list the hyperparameters of the architecture.
表 1 | Janus-Pro 的架构配置。我们列出了架构的超参数。
Table 2 | Detailed hyperparameters for training Janus-Pro. Data ratio refers to the ratio of multimodal understanding data, pure text data, and visual generation data.
表 2 | Janus-Pro 训练的详细超参数。数据比例指的是多模态理解数据、纯文本数据和视觉生成数据的比例。
Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our model on widely recognized image-based vision-language benchmarks, which include GQA
多模态理解。为了评估多模态理解能力,我们在广泛认可的基于图像的视觉语言基准上评估我们的模型,其中包括 GQA。
[17], POPE [23], MME [12], SEED [21], MMB [29], MM-Vet [51], and MMMU [52].
[17], POPE [23], MME [12], SEED [21], MMB [29], MM-Vet [51], 和 MMMU [52].
Visual Generation. For evaluating visual generation capabilities, we use GenEval [14] and DPG-Bench [16]. GenEval is a challenging benchmark for image-to-text generation, designed to reflect the comprehensive generative abilities of visual generation models by offering a detailed instance-level analysis of their compositional capabilities. DPG-Bench (Dense Prompt Graph Benchmark) is a comprehensive dataset consisting of 1065 lengthy, dense prompts, designed to assess the intricate semantic alignment capabilities of text-to-image models.
视觉生成。为了评估视觉生成能力,我们使用 GenEval [14]和 DPG-Bench [16]。GenEval 是一个具有挑战性的图像到文本生成基准,旨在通过提供详细的实例级分析来反映视觉生成模型的综合生成能力。DPG-Bench(密集提示图基准)是一个综合数据集,包含 1065 个冗长、密集的提示,旨在评估文本到图像模型的复杂语义对齐能力。
Multimodal Understanding Performance. We compare the proposed method with state-of-the-art unified models and understanding-only models in Table 3. Janus-Pro achieves the overall best results. This can be attributed to decoupling the visual encoding for multimodal understanding and generation, mitigating the conflict between these two tasks. When compared to models with significantly larger sizes, Janus-Pro remains highly competitive. For instance, Janus-Pro-7B outperforms TokenFlow-XL (13B) on all benchmarks except GQA.
多模态理解性能。我们在表 3 中将所提方法与最先进的统一模型和仅理解模型进行了比较。Janus-Pro 取得了整体最佳结果。这可以归因于将多模态理解和生成的视觉编码解耦,从而减轻了这两项任务之间的冲突。与显著更大规模的模型相比,Janus-Pro 仍然具有很强的竞争力。例如,Janus-Pro-7B 在除 GQA 外的所有基准测试中都优于 TokenFlow-XL(13B)。
Table 5 | Performances on DPG-Bench. The methods in this table are all generation-specific models except Janus and Janus-Pro.
表 5 | DPG-Bench 上的表现。此表中的方法都是特定于生成的模型,除了 Janus 和 Janus-Pro。
[1] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[1] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, 和 J. Zhou. Qwen-vl: 一种具有多种能力的前沿大型视觉-语言模型。arXiv 预印本 arXiv:2308.12966, 2023.
[2] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Computer Science.
https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[2] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, 等. 通过更好的标题改善图像生成. 计算机科学. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[3] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
[3] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: 通过长期主义扩展开源语言模型。arXiv 预印本 arXiv:2401.02954, 2024.
[4] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. Pixartalpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[4] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, 等. Pixartalpha: 快速训练扩散变换器以实现逼真的文本到图像合成. arXiv 预印本 arXiv:2310.00426, 2023.
[5] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4 K text-to-image generation. arXiv preprint arXiv:2403.04692, 2024.
[5] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, 和 Z. Li. PixArt-Sigma: 从弱到强的扩散变换器训练用于 4K 文本到图像生成. arXiv 预印本 arXiv:2403.04692, 2024.
[6] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
[6] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, 等. Mobilevlm: 一种快速、可重复和强大的移动设备视觉语言助手. arXiv 预印本 arXiv:2312.16886, 2023.
[7] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
[7] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, 等. Mobilevlm v2:视觉语言模型的更快更强基线。arXiv 预印本 arXiv:2402.03766, 2024.
[8] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
[8] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, 和 S. Hoi. Instructblip: 朝着具有指令调优的通用视觉-语言模型,2023。
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248-255. Ieee, 2009.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, 和 L. Fei-Fei. Imagenet: 一个大规模分层图像数据库. 在 2009 年 IEEE 计算机视觉与模式识别会议上, 页码 248-255. Ieee, 2009.
[10] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
[10] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, 等. Dreamllm: 协同多模态理解与创作. arXiv 预印本 arXiv:2309.11499, 2023.
[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL
https://arxiv.org/abs/2403.03206.
[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, 和 R. Rombach. 扩展整流流变换器以进行高分辨率图像合成, 2024. URL https://arxiv.org/abs/2403.03206.
[12] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[12] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, 等. Mme: 一种针对多模态大型语言模型的综合评估基准。arXiv 预印本 arXiv:2306.13394, 2023.
[13] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024.
[13] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, 和 Y. Shan. Seed-x: 具有统一多粒度理解和生成的多模态模型. arXiv 预印本 arXiv:2404.14396, 2024.
[14] D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
[14] D. Ghosh, H. Hajishirzi, 和 L. Schmidt. Geneval: 一个面向对象的文本与图像对齐评估框架. 神经信息处理系统进展, 36, 2024.
[15] High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL
https://www.high-flyer.cn/en/blog/hai-llm.
[15] High-flyer. Hai-llm: 高效且轻量的训练工具,用于大型模型,2023。网址 https://www.high-flyer.cn/en/blog/hai-llm。
[16] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
[16] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, 和 G. Yu. Ella: 为增强语义对齐而装备扩散模型 llm。arXiv 预印本 arXiv:2403.05135, 2024.
[17] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700-6709, 2019.
[17] D. A. Hudson 和 C. D. Manning. Gqa: 一个用于现实世界视觉推理和组合问答的新数据集。在 2019 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页 6700-6709。
[18] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
[18] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, 等. 通过动态离散视觉标记化进行统一的语言-视觉预训练. arXiv 预印本 arXiv:2309.04669, 2023.
[19] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL
https://huggingface.co/blog/id efics.
[19] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, 等. 介绍 idefics:一种开放的最先进视觉语言模型的再现,2023。网址 https://huggingface.co/blog/id efics.
[20] H. Laurençon, A. Marafioti, V. Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024.
[20] H. Laurençon, A. Marafioti, V. Sanh, 和 L. Tronchon. 构建和更好地理解视觉-语言模型:见解与未来方向., 2024.
[21] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[21] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, 和 Y. Shan. Seed-bench: 基于生成理解的多模态 llms 基准测试. arXiv 预印本 arXiv:2307.16125, 2023.
[22] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024.
[22] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, 和 S. Doshi. Playground v2.5: 三个提升文本到图像生成美学质量的见解. arXiv 预印本 arXiv:2402.17245, 2024.
[23] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[23] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. 评估大型视觉语言模型中的物体幻觉。arXiv 预印本 arXiv:2305.10355, 2023.
[24] Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P. Wang. Dual diffusion for unified image generation and understanding. arXiv preprint arXiv:2501.00289, 2024.
[24] Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P. Wang. 双重扩散用于统一图像生成和理解。arXiv 预印本 arXiv:2501.00289, 2024.
[25] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024.
[25] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: 一种强大的多分辨率扩散变换器,具有细粒度的中文理解。arXiv 预印本 arXiv:2405.08748, 2024.
[26] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296-26306, 2024.
[26] H. Liu, C. Li, Y. Li, 和 Y. J. Lee. 通过视觉指令调优改进基线。在 2024 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页码 26296-26306。
[27] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[27] H. Liu, C. Li, Q. Wu, and Y. J. Lee. 视觉指令调优。神经信息处理系统进展,36,2024。
[28] H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024.
[28] H. Liu, W. Yan, M. Zaharia, 和 P. Abbeel. 使用环注意力的百万长度视频和语言的世界模型. arXiv 预印本 arXiv:2402.08268, 2024.
[29] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[29] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, 等. Mmbench: 你的多模态模型是全能选手吗?arXiv 预印本 arXiv:2307.06281, 2023.
[30] Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024.
[30] Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, 和 C. Ruan. Janusflow: 协调自回归和修正流以实现统一的多模态理解和生成, 2024.
[31] mehdidc. Yfcc-huggingface.
https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024.
[32] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[32] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, 和 R. Rombach. Sdxl: 改进高分辨率图像合成的潜在扩散模型. arXiv 预印本 arXiv:2307.01952, 2023.
[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. 2024.
[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, 和 R. Rombach. SDXL: 改进高分辨率图像合成的潜在扩散模型. 2024.
[34] L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, and X. Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024.
[34] L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, 和 X. Wu. Tokenflow: 统一的图像标记器,用于多模态理解和生成. arXiv 预印本 arXiv:2412.03069, 2024.
[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, 和 M. Chen. 基于文本条件的层次化图像生成与 clip 潜变量. arXiv 预印本 arXiv:2204.06125, 1(2):3, 2022.
[36] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022.
[36] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, 和 B. Ommer. 使用潜在扩散模型进行高分辨率图像合成. 2022.
[37] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022.
[37] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 使用潜在扩散模型进行高分辨率图像合成。发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 10684-10695。
[38] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
[38] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, 和 Z. Yuan. 自回归模型超越扩散:Llama 用于可扩展图像生成。arXiv 预印本 arXiv:2406.06525, 2024.
[39] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
[39] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, 和 X. Wang. 多模态生成预训练. arXiv 预印本 arXiv:2307.05222, 2023.
[40] C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
[40] C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv 预印本 arXiv:2405.09818, 2024.
[41] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[41] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, 等. Gemini: 一系列高能力的多模态模型. arXiv 预印本 arXiv:2312.11805, 2023.
[42] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024.
[42] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, 和 Z. Liu. Metamorph: 通过指令调优实现多模态理解和生成. arXiv 预印本 arXiv:2412.14164, 2024.
[43] Vivym. Midjourney prompts dataset.
https://huggingface.co/datasets/vivym/ midjourney-prompts, 2023. Accessed: [Insert Date of Access, e.g., 2023-10-15].
[44] C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: Illuminating your llms to see, draw, and self-enhance. arXiv preprint arXiv:2412.06673, 2024.
[44] C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: 照亮你的 llms 以便查看、绘制和自我增强。arXiv 预印本 arXiv:2412.06673, 2024.
[45] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024.
[45] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, 等。Emu3: 下一标记预测就是你所需要的一切。arXiv 预印本 arXiv:2409.18869, 2024。
[46] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024.
[46] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, 等. Janus: 解耦视觉编码以实现统一的多模态理解和生成. arXiv 预印本 arXiv:2410.13848, 2024.
[47] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
[47] S. Wu, H. Fei, L. Qu, W. Ji, 和 T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv 预印本 arXiv:2309.05519, 2023.
[48] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024.
[48] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, 等. Vila-u: 一个统一的基础模型,集成视觉理解和生成. arXiv 预印本 arXiv:2409.04429, 2024.
[49] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024.
[49] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, 等。Deepseek-vl2:用于高级多模态理解的专家混合视觉语言模型。arXiv 预印本 arXiv:2412.10302,2024。
[50] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
[50] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, 和 M. Z. Shou. Show-o: 一个单一的变换器统一多模态理解和生成. arXiv 预印本 arXiv:2408.12528, 2024.
[51] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[51] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: 评估大型多模态模型的综合能力。arXiv 预印本 arXiv:2308.02490, 2023.
[52] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu : A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556-9567, 2024.
[52] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, 等. Mmmu : 一项针对专家 AGI 的大规模多学科多模态理解与推理基准。在 2024 年 IEEE/CVF 计算机视觉与模式识别会议论文集中,页码 9556-9567。
[53] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975-11986, 2023.
[53] X. Zhai, B. Mustafa, A. Kolesnikov, 和 L. Beyer. 用于语言图像预训练的 Sigmoid 损失. 在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中, 页码 11975-11986.
[54] C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, and J. Wang. Monoformer: One transformer for both diffusion and autoregression. arXiv preprint arXiv:2409.16280, 2024.
[54] C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, 和 J. Wang. Monoformer: 一个用于扩散和自回归的变换器. arXiv 预印本 arXiv:2409.16280, 2024.
[55] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
[55] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, 和 O. Levy. Transfusion: 使用一个多模态模型预测下一个标记并扩散图像。arXiv 预印本 arXiv:2408.11039, 2024.
[56] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
[56] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, 和 J. Tang. Llava-phi: 高效的多模态助手,使用小型语言模型。arXiv 预印本 arXiv:2401.02330, 2024.
[57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: Making Lumina-T2X stronger and faster with Next-DiT. arXiv preprint arXiv:2406.18583, 2024.
[57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: 通过 Next-DiT 使 Lumina-T2X 更强大、更快速。arXiv 预印本 arXiv:2406.18583, 2024.