这是用户在 2024-9-4 3:58 为 https://ar5iv.labs.arxiv.org/html/2407.15811?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
如何最大限度地利用有限的预算:在微型预算下从头开始进行扩散训练

Vikash Sehwag1 , Xianghao Kong2, Jingtao Li1, Michael Spranger1, Lingjuan Lyu1
韦卡什·赛哈格1 , 相豪孔2, 荆涛李1, 迈克尔·斯普兰格1, 刘靓娟1
Corresponding author: vikash.sehwag@sony.com
相应作者:vikash.sehwag@sony.com
(March 2024) 2024 年 3 月
Abstract 摘要

As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118×\times lower cost than stable diffusion models and 14×\times lower cost than the current state-of-the-art approach that costs $28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
随着生成式人工智能中规模定律的推动,这些模型的开发正集中在拥有大量计算资源的参与者中。我们将重点放在文本到图像(T2I)生成模型上,旨在通过展示大规模 T2I 扩散变换器模型的超低成本训练来解决这个瓶颈。由于变换器的计算成本随着每个图像中补丁的数量而增加,我们建议在训练期间随机掩盖高达 75%的图像补丁。我们提出一种推迟掩盖的策略,即先使用补丁混合器对所有补丁进行预处理,从而大大减少了掩盖带来的性能下降,使其比模型缩小在降低计算成本方面更为优越。我们还结合了变换器架构的最新改进,如使用专家层混合,以提高性能,并进一步确定了在小预算训练中使用合成图像的关键优势。最后,仅使用 3700 万张公开可用的真实和合成图像,我们训练了一个 11.6 亿参数的稀疏变换器,经济成本仅为$1,890,在 COCO 数据集上的零样本生成中实现了 12.7 的 FID。值得注意的是,我们的模型在 FID 和高质量生成方面具有竞争力,但成本是稳定扩散模型的 118 倍更低,是最先进方法(成本$28,400)的 14 倍更低。我们旨在发布我们的端到端训练流水线,进一步民主化大规模扩散模型在小预算上的训练。

1 Sony AI, 2 University of California, Riverside
1 索尼人工智能, 2 加利福尼亚大学河滨分校

Refer to caption
An elegant squirrel pirate on a ship.; A hippo in a ballet performance in a white swan dress; A photo realistic koala bear playing a guitar; A rabbit magician pulling a hat.
一只优雅的松鼠海盗在船上;一只河马在白色天鹅芭蕾舞台表演;一只栩栩如生的考拉正在弹吉他;一只兔子魔术师正在变魔术。
Refer to caption
A real life samurai standing in a green bamboo grove; A fairy tale castle in the clouds; A real looking vibrant butterfly on a photorealistic flower; A mystical cave with glowing crystals.
一位真实的武士站在一个绿色的竹林中;一座仙境般的城堡在云端;一只栩栩如生的蝴蝶停在一朵逼真的花朵上;一个神秘的洞穴里闪耀着发光的水晶。
Refer to caption
A cute rabbit in a stunning, detailed chinese coat; A frog on a lotus leaf.; Toy elephants made of Mahogany wood on a shelf; A logo of a fox in neon light.
一只可爱的兔子穿着精致细致的中国式外套;一只青蛙在荷叶上。一架上放着由柚木制成的玩具大象;一个用霓虹灯制成的狐狸标志。
Refer to caption
Ancient castle of a cliff edge; Serene beach with colourful pebble stones; Photo realistic market place in a medieval town; Alien landscape with two moons.
悬崖边的古老城堡;色彩斑斓的鹅卵石覆盖的宁静海滩;中世纪小镇中逼真的市场广场;拥有两颗月亮的外星风景。
Refer to caption
A realistic train made of origami.; A soccer plaground made of snow and timber.; Pre war apartment complexes reflected in the water of a beautifull lake.; A wolf coming out of the woods.
一辆由折纸制成的逼真火车。 一个由雪和木材制成的足球场。 战前公寓楼映于美丽湖水之中。 一只狼从林中走出。
Refer to caption
A dolphin astronaut in space in astronaut suit; Kangaroo boxing champions on a mountain top; A photo realistic turtle reading a book using a table lamp; A realistic penguin detective with magnifying glass in a crime scene.
一只宇航员海豚在太空中穿着宇航服;袋鼠拳击冠军在山顶上;一只使用台灯阅读书籍的照实真海龟;一只侦探企鹅带着放大镜在犯罪现场。
Refer to caption
A four layer wedding cake; A cactus smiling in a desert; Miniature cat made of lego playing with other cats in a room; An adorable baby dinosaur dressed as cowboy on a beach
一个四层的婚礼蛋糕; 一株在沙漠中微笑的仙人掌; 一只用乐高制作的迷你猫咪与其他猫咪在房间里玩耍; 一只可爱的小恐龙装扮成牛仔在海滩上
Refer to caption
Mountain covered in snow; A cat wearing a wizard hat; A train made of realistic pumpkins; A robot on a blue planet. Close up shot.                                                                                                                                                                                                                                                                                                              
白雪覆盖的山峰;戴着巫师帽子的猫;由栩栩如生的南瓜制成的火车;蓝色星球上的机器人。特写镜头。
Refer to caption
(a) Stable-Diffusion-1.5
Refer to caption
(b) Stable-Diffusion-2.1
(b) 稳定扩散-2.1
Refer to caption
(c) Ours
(c) 我们的
Refer to caption
(d) FID vs. training time
(d) FID 与训练时间
Figure 1: Qualitative evaluation of the image generation capabilities of our model (512×512512512512\times 512 image resolution) . Our model is trained in 2.62.62.6 days on a single 8×\timesH100 machine (amounting to only $1,890 in GPU cost) without any proprietary or billion image dataset. In (a)-(c) we examine diverse style generation capabilities using prompt ‘Image of an astronaut riding a horse in   style’, with following styles: Origami, Pixel art, Line art, Cyberpunk, and Van Gogh Starry Night. In (d) we compare training cost and fidelity, measured using FID (Heusel et al., 2017) for zero-shot image generation on COCO (Lin et al., 2014) dataset, of all models.
图 1: 我们的模型在图像生成能力的定性评估( 512 × 512 图像分辨率)。我们的模型是在 2.6 8 天 × 仅耗资 1,890 美元 GPU 的 H100 机器,不使用任何专有或数十亿张图像数据集。在(a)-(c)中,我们使用提示"骑马的宇航员图像"探讨了多种风格生成能力,包括折纸、像素艺术、线条艺术、赛博朋克和梵高星空。在(d)中,我们比较了所有模型在 COCO 数据集上进行零样本图像生成时的训练成本和保真度,这是通过 FID(Heusel 等人,2017 年)进行测量的。

1 Introduction
1 简介

While modern visual generative models excel at creating photorealistic visual content and have propelled the generation of more than a billion images each year (Valyaeva, 2023), the cost and effort of training these models from scratch remain very high (Rombach et al., 2022; Ramesh et al., 2022; Saharia et al., 2022). In text-to-image diffusion models (T2I) (Song et al., 2020a; Ho et al., 2020; Song et al., 2021), some previous works have successfully scaled down the computational cost compared to the 200,000 A100 GPU hours used by Stable Diffusion 2.1 (Rombach et al., 2022; Chen et al., 2023; Pernias et al., 2024; Gokaslan et al., 2024). However, the computational cost of even the state-of-the-art approaches (18,000 A100 GPU hours) remains very high (Chen et al., 2023), requiring more than a month of training time on the leading 8×\timesH100 GPU machine. Furthermore, previous works either leverage larger-scale datasets spanning on the order of a billion images or use proprietary datasets to enhance performance (Rombach et al., 2022; Chen et al., 2023; Yu et al., 2022; Saharia et al., 2022). The high training cost and the dataset requirements create an inaccessible barrier to participating in the development of large-scale diffusion models. In this work, we aim to address this issue by developing a low-cost end-to-end pipeline for competitive text-to-image diffusion models that achieve more than an order of magnitude reduction in training cost than the state-of-the-art while not requiring access to billions of training images or proprietary datasets111Our code will be available at https://github.com/SonyResearch/micro_diffusion
我们的代码将可在 https://github.com/SonyResearch/micro_diffusion 获得。

尽管现代视觉生成模型在创造逼真的视觉内容方面有出色表现,并推动了每年超过 10 亿张图像的生成(Valyaeva, 2023),但从头开始训练这些模型的成本和努力仍然非常高(Rombach 等人, 2022; Ramesh 等人, 2022; Saharia 等人, 2022)。在文本到图像扩散模型(T2I)(Song 等人, 2020a; Ho 等人, 2020; Song 等人, 2021)中,一些先前的工作成功地降低了计算成本,与 Stable Diffusion 2.1 使用的 20 万小时的 A100 GPU 相比(Rombach 等人, 2022; Chen 等人, 2023; Pernias 等人, 2024; Gokaslan 等人, 2024)。然而,即使是最先进方法的计算成本(18,000 小时 A100 GPU)仍然非常高(Chen 等人, 2023),需要在最先进的 8 ×\times H100 GPU 机器上进行超过一个月的训练时间。此外,以前的工作要么利用涉及数十亿张图像的大规模数据集,要么使用专有数据集来提高性能(Rombach 等人, 2022; Chen 等人, 2023; Yu 等人, 2022; Saharia 等人, 2022)。高昂的训练成本和数据集要求为参与大规模扩散模型的开发设置了无法企及的障碍。在这项工作中,我们旨在通过开发一个低成本的端到端管线来解决这个问题,该管线可以实现比最先进方法降本超过一个数量级,同时不需要访问数十亿张训练图像或专有数据集。
.

We consider vision transformer based latent diffusion models for text-to-image generation (Dosovitskiy et al., 2020; Peebles & Xie, 2023), particularly because of their simplified design and wide adoption across the latest large-scale diffusion models (Betker et al., 2023; Esser et al., 2024; Chen et al., 2023). To reduce the computational cost, we exploit the strong dependence of the transformer’s computational cost on the input sequence size, i.e., the number of patches per image. We aim to reduce the effective number of patches processed per image by the transformer during training. This objective can be easily achieved by randomly masking a fraction of the tokens at the input layer of the transformer. However, existing masking approaches fail to scale beyond a 50% masking ratio without drastically degrading performance (Zheng et al., 2024), particularly because, at high masking ratios, a large fraction of input patches are completely unobserved by the diffusion transformer.
我们考虑基于视觉变换器的潜在扩散模型进行文本到图像的生成(Dosovitskiy 等人,2020; Peebles 和 Xie, 2023),特别是因为它们设计简单且广泛应用于最新的大规模扩散模型(Betker 等人,2023; Esser 等人,2024; Chen 等人,2023)。为了降低计算成本,我们利用变换器的计算成本与输入序列大小(即每张图像的补丁数量)强烈依赖的特点。我们旨在减少变换器在训练过程中每张图像处理的有效补丁数。这一目标可以通过在变换器的输入层随机遮蔽一部分令牌来轻松实现。然而,现有的遮蔽方法在遮蔽率超过 50%时无法很好地扩展,因为在高遮蔽率下,大部分输入补丁完全无法被扩散变换器观察到(Zheng 等人,2024)

To mitigate the massive performance degradation with masking, we propose a deferred masking strategy where all patches are preprocessed by a lightweight patch-mixer before being transferred to the diffusion transformer. The patch mixer comprises a fraction of the number of parameters in the diffusion transformer. In contrast to naive masking, masking after patch mixing allows non-masked patches to retain semantic information about the whole image and enables reliable training of diffusion transformers at very high masking ratios, while incurring no additional computational cost compared to existing state-of-the-art masking (Zheng et al., 2024). We also demonstrate that under an identical computational budget, our deferred masking strategy achieves better performance than model downscaling, i.e., reducing the size of the model. Finally, we incorporate recent advancements in transformer architecture, such as layer-wise scaling (Mehta et al., 2024) and sparse transformers using mixture-of-experts (Shazeer et al., 2017; Zoph et al., 2022; Zhou et al., 2022), to improve performance in large-scale training.
为了缓解遮罩导致的大规模性能下降,我们提出了一种延迟遮罩策略,其中所有补丁都由轻量级补丁混合器预处理,然后再传输到扩散变换器。补丁混合器的参数数量仅占扩散变换器的一小部分。与简单遮罩相比,遮罩后补丁混合可以让未被遮罩的补丁保留整个图像的语义信息,并且在非常高的遮罩比例下仍能可靠地训练扩散变换器,同时不会增加额外的计算成本。我们还证明,在相同的计算预算下,我们的延迟遮罩策略的性能优于模型缩小,即减小模型的大小。最后,我们结合了最新的变换器架构改进,如逐层缩放和稀疏变换器使用专家混合,以提高大规模训练中的性能。(Zheng et al., 2024)(Mehta et al., 2024)(Shazeer et al., 2017; Zoph et al., 2022; Zhou et al., 2022)

Our low-cost training pipeline naturally reduces the overhead of ablating experimental design choices at scale. One such ablation we run is examining the impact of the choice of training datasets on the performance of micro-budget training. In addition to using only real images, we consider combining additional synthetic images in the training dataset. We find that training on a combined dataset significantly improves image quality and alignment with human visual preferences. Furthermore, our combined dataset comprises only 37M images, significantly fewer than most existing large-scale models, thus providing strong evidence of high-quality generations from moderate-sized datasets in micro-budget training.
我们的低成本培训管道自然减少了规模上试验设计选择的开销。我们运行的一种消融是检查训练数据集的选择对微预算培训性能的影响。除了只使用真实图像之外,我们还考虑在培训数据集中加入额外的合成图像。我们发现在组合数据集上培训显著提高了图像质量和与人类视觉偏好的一致性。此外,我们的组合数据集仅包含 37M 张图像,远少于大多数现有的大规模模型,从而为中等规模数据集上的高质量生成提供了有力证据。

In summary, we make the following key contributions.
总而言之,我们做出了以下关键贡献。

  • We propose a novel deferred patch masking strategy that preprocesses all patches before masking by appending a lightweight patch-mixer model to the diffusion transformer.
    我们提出了一种全新的延迟补丁遮蔽策略,通过将一个轻量级补丁混合模型附加到扩散变换器,来预处理所有补丁。

  • We conduct extensive ablations of deferred masking and demonstrate that it enables reliable training at high masking ratios and achieves significantly better performance compared to alternatives such as other masking strategies and model downscaling.
    我们对延迟遮蔽进行了广泛的消融实验,并证明它能在高遮蔽率下进行可靠的训练,并且与其他遮蔽策略和模型缩小相比,其性能明显更优。

  • We demonstrate that rather than limiting to only real image datasets, combining additional synthetic images in micro-budget training significantly improves the quality of generated images.
    我们证明,与仅限于真实图像数据集不同,在微预算训练中结合额外的合成图像可显著提高生成图像的质量。

  • Using a micro-budget of only $1,890, we train a 1.16 billion parameter sparse diffusion transformer on 37M images and a 75% masking ratio that achieves a 12.7 FID in zero-shot generation on the COCO dataset. The wall-clock time of our training is only 2.6 days on a single 8×8\timesH100 GPU machine, 14×\times lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost).
    使用仅 1,890 美元的微型预算,我们训练了一个有 11.6 亿参数的稀疏扩散变换器,在 3,700 万张图像中,遮蔽率为 75%,在 COCO 数据集上进行零拍摄生成,FID 为 12.7。我们的训练时间仅为 2.6 天,在单台 8×8\times H100 GPU 机器上进行,比当前最先进的方法快 14 倍,后者需要 37.6 天的培训时间(GPU 成本为 28,400 美元)。

2 Methodology
2 方法论

In this section, we first provide a background on diffusion models, followed by a review of common approaches, particularly those based on patch masking, to reduce computational costs in training diffusion transformers. Then, we provide a detailed description of our proposed deferred patch masking strategy.
在本节中,我们首先介绍扩散模型的背景知识,然后审视常见的方法,特别是基于补丁遮蔽的方法,以降低训练扩散转换器的计算成本。之后,我们将详细描述我们提出的延迟补丁遮蔽策略。

2.1 Background: Diffusion-based generative models and diffusion transformers
2.1 背景: 基于扩散的生成模型和扩散变换器

We denote the data distribution by 𝒟𝒟\mathcal{D}, such that (𝒙,𝒄)𝒟similar-to𝒙𝒄𝒟({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D}, where 𝒙𝒙{\boldsymbol{x}} and 𝒄𝒄{\boldsymbol{c}} represent the image and the corresponding natural language caption, respectively. We assume that p(𝒙;σ)𝑝𝒙𝜎p({\boldsymbol{x}};\sigma) is the image distribution obtained after adding σ2superscript𝜎2\sigma^{2}-variance Gaussian noise.
我们用 𝒟\mathcal{D} 表示数据分布,使得 (𝒙,𝒄)𝒟({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D} ,其中 𝒙{\boldsymbol{x}}𝒄{\boldsymbol{c}} 分别表示图像和相应的自然语言字幕。我们假设 p(𝒙;σ)p({\boldsymbol{x}};\sigma) 是在添加 σ2\sigma^{2} 方差高斯噪声后获得的图像分布。

Diffusion-based probabilistic models use a forward-reverse process-based approach, where the forward process corrupts the input data and the reverse process tries to reconstruct the original signal by learning to reconstruct the data degradation at each step. Though multiple variations of diffusion models exists (Ho et al., 2020; Song et al., 2020b; Bansal et al., 2024), it is common to use time-dependent Gaussian noise (σ(t)𝜎𝑡\sigma(t)) for model data corruption. Under deterministic sampling, both the forward and reverse processes in diffusion models can be modeled using an ordinary differential equation (ODE).
基于扩散的概率模型使用基于正向-逆向过程的方法,其中正向过程会破坏输入数据,而逆向过程则试图通过学习在每一步重建数据退化来重建原始信号。尽管存在多种扩散模型的变体(何等, 2020; 宋等, 2020b; 班萨尔等, 2024),但使用时间依赖高斯噪声( σ(t)\sigma(t) )来对模型数据进行破坏是常见的。在确定性采样下,扩散模型中的正向和逆向过程都可以使用常微分方程(ODE)进行建模。

d𝒙=σ˙(t)σ(t)𝒙logp(𝒙;σ(t))dt,d𝒙˙𝜎𝑡𝜎𝑡subscript𝒙𝑝𝒙𝜎𝑡d𝑡\displaystyle\text{d}{\boldsymbol{x}}=-\dot{\sigma}(t)\sigma(t)\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t))\text{d}t, (1)

where σ˙(t)˙𝜎𝑡\dot{\sigma}(t) represents the time derivative and 𝒙logp(𝒙;σ(t))subscript𝒙𝑝𝒙𝜎𝑡\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t)) is the score of the underlying density function. Thus the move towards (or away from) higher density regions in the diffusion process is proportional to both the scores of the density function and changes in the noise distribution (σ(t)𝜎𝑡\sigma(t)) with time. Going beyond the deterministic sampling based on the ODE formulation in Equation 1, the sampling process in diffusion models can also be formulated as a stochastic differential equation,
其中 σ˙(t)\dot{\sigma}(t) 表示时间导数, 𝒙logp(𝒙;σ(t))\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t)) 为基础密度函数的分数。因此,在扩散过程中向更高密度区域移动(或远离)与密度函数的分数和噪声分布 ( σ(t)\sigma(t) ) 随时间的变化成正比。超越方程 1 中基于 ODE 的确定性采样,扩散模型中的采样过程也可以表述为一个随机微分方程。

d𝒙=σ˙(t)σ(t)𝒙logp(𝒙;σ(t))dtProbability flow ODE (Eq. 1)β(t)σ(t)2𝒙logp(𝒙;σ(t))dtDeterministic noise decay+2β(t)σ(t)Noise injectiondwtd𝒙subscript˙𝜎𝑡𝜎𝑡subscript𝒙𝑝𝒙𝜎𝑡d𝑡Probability flow ODE (Eq. 1)subscript𝛽𝑡𝜎superscript𝑡2subscript𝒙𝑝𝒙𝜎𝑡d𝑡Deterministic noise decaysubscript2𝛽𝑡𝜎𝑡Noise injectiondsubscript𝑤𝑡\displaystyle\text{d}{\boldsymbol{x}}=\underbrace{-\dot{\sigma}(t)\sigma(t)\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t))\text{d}t}_{\text{Probability flow ODE (Eq.~{}\ref{eq: score-ode})}}-\underbrace{\beta(t)\sigma(t)^{2}\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t))\text{d}t}_{\text{Deterministic noise decay}}+\underbrace{\sqrt{2\beta(t)}\sigma(t)}_{\text{Noise injection}}\text{d}w_{t} (2)

where dwtdsubscript𝑤𝑡\text{d}w_{t} is standard the Wiener process and the parameter β(t)𝛽𝑡\beta(t) controls the rate of injection of additional noise. The generation process in diffusion models starts by sampling an instance from p(𝒙,σmax)𝑝𝒙subscript𝜎maxp({\boldsymbol{x}},\sigma_{\text{max}}) and iteratively denoising it using the ODE or SDE-based formulation.
其中 dwt\text{d}w_{t} 是标准的维纳过程,参数 β(t)\beta(t) 控制额外噪声的注入速率。扩散模型的生成过程从采样 p(𝒙,σmax)p({\boldsymbol{x}},\sigma_{\text{max}}) 的一个实例开始,然后使用基于 ODE 或 SDE 的公式进行迭代去噪。

Training. Diffusion model training is based on parameterizing the score of the density function, which doesn’t depend on the generally intractable normalization factor, using a denoiser, i.e., 𝒙logp(𝒙;σ(t))(Fθ(𝒙,σ(t))𝒙)/σ(t)2subscript𝒙𝑝𝒙𝜎𝑡subscript𝐹𝜃𝒙𝜎𝑡𝒙𝜎superscript𝑡2\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t))\approx\left(F_{\theta}({\boldsymbol{x}},\sigma(t))-{\boldsymbol{x}}\right)/\sigma(t)^{2}. The denoising function Fθ(𝒙,σ(t))subscript𝐹𝜃𝒙𝜎𝑡F_{\theta}({\boldsymbol{x}},\sigma(t)) is generally modeled using a deep neural network. For text-to-image generation, the denoiser is conditioned on both noise distribution and natural language captions and trained using the following loss function,
训练。扩散模型训练是基于参数化密度函数的分数,该分数不依赖于通常难以处理的归一化因子,使用去噪器,即 𝒙logp(𝒙;σ(t))(Fθ(𝒙,σ(t))𝒙)/σ(t)2\nabla_{\boldsymbol{x}}\log p({\boldsymbol{x}};\sigma(t))\approx\left(F_{\theta}({\boldsymbol{x}},\sigma(t))-{\boldsymbol{x}}\right)/\sigma(t)^{2} 。去噪函数 Fθ(𝒙,σ(t))F_{\theta}({\boldsymbol{x}},\sigma(t)) 通常使用深度神经网络建模。对于文本到图像的生成,去噪器同时依赖于噪声分布和自然语言字幕,并使用以下损失函数进行训练。

=𝔼(𝒙,𝒄)𝒟𝔼ϵ𝒩(𝟎,σ(t)2𝐈)Fθ(𝒙+ϵ;σ(t),𝒄)𝒙22.subscript𝔼similar-to𝒙𝒄𝒟subscript𝔼similar-tobold-italic-ϵ𝒩0𝜎superscript𝑡2𝐈superscriptsubscriptnormsubscript𝐹𝜃𝒙bold-italic-ϵ𝜎𝑡𝒄𝒙22\displaystyle\mathcal{L}=\mathbb{E}_{({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D}}\mathbb{E}_{{\boldsymbol{\epsilon}}\sim\mathcal{N}(\boldsymbol{0},\sigma(t)^{2}\mathbf{I})}\|F_{\theta}({\boldsymbol{x}}+{\boldsymbol{\epsilon}};\sigma(t),{\boldsymbol{c}})-{\boldsymbol{x}}\|_{2}^{2}. (3)

The noise distribution during training is lognormal (ln(σ)𝒩(Pmean,Pstd)similar-toln𝜎𝒩subscript𝑃meansubscript𝑃std\text{ln}(\sigma)\sim\mathcal{N}(P_{\text{mean}},P_{\text{std}})), where the choice of the mean and standard deviation of the noise distribution strongly influences the quality of generated samples. For example, the noise distribution is shifted to the right to sample larger noise values on higher resolution images, as the signal-to-noise ratio increases with resolution (Teng et al., 2023).
训练期间的噪声分布是对数正态分布( ln(σ)𝒩(Pmean,Pstd)\text{ln}(\sigma)\sim\mathcal{N}(P_{\text{mean}},P_{\text{std}}) ),其中噪声分布的均值和标准差的选择会严重影响生成样本的质量。例如,对于分辨率更高的图像,信噪比会随着分辨率的增加而提高,因此我们将噪声分布向右偏移,以采样更大的噪声值(Teng 等人, 2023)

Classifier-free guidance. To increase alignment between input captions and generated images, classifier-free guidance (Ho & Salimans, 2022) modifies the image denoiser to output a linear combination of denoised samples in the presence and absence of input captions.
无分类器引导。 为了增加输入标题和生成图像之间的对齐度,无分类器引导(Ho & Salimans, 2022)修改了图像去噪器,以输出在存在和不存在输入标题的情况下的去噪样本的线性组合。

F^θ(𝒙;σ(t),𝒄)=Fθ(𝒙;σ(t))+w(Fθ(𝒙;σ(t),𝒄)Fθ(𝒙;σ(t))),subscript^𝐹𝜃𝒙𝜎𝑡𝒄subscript𝐹𝜃𝒙𝜎𝑡𝑤subscript𝐹𝜃𝒙𝜎𝑡𝒄subscript𝐹𝜃𝒙𝜎𝑡\displaystyle\hat{F}_{\theta}({\boldsymbol{x}};\sigma(t),{\boldsymbol{c}})=F_{\theta}({\boldsymbol{x}};\sigma(t))+w\cdot(F_{\theta}({\boldsymbol{x}};\sigma(t),{\boldsymbol{c}})-F_{\theta}({\boldsymbol{x}};\sigma(t))), (4)

where w1𝑤1w\geq 1 controls the strength of guidance. During training, a fraction of image captions (commonly set to 10%) are randomly dropped to learn unconditional generation.
控制指引的强度。在训练期间,图像标题的一部分(通常设置为 10%)被随机删除以学习无条件生成。>

Latent diffusion models. In contrast to modeling higher dimensional pixel space (h×w×3superscript𝑤3\mathbb{R}^{h\times w\times 3}), latent diffusion models (Rombach et al., 2022) are trained in a much lower dimensional compressed latent space (hn×wn×csuperscript𝑛𝑤𝑛𝑐\mathbb{R}^{\frac{h}{n}\times\frac{w}{n}\times c}), where n𝑛n is the compression factor and c𝑐c is the number of channels in the latent space. The image-to-latent encoding and its reverse mapping are performed by a variational autoencoder. Latent diffusion models achieve faster convergence than training in pixel space and have been heavily adopted in large-scale diffusion models (Podell et al., 2024; Betker et al., 2023; Chen et al., 2023; Balaji et al., 2022).
潜在扩散模型。 与建模更高维的像素空间( h×w×3\mathbb{R}^{h\times w\times 3} )相比,潜在扩散模型 (Rombach et al., 2022) 在低维压缩的潜在空间( hn×wn×c\mathbb{R}^{\frac{h}{n}\times\frac{w}{n}\times c} )中进行训练,其中 nn 是压缩系数, cc 是潜在空间的通道数。图像到潜在编码及其反向映射由变分自编码器完成。潜在扩散模型的收敛速度比在像素空间中训练快,并已被大规模扩散模型大量采用(Podell et al., 2024; Betker et al., 2023; Chen et al., 2023; Balaji et al., 2022)

Diffusion transformers. We consider that the transformer model (Fθsubscript𝐹𝜃F_{\theta}) consists of k𝑘k hierarchically stacked transformer blocks. Each block consists of a multi-head self-attention layer, a multi-head cross-attention layer, and a feed-forward layer. For simplicity, we assume that all images are converted to a sequence of patches to be compatible with the transformer architecture. In the transformer architecture, we refer to the width of all linear layers as d𝑑d.
扩散型变换器。 我们认为 transformer 模型( FθF_{\theta} ) 由 kk 层次堆叠的 transformer 块组成。每个块由一个多头自注意力层、一个多头交叉注意力层和一个前馈层组成。为简单起见,我们假设所有图像都转换为与 transformer 架构兼容的一系列块。在 transformer 架构中,我们将所有线性层的宽度称为 dd

2.2 Common approaches towards minimizing the computational cost of training diffusion transformers
2.2 最小化扩散转换器训练计算成本的常见方法

Assuming transformer architectures where linear layers account for the majority of the computational cost, the total training cost is proportional to M×N×S𝑀𝑁𝑆M\times N\times S, where M𝑀M is the number of samples processed, N𝑁N is the number of model parameters, and S𝑆S is the number of patches derived from one image, i.e., sequence size for transformer input (Figure 2a). As our goal is to train an open-world model, we believe that a large number of parameters are required to support diverse concepts during generation; thus, we opt to not decrease the parameter count significantly. Since diffusion models converge slowly and are often trained for a large number of steps, even for small-scale datasets, we keep this choice intact (Rombach et al., 2022; Balaji et al., 2022). Instead, we aim to exploit any redundancy in the input sequence to reduce computational cost while simultaneously not drastically degrade performance.
M×N×SM\times N\times S ,其中 MM 是处理的样本数量, NN 是模型参数数量, SS 是从一个图像衍生的 patch 数量,即 transformer 输入的序列大小(图2a)。由于我们的目标是训练一个开放世界模型,我们认为需要大量的参数来支持生成过程中的多样化概念;因此,我们选择不显著减少参数数量。由于扩散模型收敛缓慢,即使对于小规模数据集,也需要进行大量步骤的训练,我们保留了这一选择(Rombach 等人,2022;Balaji 等人,2022。相反,我们旨在利用输入序列中的冗余来降低计算成本,同时不大幅降低性能。

A. Using larger patch size (Figure 2b). In a vision transformer, the input image is first transformed into a sequence of non-overlapping patches with resolution p×p𝑝𝑝p\times p, where p𝑝p is the patch size (Dosovitskiy et al., 2020). Though using a higher patch size quadratically reduces the number of patches per image, it can significantly degrade performance due to the aggressive compression of larger regions of the image into a single patch.
使用更大的块大小(图 2b)。在视觉 Transformer 中,输入图像首先被转换为一系列不重叠的分块,分辨率为 p×pp\times p ,其中 pp 是块大小(Dosovitskiy 等人,2020)。尽管使用更高的块大小可以将每个图像的分块数量减少到四分之一,但由于将图像的较大区域压缩为单个分块,这可能会严重降低性能。

B. Using patch masking (Figure 2c). A contemporary approach is to keep the patch size intact but drop a large fraction of patches at the input layer of the transformer (He et al., 2022). Naive token masking is similar to training on random crops in convolutional UNets, where often upsampling diffusion models are trained on smaller random crops rather than the full image (Nichol et al., 2021; Saharia et al., 2022). However, in contrast to random crops of the images, patch masking allows training on non-continuous regions of the image222We observe consistent degradation in the quality of image generation with block masking, i.e., retaining continuous regions of the image (Appendix C). Thus, we argue that random patch masking in transformers offers a more powerful masking paradigm compared to random image crops in a convolutional network under identical computational cost.
我们观察到使用块遮罩导致图像生成质量的持续恶化,即保留图像的连续区域(附录C)。因此,我们认为与卷积网络中的随机图像裁剪相比,变压器中的随机分块遮蔽提供了更强大的遮蔽范式,计算成本相同。

使用补丁掩码(图 2c)。一种当代方法是保持补丁大小不变,但在变换器的输入层丢弃大部分补丁(He 等人,2022 年)。简单的标记掩码类似于在卷积 UNets 上训练随机裁剪,其中通常将扩散模型训练在较小的随机裁剪上而不是完整的图像(Nichol 等,2021 年;Saharia 等,2022 年)。然而,与图像的随机裁剪相比,补丁掩码允许在图像的非连续区域上进行训练。
. Due to its effectiveness, masking-based training of transformers has been adopted across both vision and language domains (Devlin et al., 2018; He et al., 2022).
由于其有效性,基于掩蔽的变换器训练已广泛应用于视觉和语言领域(Devlin et al., 2018; He et al., 2022)

C. Masking patches with autoencoding (MaskDiT - Figure 2d). In order to also encourage representation learning from masked patches, Zheng et al. (2024) adds an auxiliary autoencoding loss that encourages the reconstruction of masked patches. This formulation was motivated by the success of the autoencoding loss in learning self-supervised representations (He et al., 2022). MaskDiT first appends a lightweight decoder, a small-scale transformer network, to the larger backbone transformer. Before the decoder, it expands the backbone transformer output by inserting spurious learnable patches in place of masked patches.
使用自编码掩蔽补丁(MaskDiT - 图 2d)。为了鼓励从遮蔽的补丁中学习表示,郑等人(2024)增加了一个辅助自编码损失,鼓励重建被遮蔽的补丁。这一表述是受到自编码损失在学习自监督表示方面成功的启发(何等人,2022)。MaskDiT 首先在更大的主干变换器网络上附加一个轻量级解码器,即一个小规模的变换器网络。在解码器之前,它通过在被遮蔽的补丁位置插入虚构的可学习补丁来扩展主干变换器的输出。

Assuming that the binary mask m𝑚m represents the masked patches, the final training loss is the following,
假设二进制掩码 mm 表示掩蔽补丁,最终的训练损失如下:

diff=𝔼(𝒙,𝒄)𝒟𝔼ϵ𝒩(𝟎,σ(t)2𝐈)(F¯θ((𝒙+ϵ)(1m);σ,𝒄)𝒙)(1m)22subscriptdiffsubscript𝔼similar-to𝒙𝒄𝒟subscript𝔼similar-tobold-italic-ϵ𝒩0𝜎superscript𝑡2𝐈superscriptsubscriptnormdirect-productsubscript¯𝐹𝜃direct-product𝒙bold-italic-ϵ1𝑚𝜎𝒄𝒙1𝑚22\displaystyle\mathcal{L}_{\text{diff}}=\mathbb{E}_{({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D}}\mathbb{E}_{{\boldsymbol{\epsilon}}\sim\mathcal{N}(\boldsymbol{0},\sigma(t)^{2}\mathbf{I})}\|\left(\bar{F}_{\theta}(({\boldsymbol{x}}+{\boldsymbol{\epsilon}})\odot(1-m);\sigma,{\boldsymbol{c}})-{\boldsymbol{x}}\right)\odot(1-m)\|_{2}^{2} (5)
mae=𝔼(𝒙,𝒄)𝒟𝔼ϵ𝒩(𝟎,σ(t)2𝐈)(F¯θ((𝒙+ϵ)(1m);σ,𝒄)(𝒙+ϵ))m22subscriptmaesubscript𝔼similar-to𝒙𝒄𝒟subscript𝔼similar-tobold-italic-ϵ𝒩0𝜎superscript𝑡2𝐈superscriptsubscriptnormdirect-productsubscript¯𝐹𝜃direct-product𝒙bold-italic-ϵ1𝑚𝜎𝒄𝒙bold-italic-ϵ𝑚22\displaystyle\mathcal{L}_{\text{mae}}=\mathbb{E}_{({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D}}\mathbb{E}_{{\boldsymbol{\epsilon}}\sim\mathcal{N}(\boldsymbol{0},\sigma(t)^{2}\mathbf{I})}\|\left(\bar{F}_{\theta}(({\boldsymbol{x}}+{\boldsymbol{\epsilon}})\odot(1-m);\sigma,{\boldsymbol{c}})-({\boldsymbol{x}}+{\boldsymbol{\epsilon}})\right)\odot m\|_{2}^{2} (6)
=diff+γmaesubscriptdiff𝛾subscriptmae\displaystyle\mathcal{L}=\mathcal{L}_{\text{diff}}+\gamma\mathcal{L}_{\text{mae}} (7)

where F¯θsubscript¯𝐹𝜃\bar{F}_{\theta} represents the sequential backbone transformer and decoder module, and γ𝛾\gamma is a hyperparameter to balance the influence of the masked autoencoding loss with respect to the diffusion training loss.
F¯θ\bar{F}_{\theta} 表示顺序骨架变压器和解码器模块, γ\gamma 是一个超参数,用于平衡掩码自动编码损失与扩散训练损失的影响。

Refer to caption
Figure 2: Compressing patch sequence to reduce computational cost. As the training cost of diffusion transformers is proportional to sequence size, i.e., number of patches, it is desirable to reduce the sequence size without degrading performance. It can be achieved by a) using larger patches, b) naively masking a fraction of patches at random, or c) using MaskDiT (Zheng et al., 2024) that combines naive masking with an additional autoencoding objective. We find all three approaches lead to significant degradation in image generation performance, especially at high masking ratios. To alleviate this issue, we propose a straightforward deferred masking strategy, where we mask patches after they are processed by a patch-mixer. Our approach is analogous to naive masking in all aspects except the use of the patch-mixer. In comparison to MaskDiT, our approach doesn’t require optimizing any surrogate objectives and has nearly identical computational costs.
图 2: 压缩补丁序列以降低计算成本。由于扩散式变换器的训练成本与序列大小成正比,即补丁数量,所以有必要在不影响性能的情况下减小序列大小。这可以通过以下方式实现:a) 使用更大的补丁, b) 随机遮蔽部分补丁, c) 使用结合简单遮蔽和额外自编码目标的 MaskDiT(Zheng et al., 2024)。我们发现这三种方法都会导致图像生成性能显著下降,尤其是在高遮蔽比例下。为了缓解这个问题,我们提出了一种简单的延迟遮蔽策略,即在补丁被补丁混合器处理后再进行遮蔽。与 MaskDiT 相比,我们的方法不需要优化任何代理目标,计算成本几乎相同。

2.3 Alleviating performance bottleneck in patch masking using deferred masking
2.3 使用延迟遮罩缓解修补遮罩的性能瓶颈

To drastically reduce the computational cost, patch masking requires dropping a large fraction of input patches before they are fed into the backbone transformer, thus making the information from masked patches unavailable to the transformer. High masking ratios, e.g., 75% masking, significantly degrade the overall performance of the transformer. Even with MaskDiT, we only observe a marginal improvement over naive masking, as this approach also drops the majority of image patches at the input layer itself.
为了大幅降低计算成本,补丁遮罩需要在将输入补丁传递到主干变换器之前丢弃大部分补丁,从而使被遮罩补丁的信息对变换器不可用。高遮罩比率(例如 75%遮罩)会严重降低变换器的整体性能。即使采用 MaskDiT,我们也只观察到对朴素遮罩的微小改善,因为该方法也在输入层自身就丢弃了大部分图像补丁。

Deferred masking to retain semantic information of all patches. As high masking ratios remove the majority of valuable learning signals from images, we are motivated to ask whether it is necessary to mask at the input layer. As long as the computational cost is unchanged, it is merely a design choice and not a fundamental constraint. In fact, we uncover a significantly better masking strategy, with nearly identical cost to the existing MaskDiT approach. As the patches are derived from non-overlapping image regions in diffusion transformers (Peebles & Xie, 2023; Dosovitskiy et al., 2020), each patch embedding does not embed any information about other patches in the image. Thus, we aim to preprocess the patch embeddings before masking, such that non-masked patches can embed information about the whole image. We refer to the preprocessing module as patch-mixer.
保留所有补丁的语义信息的延迟遮蔽。 由于高遮蔽率会移除图像中大部分有价值的学习信号,我们被激励去问是否有必要在输入层进行遮蔽。只要计算成本不变,这仅仅是一个设计选择,而不是一个基本约束。事实上,我们发现了一种显著更好的遮蔽策略,其成本与现有的 MaskDiT 方法几乎相同。由于补丁来自扩散变压器中的非重叠图像区域(Peebles & Xie, 2023; Dosovitskiy et al., 2020),每个补丁嵌入不会嵌入关于图像中其他补丁的任何信息。因此,我们的目标是在遮蔽之前预处理补丁嵌入,使得未被遮蔽的补丁可以嵌入有关整个图像的信息。我们将这个预处理模块称为补丁混合器

Training diffusion transformers with a patch-mixer. We consider a patch mixer to be any neural architecture that can fuse the embeddings of individual patches. In transformer models, this objective is naturally achieved with a combination of attention and feedforward layers. So we use a lightweight transformer comprising only a few layers as our patch mixer. We mask the input sequence tokens after they have been processed by the patch mixer (Figure 2e). Assuming a binary mask m𝑚m for masking, we train our model using the following loss function,
使用 patch-mixer 训练扩散变换器。我们认为 patch mixer 是任何能够融合单个 patch 的嵌入的神经架构。在 transformer 模型中,这一目标通过注意力和前馈层的组合自然实现。因此,我们使用由少量层组成的轻量级 transformer 作为我们的 patch mixer。我们在 patch mixer(图 2e)处理后屏蔽输入序列令牌。假设有一个用于掩码的二进制掩码 mm ,我们使用以下损失函数训练我们的模型。

=𝔼(𝒙,𝒄)𝒟𝔼ϵ𝒩(𝟎,σ(t)2𝐈)Fθ(Mϕ(𝒙+ϵ;σ(t),𝒄)(1m);σ(t),𝒄)𝒙(1m)22subscript𝔼similar-to𝒙𝒄𝒟subscript𝔼similar-tobold-italic-ϵ𝒩0𝜎superscript𝑡2𝐈superscriptsubscriptnormsubscript𝐹𝜃direct-productsubscript𝑀italic-ϕ𝒙bold-italic-ϵ𝜎𝑡𝒄1𝑚𝜎𝑡𝒄direct-product𝒙1𝑚22\displaystyle\mathcal{L}=\mathbb{E}_{({\boldsymbol{x}},{\boldsymbol{c}})\sim\mathcal{D}}\mathbb{E}_{{\boldsymbol{\epsilon}}\sim\mathcal{N}(\boldsymbol{0},\sigma(t)^{2}\mathbf{I})}\|F_{\theta}(M_{\phi}({\boldsymbol{x}}+{\boldsymbol{\epsilon}};\sigma(t),{\boldsymbol{c}})\odot(1-m);\sigma(t),{\boldsymbol{c}})-{\boldsymbol{x}}\odot(1-m)\|_{2}^{2} (8)

where Mϕsubscript𝑀italic-ϕM_{\phi} is the patch-mixer model and Fθsubscript𝐹𝜃F_{\theta} is the backbone transformer. Note that compared to MaskDiT, our approach also simplifies the overall design by not requiring an additional loss function and corresponding hyperparameter tuning between two losses during training. We don’t mask any patches during inference.
其中 MϕM_{\phi} 是 patch-mixer 模型, FθF_{\theta} 是 backbone transformer。值得注意的是,与 MaskDiT 相比,我们的方法还通过在训练过程中不需要额外的损失函数及相应的超参数调整来简化了整体设计。我们在推理过程中不会遮蔽任何 patches。

Unmasked finetuning. As a very high masking ratio can significantly reduce the ability of diffusion models to learn global structure in the images and introduce a train-test distribution shift in sequence size, we consider a small degree of unmasked finetuning after the masked pretraining. The finetuning can also mitigate any undesirable generation artifacts due to the use of patch masking. Thus, it has been crucial in previous work (Zheng et al., 2024) to recover the sharp performance drop from masking, especially when classifier-free guidance is used in sampling. However, we don’t find it completely necessary, as even in masked pretraining, our approach achieves comparable performance to the baseline unmasked pretraining. We only employ it in large-scale training to mitigate any unknown-unknown generation artifacts from a high degree of patch masking.
无掩蔽微调。 由于非常高的掩蔽率可能会显著降低扩散模型学习图像全局结构的能力,并在序列大小上引入训练-测试分布偏移,我们在掩蔽预训练之后考虑进行少量的无掩蔽微调。微调还可以缓解由于使用分块掩蔽而产生的任何不受欢迎的生成伪影。因此,在之前的工作(Zheng 等人, 2024)中,它在恢复由于掩蔽而导致的性能下降方面至关重要,尤其是在采样中使用无分类引导时。然而,我们发现它并非完全必要,因为即使在掩蔽预训练中,我们的方法也达到了与基线无掩蔽预训练相当的性能。我们只在大规模训练中使用它来缓解由于高度分块掩蔽而产生的任何未知-未知的生成伪影。

Improving backbone transformer architecture with mixture-of-experts (MoE) and layer-wise scaling. We also leverage innovations in transformer architecture design to boost the performance of our model under computational constraints. We use mixture-of-experts layers as they increase the parameters and expressive power of our model without significantly increasing the training cost (Shazeer et al., 2017; Fedus et al., 2022). We use a simplified MoE layer based on expert-choice routing (Zhou et al., 2022), where each expert determines the tokens routed to it, as it doesn’t require any additional auxiliary loss functions to balance the load across experts (Shazeer et al., 2017; Zoph et al., 2022). We also consider a layer-wise scaling approach, which has recently been shown to outperform canonical transformers in large-language models (Mehta et al., 2024). It linearly increases the width, i.e., hidden layer dimension in attention and feedforward layers, of transformer blocks. Thus, deeper layers in the network are assigned more parameters than earlier layers. We argue that since deeper layers in visual models tend to learn more complex features (Zeiler & Fergus, 2014), using higher parameters in deeper layers would lead to better performance. We provide background details on both techniques in Appendix B. We describe the overall architecture of our diffusion transformer in Figure 3.
使用专家组合(MoE)和逐层缩放改进 transformer 架构。 我们还利用 transformer 架构设计中的创新来提高模型在计算约束下的性能。我们使用专家组合层,因为它们可以增加模型的参数和表达能力,而不显著增加训练成本(Shazeer et al., 2017; Fedus et al., 2022)。我们使用基于专家选择路由的简化 MoE 层(Zhou et al., 2022),其中每个专家确定路由到其中的令牌,因为它不需要任何额外的辅助损失函数来平衡专家之间的负载(Shazeer et al., 2017; Zoph et al., 2022)。我们还考虑采用逐层缩放方法,该方法最近被证明在大型语言模型中优于经典的 transformers(Mehta et al., 2024)。它线性增加注意力和前馈层中的宽度,即隐藏层维度。因此,网络中更深层的层被分配了更多的参数。我们认为,由于视觉模型中更深层的层会学习更复杂的特征(Zeiler & Fergus, 2014),在更深层使用更高的参数会带来更好的性能。我们在附录B中提供了这两种技术的背景细节。我们在图3中描述了我们的扩散 transformer 的整体架构。

Refer to caption
Figure 3: Overall architecture of our diffusion transformer. We prepend the backbone transformer model with a lightweight patch-mixer that operates on all patches in the input image before they are masked. Following contemporary works (Betker et al., 2023; Esser et al., 2024), we process the caption embeddings using an attention layer before using them for conditioning. We use sinusoidal embeddings for timesteps. Our model only denoises unmasked patches, thus the diffusion loss (Eq. 3) is calculated only on these patches. We modify the backbone transformer using layer-wise scaling on individual layers and use mixture-of-expert layers in alternate transformer blocks.
图 3: 我们扩散变换器的总体架构。我们在主干变换器模型之前添加了一个轻量级的 patch-mixer,它在输入图像的所有 patch 之前对其进行操作。根据最新的研究工作(Betker et al., 2023; Esser et al., 2024),我们使用注意力层处理标题嵌入,然后再用于调节。我们使用正弦波嵌入表示时间步长。我们的模型只去噪未被掩蔽的 patch,因此扩散损失(公式 3)仅针对这些 patch 计算。我们使用层级缩放修改主干变换器的各个层,并在交替的变换器块中使用专家混合层。

3 Experimental Setup
3 实验设置

We provide key details in our experimental setup below and additional details in Appendix A.
我们在下面的实验设置中提供关键细节,并在附录A中提供更多细节。

We use two variants of Diffusion Transformer (DiT) architecture from Peebles & Xie (2023): DiT-Tiny/2 and DiT-Xl/2, both comprising a patch size of 222. When training sparse models, we replace every alternate transformer block with an 8-expert mixture-of-experts block. We train all models using the AdamW optimizer with cosine learning rate decay and high weight decay. We provide an exhaustive list of our training hyperparameters in Table 5 in Appendix A. We use a four-channel variational autoencoder (VAE) from the Stable-Diffusion-XL (Podell et al., 2024) model to extract image latents. We also consider the latest 16-channel VAEs for test their performance in large-scale micro-budget training (Esser et al., 2024). We use the EDM framework from Karras et al. (2022) as a unified training setup for all diffusion models. We refer to our diffusion trasnformer as MicroDiT. We use Fréchet inception distance (FID) (Heusel et al., 2017), both with the original Inception-v3 model (Szegedy et al., 2016) and a CLIP model (Radford et al., 2021), and CLIP score (Hessel et al., 2021) to measure the performance of the image generation models.
我们使用来自Peebles & Xie (2023)的两种 Diffusion Transformer (DiT)架构变体:DiT-Tiny/2DiT-Xl/2,它们都由 22 大小的 patch 组成。在训练稀疏模型时,我们用一个 8 专家专家混合块取代每个交替的 transformer 块。我们使用 AdamW 优化器对所有模型进行训练,采用余弦学习率衰减和较高的权重衰减。我们在附录A的表5中提供了详尽的训练超参数列表。我们使用来自 Stable-Diffusion-XL(Podell et al., 2024)模型的四通道变分自编码器(VAE)来提取图像潜在特征。我们还考虑了最新的 16 通道 VAE,以测试它们在大规模微预算训练中的性能(Esser et al., 2024)。我们使用来自Karras et al. (2022)的 EDM 框架作为所有扩散模型的统一训练设置。我们将我们的扩散 transformer 称为MicroDiT。我们使用 Fréchet inception distance (FID)(Heusel et al., 2017),同时使用原始 Inception-v3 模型(Szegedy et al., 2016)和 CLIP 模型(Radford et al., 2021),以及 CLIP 得分(Hessel et al., 2021)来衡量图像生成模型的性能。

Choice of text encoder. To convert natural language captions to higher-dimensional feature embeddings, we use the text encoder from state-of-the-art CLIP models333We use the DFN-5B text encoder: https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378
我们使用 DFN-5B 文本编码器:https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378

文本编码器的选择。为了将自然语言字幕转换为高维特征嵌入,我们使用了最先进的 CLIP 模型 3 的文本编码器。
 (Ilharco et al., 2021; Fang et al., 2023). We favor CLIP models over T5-xxl (Raffel et al., 2020; Saharia et al., 2022) for text encoding, despite the better performance of the latter on challenging tasks like text synthesis, as the latter incurs much higher computational and storage costs at scale, thus not suitable for micro-budget training (Table 6 in Appendix A). We consider the computation of text embeddings for captions and latent compression of images as a one-time cost that amortizes over multiple training runs of a diffusion model. Thus, we compute them offline and do not account for these costs in our estimation of training costs. We use a $30/hour cost estimate for an 8×\timesH100 GPU node444https://cloud-gpus.com/
(伊尔哈尔科等人,2021;方等人,2023)。我们更倾向于使用 CLIP 模型进行文本编码,而不是 T5-xxl(Raffel 等人,2020;Saharia 等人,2022),尽管后者在文本合成等具有挑战性的任务上表现更好,但后者在大规模应用时的计算和存储成本要高得多,因此不适合微预算训练(附录 A 表 6)。我们将文本嵌入和图像的潜在压缩计算视为一次性成本,可以在多次扩散模型训练运行中摊销。因此,我们离线计算它们,并不将这些成本计入我们对训练成本的估算中。我们使用每小时 30 美元的成本估算,适用于一个 8 ×\times H100 GPU 节点。
(equivalent to $12/hour for an 8×\timesA100 GPU node) to measure the economical cost.
(相当于每小时$12 的 8 ×\times A100 GPU node 成本)来衡量经济成本。

Training datasets. We use the following three real image datasets comprising 22M22𝑀22M image-text pairs: Conceptual Captions (CC12M) (Changpinyo et al., 2021), Segment Anything (SA1B) (Kirillov et al., 2023), and TextCaps (Sidorov et al., 2020). As SA1B does not provide real captions, we use synthetic captions generated from an LLaVA model (Liu et al., 2023; Chen et al., 2023). We also add two synthetic image datasets comprising 15M15𝑀15M image-text pairs in large-scale training: JourneyDB (Sun et al., 2024) and DiffusionDB (Wang et al., 2022). For small-scale ablations, we construct a text-to-image dataset, named cifar-captions, by subsampling images of ten CIFAR-10 classes from the larger COYO-700M (Byeon et al., 2022) dataset. Thus, cifar-captions serves as a drop-in replacement for any image-text dataset and simultaneously enables fast convergence due to being closed-domain. We provide additional details on construction of this dataset in Appendix A and samples images in Figure 28.
训练数据集。我们使用以下三个真实图像数据集,包含 22M22M 个图像-文本对:Conceptual Captions (CC12M) (Changpinyo et al., 2021)、Segment Anything (SA1B) (Kirillov et al., 2023)和 TextCaps (Sidorov et al., 2020)。由于 SA1B 没有提供真实的标题,我们使用从 LLaVA 模型(Liu et al., 2023; Chen et al., 2023)生成的合成标题。我们还添加了两个由 15M15M 个图像-文本对组成的合成图像数据集:JourneyDB (Sun et al., 2024)和 DiffusionDB (Wang et al., 2022)。对于小规模的消融实验,我们构建了一个名为 cifar-captions 的文本到图像数据集,它由来自更大的 COYO-700M(Byeon et al., 2022)数据集的 10 个 CIFAR-10 类的子采样图像组成。因此,cifar-captions 可以替代任何图像-文本数据集,同时由于是封闭域而能够快速收敛。我们在附录A中提供了关于这个数据集构建的更多详细信息,并在图 Figure 28中展示了样本图像。

4 Evaluating Effectiveness of Deferred Masking
4 评估延迟遮罩的有效性

In this section, we validate the effectiveness of our deferred masking approach and investigate the impact of individual design choices on image generation performance. We use the DiT-Tiny/2 model and the cifar-captions dataset (256×256256256256\times 256 image resolution) for all experiments in this section. We train each model for 606060K optimization steps, with the AdamW optimizer and exponential moving average with a 0.9950.9950.995 smoothing coefficient enabled for the last 101010K steps.
在本节中,我们验证了我们的延迟遮蔽方法的有效性,并研究了个别设计选择对图像生成性能的影响。我们在本节的所有实验中使用 DiT-Tiny/2 模型和 cifar-captions 数据集( 256×256256\times 256 图像分辨率)。我们为每个模型训练了 6060 K 个优化步骤,使用带有指数移动平均的 AdamW 优化器,在最后 1010 K 个步骤中启用了 0.9950.995 的平滑系数。

4.1 Out-of-the-box performance: Making high masking ratios feasible with deferred masking
4.1 开箱即用的性能:通过延迟遮罩使高遮罩比成为可行

A key limitation of patch masking strategies is poor performance when a large fraction of patches are masked. For example, Zheng et al. (2024) observed large degradation in even MaskDiT performance beyond a 50% masking ratio. Before optimizing the performance of our approach with a rigorous ablation study, we evaluate its out-of-the-box performance with common training parameters for up to 87.5%percent87.587.5\% masking ratios. As a baseline, we train a network with no patch-mixer, i.e., naive masking (Figure 2c) for each masking ratio. For deferred masking, we use a four-transformer block patch-mixer that comprises less than 101010% of the parameters of the backbone transformer.
郑等人(2024)观察到即使 MaskDiT 在 50% 以上遮挡比率下性能也会大幅下降,这是遮挡掩码策略的一个关键局限性。在对我们的方法进行严格的消融研究来优化性能之前,我们使用常见的训练参数评估了其达到最高 87.5%87.5\% 遮挡比率时的开箱即用性能。作为基准,我们训练了一个没有补丁混合器的网络,即简单遮挡掩码(图 2c)。对于延迟遮挡,我们使用一个包含不到主干变压器参数 1010 % 的四个变压器块补丁混合器。

We use commonly used default hyperparameters to simulate the out-of-the-box performance for both our approach and the baseline. We train both models with the AdamW optimizer with an identical learning rate of 1.6×1041.6superscript1041.6\times 10^{-4}, 0.010.010.01 weight decay, and a cosine learning rate schedule. We set (Pmean,Pstd)subscript𝑃meansubscript𝑃std(P_{\text{mean}},P_{\text{std}}) to (1.2,1.2)1.21.2(-1.2,1.2) following the original work (Karras et al., 2022). We provide our results in Figure 4. Our deferred masking approach achieves much better performance across all three performance metrics. Furthermore, the performance gaps widen with an increase in masking ratio. For example, with a 75% masking ratio, naive masking degrades the FID score to 16.516.516.5 while our approach achieves 5.035.035.03, much closer to the FID score of 3.793.793.79 with no masking.
我们使用常用的默认超参数来模拟我们的方法和基线方法的开箱即用性能。我们使用相同的学习率 1.6×1041.6\times 10^{-4}0.010.01 权重衰减和余弦学习率调度训练了两个模型。我们将 (Pmean,Pstd)(P_{\text{mean}},P_{\text{std}}) 设置为 (1.2,1.2)(-1.2,1.2) ,遵循原始工作(Karras et al., 2022)。我们在图4中给出了结果。我们的延迟掩蔽方法在所有三个性能指标上都取得了更好的表现。此外,随着掩蔽率的增加,性能差距也越来越大。例如,当掩蔽率为 75%时,简单掩蔽会将 FID 得分降低到 16.516.5 ,而我们的方法达到了 5.035.03 ,接近无掩蔽时的 FID 得分 3.793.79

Refer to caption
Figure 4: Out-of-the-box performance of deferred masking. Without any hyperparameter optimization, we compare the performance of our deferred masking with a naive masking strategy. We find that deferred masking, i.e., using a patch-mixer before naive masking, tremendously improves image generation performance, particularly at high masking ratios.
图 4: 延迟掩蔽的开箱即用性能。 在没有任何超参数优化的情况下,我们将我们的延迟掩蔽与简单掩蔽策略的性能进行了比较。我们发现,延迟掩蔽,即在简单掩蔽之前使用补丁混合器,大大改善了图像生成性能,特别是在高掩蔽比例下。

4.2 Ablation study of our training pipeline
4.2 我们训练流程的消融研究

We ablate design choices across masking, patch mixer, and training hyperparameters. We summarize the techniques that improved out-of-the-box performance in Table 5(a). We provide supplementary results and discussion of the ablation study in Appendix C.
我们对遮罩、补丁混合器和训练超参数的设计选择进行消融。我们总结了在表 5(a) 中提高开箱即用性能的技术。我们在附录 C 中提供了消融研究的补充结果和讨论。

Comparison with training hyperparameters of LLMs. Since the diffusion architectures are very similar to transformers used in large language models (LLMs), we compare the hyperparameter choices across the two tasks. Similar to common LLM training setups (Touvron et al., 2023; Jiang et al., 2023; Chowdhery et al., 2022), we find that SwiGLU activation (Shazeer, 2020) in the feedforward layer outperforms the GELU (Hendrycks & Gimpel, 2016) activation function. Similarly, higher weight decay leads to better image generation performance. However, we observe better performance when using a higher running average coefficient for the AdamW second moment (β2subscript𝛽2\beta_{2}), in contrast to large-scale LLMs where β20.95subscript𝛽20.95\beta_{2}\approx 0.95 is preferred. As we use a small number of training steps, we find that increasing the learning rate to the maximum possible value until training instabilities also significantly improves image generation performance.
与LLMs训练超参数的对比。由于扩散架构与用于大型语言模型(LLMs)的变换器非常相似,因此我们对这两个任务的超参数选择进行了比较。类似于常见的LLM训练设置(Touvron et al., 2023; Jiang et al., 2023; Chowdhery et al., 2022),我们发现 SwiGLU 激活(Shazeer, 2020)在前馈层中的表现优于 GELU(Hendrycks & Gimpel, 2016)激活函数。同样地,更高的权重衰减可以带来更好的图像生成性能。然而,我们观察到在使用更高的 AdamW 第二矩运行平均系数( β2\beta_{2} )时,图像生成性能更好,这与大规模LLMs中倾向于使用较低的 β20.95\beta_{2}\approx 0.95 相反。由于我们使用较少的训练步骤,我们发现将学习率增加到最大可能值,直到训练出现不稳定性,也能显著提高图像生成性能。

Design choices in masking and patch-mixer. We observe a consistent improvement in performance with a larger patch mixer. However, despite the better performance of larger patch-mixers, we choose to use a small patch mixer to lower the computational budget spent by the patch mixer in processing the unmasked input. We also update the noise distribution (Pmean,Pstd)subscript𝑃meansubscript𝑃std(P_{\text{mean}},P_{\text{std}}) to (0.6,1.2)0.61.2(-0.6,1.2) as it improves the alignment between captions and generated images. By default, we mask each patch at random. We ablate the masking strategy to retain more continuous regions of patches using block masking (Figure 5(b)). However, we find that increasing the block size leads to a degradation in the performance of our approach, thus we retain the original strategy of masking each patch at random.
面具和修补程序混合器的设计选择。我们观察到,使用更大的补丁混合器可以一致地提高性能。然而,尽管更大的补丁混合器的性能更好,但我们选择使用一个小的补丁混合器,以降低补丁混合器在处理未屏蔽输入时的计算预算。我们还将噪音分布更新为 (Pmean,Pstd)(P_{\text{mean}},P_{\text{std}})(0.6,1.2)(-0.6,1.2) ,因为它可以改善标题和生成图像之间的一致性。默认情况下,我们会随机屏蔽每个补丁。我们对屏蔽策略进行了消融,使用块屏蔽来保留更多连续的补丁区域(图5(b))。然而,我们发现增加块大小会导致我们的方法性能下降,因此我们保留了随机屏蔽每个补丁的原始策略。

FID (\downarrow) FID( \downarrow Clip-FID (\downarrow)
剪辑-FID( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
Out-of-box 开箱即用 8.828.828.82 5.035.035.03 26.7226.7226.72
Higher weight decay 更大的权重衰减 8.388.388.38 4.904.904.90 27.0027.0027.00
Sigma distribution σ分布 8.498.498.49 4.934.934.93 27.4727.4727.47
Larger patch-mixer 更大的拼接混音器 7.407.407.40 4.574.574.57 27.7927.7927.79
Higher learning rate 更高的学习速率 7.097.097.09 4.104.104.10 28.2428.2428.24
(a) Summarizing the factors that led to improved performance during our ablation study. We conduct this ablation study at 75%percent7575\% masking ratio. Detailed results provided in Table 11 in Appendix C.
(a) 总结了在消融研究中导致性能提高的因素。我们以75%75\%掩蔽率进行了这项消融研究。详细结果见附录 C表 11
Refer to caption
(b) Is block masking of patches more effective? We retain a continuous region of image patches using block masking to test its impact on image generation performance. In this figure, we illustrate the masks for 75%percent7575\% masking ratio with varying block sizes. By default, we mask each patch randomly, i.e., block size (b𝑏b) = 1.
(b) 遮蔽块是否更有效?我们保留图像块的连续区域,使用遮蔽块来测试它对图像生成性能的影响。在这个图中,我们展示了75%75%遮蔽比例下,不同块大小的掩码。默认情况下,我们随机遮蔽每个块,即块大小(bb) = 1。

4.3 Validating improvements in diffusion transformer architecture
4.3 验证扩散变压器体系结构改进

We measure the effectiveness of the two modifications in transformer architecture in improving image generation performance.
我们衡量两种修改的转换器架构在提高图像生成性能方面的有效性。

Table 1: Layer-wise scaling of transformer architecture is a better fit for masked training in diffusion transformers. We validate its effectiveness in the canonical naive masking with 75%percent7575\% masking ratio.
表 1: 逐层缩放的变换器架构更适合于扩散变换器中的掩蔽训练。我们在标准的简单掩蔽中验证了其有效性,掩蔽比例为75%75%
Arch  FID (\downarrow) FID( \downarrow Clip-FID (\downarrow)
剪辑-FID( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
Constant width 恒定宽度 19.619.619.6 9.99.99.9 26.726.726.7
Layer-wise scaling 按层级的缩放 15.915.9\mathbf{15.9} 7.47.4\mathbf{7.4} 27.127.1\mathbf{27.1}

Layer-wise scaling. We investigate the impact of this design choice in a standalone experiment where we train two variants of a DiT-Tiny architecture, one with a constant width transformer and the other with layer-wise scaling of each transformer block. We use naive masking for both methods. We select the width of the constant-width transformer such that its computational footprint is identical to layer-wise scaling. We train both models for an identical number of training steps and wall-clock time. We find that the layer-wise scaling approach outperforms the baseline constant width approach across all three performance metrics (Table 1), demonstrating a better fit of the layer-wise scaling approach in masked training of diffusion transformers.
逐层缩放。我们在一个独立实验中研究了这种设计选择的影响,其中我们训练了两个DiT-Tiny架构的变体,一个是常宽度变压器,另一个是逐层缩放的每个变压器块。我们对两种方法都使用简单遮蔽。我们选择常宽度变压器的宽度,使其计算开销与逐层缩放相同。我们将两个模型训练相同的训练步数和计算时间。我们发现,逐层缩放方法在所有三个性能指标(表1)上都优于基准常宽度方法,demonstrating 逐层缩放方法在扩散变压器的掩码训练中有更好的拟合效果。

Mixture-of-experts layers. We train a DiT-Tiny/2 transformer with expert-choice routing based mixture-of-experts (MoE) layers in each alternate transformer block. On the small-scale setup, it achieves similar performance to baseline model trained without MoE blocks. While it slightly improves Clip-score from 28.1128.1128.11 to 28.6628.6628.66, it degrades FID score from 6.926.926.92 to 6.986.986.98. We hypothesize that the lack of improvement is due to the small number of training steps (606060K) as using multiple experts effectively reduces the number of samples observed by each router. In top-2 routing for 8-experts, each expert is trained effectively for one fourth number of epochs over the dataset. We observe a significantly improvement in performance with MoE blocks when training on scale for longer training schedules (Section 6.2).
专家混合层。 我们训练一个DiT-Tiny/2变压器 在各个交替的变换器块中基于专家选择的混合专家(MoE)层。在小规模设置中,它的性能与不使用 MoE 块训练的基线模型相似。虽然它略微改善了 Clip-score 从 28.1128.1128.6628.66 ,但它使 FID 得分从 6.926.92 降低到 6.986.98 。我们假设性能缺乏改善是由于训练步骤数量较少( 6060 K),因为使用多个专家实际上减少了每个路由器观察到的样本数量。在 8 个专家的 top-2 路由中,每个专家在数据集上的有效训练周期数为原来的 1/4。我们观察到,在长时间的训练计划下(第6.2节),使用 MoE 块可以显著提高性能。

Refer to caption
Figure 6: Comparing performance of patch masking strategies. Using a lightweight patch-mixer before patch masking in our deferred masking approach significantly improves image generation performance over baseline masking strategies. Our approach incurs near identical training cost as the MaskDiT (Zheng et al., 2024) baseline. However, both approaches incur slightly higher cost than naive masking due the use of an additional lightweight transformer along with the backbone diffusion transformer.
图 6: 对比打补丁遮罩策略的性能。在我们的延迟遮罩方法中使用轻量级打补丁混合器,可以显著提高图像生成性能,优于基线遮罩策略。我们的方法训练成本与 MaskDiT (Zheng 等人, 2024) 基线几乎相同。但是,由于使用了附加的轻量级 Transformer 以及主干扩散 Transformer,两种方法的成本略高于原始遮罩。

5 Comparing the Effectiveness of Deferred Masking with Baselines
比较延迟掩蔽与基线的有效性

In this section, we provide a concrete comparison with multiple baselines. First, we compare with techniques aimed at reducing computational cost by reducing the transformer patch sequence size. Next, we consider reducing network size, i.e., downscaling, while keeping the patch sequence intact. Under isoflops budget, i.e., identical wall-clock training time and FLOPs, we show that our approach achieves better performance than model downscaling.
在本节中,我们与多个基准进行了具体的比较。首先,我们比较了通过减小 Transformer patch 序列大小来降低计算成本的技术。接下来,我们考虑在保持 patch 序列不变的情况下缩小网络规模,即降尺度。在等 Flop 预算下,即训练时间和 Flop 数相同的情况下,我们证明我们的方法的性能优于模型降尺度。

Comparing different masking strategies. We first compare our approach with the strategy of using larger patches. We increase the patch size from two to four, equivalent to 75% patch masking. However, it underperforms compared to deferred masking and only achieves 9.389.389.38, 6.316.316.31, and 26.7026.7026.70 FID, Clip-FID, and Clip-score, respectively. In contrast, deferred masking achieves 7.097.097.09, 4.104.104.10, and 28.2428.2428.24 FID, Clip-FID, and Clip-score, respectively. Next, we compare all three masking strategies (Figure 2). We train each model for 606060K steps and report total training flops for each approach to indicate its computational cost. We find that naive masking significantly degrades model performance, and using additional proxy loss functions (Zheng et al., 2024) only marginally improves performance. We find that simply deferring the masking and using a small transformer as a patch-mixer significantly bridges the gap between masked and unmasked training of diffusion transformers.
比较不同的遮蔽策略。我们首先将我们的方法与使用更大补丁的策略进行比较。我们将补丁大小从两个增加到四个,相当于 75%的补丁遮蔽。然而,它的表现不如延迟遮蔽,仅实现了 9.389.386.316.3126.7026.70 FID、Clip-FID 和 Clip-score。相比之下,延迟遮蔽实现了 7.097.094.104.1028.2428.24 FID、Clip-FID 和 Clip-score。接下来,我们比较所有三种遮蔽策略(图2)。我们为每个模型训练 6060 K 步,并报告每种方法的总训练 FLOPS,以指示其计算成本。我们发现简单的遮蔽显著降低了模型性能,使用额外的代理损失函数(Zheng 等人,2024)只能略微提高性能。我们发现,仅仅将遮蔽推迟并使用小型变压器作为补丁混合器,就能显著缩小有遮蔽和无遮蔽训练扩散变压器之间的差距。

IsoFLOPs study - Why bother using masking instead of smaller transformers? Using patch masking is a complementary approach to downscaling transformer size to reduce computation cost in training. We compare the two approaches in Table 7(a). At each masking ratio, we reduce the size of the unmasked network to match FLOPs and total wall-clock training time of masked training. We train both networks for 606060K training steps. Until a masking ratio of 75%, we find that deferred masking outperforms network down-scaling across at least two out of three metrics. However, at extremely high masking ratios, deferred masking tends to achieve lower performance. This is likely because the information loss from masking is too high at these ratios. For example, for 32×32323232\times 32 resolution latent and a patch size of two, only 323232 patches are retained (out of 256256256 patches) at an 87.5% masking ratio.
IsoFLOPs 研究-为什么要使用遮罩而不是更小的变换器?使用补丁遮罩是减少训练中计算成本的一种补充方法,与缩小变换器大小相比。我们在表7(a)中比较了这两种方法。对于每个遮罩率,我们都减小了未遮罩网络的大小,以匹配遮罩训练的浮点运算和总体训练时间。我们训练了两个网络 6060 K 个训练步骤。直到 75%的遮罩率,我们发现延迟遮罩在至少三个指标中的两个指标上优于网络缩小。然而,在极高的遮罩率下,延迟遮罩倾向于达到较低的性能。这可能是因为在这些比率下,遮罩造成的信息损失太高。例如,对于 32×3232\times 32 分辨率的潜在因子和 2 个 patch 大小,在 87.5%的遮罩率下只保留了 3232 个 patch(共 256256 个 patch)。

Deferred masking as pretraining ++ unmasked finetuning. We show deferred masking also acts as a strong pretraining objective and using it with an unmasked finetuning schedule achieves better performance than training an unmasked network under an identical computational budget. We first train a network without any masking and another identical network with 75% deferred masking. We finetune the latter with no masking and measure performance as we increase the number of finetuning steps. We mark the isoflops threshold when the combined cost of masked pretraining and unmasked finetuning is identical to the model trained with no masking. We find that at the isoflops threshold, the finetuned model achieves superior performance across all three performance metrics. The performance of the model also continues to improve with unmasked finetuning steps beyond the isoflops threshold.
延迟遮蔽作为预训练++未遮蔽微调。我们发现,延迟遮蔽也可以作为一个强大的预训练目标,并将其与未遮蔽的微调时间表一起使用,可以在相同的计算预算下获得比训练未遮蔽网络更好的性能。我们首先训练一个没有任何遮蔽的网络,另一个相同的网络有 75%的延迟遮蔽。我们在没有遮蔽的情况下对后者进行微调,并随着微调步数的增加来衡量性能。我们标记了isoflops阈值,当遮蔽预训练和未遮蔽微调的总成本与未遮蔽训练的模型相同时。我们发现,在 isoflops 阈值时,微调后的模型在所有三个性能指标上都取得了优越的表现。该模型的性能也会继续随着未遮蔽微调步骤的增加而提高。

Masking ratio 遮蔽比例 Masking ratio 遮蔽比例 FID Clip-FID 剪辑-FID Clip-score 剪辑分数
Downscaled network 网络缩小 - 6.606.60\mathbf{6.60} 3.853.853.85 28.4928.4928.49
Deferred masking 延迟遮挡 0.50.50.5 6.746.746.74 3.453.45\mathbf{3.45} 28.8628.86\mathbf{28.86}
Downscaled network 网络缩小 - 7.097.097.09 4.564.564.56 27.6827.6827.68
Deferred masking 延迟遮挡 0.750.750.75 6.966.96\mathbf{6.96} 4.174.17\mathbf{4.17} 28.1628.16\mathbf{28.16}
Downscaled network 网络缩小 - 7.527.52\mathbf{7.52} 5.035.03\mathbf{5.03} 26.9826.9826.98
Deferred masking 延迟遮挡 0.8750.8750.875 8.358.358.35 5.265.265.26 27.0027.00\mathbf{27.00}
(a) IsoFLOPs training. Under identical training setup and wall-clock time, we compare the effectiveness of model downscaling and our deferred masking approaches. We find that except at extremely high masking ratios, deferred masking achieves better performance across at least two performance metrics. Based on this finding, we do not use a masking ratio higher than 75% in our models.
(a) IsoFLOPs 训练。 在相同的训练设置和时间范围内,我们比较了模型缩小和我们的延迟掩码方法的有效性。我们发现,除了在非常高的掩码比率下,延迟掩码在至少两个性能指标上都能获得更好的表现。基于这一发现,我们在模型中不使用超过 75%的掩码比率。
Refer to caption
(b) Unmasked finetuning. After pretraining with 75% deferred masking, we increase the number of unmasked finetuning steps and compare its performance with another model trained completely without masking. The shaded region represents steps after the isoflops threshold, where deferred masking pretraining and finetuning have higher computation cost. At the isoflops threshold, the finetuned model achieves better performance than training the diffusion model without any masking.
(b) 无掩蔽微调。在经过 75%延迟掩蔽的预训练之后,我们增加了无掩蔽微调步骤,并将其性能与另一个完全不使用掩蔽训练的模型进行比较。阴影区域表示在等计算预算阈值之后,延迟掩蔽预训练和微调的计算成本更高。在等计算预算阈值时,微调模型的性能优于完全不使用掩蔽的扩散模型训练。
Figure 7: Deferred masking vs. model downscaling to reduce training cost. Both patch masking and model downscaling, i.e., reducing the size of the model, reduce the computational cost in training and can be complementary to each other. However, it is natural to ask how the two paradigms compare to each other in training diffusion transformers. In figure (a), we show that reducing the model size, thus reducing the total computational budget of training, achieves inferior performance compared to our approach at masking ratios under 75%. Using a patch-mixer in deferred masking effectively enables our model to simultaneously observe all patches in input images while reducing the effective number of patches processed per image. This result encourages using a larger model with masking rather than training a smaller model with no masking for equivalent computational cost. In figure (b), we further highlight the advantage of using deferred masking with finetuning over training a model with no masking for a given computational budget.
图 7延迟遮蔽与模型缩小以降低训练成本。补丁遮蔽和模型缩小(即减小模型尺寸)都可以降低训练计算成本,并且两者可以相互补充。然而,自然要问这两种方法在训练扩散转换器方面的比较如何。在图(a)中,我们显示降低模型尺寸从而降低总训练计算预算导致的性能下降,与我们在遮蔽率 75%以下的方法相比效果较差。使用延迟遮蔽的补丁混合器可以有效使我们的模型同时观察输入图像中的所有补丁,同时减少每个图像处理的有效补丁数。这个结果鼓励我们使用更大的模型并进行遮蔽,而不是训练一个没有遮蔽的较小模型来达到相同的计算成本。在图(b)中,我们进一步突出了使用延迟遮蔽并微调相比于为给定计算预算训练一个无遮蔽模型的优势。
Table 2: Computational cost of the two-stage training of our large-scale model. Our total computational cost is 3.45×10203.45superscript10203.45\times 10^{20} FLOPs, amounting to a total cost of $1,890 and 2.62.62.6 training days on an 8×\timesH100 GPU machine.
表 2: 我们大规模模型的两阶段训练的计算成本。我们的总计算成本是3.45×10203.45\times 10^{20} 浮点运算次数,总成本为 $1,890 和 2.62.6 天在一个 8×\timesH100 GPU 机器上进行训练。
Resolution 决议 Masking ratio 遮蔽比例 Training steps 训练步骤 Total FLOPs 总 FLOPs 8×8\timesA100 days  8×8\times 100 天 8×8\timesH100 days  8×8\times 100 天 Cost ($) 成本 ($)
256×256256256256\times 256 0.750.750.75 250000250000250000 1.47×10201.47superscript10201.47\times 10^{20} 2.772.772.77 1.111.111.11 800800800
0.000.000.00 300003000030000 4.53×10194.53superscript10194.53\times 10^{19} 0.940.940.94 0.380.380.38 271271271
512×512512512512\times 512 0.750.750.75 500005000050000 1.18×10201.18superscript10201.18\times 10^{20} 2.182.182.18 0.880.880.88 630630630
0.000.000.00 500050005000 3.48×10193.48superscript10193.48\times 10^{19} 0.650.650.65 0.260.260.26 189189189

6 Micro-budget Training of Large-scale Models
6 大规模模型的微型预算培训

In this section, we validate the effectiveness of our approach in training open-world text-to-image generative models on a micro-budget. Moving beyond the small-scale cifar-captions dataset, we train the large-scale model on a mixture of real and synthetic image datasets comprising 373737M images. We demonstrate the high-quality image generation capability of our model and compare its performance with previous works across multiple quantitative metrics.
在这个部分,我们验证了我们的方法在使用微型预算上训练开放式文本到图像生成模型的有效性。超越了小规模的 cifar-captions 数据集,我们在由 3737 M 张图像组成的真实和合成图像数据集混合上训练了大规模模型。我们展示了我们模型的高质量图像生成能力,并将其在多个定量指标上的性能与之前的工作进行了比较。

Training process. We train a DiT-Xl/2 transformer with eight experts in alternate transformer blocks. We provide details on training hyperparameters in Table 5 in the Appendix A. We conduct the training in follwoing two phases. We refer to any diffusion transformer trained using our micro-budget training as MicroDiT.
训练过程。 我们训练了一个具有八个专家的 DiT-Xl/2 转换器,它们在交替的转换器块中。 我们在附录 A 的表 5 中提供了训练超参数的详细信息。 我们在以下两个阶段进行训练。 我们将使用我们的微预算训练方法训练的任何扩散转换器称为 MicroDiT

  • Phase-1: In this phase, we pretrain the model on 256×256256256256\times 256 resolution images. We train for 250250250K optimization steps with 75%percent7575\% patch masking and finetune for another 303030K steps without any patch masking.
    第一阶段:在这个阶段,我们在 256×256256\times 256 分辨率图像上预训练模型。我们进行 250250 K 次优化步骤,采用 75%75\% 遮掩修补,然后无遮掩修补再 finetune 3030 K 步。

  • Phase-2: In this phase, we finetune the Phase-1 model on 512×512512512512\times 512 resolution images. We first finetune the model for 505050K steps with 75%percent7575\% patch masking followed by another 555K optimization steps with no patch masking.
    第二阶段: 在这个阶段,我们在 512×512512\times 512 分辨率图像上对第一阶段模型进行微调。我们首先用 75%75\% 补丁遮蔽进行了 5050 K 步微调,然后又进行了 55 K 步无补丁遮蔽的优化。

Computational cost. We provide a breakdown of computational cost, including both training FLOPs and economical cost, across each training phase in Table 2. Our Phase-1 and Phase-2 training consumes 56%percent5656\% and 44%percent4444\% of the total computational cost, respectively. The total wall-clock training time of our model is 2.62.62.6 days on an 8×8\timesH100 GPU cluster, equivalent to 6.66.66.6 days on an 8×8\timesA100 GPU cluster.
计算成本。我们在表2中提供了计算成本的细目,包括训练浮点运算和经济成本,涵盖每个训练阶段。我们的第 1 阶段和第 2 阶段训练分别占总计算成本的 56%56\%44%44\% 。我们模型的总训练时间为 2.62.6 天,使用 8×8\times H100 GPU 集群,相当于使用 6.66.68×8\times A100 GPU 集群。

Refer to caption
(a) DrawBench
(a) 绘图基准
Refer to caption
(b) PartiPrompts
(b) 党员提示
Figure 8: Assesssing image quality with GPT-4o. We supply the following prompt to the GPT-4o model: Given the prompt ‘{prompt}’, which image do you prefer, Image A or Image B, considering factors like image details, quality, realism, and aesthetics? Respond with ’A’ or ’B’ or ’none’ if neither is preferred. For each comparison, we also shuffle the order of images to remove any ordering bias. We evaluate the performance on two prompt databases: DrawBench (Saharia et al., 2022) and PartiPrompts (Yu et al., 2022). The y-axis in the bar plots indicates the percentage of comparisons in which image from a model is preferred. We breakdown the comparison across individual image category in each prompt database.
图 8: 使用 GPT-4o 评估图像质量。我们向 GPT-4o 模型提供以下提示:给定提示'{prompt}',考虑图像细节、质量、真实感和美学等因素,您更喜欢图像 A 还是图像 B?如果没有一个图像更受青睐,请回答'none'。对于每次比较,我们还将图像顺序打乱,以消除任何顺序偏见。我们在 DrawBench (Saharia 等人,2022) 和 PartiPrompts (Yu 等人,2022) 两个提示数据库上评估性能。柱状图纵轴表示模型图像被偏好的比例。我们将比较分解到每个提示数据库的单个图像类别中。
Refer to caption
Figure 9: Synthesized images from a model trained only on real data (on left) and combined real and synthetic data (on right). Real data includes SA1B, CC12M, and TextCaps datasets while synthetic data includes JourneyDB and DiffusionDB datasets.
图 9: 仅使用真实数据训练的模型生成的合成图像(左侧)和使用真实和合成数据结合训练的模型生成的合成图像(右侧)。真实数据包括 SA1B、CC12M 和 TextCaps 数据集,合成数据包括 JourneyDB 和 DiffusionDB 数据集。

6.1 Benefit of additional synthetic data in training
6.1 额外合成数据在训练中的益处

Instead of training only on real images, we expand the training data to include 40% additional synthetic images. We compare the performance of two MicroDiT models trained on only real images and combined real and synthetic images, respectively, using identical training processes.
不仅训练真实图像,我们还扩展训练数据包括 40%额外的合成图像。我们比较只使用真实图像和结合真实及合成图像训练的两个MicroDiT模型的性能,采用相同的训练过程。

Under canonical performance metrics, both models apparently achieve similar performance. For example, the model trained on real-only data achieved an FID score of 12.72 and a CLIP score of 26.67, while the model trained on both real and synthetic data achieved an FID score of 12.66 and a CLIP score of 28.14. Even on GenEval (Ghosh et al., 2024), a benchmark that evaluates the ability to generate multiple objects and modelling object dynamics in images, both models achieved an identical score of 0.46. These results seemingly suggest that incorporating a large amount of synthetic samples didn’t yield any meaningful improvement in image generation capabilities.
根据标准性能指标,两种模型的性能似乎都差不多。例如,仅使用实际数据训练的模型 FID 得分为 12.72,CLIP 得分为 26.67,而同时使用实际和合成数据训练的模型 FID 得分为 12.66,CLIP 得分为 28.14。即使在 GenEval (Ghosh et al., 2024)基准测试中,这种评估生成多个对象和建模图像中对象动力学的能力,两种模型也都获得了相同的 0.46 分。这些结果似乎表明,纳入大量合成样本并没有带来图像生成能力的明显改善。

However, we argue that this observation is heavily influenced by the limitations of our existing evaluation metrics. In a qualitative evaluation, we found that the model trained on the combined dataset achieved much better image quality (Figure 9). The real data model often fails to adhere to the prompt, frequently hallucinating key details and often failing to synthesize the correct object. Metrics, such as FID, fail to capture this difference because they predominantly measure distribution similarity (Pernias et al., 2024). Thus, we focus on using human visual preference as an evaluation metric. To automate the process, we use GPT-4o (OpenAI, 2024), a state-of-the-art multimodal model, as a proxy for human preference. We supply the following prompt to the model: Given the prompt ‘{prompt}’, which image do you prefer, Image A or Image B, considering factors like image details, quality, realism, and aesthetics? Respond with ’A’ or ’B’ or ’none’ if neither is preferred. For each comparison, we also shuffle the order of images to remove any ordering bias. We generate samples using DrawBench (Saharia et al., 2022) and PartiPrompts (P2) (Yu et al., 2022), two commonly used prompt databases (Figure 9). On the P2 dataset, images from the combined data model are preferred in 63% of comparisons while images from the real data model are only preferred 21% of the time (16% of comparisons resulted in no preference). For the DrawBench dataset, the combined data model is preferred in 40% of comparisons while the real data model is only preferred in 21% of comparisons. Overall, using a human preference-centric metric clearly demonstrates the benefit of additional synthetic data in improving overall image quality.
然而,我们认为这一观察深受我们现有评估指标局限性的影响。在定性评估中,我们发现在综合数据集上训练的模型能够达到更好的图像质量(图 9)。真实数据模型经常无法恰当地遵循提示,频繁地编造关键细节,通常也无法合成正确的物体。诸如 FID 之类的指标无法捕捉这种差异,因为它们主要衡量分布相似性(Pernias et al., 2024)。因此,我们专注于使用人类视觉偏好作为评估指标。为了自动化这一过程,我们使用了最先进的多模态模型 GPT-4o (OpenAI, 2024)作为人类偏好的代理。我们向模型提供以下提示: 给定提示'{prompt}',您更喜欢图像 A 还是图像 B,考虑诸如图像细节、质量、真实性和美学等因素?请回答'A'、'B'或'none'(如果两者都不是首选)。对于每次比较,我们也会打乱图像的顺序以消除任何排序偏差。我们使用 DrawBench (Saharia et al., 2022)和 PartiPrompts (P2) (Yu et al., 2022)两个常用的提示数据库生成样本(图 9)。在 P2 数据集上,综合数据模型的图像在 63%的比较中被首选,而真实数据模型的图像只在 21%的比较中被首选(16%的比较没有首选)。对于 DrawBench 数据集,综合数据模型在 40%的比较中被首选,而真实数据模型只在 21%的比较中被首选。 总的来说,使用以人类偏好为中心的度量标准清楚地表明了使用额外合成数据提高整体图像质量的好处。

(a) Measuring fidelity and prompt alignment of generated images on COCO dataset.
(a) 测量 COCO 数据集上生成图像的逼真度和及时对齐情况。
Channels 渠道 FID-30K (\downarrow) Clip-FID-30K (\downarrow)
剪辑 FID-30K( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
444 12.6512.65\mathbf{12.65} 5.965.96\mathbf{5.96} 28.1428.14\mathbf{28.14}
161616 13.0413.0413.04 6.846.846.84 25.6325.6325.63
(b) Measuring performance on the GenEval benchmark.
(b) 在 GenEval 基准上测量性能。
Objects 物体
Channels 渠道 Overall 总体而言 Single 单一的 Two 两个 Counting 计数 Colors 颜色 Position 位置
Color 颜色
attribution 署名
4 0.460.46\mathbf{0.46} 0.970.97\mathbf{0.97} 0.470.47\mathbf{0.47} 0.330.33\mathbf{0.33} 0.780.78\mathbf{0.78} 0.090.09\mathbf{0.09} 0.200.20\mathbf{0.20}
16 0.40 0.96 0.36 0.27 0.72 0.07 0.09
Figure 10: We ablate the dimension of latent space by training our MicroDiT models in four and sixteen channel latent space, respectively. Even though training in higher dimensional latent space is being adopted across large-scale models (Esser et al., 2024; Dai et al., 2023) we find that it underperforms when training on a micro-budget.
图 10: 我们通过在四个和十六个通道的潜在空间中训练我们的MicroDiT模型来消除潜在空间的维度。尽管在更高维的潜在空间中训练被大规模模型广泛采用 (Esser 等人, 2024; Dai 等人, 2023),但我们发现在微预算训练中表现不佳。
Model Params (\downarrow)
参数 ( \downarrow )
Sampling 采样
steps (\downarrow) 步骤( \downarrow
Open-source 开源
Training 培训
images(\downarrow) 图像( \downarrow )
8×\timesA100
GPU days (\downarrow)
GPU 日期 ( \downarrow )
FID-30K (\downarrow)
CogView2 (Ding et al., 2022)
CogView2 (丁等人,2022)
6.006.006.00B - - - - 24.024.024.0
Dall-E (Ramesh et al., 2021)
Dall-E (Ramesh 等人,2021)
12.012.012.0B 256256256 - - - 17.8917.8917.89
Glide (Nichol et al., 2021)
滑动 (尼科尔等人, 2021)
3.503.503.50B 250250250 - - - 12.2412.2412.24
Parti-750M (Yu et al., 2022)
帕蒂-750M (于等., 2022)
0.750.750.75B 102410241024 - 369036903690M - 10.7110.7110.71
Dall-E.2 (Ramesh et al., 2022)
Dall-E.2 (Ramesh 等人,2022)
6.506.506.50B - - 650650650M 5208.35208.35208.3 10.3910.3910.39
Make-a-Scene (Gafni et al., 2022)
创作场景 (Gafni 等人,2022
4.004.004.00B 102410241024 - - - 11.8411.8411.84
GigaGAN (Kang et al., 2023)
GigaGAN (康等人, 2023)
1.011.011.01B 111 - 980980980M 597.8597.8597.8 9.099.099.09
ImageN (Saharia et al., 2022)
图像 N (Saharia 等人,2022)
3.003.003.00B - - 860860860M 891.5891.5891.5 7.277.277.27
Parti-20B (Yu et al., 2022)
分区-20B (于等人,2022)
20.020.020.0B 102410241024 - 369036903690M - 7.237.237.23
eDiff-I (Balaji et al., 2022)
eDiff-I (巴拉吉等人,2022 年)
9.109.109.10B 252525 - 114701147011470M - 6.956.956.95
Stable-Diffusion-2.1a (Rombach et al., 2022)
稳定扩散-2.1a(Rombach 等人, 2022)
0.860.860.86B 505050 \checkmark 390039003900M 1041.61041.61041.6 9.129.129.12
Stable-Diffusion-1.5 (Rombach et al., 2022)
稳定扩散-1.5 (Rombach 等人, 2022)
0.860.860.86B 505050 \checkmark 480048004800M 781.2781.2781.2 11.1811.1811.18
Würstchen (Pernias et al., 2024)
香肠 (Pernias 等, 2024)
0.990.990.99B 606060 \checkmark 142014201420M 128.1128.1128.1 23.6023.6023.60
PixArt-α𝛼\alpha (Chen et al., 2023)
像素艺术- α\alpha (陈等人, 2023)
0.610.610.61B 202020 \checkmark 252525Mb 94.194.194.1c 7.327.327.32
MicroDiT (our work) 微型数字设备(我们的作品) 1.161.161.16B 303030 \checkmark 373737M 6.66.6\mathbf{6.6} 12.6612.6612.66
  • a

    As the FID scores for the stable diffusion models are not officially reported (Rombach et al., 2022), we calculate them using the official release of each model. We achieve slightly better FID scores compared to the scores reported in Würstchen (Pernias et al., 2024). We use our FID scores to represent the best performance of these models.
    由于稳定扩散模型的 FID 分数并未正式报告(Rombach 等人,2022),我们使用每个模型的官方发布来计算它们。与 Würstchen 中报告的分数相比,我们获得了略好的 FID 分数(Pernias 等人,2024)。我们使用我们的 FID 分数来代表这些模型的最佳性能。

  • b

    Includes 10M proprietary high-quality images.
    包括 1000 万专有高品质图像。

  • c

    PixArt-α𝛼\alpha training takes 85 days on an 8×\timesA100 machine when only training till 512×512512512512\times 512 resolution.
    PixArt- α\alpha 培训在 8 ×\times A100 机器上训练到 512×512512\times 512 分辨率需要 85 天。

Table 3: Zero-shot FID on COCO2014 validation split (Lin et al., 2014). We report total training time in terms of number of days required to train the model on a machine with eight A100 GPUs. We observe a 2.5×2.5\times reduction in training time when using H100 GPUs. Our micro-budget training takes 14.2×14.2\times less training time than state-of-the-art low-cost training approach while simultaneously achieving competitive FID compared to some open-source models.
表 3: 在 COCO2014 验证集上的零镜像 FID (林等人, 2014)。我们报告了在八个 A100 GPU 设备上训练模型所需的总训练时间,以天为单位。我们观察到当使用 H100 GPU 时,训练时间减少了2.5×2.5 倍。我们的微型预算培训的训练时间比最新的低成本培训方法14.2×14.2 倍更短,同时在 FID 指标上与一些开源模型保持竞争力。
Table 4: Comparing performance on compositional image generation using GenEval (Ghosh et al., 2024) benchmark. Table adopted from Esser et al. (2024). Higher performance is better.
表格 4: 使用 GenEval(Ghosh 等, 2024)基准比较组合图像生成的性能。表格来源于Esser 等人(2024)。性能越高越好。
Objects 物体
Model Open-source 开源 Overall 总体而言 Single 单一的 Two 两个 Counting 计数 Colors 颜色 Position 位置
Color 颜色
attribution 署名
DaLL-E.2 (Ramesh et al., 2022)
DaLL-E.2 (Ramesh 等, 2022)
- 0.52 0.94 0.66 0.49 0.77 0.10 0.19
DaLL-E.3 (Betker et al., 2023)
DaLL-E.3 (Betker 等, 2023)
- 0.67 0.96 0.87 0.47 0.83 0.43 0.45
minDALL-E (Kim et al., 2021)
minDALL-E (金等人, 2021)
\checkmark 0.23 0.73 0.11 0.12 0.37 0.02 0.01
Stable-Diffusion-1.5 (Rombach et al., 2022)
稳定扩散-1.5 (Rombach 等人, 2022)
\checkmark 0.43 0.97 0.38 0.35 0.76 0.04 0.06
PixArt-α𝛼\alpha (Chen et al., 2023)
像素艺术- α\alpha (陈等人, 2023)
\checkmark 0.48 0.98 0.50 0.44 0.80 0.08 0.07
Stable-Diffusion-2.1 (Rombach et al., 2022)
稳定扩散-2.1 (Rombach 等, 2022)
\checkmark 0.50 0.98 0.51 0.44 0.85 0.07 0.17
Stable-Diffusion-XL (Podell et al., 2024)
稳定扩散-XL (波德尔等人, 2024)
\checkmark 0.55 0.98 0.74 0.39 0.85 0.15 0.23
Stable-Diffusion-XL-Turbo (Sauer et al., 2023)
稳定扩散-XL-Turbo (Sauer 等人,2023)
\checkmark 0.55 1.00 0.72 0.49 0.80 0.10 0.18
IF-XL \checkmark 0.61 0.97 0.74 0.66 0.81 0.13 0.35
Stable-Diffusion-3 (Esser et al., 2024)
稳定扩散-3 (Esser 等人,2024)
\checkmark 0.68 0.98 0.84 0.66 0.74 0.40 0.43
MicroDiT (our work) 微型数字设备(我们的作品) \checkmark 0.46 0.97 0.47 0.33 0.78 0.09 0.20

6.2 Ablating design choices on scale
6.2 缩放设计选择的消融

We first examine the effect of using a higher dimensional latent space in micro-budget training. We replace the default four-channel autoencoder with one that has sixteen channels, resulting in a 4×4\times higher dimensional latent space. Recent large-scale models have adopted high dimensional latent space as it provides significant improvements in image generation abilities (Dai et al., 2023; Esser et al., 2024). Note that the autoencoder with higher channels itself has superior image reconstruction capabilities, which further contributes to overall success. Intriguingly, we find that using a higher dimensional latent space in micro-budget training hurts performance. For two MicroDiT models trained with identical computational budgets and training hyperparameters, we find that the model trained in four-channel latent space achieves better FID, Clip-score, and GenEval scores (Table 10). We hypothesize that even though an increase in latent dimensionality allows better modeling of data distribution, it also simultaneously increases the training budget required to train higher quality models. We also ablate the choice of mixture-of-experts (MoE) layers by training another MicroDiT model without them under an identical training setup. We find that using MoE layers improves zero-shot FID-30K from 13.7 to 12.7 on the COCO dataset. Due to the better performance of the sparse transformers with MoE layers, we use MoE layers by default in large-scale training.
我们首先研究在微预算训练中使用更高维的潜在空间的效果。我们将默认的四通道自编码器替换为一个有 16 个通道的自编码器,从而产生 4×4\times 更高维的潜在空间。最近的大规模模型采用了高维潜在空间,因为它提供了图像生成能力的显著改善(Dai 等人,2023; Esser 等人,2024)。值得注意的是,具有更多通道的自编码器本身具有更出色的图像重建能力,这进一步有助于整体成功。有趣的是,我们发现在微预算训练中使用更高维的潜在空间会损害性能。对于两个经过相同计算预算和训练超参数训练的MicroDiT模型,我们发现在四通道潜在空间训练的模型实现了更好的 FID、Clip-score 和 GenEval 得分(表10)。我们假设,尽管潜在维度的增加允许更好地建模数据分布,但它同时也增加了训练高质量模型所需的训练预算。我们还剔除了专家组合(MoE)层的选择,在相同的训练设置下训练了另一个MicroDiT模型。我们发现使用 MoE 层可以将 COCO 数据集上的零 shot FID-30K 从 13.7 改善到 12.7。由于稀疏变压器与 MoE 层的性能更好,我们在大规模训练中默认使用 MoE 层。

6.3 Comparison with previous works
6.3 与之前工作的对比

Comparing zero-shot image generation on COCO dataset (Table 3). We follow the evaluation methodology in previous work (Saharia et al., 2022; Yu et al., 2022; Balaji et al., 2022) and sample 30K random images and corresponding captions from the COCO 2014 validation set. We generate 30K synthetic images using the captions and measure the distribution similarity between real and synthetic samples using FID (referred to as FID-30K in Table 3). Our model achieves a competitive 12.66 FID-30K, while trained with 14×14\times lower computational cost than the state-of-the-art low-cost training method and without any proprietary or humongous dataset to boost performance. Our approach also outperforms Würstchen (Pernias et al., 2024), a recent low-cost training approach for diffusion models, while requiring 19×19\times lower computational cost.
比较 COCO 数据集上的零镜头图像生成效果(表 3)。 我们遵循以前工作 (Saharia 等人, 2022; Yu 等人, 2022; Balaji 等人, 2022) 中的评估方法,从 COCO 2014 验证集中随机抽取 30K 张图像及其对应的标题。我们使用这些标题生成 30K 张合成图像,并使用 FID 来测量真实和合成样本分布的相似度(在表 3 中称为 FID-30K)。我们的模型在 FID-30K 上达到了 12.66 的竞争性成绩,训练成本也低于最先进的低成本训练方法,且无需使用任何专有或庞大的数据集来提升性能。我们的方法还优于最近的低成本训练方法 Würstchen (Pernias 等人, 2024),同时所需的计算成本也更低。

Fine-grained image generation comparison (Table 4). We use the GenEval (Ghosh et al., 2024) framework to evaluate the compositional image generation properties, such as object positions, co-occurrence, count, and color. Similar to most other models, our model achieves near perfect accuracy on single-object generation. Across other tasks, our model performs similar to early Stable-Diffusion variants, notably outperforming Stable-Diffusion-1.5. On the color attribution task, our model also outperforms Stable-Diffusion-XL-turbo and PixArt-α𝛼\alpha models.
细粒度图像生成比较(表 4)。我们使用 GenEval (Ghosh 等人, 2024)框架来评估图像的组成属性,如对象位置、共现、数量和颜色。与大多数其他模型类似,我们的模型在单个对象生成上达到了近乎完美的准确性。在其他任务中,我们的模型与早期的 Stable-Diffusion 变体表现相似,明显优于 Stable-Diffusion-1.5。在色彩归属任务上,我们的模型也优于 Stable-Diffusion-XL-turbo 和 PixArt- α\alpha 模型。

7 Related work
7 相关工作

The landscape of diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020a; b) has rapidly evolved in the last few years, with modern models trained on web-scale datasets (Betker et al., 2023; Esser et al., 2024; Rombach et al., 2022; Saharia et al., 2022). In contrast to training in pixel space (Saharia et al., 2022; Nichol et al., 2021), the majority of large-scale diffusion models are trained in a compressed latent space, thus referred to as latent diffusion models (Rombach et al., 2022). Similar to auto-regressive models, transformer architectures (Vaswani et al., 2017) have also been recently adopted for diffusion-based image synthesis. While earlier models commonly adopted a fully convolutional or hybrid UNet network architecture (Dhariwal & Nichol, 2021; Nichol et al., 2021; Rombach et al., 2022; Saharia et al., 2022), recent works have demonstrated that diffusion transformers (Peebles & Xie, 2023) achieve better performance (Esser et al., 2024; Chen et al., 2023; Betker et al., 2023). Thus, we also use diffusion transformers for modeling latent diffusion models.
扩散模型的景观(Sohl-Dickstein 等人,2015;Ho 等人,2020;Song 等人,2020ab)在过去几年里发生了快速演变,现代模型是在 web 规模数据集上进行训练的(Betker 等人,2023;Esser 等人,2024;Rombach 等人,2022;Saharia 等人,2022)。与在像素空间进行训练相比(Saharia 等人,2022;Nichol 等人,2021),大多数大规模扩散模型都是在压缩的潜在空间中进行训练的,因此被称为潜在扩散模型(Rombach 等人,2022)。与自回归模型类似,Transformer 架构(Vaswani 等人,2017)也最近被采用于基于扩散的图像合成。虽然早期模型通常采用全卷积或混合 UNet 网络架构(Dhariwal 和 Nichol,2021;Nichol 等人,2021;Rombach 等人,2022;Saharia 等人,2022),但最近的研究表明,扩散 Transformer(Peebles 和 Xie,2023)能够取得更好的性能(Esser 等人,2024;Chen 等人,2023;Betker 等人,2023)。因此,我们也使用扩散 Transformer 来建模潜在扩散模型。

Since the image captions in web-scale datasets are often noisy and of poor quality (Byeon et al., 2022; Schuhmann et al., 2022), recent works have started to recaption them using vision-language models (Liu et al., 2023; Wang et al., 2023; Li et al., 2024). Using synthetic captions leads to significant improvements in the diffusion models’ image generation capabilities (Betker et al., 2023; Esser et al., 2024; Gokaslan et al., 2024; Chen et al., 2023). While text-to-image generation models are the most common application of diffusion models, they can also support a wide range of other conditioning mechanisms, such as segmentation maps, sketches, or even audio descriptions (Zhang et al., 2023; Yariv et al., 2023). Sampling from diffusion models is also an active area of research, with multiple novel solvers for ODE/SDE sampling formulations to reduce the number of iterations required in sampling without degrading performance (Song et al., 2020a; Jolicoeur-Martineau et al., 2021; Liu et al., 2022; Karras et al., 2022). Furthermore, the latest approaches enable single-step sampling from diffusion models using distillation-based training strategies (Song et al., 2023; Sauer et al., 2023). The sampling process in diffusion models also employs an additional guidance signal to improve prompt alignment, either based on an external classifier (Dhariwal & Nichol, 2021; Nichol et al., 2021; Sehwag et al., 2022) or self-guidance (Ho & Salimans, 2022; Karras et al., 2024). The latter classifier-free guidance approach is widely adopted in large-scale diffusion models and has been further extended to large-language models (Zhao et al., 2024; Sanchez et al., 2023).
(Byeon 等人,2022;Schuhmann 等人,2022由于网络规模数据集中的图像标题通常存在噪音和质量较差的问题,最近的研究开始使用基于视觉和语言的模型对其进行重新标题(Liu 等人,2023;Wang 等人,2023;Li 等人,2024。使用合成标题可以显著提高扩散模型的图像生成能力(Betker 等人,2023;Esser 等人,2024;Gokaslan 等人,2024;Chen 等人,2023。虽然文本到图像生成模型是最常见的扩散模型应用,但它们也可以支持分割图、草图甚至音频描述等各种其他条件机制(Zhang 等人,2023;Yariv 等人,2023。采样从扩散模型是一个活跃的研究领域,有多种新颖的求解器用于 ODE/SDE 采样公式,可以减少采样所需的迭代次数而不会降低性能(Song 等人,2020a;Jolicoeur-Martineau 等人,2021;Liu 等人,2022;Karras 等人,2022。此外,最新的方法使用基于蒸馏的训练策略从扩散模型进行单步采样(Song 等人,2023;Sauer 等人,2023。扩散模型的采样过程还采用额外的引导信号来提高提示对齐,这可以基于外部分类器(Dhariwal 和 Nichol, 2021;Nichol 等人,2021;Sehwag 等人,2022)或自我指导(何和萨利曼斯, 2022;卡拉斯等人, 2024)。后者的分类器自由指导方法被广泛采用于大规模扩散模型中,并进一步扩展到大型语言模型(赵等人, 2024;桑彻斯等人, 2023)

Since the training cost of early large-scale diffusion models was noticeably high (Rombach et al., 2022; Ramesh et al., 2022), multiple previous works focused on bringing down this cost. Gokaslan et al. (2024) showed that using common tricks from efficient deep learning can bring the cost of stable-diffusion-2 models under $50K. Chen et al. (2023) also reduced this cost by training a diffusion transformer model on a mixture of openly accessible and proprietary image datasets. Cascaded training of diffusion models is also used by some previous works (Saharia et al., 2022; Pernias et al., 2024; Guo et al., 2024), where multiple diffusion models are employed to sequentially upsample the low-resolution generations from the base diffusion model. A key limitation of cascaded training is the strong influence of the low-resolution base model on overall image fidelity and prompt alignment. Most recently, Pernias et al. (2024) adopted the cascaded training approach (Würstchen) while training the base model in a 42×42\times compressed latent space. Though Würstchen achieves low training cost due to extreme image compression, it also achieves significantly lower image generation performance on the FID evaluation metric. Alternatively, patch masking has been recently adopted as a means to reduce the computational cost of training diffusion transformers (Zheng et al., 2024; Gao et al., 2023), taking inspiration from the success of patch masking in contrastive models (He et al., 2022). Patch masking is straightforward to implement in transformers, and the diffusion transformer successfully generalizes to unmasked images in inference, even when patches for each image were masked randomly during training.
由于早期大规模扩散模型的训练成本非常高(Rombach 等, 2022; Ramesh 等, 2022),多个先前的工作都集中在降低这一成本。Gokaslan 等(2024)表明利用高效深度学习的常见技巧,可将稳定扩散-2 模型的成本降至 5 万美元以下。Chen 等(2023)也通过在公开可访问和专有图像数据集的混合上训练扩散 transformer 模型来降低了这一成本。一些先前的工作(Saharia 等, 2022; Pernias 等, 2024; Guo 等, 2024)也使用了级联训练扩散模型的方法,其中采用多个扩散模型依次对基础扩散模型生成的低分辨率结果进行上采样。级联训练的一个关键限制是低分辨率基础模型对整体图像保真度和提示对齐的强烈影响。最近,Pernias 等(2024)采用了级联训练方法(Würstchen),同时将基础模型训练在压缩的潜在空间中。虽然 Würstchen 由于极端图像压缩而实现了低训练成本,但在 FID 评估指标上也显著降低了图像生成性能。另一方面,最近也采用了 patch 遮蔽的方式来减少训练扩散 transformer 的计算成本(Zheng 等, 2024; Gao 等, 2023),这受益于 patch 遮蔽在对比模型中的成功(He 等, 2022)。 分块遮蔽在变压器中实现起来很简单,扩散变压器在推理过程中能够成功推广到未遮蔽的图像,即使在训练过程中每个图像的分块都随机遮蔽。

8 Discussion and Future Work
8 讨论和未来工作

As the overarching objective of our work is to reduce the overhead of training large-scale diffusion models, we not only aim to reduce the training cost but also align design choices in the training setup with this objective. For example, we deliberately choose datasets that are openly accessible and show that using only these datasets suffices to generate high-quality samples, thus omitting the need for proprietary or enormous datasets. We also favor CLIP-based text embeddings which, although underperforming compared to T5 model embeddings (especially in text generation), are much faster to compute and require six times less disk storage.
作为我们工作的总目标是减少大规模扩散模型训练的开销,我们不仅旨在降低训练成本,还希望使训练设置中的设计选择与此目标保持一致。例如,我们有意选择公开可访问的数据集,并证明仅使用这些数据集就足以生成高质量的样本,从而避免了对专有或巨大数据集的需求。我们也倾向于使用基于 CLIP 的文本嵌入,尽管与 T5 模型嵌入相比性能稍逊(尤其在文本生成方面),但其计算速度快得多,需要的磁盘存储也减少了六倍。

While we observe competitive image generation performance with micro-budget training, it inherits the limitations of existing text-to-image generative models, especially in text rendering. It is surprising that the model learns to successfully adhere to complex prompts demonstrating strong generalization to novel concepts while simultaneously failing to render even simple words, despite the presence of a dedicated optical character recognition dataset in training. Similar to a majority of other open-source models, our model struggles with controlling object positions, count, and color attribution, as benchmarked on the GenEval dataset.
尽管我们观察到微预算训练的竞争性图像生成性能,但它继承了现有文本到图像生成模型的局限性,特别是在文本渲染方面。令人惊讶的是,该模型学会成功地遵循复杂的提示,展示了对新概念的强大概括能力,同时却无法渲染即使是简单的单词,尽管在训练中有一个专门的光学字符识别数据集。与大多数其他开源模型类似,我们的模型在控制对象位置、数量和颜色归属方面也存在困难,这在 GenEval 数据集基准测试中得到验证。

We believe that the lower overhead of large-scale training of diffusion generative models can dramatically enhance our understanding of training dynamics and accelerate research progress in alleviating the shortcomings of these models. It can enable a host of explorations on scale, in directions such as understanding end-to-end training dynamics (Agarwal et al., 2022; Choromanska et al., 2015; Jiang et al., 2020), measuring the impact of dataset choice on performance (Xie et al., 2024), data attribution for generated samples (Zheng et al., 2023; Dai & Gifford, 2023; Park et al., 2023; Koh & Liang, 2017), and adversarial interventions in the training pipeline (Chou et al., 2023; Shafahi et al., 2018). Future work can also focus on alleviating current limitations of micro-budget training, such as poor rendering of text and limitations in capturing intricate relationships between objects in detailed scenes. While we focus mainly on algorithmic efficiency and dataset choices, the training cost can be significantly reduced by further optimizing the software stack, e.g., by adopting native 8-bit precision training (Mellempudi et al., 2019), using dedicated attention kernels for H100 GPUs (Shah et al., 2024), and optimizing data loading speed.
我们相信大规模训练扩散生成模型的较低开销可以显著提高我们对训练动力学的理解, 并加快在缓解这些模型缺陷方面的研究进展。它可以在以下方面进行大量探索: 了解端到端训练动力学(Agarwal 等,2022; Choromanska 等,2015; Jiang 等,2020)、测量数据集选择对性能的影响(Xie 等,2024)、生成样本的数据归因(Zheng 等,2023; Dai&Gifford,2023; Park 等,2023; Koh&Liang,2017)以及训练管道中的对抗性干预(Chou 等,2023; Shafahi 等,2018)。未来的工作还可以专注于缓解微预算训练的当前局限性, 如文本渲染效果差以及难以捕捉详细场景中物体之间的复杂关系。尽管我们主要关注算法效率和数据集选择, 但通过进一步优化软件栈, 例如采用原生 8 位精度训练(Mellempudi 等,2019)、使用针对 H100 GPU 的专用注意力内核(Shah 等,2024)以及优化数据加载速度, 训练成本也可以大幅降低。

9 Conclusion
结论

In this work, we focus on patch masking strategies to reduce the computational cost of training diffusion transformers. We propose a deferred masking strategy to alleviate the shortcomings of existing masking approaches and demonstrate that it provides significant improvements in performance across all masking ratios. With a deferred masking ratio of 75%, we conduct large-scale training on commonly used real image datasets, combined with additional synthetic images. Despite being trained at an order of magnitude lower cost than the current state-of-the-art, our model achieves competitive zero-shot image generation performance. We hope that our low-cost training mechanism will empower more researchers to participate in the training and development of large-scale diffusion models.
在这项工作中,我们专注于补丁掩蔽策略,以降低训练扩散变换器的计算成本。我们提出了一种延迟掩蔽策略,以缓解现有掩蔽方法的缺点,并证明它在所有掩蔽比例下都提供了显著的性能改善。通过 75%的延迟掩蔽比例,我们在常用的真实图像数据集上进行大规模训练,并结合了额外的合成图像。尽管训练成本要低于当前最先进水平一个数量级,但我们的模型仍能实现具有竞争力的零样本图像生成性能。我们希望,我们的低成本训练机制将使更多研究人员能够参与到大规模扩散模型的训练和开发中来。

10 Acknowledgements
10 致谢

We would like to thank Peter Stone, Felancarlo Garcia, Yihao Zhan, and Naohiro Adachi for their valuable feedback and support during the project. We also thank Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajadmanesh, Jiabo Huang, Nidham Gazagnadou, and Vivek Sharma for their constructive feedback throughout the progress of this project.
我们要感谢彼得·斯通、费兰卡洛·加西亚、詹伊豪和足立直博在项目过程中提供宝贵的反馈和支持。我们还要感谢庄卫明、陈晨、李志忠、萨贾德马内希·西纳、黄嘉柏、尼达姆·加扎格纳杜和维韦克·夏尔马在该项目进程中提供了建设性的反馈。

References

  • Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  • Agarwal et al. (2022) Chirag Agarwal, Daniel D’souza, and Sara Hooker. Estimating example difficulty using variance of gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10368–10378, 2022.
  • Balaji et al. (2022) Yogesh Balaji, Yogesh Balaji, Nah, Seungjun, Seungjun Nah, Huang, Xun, Xun Huang, Vahdat, Arash, Arash Vahdat, Song, Jiaming, Jiaming Song, Kreis, Karsten, Karsten Kreis, Aittala, Miika, Miika Aittala, Aila, Timo, Timo Aila, Laine, Samuli, Samuli Laine, Catanzaro, Bryan, Bryan Catanzaro, Karras, Tero, Tero Karras, Liu, Ming-Yu, and Mingyu Li. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324, 2022. doi: 10.48550/arxiv.2211.01324.
  • Bansal et al. (2024) Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. Advances in Neural Information Processing Systems, 36, 2024.
  • Betker et al. (2023) James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Byeon et al. (2022) Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  • Changpinyo et al. (2021) Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  3558–3568, 2021.
  • Chen et al. (2023) Jun Song Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. arXiv preprint arXiv:2310.00426, 2023. doi: 10.48550/arxiv.2310.00426.
  • Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp.  192–204. PMLR, 2015.
  • Chou et al. (2023) Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4015–4024, 2023.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arxiv 2022. arXiv preprint arXiv:2204.02311, 10, 2022.
  • Dai et al. (2023) Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  • Dai & Gifford (2023) Zheng Dai and David K Gifford. Training data attribution for diffusion models. arXiv preprint arXiv:2306.02174, 2023.
  • Dehghani et al. (2023) Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.  7480–7512. PMLR, 2023.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Ding et al. (2022) Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  • Fang et al. (2023) Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • Fourier (1822) Jean Baptiste Joseph Fourier. Théorie analytique de la chaleur, volume 504. Didot Paris, 1822.
  • Gafni et al. (2022) Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pp.  89–106. Springer, 2022.
  • Gao et al. (2023) Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  • Ghosh et al. (2024) Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • Gokaslan et al. (2024) Aaron Gokaslan, A Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and Volodymyr Kuleshov. Commoncanvas: Open diffusion models trained on creative-commons images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8250–8260, 2024.
  • Guo et al. (2024) Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, and Bihan Wen. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation, 2024. URL https://arxiv.org/abs/2402.10491.
  • He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  • Hendrycks & Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  • Hessel et al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ilharco et al. (2021) Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Jiang et al. (2020) Yiding Jiang, Pierre Foret, Scott Yak, Daniel M Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, and Behnam Neyshabur. Neurips 2020 competition: Predicting generalization in deep learning. arXiv preprint arXiv:2012.07976, 2020.
  • Jolicoeur-Martineau et al. (2021) Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
  • Kang et al. (2023) Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10124–10134, 2023.
  • Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022.
  • Karras et al. (2024) Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. arXiv preprint arXiv:2406.02507, 2024.
  • Kim et al. (2021) Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek. mindall-e on conceptual captions. https://github.com/kakaobrain/minDALL-E, 2021.
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4015–4026, 2023.
  • Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp.  1885–1894. PMLR, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. (2024) Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, et al. What if we recaption billions of web images with llama-3? arXiv preprint arXiv:2406.08478, 2024.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  • Liu et al. (2023) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • Liu et al. (2022) Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  • Mehta et al. (2024) Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, et al. Openelm: An efficient language model family with open-source training and inference framework. arXiv preprint arXiv:2404.14619, 2024.
  • Mellempudi et al. (2019) Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision training with 8-bit floating point. arXiv preprint arXiv:1905.12334, 2019.
  • Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • OpenAI (2024) OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024.
  • Park et al. (2023) Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023.
  • Parmar et al. (2022) Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11410–11420, 2022.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  • Pernias et al. (2024) Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville. Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models. ICLR, 2024.
  • Podell et al. (2024) Dustin Podell, Zion English, Karen Lacey, Andreas Blattmann, Tim Dockhorn, Judith Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. ICLR, 2024. doi: 10.48550/arxiv.2307.01952.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pp.  8821–8831. Pmlr, 2021.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Sanchez et al. (2023) Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806, 2023.
  • Sauer et al. (2023) Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Sehwag et al. (2022) Vikash Sehwag, Caner Hazirbas, Albert Gordo, Firat Ozgenel, and Cristian Canton. Generating high fidelity data from low-density regions using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11492–11501, 2022.
  • Shafahi et al. (2018) Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 31, 2018.
  • Shah et al. (2024) Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608.
  • Shazeer (2020) Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Sidorov et al. (2020) Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp.  742–758. Springer, 2020.
  • Smith (2015) Leslie N Smith. Cyclical learning rates for training neural networks. arxiv. Preprint at https://arxiv. org/abs/1506.01186, 2015.
  • Smith (2022) Leslie N Smith. General cyclical training of neural networks. arXiv preprint arXiv:2202.08835, 2022.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  • Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ICLR, 2021.
  • Song et al. (2023) Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  • Sun et al. (2024) Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
  • Sutawika et al. (2024) Lintang Sutawika, Aran Komatsuzaki, and Colin Raffel. Pile-t5, 2024. URL https://blog.eleuther.ai/pile-t5/. Blog post.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  • Team (2022) Mosaic ML Team. streaming. <https://github.com/mosaicml/streaming/>, 2022.
  • Teng et al. (2023) Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350, 2023.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Valyaeva (2023) Alina Valyaeva. AI Has Already Created As Many Images As Photographers Have Taken in 150 Years. Statistics for 2023. https://journal.everypixel.com/ai-image-statistics, 2023. [Online; accessed 11-July-2024].
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2023) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  • Wang et al. (2022) Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896, 2022.
  • Xie et al. (2024) Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
  • Yariv et al. (2023) Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, and Idan Schwartz. Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation. arXiv preprint arXiv:2305.13050, 2023.
  • Yu et al. (2022) Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  • Zeiler & Fergus (2014) Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp.  818–833. Springer, 2014.
  • Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023.
  • Zhao et al. (2024) Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via classifier-free guidance. arXiv preprint arXiv:2402.08680, 2024.
  • Zheng et al. (2024) Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=vTBjBtGioE.
  • Zheng et al. (2023) Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, and Min Lin. Intriguing properties of data attribution on diffusion models. arXiv preprint arXiv:2311.00500, 2023.
  • Zhou et al. (2022) Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
  • Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.

Appendix A Additional details on experimental setup
附录 A实验设置的更多细节

We use DiT-Tiny/2 and DiT-Xl/2 diffusion transformer architectures for small and large scale training setups, respectively. We use four and six transformer blocks in the patch-mixer for the DiT-Tiny/2 and DiT-Xl/2 architectures, respectively. The patch-mixer comprises approximately 13% of the parameters in the backbone diffusion transformer. We use half-precision layernorm normalization and SwiGLU activation in the feedforward layers for all transformers. Initially, half-precision layernorm leads to training instabilities after 100100100K steps of training. Thus, we further apply layer normalization to query and key embeddings in the attention layers (Dehghani et al., 2023), which stabilizes the training. We reduce the learning rate for expert layers by half as each expert now processes a fraction of all patches. We provide an exhaustive list of our training hyperparameters in Table 5. We use the default configuration (σmax=80,σmin=0.002,Smax=,Snoise=1,Smin=0,Schurn=0formulae-sequencesubscript𝜎max80formulae-sequencesubscript𝜎min0.002formulae-sequencesubscript𝑆maxformulae-sequencesubscript𝑆noise1formulae-sequencesubscript𝑆min0subscript𝑆churn0\sigma_{\text{max}}=80,\sigma_{\text{min}}=0.002,S_{\text{max}}=\infty,S_{\text{noise}}=1,S_{\text{min}}=0,S_{\text{churn}}=0), except that we increase σdatasubscript𝜎𝑑𝑎𝑡𝑎\sigma_{data} to match the standard deviation of our image datasets in latent space. We generate images using deterministic sampling from Heun’s 2ndsuperscript2𝑛𝑑2^{nd} order method (Fourier, 1822; Karras et al., 2022) with 303030 sampling steps. Unless specified, we use classifier-free guidance of 333 and sample 303030K images in quantitative evaluation on the cifar-captions dataset. We reduce the guidance scale to 1.51.51.5 for large-scale models. We find that these guidance values achieve the best FID score under the FID-Clip-score tradeoff. For all qualitative generations, we recommend a guidance scale of 555 to better photorealism and prompt adherence. By default, we use 512×512512512512\times 512 pixel resolution when generating synthetic images from large-scale models and 256×256256256256\times 256 pixel resolution with small-scale model trained on cifar-captions dataset.
我们分别使用DiT-Tiny/2DiT-Xl/2扩散变换器架构进行小型和大型训练设置。我们在DiT-Tiny/2DiT-Xl/2架构的分块混合器中分别使用四个和六个变换器块。分块混合器约占主干扩散变换器参数的 13%。 我们在所有变压器的前馈层中使用半精度层归一化和 SwiGLU 激活。初始时,半精度层归一化在训练后 100100 K 步导致了训练不稳定。因此,我们进一步在注意力层中应用层归一化到查询和键嵌入(Dehghani et al., 2023),这稳定了训练。我们将每个专家现在处理所有块的一小部分,将专家层的学习率减半。我们在表5中提供了我们训练超参数的详尽列表。我们使用默认配置( σmax=80,σmin=0.002,Smax=,Snoise=1,Smin=0,Schurn=0\sigma_{\text{max}}=80,\sigma_{\text{min}}=0.002,S_{\text{max}}=\infty,S_{\text{noise}}=1,S_{\text{min}}=0,S_{\text{churn}}=0 ),但将 σdata\sigma_{data} 增加以匹配我们图像数据集在潜在空间中的标准差。我们使用 Heun's 第 2nd2^{nd} 阶方法(Fourier, 1822; Karras et al., 2022)中的确定性采样生成图像,采样步数为 3030 。除非另有说明,否则我们在 cifar-captions 数据集上进行定量评估时使用 33 的无分类器引导,并采样 3030 K 个图像。我们将大规模模型的引导尺度降低到 1.51.5 。我们发现这些引导值在 FID-Clip-score 权衡下达到最佳 FID 分数。对于所有定性生成,我们建议使用 55 的引导尺度来获得更好的逼真度和遵循提示。默认情况下,我们在生成大规模模型的合成图像时使用 512×512512\times 512 像素分辨率,在训练于 cifar-captions 数据集的小规模模型上使用 256×256256\times 256 像素分辨率。

We conduct all experiments on an 8×\timesH100 GPU node. We train and infer all models using bfloat16 mixed-precision mode. We did not observe a significant speedup when using FP8 precision with the Transformer Engine library555https://github.com/NVIDIA/TransformerEngine
我们在一个 8 ×\times H100 GPU 节点上进行所有实验。我们使用 bfloat16 混合精度模式训练和推理所有模型。我们没有观察到使用 FP8 精度和 Transformer Engine 库时有显著的加速。
. We use PyTorch 2.3.0 (Paszke et al., 2019) with PyTorch native flash-attention-2 implementation. We use the DeepSpeed (Rasley et al., 2020) flops profiler to estimate total FLOPs in training. We also use just-in-time compilation (torch.compile) to achieve a 10-15% speedup in training time. We use StreamingDataset (Team, 2022) to enable fast dataloading.
我们使用 PyTorch 2.3.0(Paszke 等人, 2019)以及 PyTorch 原生的 flash-attention-2 实现。我们使用 DeepSpeed(Rasley 等人, 2020)的 flops 分析器来估算训练过程中的总浮点运算数。我们还使用即时编译(torch.compile)来实现训练时间的 10-15%加速。我们使用 StreamingDataset(Team, 2022)来实现快速的数据加载。

Table 5: Hyperparameter across both phases of our large scale training setup.
表 5: 我们大规模培训设置的两个阶段中的超参数。
Resolution 决议 256×256256256256\times 256 (Phase-I)
256×256256\times 256 (第一阶段)
512×512512512512\times 512 (Phase-II)
512×512512\times 512 (第二阶段)
Masking ratio 遮蔽比例 0.750.750.75 00 0.750.750.75 00
Training steps 训练步骤 250000250000250000 300003000030000 500005000050000 500050005000
Batch size 批量大小 204820482048 204820482048 204820482048 204820482048
Learning rate 学习率 2.4×1042.4superscript1042.4\times 10^{-4} 8×1058superscript1058\times 10^{-5} 8×1058superscript1058\times 10^{-5} 8×1058superscript1058\times 10^{-5}
Weight decay 权重衰减 0.10.10.1 0.10.10.1 0.10.10.1 0.10.10.1
Optimizer 优化器 AdamW 亚当 W AdamW AdamW 亚当 W AdamW 阿当 W
Momentum 动力 β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999 β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999 β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999 β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999
Optimizer epsilon 优化器 epsilon 1×1081superscript1081\times 10^{-8} 1×1081superscript1081\times 10^{-8} 1×1081superscript1081\times 10^{-8} 1×1081superscript1081\times 10^{-8}
Lr scheduler 学习率调度器 Cosine 余弦 Constant 持续 Constant 持续 Constant 持续不断的
Warmup steps 热身步骤 250025002500 00 500500500 00
Gradient clip norm 梯度剪裁范数 0.250.250.25 0.250.250.25 0.50.50.5 0.50.50.5
EMA no ema 没有 no ema 没有 0.999750.999750.99975 0.99750.99750.9975
Hflip 水平翻转 False 错误 False 错误 False 错误 False 错误
Precision 精确 bf16 bf16 bf16 bf16
Layernorm precision 层标准化精度 bf16 bf16 bf16 bf16
QK-normalization QK 规范化 True 真的 True 真的 True 真的 True 真的
(Pmean,Pstd)subscript𝑃meansubscript𝑃std(P_{\text{mean}},P_{\text{std}}) (0.60.6-0.6, 1.21.21.2)
(〈代码 0〉〈/代码 0〉,〈代码 1〉〈/代码 1〉)
(0.60.6-0.6, 1.21.21.2)
(第 0.6-0.6 个, 第 1.21.2 个)
(00, 0.60.60.6) (00, 0.60.60.6)

Choice of text encoder. To convert discrete token sequences in captions to high-dimensional feature embeddings, two common choices of text encoders are CLIP (Radford et al., 2021; Ilharco et al., 2021) and T5 (Raffel et al., 2020), with T5-xxl666https://huggingface.co/DeepFloyd/t5-v1_1-xxl
文本编码器的选择。为了将说明文中的离散标记序列转换为高维特征嵌入,两种常见的文本编码器选择是 CLIP(Radford 等人,2021 年;Ilharco 等人,2021 年)和 T5(Raffel 等人,2020 年),其中 T5-xxl6 。
embeddings narrowly outperforming equivalent CLIP model embeddings (Saharia et al., 2022). However, T5-xxl embeddings pose both compute and storage challenges: 1) Computing T5-xxl embeddings costs an order of magnitude more time than a CLIP ViT-H/14 model while also requiring 6×6\times more disk space for pre-computed embeddings. Overall, using T5-xxl embeddings is more demanding (Table 6). Following the observation that large text encoders trained on higher quality data tend to perform better (Saharia et al., 2022), we use state-of-the-art CLIP models as text encoders (Fang et al., 2023)777We use the DFN-5B text encoder: https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378
我们使用 DFN-5B 文本编码器:https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378

嵌入式特征略胜 CLIP 模型嵌入特征(Saharia et al.,2022)。然而,T5-xxl 嵌入特征存在计算和存储挑战:1)计算 T5-xxl 嵌入特征的时间成本比 CLIP ViT-H/14 模型高一个数量级,同时还需要 6×6\times 更多的磁盘空间存储预先计算的嵌入特征。总的来说,使用 T5-xxl 嵌入特征的要求更高(表 6)。根据观察,在更高质量数据上训练的大型文本编码器性能更佳(Saharia et al.,2022),我们使用最先进的 CLIP 模型作为文本编码器(Fang et al.,2023)7。
.

Cost analysis. We translate the wall-clock time of training to financial cost using a $30/hour budget for an 8×\timesH100 GPU cluster. Since the cost of H100 fluctuates significantly across vendors, we base our estimate on the commonly used cost estimates for A100 GPUs (Chen et al., 2023), in particular $1.5/A100/hour888https://cloud-gpus.com/
成本分析。我们使用每小时 $30 的预算将训练的总时间转换为财务成本,这是基于一个拥有 8 ×\times 台 H100 GPU 的集群。由于 H100 GPU 的成本在不同供应商之间波动很大,我们的估算是基于常用的 A100 GPU 成本估算(Chen et al., 2023),即每小时 $1.5/A100。
. We observe a 2.5x reduction in wall-clock time on H100 GPUs, thus assume a $3.75/H100/hour cost. We also report the wall-clock time to benchmark the training cost independent of the fluctuating GPU costs in the AI economy.
在 H100 GPU 上,我们观察到壁钟时间减少了 2.5 倍,因此假设每小时 3.75 美元的成本。我们还报告了基准测试训练成本的壁钟时间,这与 AI 经济中不稳定的 GPU 成本无关。

Datasets. We use the following five datasets to train our final models:
数据集。我们使用以下五个数据集来训练我们的最终模型:

  • Conceptual Captions (real). This dataset was released by Google and includes 12 million pairs of image URLs and captions (Changpinyo et al., 2021). Our downloaded version of the dataset includes 10.910.910.9M image-caption pairs.
    概念标题(真实)。这个数据集是由谷歌发布的,包括 1200 万对图像 URL 和标题(Changpinyo et al., 2021)。我们下载的这个数据集包括 10.910.9 M 个图像-字幕对。

  • Segment Anything-1B (real). Segment Anything comprises 11.111.111.1M high-resolution images originally released by Meta for segmentation tasks (Kirillov et al., 2023). Since the images do not have corresponding real captions, we use synthetic captions generated by the LLaVA model (Chen et al., 2023; Liu et al., 2023).
    分割任何事物-1B(真实)。分割任何事物包括 11.111.1 M 高分辨率图像,最初由 Meta 发布用于分割任务 (Kirillov 等人, 2023)。由于这些图像没有相应的真实说明,我们使用 LLaVA 模型生成的合成说明 (Chen 等人, 2023; Liu 等人, 2023)

  • TextCaps (real). This dataset comprises 282828K images with natural text in the images (Sidorov et al., 2020). Each image has five associated descriptions. We combine them into single captions using the Phi-3 language model (Abdin et al., 2024).
    TextCaps (实际)。 此数据集包含 2828 K 张图像,其中包含图像中自然文本 (Sidorov 等人, 2020)。每个图像都有 5 个相关的描述。我们使用 Phi-3 语言模型将它们组合成单个字幕 (Abdin 等人, 2024)

  • JourneyDB (synthetic). JourneyDB is a synthetic image dataset comprising 4.44.44.4M high-resolution Midjourney image-caption pairs (Sun et al., 2024). We use the train subset (4.24.24.2M samples) of this dataset.
    JourneyDB(合成)。JourneyDB 是一个合成图像数据集,包含了 4.44.4 M 个高分辨率的 Midjourney 图像-标题对(孙等人,2024。我们使用这个数据集的训练子集( 4.24.2 M 个样本)。

  • DiffusionDB (synthetic). DiffusionDB is a collection of 141414M synthetic images generated primarily by Stable Diffusion models (Wang et al., 2022). We filter out poor quality images from this dataset, resulting in 10.710.710.7M samples, and use this dataset only in the first phase of pretraining.
    DiffusionDB(合成)。 DiffusionDB 是一个由 1414 M 个主要由 Stable Diffusion 模型生成的合成图像组成的集合(王等人,2022。我们从这个数据集中过滤掉质量较差的图像,得到 10.710.7 M 个样本,并只在预训练的第一阶段使用这个数据集。

CIFAR-Captions. We construct a text-to-image dataset, named cifar-captions, that closely resembles the existing web-curated datasets and serves as a drop-in replacement of existing datasets in our setup. cifar-captions is closed-domain and only includes images of ten classes (airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks), imitating the widely used CIFAR-10 classification dataset (Krizhevsky et al., 2009). In contrast to other small-scale text-to-image datasets that are open-world, such as subsets of conceptual captions (Changpinyo et al., 2021), we observe fast convergence of diffusion models on this dataset. We create this dataset by first downloading 120120120M images from the coyo-700700700(Byeon et al., 2022) dataset. We observe a success rate of approximately 60%percent6060\% at the time of downloading the dataset. Next, we use a ViT-H-14 (Fang et al., 2023) to measure CLIP score by averaging it over the eighteen prompt templates (such as ‘a photo of a {}.’) used in the original CLIP model (Radford et al., 2021). We select images with CLIP scores higher than 0.250.250.25 ( 1.25% acceptance rate) that results in a total of 1.31.31.3M images. As the real captions are highly noisy and uninformative, we replace them with synthetic captions generated with an LLaVA-1.5 model (Liu et al., 2023).
CIFAR-Captions. 我们构建了一个文本到图像的数据集,名为 cifar-captions,它与现有的网络策划数据集非常相似,可以直接替换现有数据集。cifar-captions 是一个封闭域数据集,只包括十个类别(飞机、汽车、鸟、猫、鹿、狗、青蛙、马、船和卡车)的图像,模仿了广泛使用的 CIFAR-10 分类数据集(Krizhevsky 等人, 2009)。与其他小规模的开放世界文本到图像数据集(如概念字幕的子集(Changpinyo 等人, 2021))相比,我们观察到扩散模型在这个数据集上快速收敛。我们通过首先从 coyo- 120120 M (Byeon 等人, 2022)数据集下载 700700 M 图像来创建这个数据集。在下载数据集时,我们观察到大约 60%60\% 的成功率。接下来,我们使用一个 ViT-H-14(Fang 等人, 2023)来测量 CLIP 分数,通过平均原始 CLIP 模型(Radford 等人, 2021)使用的 18 个提示模板(如'一张{}.的照片')来计算。我们选择 CLIP 分数高于 0.250.25 的图像(接受率为 1.25%),总共得到 1.31.3 M 张图像。由于真实标题非常嘈杂和无信息,我们用 LLaVA-1.5 模型(Liu 等人, 2023)生成的合成标题替换了它们。

Table 6: Comparing the computational cost and storage overhead of CLIP (Radford et al., 2021; Fang et al., 2023) and T5 (Raffel et al., 2020; Sutawika et al., 2024) text encoders. We use the state-of-the-art CLIP model from Fang et al. (2023). We report the compute and storage cost for 1M1𝑀1M image captions. Even though T5 embeddings achieve better generation quality, especially for text generation, computing them is an order of magnitude slower than CLIP embeddings and even precomputing them for our dataset (37M images) poses high storage overheads (36.4 TB). We use one H100 GPU to estimate the time to process 1M1𝑀1M image captions. We save the embeddings in float16 precision.
表 6: 比较 CLIP(Radford 等人,2021;Fang 等人,2023)和 T5(Raffel 等人,2020;Sutawika 等人,2024)文本编码器的计算成本和存储开销。我们使用 Fang 等人(2023)提出的最先进的 CLIP 模型。我们报告了用于 1 ​ 𝑀 图像标题。虽然 T5 嵌入实现了更好的生成质量,特别是对于文本生成,但计算它们的速度比 CLIP 嵌入慢一个数量级,即使是对我们的数据集(3700 万张图像)进行预计算也会带来很高的存储开销(36.4TB)。我们使用一个 H100 GPU 来估算处理时间 1 ​ 𝑀 图像标题。我们将嵌入以 float16 精度保存。
Text encoder 文本编码器 Sequence length 序列长度 Embedding size 嵌入尺寸 Time (min:sec) 时间 (分:秒) Storage (GB) 存储空间(GB)
CLIP (ViT-H-14)
CLIP (ViT-H-14)
777777 102410241024 3:20 157157157
T5 (T5-xxl) 120120120 409640964096 33:04 983983983

Evaluation metrics. We use the following evaluation metrics to assess the quality of synthetic images generated by the text-to-image models.
评估指标。我们使用以下评估指标来评估文本到图像模型生成的合成图像的质量。

  • FID. Fréchet Inception Distance (FID) measures the 2-Wasserstein distance between real and synthetic data distributions in the feature space of a pretrained vision model. We use the clean-fid999https://github.com/GaParmar/clean-fid
    FID。FID(Fréchet Inception Distance)测量了预训练的视觉模型特征空间中真实数据分布和合成数据分布之间的 2-Wasserstein 距离。我们使用 clean-fid 进行测量。
     (Parmar et al., 2022) implementation for a robust evaluation of the FID score.
    (帕马尔等人,2022)对 FID 评分进行稳健评估的实施。

  • Clip-FID. Unlike FID that uses an Inception-v3 (Szegedy et al., 2016) model trained on ImageNet (Deng et al., 2009), Clip-FID uses a CLIP (Radford et al., 2021) model, trained on a much larger dataset than ImageNet, as an image feature extractor. We use the default ViT-B/32 CLIP model from the clean-fid library to measure Clip-FID.
    Clip-FID。不同于使用在 ImageNet (Deng et al., 2009)上训练的 Inception-v3 (Szegedy et al., 2016)模型的 FID,Clip-FID 使用了在比 ImageNet 更大的数据集上训练的 CLIP (Radford et al., 2021)模型作为图像特征提取器。我们使用 clean-fid 库中默认的 ViT-B/32 CLIP 模型来测量 Clip-FID。

  • Clip-score. It measures the similarity between a caption and the generated image corresponding to the caption. In particular, it measures the cosine similarity between normalized caption and image embeddings calculated using a CLIP text and image encoder, respectively.
    Clip-score。它衡量标题与生成图像之间的相似性。特别是,它测量使用 CLIP 文本和图像编码器分别计算的归一化标题和图像嵌入之间的余弦相似度。

(a) Ablating the choice of β2subscript𝛽2\beta_{2} in Adam optimizer. Unlike the LLM training where β2subscript𝛽2\beta_{2} if often set to 0.950.950.95 (Touvron et al., 2023; Brown et al., 2020), we find that image generation quality consistently degrades as we reduce β2subscript𝛽2\beta_{2}.
(a) 移除 Adam 优化器中β2\beta_{2}的选择。与常见的LLM训练中β2\beta_{2}设置为0.950.95 (Touvron et al., 2023; Brown et al., 2020)不同,我们发现随着β2\beta_{2}的降低,图像生成质量会持续下降.
β2subscript𝛽2\beta_{2} FID (\downarrow) FID( \downarrow Clip-FID (\downarrow)
剪辑-FID( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
0.9990.9990.999 8.538.53\mathbf{8.53} 4.854.85\mathbf{4.85} 26.8826.88\mathbf{26.88}
0.990.990.99 8.638.638.63 4.944.944.94 26.7526.7526.75
0.950.950.95 8.718.718.71 5.025.025.02 26.7126.7126.71
0.90.90.9 8.818.818.81 5.135.135.13 26.6126.6126.61
(b) Ablating the choice of weight-decay in AdamW optimizer. In resonance with transformer training in LLMs, we observe improvement in performance with increase in weight decay regularization.
(b) 消除AdamW优化器中权重衰减的选择。与LLMs中的变压器训练相呼应,我们观察到随着权重衰减正则化的增加,性能有所提高。
wd𝑤𝑑wd FID (\downarrow) FID( \downarrow Clip-FID (\downarrow)
剪辑-FID( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
0.000.000.00 8.778.778.77 5.035.035.03 26.8226.8226.82
0.010.010.01 8.738.738.73 5.035.035.03 26.8226.8226.82
0.030.030.03 8.538.538.53 4.854.85\mathbf{4.85} 26.8826.8826.88
0.100.100.10 8.388.38\mathbf{8.38} 4.904.904.90 27.0027.00\mathbf{27.00}
(c) Ablating the parameters of noise distribution. We used patch-mixer with 0.9990.9990.999 β2subscript𝛽2\beta_{2} and 0.10.10.1 weight decay. We observe a tradeoff between between FID and Clip-score in first two choices and set (m,s)𝑚𝑠(m,s) to (0.6,1.2)0.61.2(-0.6,1.2) in all followup experiments.
(c) 消除噪声分布参数。我们使用 patch-mixer,其中 0.9990.999 β2\beta_{2}0.10.1 权重衰减。我们观察到前两种选择之间存在 FID 和 Clip-score 之间的权衡,并将 (m,s)(m,s) 设置为 (0.6,1.2)(-0.6,1.2) 在所有后续实验中。
(m,s)𝑚𝑠(m,s) FID (\downarrow) FID( \downarrow Clip-FID (\downarrow)
剪辑-FID( \downarrow
Clip-score (\uparrow)
夹分数( \uparrow
(1.2,1.2)1.21.2(-1.2,1.2) 8.388.38\mathbf{8.38} 4.904.90\mathbf{4.90} 27.0027.0027.00
(0.6,1.2)0.61.2(-0.6,1.2) 8.498.498.49 4.934.934.93 27.4727.47\mathbf{27.47}
(0.6,0.6)0.60.6(-0.6,0.6) 9.059.059.05 6.726.726.72 26.9526.9526.95
(0.25,0.6)0.250.6(-0.25,0.6) 10.4410.4410.44 7.517.517.51 27.4627.4627.46
(0.0,0.6)0.00.6(0.0,0.6) 12.7612.7612.76 9.009.009.00 27.4027.4027.40
(d) Ablating the block-size in block sampling. We observe consistent performance degradation with block masking. At block size of 888 (latent image with 16×16=256161625616\times 16=256 patches), block sampling sampling collapse to sampling a quadrant, thus we sample a single continuous square patch for it.
(d) 降低块采样中的块大小。我们观察到块遮蔽会导致性能持续下降。在块大小为88(潜在图像由16×16=25616\times 16=256个补丁组成)时,块采样会崩溃为采样一个四分之一区域,因此我们对此采样一个连续的正方形补丁。
Blocksize𝐵𝑙𝑜𝑐𝑘𝑠𝑖𝑧𝑒Block-size FID (\downarrow) FID( \downarrow