MaskGIT: Masked Generative Image Transformer
MaskGIT：掩码生成式图像变换器

Huiwen Chang Han Zhang Lu Jiang Ce Liu^∗ William T. Freeman
张慧文张涵姜璐刘策 ^∗ 威廉·T·弗里曼
Google Research 谷歌研究院

Abstract 摘要

Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.
生成式 Transformer 模型在计算机视觉领域合成高保真、高分辨率图像方面迅速获得广泛关注。然而目前最先进的生成式 Transformer 模型仍将图像简单视为 token 序列，并按照光栅扫描顺序（即逐行扫描）进行序列化解码。我们发现这种策略既不高效也不最优。本文提出了一种基于双向 Transformer 解码器的新型图像合成范式——MaskGIT。在训练阶段，MaskGIT 通过关注所有方向的 token 来预测随机掩码的 token。在推理阶段，模型首先生成图像的所有 token，然后基于前次生成结果进行迭代优化。实验表明，MaskGIT 在 ImageNet 数据集上显著优于当前最先进的 Transformer 模型，并将自回归解码速度提升高达 64 倍。此外，我们还展示了 MaskGIT 可轻松扩展至多种图像编辑任务，如图像修复、外推和图像处理。

Figure 1: Example generation by MaskGIT on image synthesis and manipulation tasks. We show that MaskGIT is a flexible model that can generate high-quality samples on (a) class-conditional synthesis, (b) class-conditional image manipulation, e.g. replacing selected objects in the bounding box with ones from the given classes, and (c) image extrapolation. Examples shown here have resolutions 512

\times

512, 512

\times

512, and 512

\times

2560 in the three columns, respectively. Zoom in to see the details.
图 1：MaskGIT 在图像合成与编辑任务中的生成示例。我们展示 MaskGIT 作为灵活模型，能够在以下场景生成高质量样本：(a) 类别条件合成，(b) 类别条件图像编辑（例如用给定类别的物体替换边界框内选定对象），以及(c) 图像外推。三列示例分辨率分别为 512×512、512×512 和 512×2560。放大查看细节。

^†^†footnotetext: ^∗ Currently affiliated with Microsoft Azure AI.

1 Introduction 1 引言

Refer to caption — Figure 2: Comparison between sequential decoding and MaskGIT’s scheduled parallel decoding. Rows 1 and 3 are the input latent masks at each iteration, and rows 2 and 4 are samples generated by each model at that iteration. Our decoding starts with all unknown codes (marked in lighter gray), and gradually fills up the latent representation with more and more scattered predictions in parallel (marked in darker gray), where the number of predicted tokens increases sharply over iterations. MaskGIT finishes its decoding in 8 iterations compared to the 256 rounds the sequential method takes.
图 2：顺序解码与 MaskGIT 计划并行解码的对比。第 1、3 行显示每次迭代的输入潜在掩码，第 2、4 行显示各模型当前迭代生成的样本。我们的解码从全未知编码（浅灰色标记）开始，通过并行方式逐步用越来越多分散预测（深灰色标记）填充潜在表示，其中预测标记数量随迭代次数急剧增加。MaskGIT 仅需 8 次迭代即可完成解码，而顺序方法需要 256 轮。

Deep image synthesis as a field has seen a lot of progress in recent years. Currently holding state-of-the-art results are Generative Adversarial Networks (GANs), which are capable of synthesizing high-fidelity images at blazing speeds. They suffer from, however, well known issues include training instability and mode collapse, which lead to a lack of sample diversity. Addressing these issues still remains open research problems.
近年来，深度图像合成领域取得了长足进展。当前最具代表性的生成对抗网络（GANs）能够以惊人速度合成高保真度图像，但仍存在训练不稳定和模式坍塌等固有缺陷，导致生成样本缺乏多样性。解决这些问题仍是开放的研究课题。

Inspired by the success of Transformer [48] and GPT [5] in NLP, generative transformer models have received growing interests in image synthesis [7, 15, 37]. Generally, these approaches aim at modeling an image like a sequence and leveraging the existing autoregressive models to generate image. Images are generated in two stages; the first stage is to quantize an image to a sequence of discrete tokens (or visual words). In the second stage, an autoregressive model (e.g., transformer) is learned to generate image tokens sequentially based on the previously generated result (i.e. autoregressive decoding). Unlike the subtle min-max optimization used in GANs, these models are learned by maximum likelihood estimation. Because of the design differences, existing works have demonstrated their advantages over GANs in offering stabilized training and improved distribution coverage or diversity.
受 Transformer[48]和 GPT[5]在自然语言处理领域取得成功的启发，生成式 Transformer 模型在图像合成领域[7,15,37]获得了越来越多的关注。这类方法通常将图像建模为序列，并利用现有自回归模型进行图像生成。图像生成分为两个阶段：第一阶段将图像量化为离散标记序列（或称视觉词汇）；第二阶段通过自回归模型（如 Transformer）基于已生成结果（即自回归解码）逐步生成图像标记。与生成对抗网络中精细的极小极大优化不同，这类模型采用最大似然估计进行训练。由于设计差异，现有研究已证明其在训练稳定性、分布覆盖范围及多样性方面优于生成对抗网络。

Existing works on generative transformers mostly focus on the first stage, i.e. how to quantize images such that information loss is minimized, and share the same second stage borrowed from NLP. Consequently, even the state-of-the-art generative transformers [15, 35] still treat an image naively as a sequence, where an image is flattened into a 1D sequence of tokens following a raster scan ordering, i.e. from left to right line-by-line (cf. Figure 2). We find this representation neither optimal nor efficient for images. Unlike text, images are not sequential. Imagine how an artwork is created. A painter starts with a sketch and then progressively refines it by filling or tweaking the details, which is in clear contrast to the line-by-line printing used in previous work [7, 15]. Additionally, treating image as a flat sequence means that the autoregressive sequence length grows quadratically, easily forming an extremely long sequence–longer than any natural language sentence. This poses challenges for not only modeling long-term correlation but also renders the decoding intractable. For example, it takes a considerable 30 seconds to generate a single image on a GPU autoregressively with 32x32 tokens.
现有关于生成式变换器的研究大多集中于第一阶段，即如何量化图像以最小化信息损失，而第二阶段则直接沿用自然语言处理领域的方案。因此，即使是当前最先进的生成式变换器[15,35]仍将图像简单视为序列——通过光栅扫描顺序（即从左到右逐行排列）将图像展平为一维标记序列（见图 2）。我们发现这种表示方式对图像而言既不理想也不高效。与文本不同，图像并非序列化数据。试想艺术作品的创作过程：画家从草图开始，通过逐步填充或调整细节来完成作品，这与前人工作[7,15]采用的逐行打印模式形成鲜明对比。此外，将图像视为扁平序列意味着自回归序列长度呈二次方增长，极易形成远超自然语言句子的超长序列，这不仅对建模长期相关性构成挑战，还会导致解码过程难以处理。例如，在 GPU 上以自回归方式生成一张 32x32 标记的图像需要耗费整整 30 秒。

This paper introduces a new bidirectional transformer for image synthesis called Masked Generative Image Transformer (MaskGIT). During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT [11]. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps. Specifically, at each iteration, the model predicts all tokens simultaneously in parallel but only keeps the most confident ones. The remaining tokens are masked out and will be re-predicted in the next iteration. The mask ratio is decreased until all tokens are generated with a few iterations of refinement. As illustrated in Figure 2, MaskGIT’s decoding is an order-of-magnitude faster than the autoregresive decoding as it only takes 8 steps, instead of 256 steps, to generate an image and the predictions within each step are parallelizable. Moreover, instead of conditioning only on previous tokens in the order of raster scan, bidirectional self-attention allows the model to generate new tokens from generated tokens in all directions. We find that the mask scheduling (i.e. fraction of the image masked each iteration) significantly affects generation quality. We propose to use the cosine schedule and substantiate its efficacy in the ablation study.
本文提出了一种名为掩码生成图像变换器(MaskGIT)的新型双向变换器用于图像合成。在训练阶段，MaskGIT 采用与 BERT[11]中掩码预测类似的代理任务进行训练。在推理阶段，MaskGIT 采用了一种新颖的非自回归解码方法，能以恒定步骤数合成图像。具体而言，在每次迭代中，模型并行预测所有标记，但仅保留置信度最高的部分。其余标记会被重新掩码，并在下一次迭代中重新预测。掩码比例会逐步降低，经过数次迭代优化后即可生成所有标记。如图 2 所示，由于仅需 8 步而非 256 步就能生成图像，且每步内的预测可并行处理，MaskGIT 的解码速度比自回归解码快一个数量级。此外，不同于传统光栅扫描顺序仅依赖前序标记的条件生成，双向自注意力机制使模型能从各个方向的已生成标记中预测新标记。我们发现掩码调度（即每次迭代中图像被遮蔽的比例）对生成质量有显著影响。我们提出采用余弦调度策略，并通过消融实验验证了其有效性。

On the ImageNet benchmark, we empirically demonstrate that MaskGIT is both significantly faster (by up to 64x) and capable of generating higher quality samples than the state-of-the-art autoregressive transformer, i.e. VQGAN, on class-conditional generation with 256 $\times$ 256 and 512 $\times$ 512 resolution. Even compared with the leading GAN model, i.e. BigGAN, and diffusion model, i.e. ADM [12], MaskGIT offers comparable sample quality while yielding more favourable diversity. Notably, our model establishes new state-of-the-arts on classification accuracy score (CAS) [36] and on FID[23] for synthesizing 512 $\times$ 512 images. To our knowledge, this paper provides the first evidence demonstrating the efficacy of the masked modeling for image generation on the common ImageNet benchmark.
在 ImageNet 基准测试中，我们通过实验证明，在 256×256 和 512×512 分辨率的类别条件生成任务上，MaskGIT 不仅比最先进的自回归变换器（如 VQGAN）生成速度显著提升（最高达 64 倍），还能产生更高质量的样本。即使与领先的 GAN 模型（如 BigGAN）和扩散模型（如 ADM[12]）相比，MaskGIT 在保持相当样本质量的同时，还能提供更优的多样性表现。值得注意的是，我们的模型在 512×512 图像合成任务上，创造了分类准确率得分（CAS）[36]和 FID[23]两项指标的新纪录。据我们所知，本文首次在通用 ImageNet 基准上验证了掩码建模在图像生成领域的有效性。

Furthermore, MaskGIT’s multidirectional nature makes it readily extendable to image manipulation tasks that are otherwise difficult for autoregressive models. Fig. 1 shows a new application of class-conditional image editing in which MaskGIT re-generates content inside the bounding box based on the given class while keeping the context (outside of the box) unchanged. This task, which is either infeasible for autoregressive model or difficult for GAN models, is trivial for our model. Quantitatively, we demonstrate this flexibility by applying MaskGIT to image inpainting, and image extrapolation in arbitrary directions. Even though our model is not designed for such tasks, it obtains comparable performance to the dedicated models on each task.
此外，MaskGIT 的多向性特性使其能够轻松扩展到自回归模型难以处理的图像编辑任务。如图 1 所示，我们展示了一种新的类别条件图像编辑应用：MaskGIT 根据给定类别重新生成边界框内的内容，同时保持上下文（框外区域）不变。这项对自回归模型不可行或对 GAN 模型颇具挑战的任务，对我们的模型而言却轻而易举。我们通过将 MaskGIT 应用于图像修复和任意方向上的图像外推任务，定量验证了这种灵活性。尽管我们的模型并非专为这些任务设计，但在每项任务上都取得了与专用模型相当的性能。

2 Related Work 2 相关工作

2.1 Image Synthesis 2.1 图像合成

Deep generative models [29, 45, 17, 53, 41, 12, 46, 34] have achieved lots of successes in image synthesis tasks. GAN based methods demonstrate amazing capability in yielding high-fidelity samples [17, 4, 27, 53, 44]. In contrast, likelihood-based methods, such as Variational Autoencoders (VAEs) [29, 45], Diffusion Models [41, 12, 24] and Autoregressive Models [46, 34], offer distribution coverage and hence can generate more diverse samples [41, 45, 46].
深度生成模型[29,45,17,53,41,12,46,34]在图像合成任务中取得了诸多成功。基于 GAN 的方法在生成高保真样本方面展现出惊人能力[17,4,27,53,44]。相比之下，基于似然的方法——如变分自编码器(VAEs)[29,45]、扩散模型[41,12,24]和自回归模型[46,34]——能提供分布覆盖，因此可生成更多样化的样本[41,45,46]。

However, maximizing likelihood directly in pixel space can be challenging. So instead, VQVAE [47, 37] proposes to generate images in latent space in two stages. In the first stage, which is known as tokenization, it tries to compress images into discrete latent space, and primarily consists of three components:
然而直接在像素空间最大化似然具有挑战性。为此 VQVAE[47,37]提出分两阶段在潜在空间生成图像：第一阶段称为标记化(tokenization)，通过三个主要组件将图像压缩至离散潜在空间：

•

an encoder $E$ that learns to tokenize images $x\in\mathbb{R}^{H\times W\times 3}$ into latent embedding $E(x)$ ,
一个编码器 $E$ 学习将图像 $x\in\mathbb{R}^{H\times W\times 3}$ 转换为潜在嵌入 $E(x)$ ，
•

a codebook $\mathbf{e}_{k}\in\mathbb{R}^{D},k\in 1,2,\cdots,K$ which serves for a nearest neighbor look up used to quantize the embedding into visual tokens, and
一个用于最近邻查找的码本 $\mathbf{e}_{k}\in\mathbb{R}^{D},k\in 1,2,\cdots,K$ ，将嵌入向量量化为视觉标记
•

a decoder $G$ which predicts the reconstructed image $\hat{x}$ from the visual tokens $\mathbf{e}$ .
一个解码器 $G$ ，用于根据视觉标记 $\mathbf{e}$ 预测重建图像 $\hat{x}$

In the second stage, it first predicts the latent priors of the visual tokens using deep autoregressive models, and then uses the decoder from the first stage to map the token sequences into image pixels. Several approaches have followed this paradigm due to the efficacy of the two-stage approach. DALL-E [35] uses Transformers [48] to improve token prediction in the second stage. VQGAN [15] adds adversarial loss and perceptual loss [26, 54] in the first stage to improve the image fidelity. A contemporary work to ours, VIM [51], proposes to use a VIT backbone [13] to further improve the tokenization stage. Since these approaches still employ an auto-regressive model, the decoding time in the second stage scales with the token sequence length.
在第二阶段，首先使用深度自回归模型预测视觉标记的潜在先验，然后利用第一阶段的解码器将标记序列映射为图像像素。由于这种两阶段方法的有效性，已有多种研究遵循该范式。DALL-E[35]采用 Transformer[48]来改进第二阶段的标记预测。VQGAN[15]在第一阶段加入对抗损失和感知损失[26,54]以提升图像保真度。与我们同期的工作 VIM[51]提出使用 VIT 主干网络[13]进一步优化标记化阶段。由于这些方法仍采用自回归模型，第二阶段的解码时间会随标记序列长度增加而延长。

2.2 Masked Modeling with Bi-directional Transformers
2.2 基于双向 Transformer 的掩码建模

The transformer architecture [48], was first proposed in NLP, and has recently extended its reach to computer vision [13, 6]. Transformer consists of multiple self-attention layers, allowing interactions between all pairs of elements in the sequence to be captured. In particular, BERT [11] introduces the masked language modeling (MLM) task for language representation learning. The bi-directional self-attention used in BERT [11] allows the masked tokens in MLM to be predicted utilizing context from both directions. In vision, the masked modeling in BERT [11] has been extended to image representation learning [21, 2] with images quantized to discrete tokens. However, few works have successfully applied the same masked modeling to image generation [56] because of the difficulty in performing autoregressive decoding using bi-directional attentions. To our knowledge, this paper provides the first evidence demonstrating the efficacy of masked modeling for image generation on the common ImageNet benchmark. Our work is inspired by bi-directional machine translation [16, 20, 19] in NLP, and our novelty lies in the proposed new masking strategy and decoding algorithm which, as substantiated by our experiments, are essential for image generation.
Transformer 架构[48]最初在自然语言处理领域提出，近年来已扩展至计算机视觉领域[13,6]。该架构由多个自注意力层组成，能捕捉序列中所有元素对之间的交互关系。特别是 BERT[11]提出的掩码语言建模(MLM)任务，通过双向自注意力机制使掩码标记能利用上下文双向信息进行预测。在视觉领域，研究者将 BERT[11]的掩码建模思想拓展至图像表示学习[21,2]，将图像量化为离散标记。然而由于双向注意力机制在自回归解码上的困难，鲜有研究能成功将相同掩码建模应用于图像生成任务[56]。据我们所知，本文首次在通用 ImageNet 基准上验证了掩码建模对图像生成的有效性。我们的工作受到自然语言处理中双向机器翻译[16,20,19]的启发，创新点在于提出的新型掩码策略和解码算法——实验证明这些改进对图像生成至关重要。

3 Method 3 方法

Our goal is to design a new image synthesis paradigm utilizing parallel decoding and bi-directional generation.
我们的目标是设计一种利用并行解码和双向生成的新型图像合成范式。

We follow the two-stage recipe discussed in 2.1, as illustrated in Figure 3. Since our goal is to improve the second stage, we employ the same setup for the first stage as in the VQGAN model [15], and leave potential improvements to the tokenization step to future work.
我们遵循 2.1 节讨论的两阶段方案，如图 3 所示。由于我们的目标是改进第二阶段，因此第一阶段采用与 VQGAN 模型[15]相同的设置，而将分词步骤的潜在改进留待未来工作。

For the second stage, we propose to learn a bidirectional transformer by Masked Visual Token Modeling (MVTM). We introduce MVTM training in 3.1 and the sampling procedure in 3.2. We then discuss the key technique of masking design in 3.3.
在第二阶段，我们提出通过掩码视觉标记建模（MVTM）来学习双向变换器。3.1 节介绍了 MVTM 训练方法，3.2 节阐述了采样流程，随后在 3.3 节探讨了掩码设计的关键技术。

3.1 MVTM in Training 3.1 训练中的 MVTM

Let ${\mathbf{Y}}=[y_{i}]_{i=1}^{N}$ denote the latent tokens obtained by inputting the image to the VQ-encoder, where $N$ is the length of the reshaped token matrix, and ${\mathbf{M}}=[m_{i}]_{i=1}^{N}$ the corresponding binary mask. During training, we sample a subset of tokens and replace them with a special [MASK] token. The token $y_{i}$ is replaced with [MASK] if $m_{i}=1$ , otherwise, when $m_{i}=0$ , $y_{i}$ will be left intact.
令 ${\mathbf{Y}}=[y_{i}]_{i=1}^{N}$ 表示通过将图像输入 VQ 编码器获得的潜在标记，其中 $N$ 是重塑后标记矩阵的长度， ${\mathbf{M}}=[m_{i}]_{i=1}^{N}$ 为对应的二元掩码。在训练过程中，我们采样一个标记子集并用特殊的[MASK]标记替换它们。当 $m_{i}=1$ 时，标记 $y_{i}$ 会被替换为[MASK]；反之当 $m_{i}=0$ 时， $y_{i}$ 将保持原状。

The sampling procedure is parameterized by a mask scheduling function $\gamma(r)\in(0,1]$ , and executes as follows: we first sample a ratio from $0$ to $1$ , then uniformly select $\lceil\gamma(r)\cdot N\rceil$ tokens in ${\mathbf{Y}}$ to place masks, where $N$ is the length. The mask scheduling significantly affects the quality of image generation and will be discussed in 3.3.
采样过程由掩码调度函数 $\gamma(r)\in(0,1]$ 参数化，并按如下方式执行：首先从 $0$ 到 $1$ 采样一个比例值，然后在 ${\mathbf{Y}}$ 中均匀选择 $\lceil\gamma(r)\cdot N\rceil$ 个标记放置掩码，其中 $N$ 表示长度。掩码调度策略会显著影响图像生成质量，我们将在 3.3 节详细讨论。

Denote $Y_{{\overline{\mathbf{M}}}}$ the result after applying mask ${\mathbf{M}}$ to ${\mathbf{Y}}$ . The training objective is to minimize the negative log-likelihood of the masked tokens:
设 $Y_{{\overline{\mathbf{M}}}}$ 为对 ${\mathbf{Y}}$ 应用掩码 ${\mathbf{M}}$ 后的结果。训练目标是最小化被掩码标记的负对数似然：

\mathcal{L}_{\text{mask}}=-\mathop{\mathbb{E}}\limits_{{\mathbf{Y}}\in\mathcal{D}}\Big{[}\sum_{\forall i\in[1,N],m_{i}=1}\log p(y_{i}|Y_{{\overline{\mathbf{M}}}})\Big{]},

(1)

Concretely, we feed the masked $Y_{{\overline{\mathbf{M}}}}$ into a multi-layer bidirectional transformer to predict the probabilities $P(y_{i}|Y_{{\overline{\mathbf{M}}}})$ for each masked token, where the negative log-likelihood is computed as the cross-entropy between the ground-truth one-hot token and predicted token. Notice the key difference to autoregressive modeling: the conditional dependency in MVTM has two directions, which allows image generation to utilize richer contexts by attending to all tokens in the image.
具体而言，我们将掩码后的 $Y_{{\overline{\mathbf{M}}}}$ 输入多层双向 Transformer，以预测每个掩码标记的概率分布 $P(y_{i}|Y_{{\overline{\mathbf{M}}}})$ ，其中负对数似然通过计算真实单热标记与预测标记之间的交叉熵获得。请注意与自回归建模的关键差异：MVTM 中的条件依赖具有双向性，这使得图像生成能通过关注图像中所有标记来利用更丰富的上下文信息。

3.2 Iterative Decoding 3.2 迭代解码

In autoregressive decoding, tokens are generated sequentially based on previously generated output. This process is not parallelizable and thus very slow for image because the image token length, e.g. 256 or 1024, is typically much larger than that of language. We introduce a novel decoding method where all tokens in the image are generated simultaneously in parallel. This is feasible due to the bi-directional self-attention of MTVM.
在自回归解码过程中，标记是基于先前生成的输出顺序生成的。这一过程无法并行化，因此对于图像生成而言速度极慢，因为图像标记长度（例如 256 或 1024）通常远大于语言标记长度。我们提出了一种新颖的解码方法，其中图像中的所有标记都能并行同步生成。这种并行化可行性得益于 MTVM 架构的双向自注意力机制。

In theory, our model is able to infer all tokens and generate the entire image in a single pass. We find this challenging due to inconsistency with the training task. Below, the proposed iterative decoding is introduced. To generate an image at inference time, we start from a blank canvas with all the tokens masked out, i.e. $Y_{\mathbf{M}}^{(0)}$ . For iteration $t$ , our algorithm runs as follows:
理论上，我们的模型能够一次性推断所有标记并生成完整图像。但由于与训练任务存在不一致性，我们发现这一目标具有挑战性。下文将介绍所提出的迭代解码方法。在推理阶段生成图像时，我们从所有标记被掩蔽的空白画布（即 $Y_{\mathbf{M}}^{(0)}$ ）开始。对于迭代 $t$ ，算法运行流程如下：

1.

Predict. Given the masked tokens $Y_{\mathbf{M}}^{(t)}$ at the current iteration, our model predicts the probabilities, denoted as $p^{(t)}\in\mathbb{R}^{N\times K}$ , for all the masked locations in parallel.
预测。给定当前迭代中的掩码标记 $Y_{\mathbf{M}}^{(t)}$ ，我们的模型并行预测所有掩码位置的概率，记为 $p^{(t)}\in\mathbb{R}^{N\times K}$ 。
2.

Sample. At each masked location $i$ , we sample a token $y_{i}^{(t)}$ based on its prediction probabilities $p_{i}^{(t)}\in\mathbb{R}^{K}$ over all possible tokens in the codebook. After a token $y_{i}^{(t)}$ is sampled, its corresponding prediction score is used as a “confidence” score indicating the model’s belief of this prediction. For the unmasked position in $Y_{\mathbf{M}}^{(t)}$ , we simply set its confidence score to $1.0$ .
采样。在每个掩码位置 $i$ ，我们根据其在码本中所有可能标记上的预测概率 $p_{i}^{(t)}\in\mathbb{R}^{K}$ 来采样一个标记 $y_{i}^{(t)}$ 。当标记 $y_{i}^{(t)}$ 被采样后，其对应的预测分数将作为"置信度"分数，表示模型对该预测的确信程度。对于 $Y_{\mathbf{M}}^{(t)}$ 中的非掩码位置，我们直接将其置信度分数设为 $1.0$ 。
3.

Mask Schedule. We compute the number of tokens to mask according to the mask scheduling function $\gamma$ by $n=\lceil\gamma(\frac{t}{T})N\rceil$ , where $N$ is the input length and $T$ is the total number of iterations.
掩码调度。我们根据掩码调度函数 $\gamma$ ，通过 $n=\lceil\gamma(\frac{t}{T})N\rceil$ 计算需要掩码的标记数量，其中 $N$ 是输入长度， $T$ 是总迭代次数。
4.

Mask. We obtain $Y_{\mathbf{M}}^{(t+1)}$ by masking $n$ tokens in $Y_{\mathbf{M}}^{(t)}$ . The mask ${\mathbf{M}}^{(t+1)}$ for iteration $t+1$ is calculated from:

$m_{i}^{(t+1)}=\begin{cases}1,&\text{if $c_{i}<{\text{sorted}}_{j}(c_{j})[n]$.}\\ 0,&\text{otherwise.}\vspace{-2mm}\end{cases},$

where $c_{i}$ is the confidence score for the $i$ -th token.

掩码处理。我们通过在 $Y_{\mathbf{M}}^{(t)}$ 中掩码 $n$ 个标记来获得 $Y_{\mathbf{M}}^{(t+1)}$ 。第 $t+1$ 次迭代的掩码 ${\mathbf{M}}^{(t+1)}$ 由以下公式计算得出：其中 $c_{i}$ 表示第 $i$ 个标记的置信度分数。

The decoding algorithm synthesizes an image in $T$ steps. At each iteration, the model predicts all tokens simultaneously but only keeps the most confident ones. The remaining tokens are masked out and re-predicted in the next iteration. The mask ratio is made decreasing until all tokens are generated within $T$ iterations. In practice, the masking tokens are randomly sampled with temperature annealing to encourage more diversity, and we will discuss its effect in 3. Figure 2 illustrates an example of our decoding process. It generates an image in $T=8$ iterations, where the unmasked tokens at each iteration are highlighted in the grid, e.g. when $t=1$ we only keep 1 token and mask out the rest.
解码算法通过 $T$ 个步骤合成图像。在每次迭代中，模型同时预测所有标记，但仅保留置信度最高的部分。其余标记会被遮蔽并在下一次迭代中重新预测。掩码比例会逐步降低，直到所有标记在 $T$ 次迭代内生成完毕。实际操作中，我们会采用温度退火方法随机采样遮蔽标记以增强多样性，其效果将在第 3 节讨论。图 2 展示了我们解码过程的示例：该过程通过 $T=8$ 次迭代生成图像，每次迭代中未被遮蔽的标记会在网格中高亮显示（例如当 $t=1$ 时仅保留 1 个标记并遮蔽其余部分）。

3.3 Masking Design 3.3 掩码设计

We find that the quality of image generation is significantly affected by the masking design. We model the masking procedure by a mask scheduling function $\gamma(\cdot)$ that computes the mask ratio for the given latent tokens. As discussed, the function $\gamma$ is used in both training and inference. During inference time, it takes the input of $0/T,1/T,\cdots,(T-1)/T$ indicating the progress in decoding. In training, we randomly sample a ratio $r$ in $[0,1)$ to simulate the various decoding scenarios.
我们发现图像生成质量显著受掩码设计影响。我们通过掩码调度函数 $\gamma(\cdot)$ 对掩码过程进行建模，该函数计算给定潜在标记的掩码比例。如前所述，函数 $\gamma$ 在训练和推理阶段均被使用。在推理过程中，它以 $0/T,1/T,\cdots,(T-1)/T$ 作为输入来指示解码进度。训练时，我们在 $[0,1)$ 范围内随机采样比例 $r$ 以模拟各种解码场景。

BERT uses a fixed mask ratio of 15% [11], i.e., it always masks 15% of the tokens, which is unsuitable for our task since our decoder needs to generate images from scratch. New masking scheduling is thus needed. Before discussing specific schemes, we first examine the property of the mask scheduling function. First, $\gamma(r)$ needs to be a continuous function bounded between $0$ and $1$ for $r\in[0,1]$ . Second, $\gamma(r)$ should be (monotonically) decreasing with respect to $r$ , and it holds that $\gamma(0)\rightarrow 1$ and $\gamma(1)\rightarrow 0$ . The second property ensures the convergence of our decoding algorithm.
BERT 采用固定的 15%掩码比例[11]，即始终遮蔽 15%的标记，这不适用于我们的任务，因为我们的解码器需要从零开始生成图像。因此需要新的掩码调度方案。在讨论具体方案前，我们首先研究掩码调度函数的特性。首先， $\gamma(r)$ 需要是定义在 $r\in[0,1]$ 区间内、取值介于 $0$ 和 $1$ 之间的连续函数。其次， $\gamma(r)$ 应关于 $r$ （单调）递减，且满足 $\gamma(0)\rightarrow 1$ 和 $\gamma(1)\rightarrow 0$ 。第二个特性确保了解码算法的收敛性。

This paper considers common functions and makes simple transformations so that they satisfy the properties. Figure 8 visualizes these functions which are divided into three groups:
本文对常见函数进行简单变换以满足特性要求。图 8 将这些函数可视化展示，分为三类：

•

Linear function is a straightforward solution, which masks an equal amount of tokens each time.
线性函数是最直接的解决方案，每次遮蔽等量的标记。
•

Concave function captures the intuition that image generation follows a less-to-more information flow. In the beginning, most tokens are masked so the model only needs to make a few correct predictions for which the model feel confident. Towards the end, the mask ratio sharply drops, forcing the model to make a lot more correct predictions. The effective information is increasing in this process. The concave family includes cosine, square, cubic, and exponential.
凹函数捕捉了图像生成遵循"从少到多"信息流的直觉。初期大多数标记被遮蔽，模型只需做出少量有把握的正确预测；接近尾声时遮蔽率急剧下降，迫使模型做出更多正确预测。这一过程中有效信息持续增加。凹函数族包括余弦、平方、立方和指数函数。
•

Convex function, conversely, implements a more-to-less process. The model needs to finalize a vast majority of tokens within the first couple of iterations. This family includes square root and logarithmic.
凸函数则相反，实现"从多到少"的过程。模型需要在最初几次迭代中确定绝大多数标记。该函数族包括平方根和对数函数。

We empirically compare the above mask scheduling functions in 3 and find the cosine function works the best in all of our experiments.
我们通过实验比较了上述三种掩码调度函数，发现余弦函数在所有实验中表现最佳。

4 Experiments 4 实验

In this section, we empirically evaluate MaskGIT on image generation in terms of quality, efficiency and flexibility. In 4.2, we evaluate MaskGIT on the standard class-conditional image generation tasks on ImageNet [10] 256 $\times$ 256 and 512 $\times$ 512. In 4.3, we show MaskGIT’s versatility by demonstrating its performance on three image editing tasks, image inpainting, outpainting, and editing. In 3, we verify the necessity of our design of mask scheduling. We will release the code and model for reproducible research.
本节我们从质量、效率和灵活性三个维度对 MaskGIT 的图像生成能力进行实证评估。在 4.2 节中，我们在 ImageNet[10]数据集 256×256 和 512×512 分辨率的标准类别条件图像生成任务上评估 MaskGIT。4.3 节通过图像修复、外绘和编辑三项任务展示模型的通用性。第 3 节验证了我们设计的掩码调度机制的必要性。我们将公开代码和模型以确保研究可复现性。

4.1 Experimental Setup 4.1 实验设置

For each dataset, we only train a single autoencoder, decoder, and codebook with 1024 tokens on cropped 256x256 images for all the experiments. The image is always compressed by a fixed factor of 16, i.e. from $H\times W$ to a grid of tokens in the size of $h\times w$ , where $h$ = $H/16$ and $w$ = $W/16$ . We find that this autoencoder, together with the codebook, can be reused to synthesize 512 $\times$ 512 images.
针对每个数据集，我们仅训练一个自动编码器、解码器和包含 1024 个标记的码本，所有实验均在裁剪后的 256x256 图像上进行。图像始终按固定比例 16 进行压缩，即从 $H\times W$ 尺寸压缩至 $h\times w$ 大小的标记网格，其中 $h$ = $H/16$ 且 $w$ = $W/16$ 。我们发现该自动编码器与码本可复用于合成 512 $\times$ 512 尺寸的图像。

All models in this work have the same configuration: 24 layers, 8 attention heads, 768 embedding dimensions and 3072 hidden dimensions. Our models use learnable positional embedding[48], LayerNorm[1], and truncated normal initialization (stddev= $0.02$ ). We employ the following training hyperparameters: label smoothing= $0.1$ , dropout rate= $0.1$ , Adam optimizer [28] with $\beta_{1}$ = $0.9$ and $\beta_{2}$ = $0.96$ . We use RandomResizeAndCrop for data augmentation. All models are trained on 4x4 TPU devices with a batch size of 256. ImageNet models are trained for 300 epochs while the Places2 model is trained for 200 epochs.
本研究中所有模型采用相同配置：24 层网络结构、8 个注意力头、768 维嵌入空间和 3072 维隐藏层。模型使用可学习的位置嵌入[48]、层归一化[1]以及截断正态初始化（标准差= $0.02$ ）。训练超参数设置为：标签平滑= $0.1$ 、丢弃率= $0.1$ 、采用 Adam 优化器[28]（ $\beta_{1}$ = $0.9$ ， $\beta_{2}$ = $0.96$ ）。数据增强采用随机缩放裁剪策略。所有模型均在 4x4 TPU 设备上以 256 批次规模训练，其中 ImageNet 模型训练 300 轮次，Places2 模型训练 200 轮次。

4.2 Class-conditional Image Synthesis
4.2 类别条件图像合成

We evaluate the performance of our model on class-conditional image synthesis on ImageNet 256 $\times$ 256 and 512 $\times$ 512. Our main results are summarized in Table 1.
我们在 ImageNet 256×256 和 512×512 分辨率下评估了模型在类别条件图像合成中的性能，主要结果汇总于表 1。

Quality. On ImageNet 256 $\times$ 256, without any special sampling strategies such as beam-search, top-k or nucleus sampling heuristics [25] or classifier guidance [37], we significantly outperform VQGAN [15] in both Fréchet Inception Distance (FID) [23] ( $6.18$ vs $15.78$ ) and Inception Score (IS) ( $182.1$ vs $78.3$ ). We also report the results with classifier-based rejection sampling in the appendix B.
质量表现。在 ImageNet 256×256 数据集上，无需使用束搜索、top-k 或核心采样启发式[25]等特殊采样策略，也无需分类器引导[37]，我们的模型在弗雷歇起始距离(FID)[23]（ $6.18$ 对比 $15.78$ ）和初始分数(IS)（ $182.1$ 对比 $78.3$ ）两项指标上均显著优于 VQGAN[15]。附录 B 还提供了基于分类器的拒绝采样结果。

We also train a VQGAN baseline with the same tokenizer and hyperparameters as MaskGIT’s in order to further highlight the difference between bi-directional and uni-directional transformers, and find that on both resolutions, MaskGIT still outperforms our implemented baseline by a significant margin.
为突显双向与单向 Transformer 的差异，我们使用与 MaskGIT 相同的分词器和超参数训练了 VQGAN 基线模型。实验表明，在两种分辨率下，MaskGIT 仍以显著优势超越我们实现的基线模型。

Furthermore, MaskGIT improves BigGAN’s FIDs on both resolutions, achieving a new state-of-the-art on 512 $\times$ 512 with an FID of $7.32$ .
此外，MaskGIT 在两种分辨率下均改进了 BigGAN 的 FID 指标，其中在 512×512 分辨率上以 $7.32$ 的 FID 创造了新的最优记录。

Model	FID $\downarrow$	IS $\uparrow$ 《[2202.04200] MaskGIT：掩码生成式图像变换器》	Prec $\uparrow$ 精确率 $\uparrow$	Rec $\uparrow$ 召回率 $\uparrow$	# params # 参数	# steps # 步骤	CAS $\times 100$ $\uparrow$
							Top-1 (76.6)	Top-5 (93.1) 前五名（93.1）
ImageNet 256 $\times$ 256
DCTransformer [32] ^□	36.51	n/a 不适用	0.36	0.67	738M	$>$ 1024
BigGAN-deep [4] BigGAN-deep [4]	6.95	198.2	0.87	0.28	160M	1	43.99	67.89
Improved DDPM [33]^□ 改进型 DDPM [33] ^□	12.26	n/a 不适用	0.70	0.62	280M	250
ADM [12]^□ ADM [12] ^□	10.94	101.0	0.69	0.63	554M	250
VQVAE-2 [37]^□ VQVAE-2 [37] ^□	31.11	$\sim$ 45	0.36	0.57	13.5B^†	5120	54.83	77.59
VQGAN [15]^□ VQGAN [15] ^□	15.78	78.3	n/a 不适用	n/a 不适用	1.4B	256
VQGAN^∗	18.65	80.4	0.78	0.26	227M	256	53.10	76.18
MaskGIT (Ours) MaskGIT（我们的方法）	6.18	182.1	0.80	0.51	227M	8	63.14	84.45
ImageNet 512 $\times$ 512
BigGAN-deep [4] BigGAN-deep [4]	8.43	232.5	0.88	0.29	160M	1	44.02	68.22
ADM [12]^□ ADM [12] ^□	23.24	58.06	0.73	0.60	559M	250
VQGAN^∗	26.52	66.8	0.73	0.31	227M	1024	51.29	74.24
MaskGIT (Ours) MaskGIT（我们的方法）	7.32	156.0	0.78	0.50	227M	12	63.43	84.79

Table 1: Quantitative comparison with state-of-the-art generative models on ImageNet 256

\times

256 and 512

\times

512. “# steps” refers to the number of neural network runs needed to generate a sample. ^∗ denotes the model we train with the same architecture and setup with ours; ^□ denotes values taken from prior publications; ^† estimated based on the pytorch implementation [39].
表 1：ImageNet 256×256 和 512×512 分辨率下与最先进生成模型的量化对比。"# steps"指生成样本所需的神经网络运行次数。 ^∗ 表示采用与我们相同架构和训练设置的模型； ^□ 表示引自先前文献的数值； ^† 基于 PyTorch 实现[39]估算得出。

Task 任务	Model	FID $\downarrow$	IS $\uparrow$ 《[2202.04200] MaskGIT：掩码生成式图像变换器》
Outpainting 外延绘制	Boundless [43]^□ 无限扩展 [43] ^□	35.02	6.15
Right 50% 右侧 50%	In&Out [8]^□ 内外修复[8] ^□	23.57	7.18
	InfinityGAN [31] InfinityGAN [31]	10.60	5.57
	Boundless [43] TF ^◆ Boundless [43] TF ^◆	7.80	5.99
	MaskGIT (Ours) ⁵¹² MaskGIT（我们的） ⁵¹²	6.78	11.69
Inpainting 图像修复	DeepFill [52] DeepFill [52]	11.51	22.55
Center 50% $\times$ 50% 中心 50% $\times$ 50%	ICT[49]^† ICT[49] ^†	13.63	17.70
	HiFill [50]⁵¹² HiFill [50] ⁵¹²	16.60	19.93
	CoModGAN[57]⁵¹² CoModGAN[57] ⁵¹²	7.13	21.82
	MaskGIT (Ours)⁵¹² MaskGIT（本方法） ⁵¹²	7.92	22.95

Table 2: Quantitative Comparisons for Inpainting and Outpainting on Places2. ⁵¹² evaluated on 512

\times

512 samples while others evaluated on the corresponding 256

\times

256 ones, consistent with their training; ^□ taken from the prior work; ^† evaluated using the released model trained on a subset of Places2; ^◆ evaluated using the TFHub model[18].
表 2：Places2 数据集上图像修复与外延绘制的量化对比。 ⁵¹² 在 512

\times

512 样本上评估，其余方法在对应 256

\times

256 样本上评估（与其训练设置一致）； ^□ 引自先前工作； ^† 使用在 Places2 子集上训练的已发布模型评估； ^◆ 使用 TFHub 模型[18]评估。

Speed. We evaluate model speed by assessing the number of steps, i.e. forward passes, each model requires to generate a sample. As shown in Table 1, MaskGIT requires the fewest steps among all non-GAN-based models on both resolutions.
速度。我们通过评估每个模型生成样本所需的步骤数（即前向传播次数）来衡量模型速度。如表 1 所示，在两种分辨率下，MaskGIT 在所有非基于 GAN 的模型中所需步骤最少。

To further substantiate the speed difference between MaskGIT and autoregressive models, we perform a runtime comparison between MaskGIT and VQGAN’s decoding processes. As illustrated in Figure 4, MaskGIT significantly accelerates VQGAN by $30$ - $64$ x, with a speedup that gets more pronounced as the image resolution (and thus the input token length) grows.
为进一步验证 MaskGIT 与自回归模型的速度差异，我们对 MaskGIT 与 VQGAN 的解码过程进行了运行时对比。如图 4 所示，MaskGIT 将 VQGAN 加速了 $30$ 至 $64$ 倍，且随着图像分辨率（即输入标记长度）的增加，这种加速效果愈发显著。

Diversity. We consider Classification Accuracy Score (CAS) [36] and Precision/Recall [30] as two metrics for evaluating sample diversity, in addition to sample quality.
多样性。除了样本质量外，我们还将分类准确率分数（CAS）[36]和精确率/召回率[30]作为评估样本多样性的两项指标。

CAS involves first training a ResNet-50 classifier[22] solely on the samples generated by the candidate model, and then measuring the classifier’s classification accuracy on the ImageNet validation set. The last two columns in Table 1 present the CAS results, where the scores of the classifier trained on real ImageNet training data are included for reference (76.6% and 93.1% for the top-1 and top-5 accuracy). For image resolution 256 $\times$ 256, we follow the common practice of using data augmentation RandAugment[9], and report the scores trained without augmentation in the appendix B. We find that MaskGIT significantly outperforms prior work VQVAE-2 and VQGAN, establishing a new state-of-the-art of CAS on the ImageNet benchmark on both resolutions.
CAS（分类器精度评分）首先需要训练一个仅基于候选模型生成样本的 ResNet-50 分类器[22]，然后测量该分类器在 ImageNet 验证集上的分类准确率。表 1 最后两列展示了 CAS 结果，其中包含基于真实 ImageNet 训练数据训练的分类器准确率作为参考（top-1 准确率 76.6%，top-5 准确率 93.1%）。对于 256×256 分辨率图像，我们遵循常规做法使用 RandAugment 数据增强[9]，未使用数据增强的训练结果详见附录 B。研究发现，MaskGIT 显著优于先前工作 VQVAE-2 和 VQGAN，在两种分辨率下均创下了 ImageNet 基准测试中 CAS 指标的新纪录。

The Precision/Recall results in Table 1 show that MaskGIT achieves better coverage (Recall) compared to BigGAN, and better sample quality (Precision) compared to likelihood-based models such as VQVAE-2 and diffusion models. Compared to our baseline VQGAN, we improve the diversity as measured by recall while slightly boosting its precision.
表 1 中的精确率/召回率结果显示，与 BigGAN 相比，MaskGIT 实现了更好的覆盖率（召回率）；与 VQVAE-2 和扩散模型等基于似然的模型相比，则具有更优的样本质量（精确率）。相较于基线模型 VQGAN，我们在保持精确率小幅提升的同时，通过召回率指标衡量显著提高了生成多样性。

In contrast to BigGAN’s samples, MaskGIT’s samples are more diverse with more varied lighting, poses, scales and context as shown in Figure 5. More comparisons are available in the appendix B.
与 BigGAN 的样本相比，如图 5 所示，MaskGIT 生成的样本在光照、姿态、比例和场景方面展现出更丰富的多样性。更多对比结果详见附录 B。

4.3 Image Editing Applications
4.3 图像编辑应用

In this subsection, we present direct applications of MaskGIT on three image editing tasks: class-conditional image editing, image inpainting, and outpainting. All three tasks can be almost trivially translated to ones that MaskGIT can handle if we consider the task as just a constraint on the initial binary mask ${\mathbf{M}}$ MaskGIT uses in its iterative decoding, as discussed in 3.2. We show that without modifications to the architecture or any task-specific training, MaskGIT is capable of generating very compelling results on all three applications. Furthermore, MaskGIT obtains comparable performance to dedicated models on both inpainting and outpainiting, even though it is not designed specifically for either task.
在本小节中，我们将展示 MaskGIT 在三种图像编辑任务上的直接应用：类别条件图像编辑、图像修复和外绘。正如 3.2 节所述，若将这些任务视为对初始二值掩码 ${\mathbf{M}}$ 的约束条件（MaskGIT 在迭代解码过程中使用），那么这三个任务几乎都能直接转化为 MaskGIT 可处理的形式。研究表明，无需修改架构或进行任何任务特定训练，MaskGIT 就能在所有三项应用中生成极具吸引力的结果。更值得注意的是，尽管 MaskGIT 并非专为修复或外绘任务设计，但其在这两项任务上的表现与专用模型相比毫不逊色。

Class-conditional Image Editing. We define a new class-conditional image editing task to showcase MaskGIT’s flexibility. In this task, the model regenerates content specified inside a bounding box on the given class while preserving the context, i.e. content outside of the box. It is infeasible for autoregressive methods due to the violation to their prediction orders.
类别条件图像编辑。我们定义了一个新的类别条件图像编辑任务，以展示 MaskGIT 的灵活性。在该任务中，模型会在给定类别下重新生成边界框内指定的内容，同时保留上下文（即框外的内容）。由于违反了自回归方法的预测顺序，这类任务对自回归方法而言是不可行的。

For MaskGIT, however, it is a trivial task if we consider the bounding box region as the input of initial mask to the iterative decoding algorithm. Figure 6 shows a few example results. More can be found in the appendix C.
然而对于 MaskGIT 而言，若将边界框区域作为初始掩码输入迭代解码算法，这不过是轻而易举的任务。图 6 展示了若干示例结果，更多案例可参阅附录 C。

In these examples, we observe that MaskGIT can reasonably replace the selected object while preserving, or to some extend even completing, the context in the background. Furthermore, we find that MaskGIT seems to be capable of synthesizing unnatural yet plausible combinations unseen in the ImageNet training set, e.g. a flying cat, cat in a soup bowl, and cat in a flower. This suggests that MaskGIT has incidentally learned useful representations for composition, which may be further exploited in related tasks in future works.
在这些示例中，我们观察到 MaskGIT 能够合理替换选定对象，同时保持（甚至在一定程度上补全）背景内容。更值得注意的是，MaskGIT 似乎能合成 ImageNet 训练集中未曾出现、看似反常却合理的组合——例如飞行中的猫、汤碗里的猫，以及花朵中的猫。这表明 MaskGIT 意外习得了有效的组合表征能力，未来或可在相关任务中进一步挖掘其潜力。

Image Inpainting. Image inpainting or image completion is a fundamental image editing task to synthesize contents in missing regions so that the completion looks visually realistic. Traditional patch-based methods[3] work well on texture regions, while deep learning based methods[52, 50, 57, 38, 14] have been demonstrated to synthesize images requiring better semantic coherence. Both approaches have been are extensively studied in computer vision.
图像修复。图像修复或图像补全是图像编辑中的一项基础任务，旨在合成缺失区域的内容，使补全结果在视觉上显得真实。传统的基于补丁的方法[3]在纹理区域表现良好，而基于深度学习的方法[52,50,57,38,14]已被证明能够合成需要更好语义连贯性的图像。这两种方法在计算机视觉领域都得到了广泛研究。

We extend MaskGIT to this problem by tokenizing the masked image and interpreting the inpainting mask as the initial mask in our iterative decoding. We then composite the output image by linearly blending it with the input based on the masking boundary following [8]. To match the training of our baselines, we train MaskGIT on the 512 $\times$ 512 center-cropped images from the Places2[58] dataset. All hyperparameters are kept the same as the MaskGIT model trained on ImageNet.
我们通过将掩码图像进行标记化处理，并将修复掩码解释为迭代解码中的初始掩码，从而将 MaskGIT 扩展应用于此问题。随后按照文献[8]的方法，基于掩码边界对输出图像与输入图像进行线性混合合成。为与基线模型的训练保持一致，我们在 Places2[58]数据集的 512×512 中心裁剪图像上训练 MaskGIT 模型，所有超参数均保持与 ImageNet 上训练的 MaskGIT 模型相同。

We compare MaskGIT against common GAN-based baselines, including DeepFillv2[52] and HiFill[50], on inpainting with a central 50% $\times$ 50% mask, which are evaluated on the Places2 validation set. Table 2 summarizes the quantitative comparisons. MaskGIT beats both DeepFill and HiFill in FID and IS by a significant margin, while achieving scores close to the state-of-the-art inpainting approach CoModGAN [57]. We show more qualitative comparisons with CoModGAN in the appendix E.
我们将 MaskGIT 与基于 GAN 的常见基线方法（包括 DeepFillv2[52]和 HiFill[50]）在中心区域 50%×50%掩码的图像修复任务上进行比较，评估基于 Places2 验证集进行。表 2 总结了定量对比结果。MaskGIT 在 FID 和 IS 指标上均以显著优势超越 DeepFill 和 HiFill，同时达到与当前最先进的图像修复方法 CoModGAN[57]相近的分数。更多与 CoModGAN 的定性对比结果详见附录 E。

Image Outpainting. Outpainting, or image extrapolation, is an image editing task that has received increased attention recently. It is seen as a more challenging task than inpainting due to the fewer constraints from surrounding pixels and thus more uncertainty in the predicted regions. Our adaptation of the problem and the model used in the following evaluation is the same as in inpainting.
图像外绘。外绘（或称图像外推）是近年来备受关注的图像编辑任务。由于周边像素提供的约束更少，预测区域的不确定性更高，该任务被认为比图像修复更具挑战性。我们在后续评估中对问题及模型的调整方式与图像修复任务保持一致。

We compare against common GAN-based baselines, including Boundless [43], In&Out [8], InfinityGAN[31], and CoModGAN[57] on extrapolating rightward with a 50% ratio. We evaluate on the image set generously provided by the authors of InfinityGAN[31] and In&Out[8].
我们与常见的基于 GAN 的基线方法进行了比较，包括 Boundless[43]、In&Out[8]、InfinityGAN[31]和 CoModGAN[57]，测试其在 50%比例下向右外推的性能。评估所用图像集由 InfinityGAN[31]和 In&Out[8]的作者慷慨提供。

Table 2 summarizes the quantitative comparisons. MaskGIT beats all baselines and achieves state-of-the-art FID and IS. As the examples in Figure 7 illustrate, MaskGIT is also capable of synthesizing diverse results given the same input with different seeds. We observe that MaskGIT completes objects and global structures particularly well, and hypothesize that this is thanks to the model learning useful representations with the global attentions in the transformer.
表 2 总结了量化对比结果。MaskGIT 在所有基线模型中表现最优，取得了当前最佳的 FID 和 IS 分数。如图 7 所示案例表明，在相同输入配合不同随机种子的情况下，MaskGIT 仍能生成多样化结果。我们注意到该模型在物体补全与全局结构重建方面表现尤为突出，推测这得益于 Transformer 架构中全局注意力机制所学习到的有效表征。

4.4 Ablation Studies 4.4 消融实验

$\gamma$	$T$	FID $\downarrow$	IS $\uparrow$ 《[2202.04200] MaskGIT：掩码生成式图像变换器》	NLL
Exponential 指数	8	7.89	156.3	4.83
Cubic 立方	9	7.26	165.2	4.63
Square 平方	10	6.35	179.9	4.38
Cosine 余弦	10	6.06	181.5	4.22
Linear 线性	16	7.51	113.2	3.75
Square Root 平方根	32	12.33	99.0	3.34
Logarithmic 对数	60	29.17	47.9	3.08

Table 3: Ablation results on the mask scheduling functions. We report the best FID, IS, and Negative Log-Likelihood loss for each candidate scheduling function.
表 3：掩码调度函数的消融实验结果。我们报告了每个候选调度函数的最佳 FID、IS 和负对数似然损失值。

We conduct ablation experiments using the default setting on ImageNet 256 $\times$ 256.
我们在 ImageNet 256×256 数据集上使用默认设置进行了消融实验。

Mask scheduling. A key design of MaskGIT is the mask scheduling function used in both training and iterative decoding. We compare the functions discussed in 3.3, visualize them in Figure 8, and summarize the results in Table 3.
掩码调度机制。MaskGIT 的核心设计之一是在训练和迭代解码过程中使用的掩码调度函数。我们对比了 3.3 节讨论的各类函数，其可视化结果见图 8，定量分析结果汇总于表 3。

We observe that concave functions generally obtain better FID and IS than linear, followed by the convex functions. While cosine and square perform similarly relative to other functions, cosine slightly edges out square in all scores, making cosine the default in our model.
我们观察到，凹函数通常能获得比线性函数更好的 FID 和 IS 分数，其次是凸函数。虽然余弦函数和平方函数相对于其他函数表现相似，但余弦函数在所有指标上都略微优于平方函数，因此我们选择余弦函数作为模型的默认设置。

We hypothesize that concave functions perform favorably because they 1) challenge training with more difficult cases (i.e. encouraging larger mask ratios), and 2) appropriately prioritize the less-to-more prediction throughout the decoding. That said, over-prioritization seems to be costly as well, as shown by the cubic function being worse than square, and exponential being much worse than all other concave functions.
我们推测凹函数表现优异的原因在于：1）通过更具挑战性的案例（即鼓励使用更大的掩码比例）来强化训练；2）在解码过程中合理分配从少到多的预测优先级。但过度优先化似乎也会带来负面影响，立方函数表现逊于平方函数，而指数函数更是远差于其他凹函数的结果印证了这一点。

Iteration number. We study the effect of the number of iterations ( $T$ ) on our model by running all candidate masking functions with different $T$ s. As shown in Figure 8, under the same setting, more iterations are not necessarily better: as $T$ increases, aside from the logarithmic function which performs poorly throughout, all other functions hit a “sweet spot” where the model’s performance peaks before it worsens again. The sweet spot also gets “delayed” as functions get less concave. As shown, among functions that achieve strong FIDs (i.e. cosine, square, and linear), cosine not only has the strongest overall score, but also the earliest sweet spot at a total of $8$ to $12$ iterations. We hypothesize that such sweet spots exist because too many iterations may discourage the model from keeping less confident predictions, which worsens the token diversity. We think further study on the masking design would be interesting for future work.
迭代次数研究。我们通过运行所有候选掩码函数并采用不同的迭代次数（ $T$ ），研究了迭代次数对模型的影响。如图 8 所示，在相同设置下，并非迭代次数越多越好：随着 $T$ 增加，除全程表现不佳的对数函数外，其他函数都会达到一个"最佳平衡点"——模型性能在此达到峰值后再次下降。随着函数凹性减弱，这个最佳平衡点也会相应"延迟"。数据显示，在获得较强 FID 得分的函数（即余弦、平方和线性函数）中，余弦函数不仅总体得分最高，而且最早在总计 $8$ 至 $12$ 次迭代时达到最佳平衡点。我们推测这种平衡点的存在是因为过多迭代可能抑制模型保留置信度较低的预测，从而降低标记多样性。我们认为未来对掩码设计的进一步研究将很有意义。

5 Conclusion 5 结论

In this paper, we propose MaskGIT, a novel image synthesis paradigm using a bidirectional transformer decoder. Trained on Masked Visual Token Modeling, MaskGIT learns to generate samples using an iterative decoding process within a constant number of iterations. Experimental results show that MaskGIT significantly outperforms the state-of-the-art transformer model on conditional image generation, and our model is readily extendable to various image manipulation tasks. As MaskGIT achieves competitive performance with state-of-the-art GANs, applying our approach to other synthesis tasks is a promising direction for future work. Please see the appendix F for the limitations and future work.
本文提出 MaskGIT——一种基于双向 Transformer 解码器的新型图像合成范式。通过掩码视觉标记建模训练，MaskGIT 能在固定迭代次数内通过渐进式解码过程生成样本。实验表明，在条件图像生成任务上，MaskGIT 显著优于当前最先进的 Transformer 模型，且该模型可轻松扩展至多种图像处理任务。由于 MaskGIT 取得了与顶尖 GAN 模型相媲美的性能，将该方法应用于其他合成任务是未来工作的潜在方向。具体局限性与未来工作详见附录 F。

Acknowledgement The authors would like to thank Xiang Kong for inspiring related works and anonymous reviewers for helpful comments.
致谢作者感谢项昆对相关研究的启发，并感谢匿名评审人提出的宝贵意见。

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
[2] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
[3] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009.
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020.
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
[7] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
[8] Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, and Ming-Hsuan Yang. In&out: Diverse image outpainting via gan inversion. arXiv preprint arXiv:2104.00675, 2021.
[9] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space, 2019.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, 2019.
[12] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[14] Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis, 2021.
[15] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
[16] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models, 2019.
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
[18] Google. Tfhub model of boundless. https://tfhub.dev/google/boundless/half/1, 2021.
[19] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. In ICLR, 2018.
[20] Jiatao Gu and Xiang Kong. Fully non-autoregressive neural machine translation: Tricks of the trade, 2020.
[21] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[24] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
[25] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration, 2019.
[26] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
[27] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[29] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
[30] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
[31] Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, and Ming-Hsuan Yang. Infinitygan: Towards infinite-resolution image synthesis. arXiv preprint arXiv:2104.03963, 2021.
[32] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W. Battaglia. Generating images with sparse representations, 2021.
[33] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
[34] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer G. Dy and Andreas Krause, editors, ICML, 2018.
[35] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, ICML, 2021.
[36] Suman V. Ravuri and Oriol Vinyals. Classification accuracy score for conditional generative models. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily B. Fox, and Roman Garnett, editors, NeurIPS, pages 12247–12258, 2019.
[37] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, NeurIPS, 2019.
[38] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models, 2021.
[39] Kim Seonghyeon. Implementation of generating diverse high-fidelity images with vq-vae-2 in pytorch. https://github.com/rosinality/vq-vae-2-pytorch, 2020.
[40] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
[41] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, NeurIPS, 2019.
[42] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
[43] Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron Maschinot, David Belanger, Ce Liu, and William T Freeman. Boundless: Generative adversarial networks for image extension. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10521–10530, 2019.
[44] Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. In CVPR, 2021.
[45] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, NeurIPS, 2020.
[46] Aäron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, NeurIPS, 2016.
[47] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, NeurIPS, 2017.
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[49] Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. High-fidelity pluralistic image completion with transformers. arXiv preprint arXiv:2103.14031, 2021.
[50] Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7508–7517, 2020.
[51] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. arXiv preprint arXiv:2110.04627, 2021.
[52] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4471–4480, 2019.
[53] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.
[54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[55] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[56] Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, and Hongxia Yang. UFC-BERT: Unifying multi-modal controls for conditional image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[57] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. In International Conference on Learning Representations (ICLR), 2021.
[58] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Original 《[2202.04200] MaskGIT：掩码生成式图像变换器》
	Mask 95% 遮蔽 95%	Mask 90% 遮蔽 90%	Mask 85% 遮蔽 85%	Mask 75% 遮蔽 75%	Mask 95% 遮蔽 95%	Mask 90% 遮蔽 90%	Mask 85% 遮蔽 85%	Mask 75% 遮蔽 75%
Example Input Mask 示例输入遮蔽
Reconstruction Sample 重建样本
Median of $100$ Samples $100$ 样本的中位数

Figure 9: Examples of MaskGIT on Image Reconstruction. MaskGIT takes in masked tokens extracted from original images (row one) using random input masks (row two, with unknown tokens marked in light gray), and outputs reconstructed images (row three). We then randomly sample 100 masks with the same mask ratio, and illustrate the median of the 100 reconstructed samples in row four.
图 9：MaskGIT 图像重建示例。MaskGIT 接收从原始图像（第一行）通过随机输入掩码（第二行，浅灰色标记未知 token）提取的掩码 token，并输出重建图像（第三行）。随后我们随机采样 100 个具有相同掩码比例的掩码，并在第四行展示 100 个重建样本的中值结果。

Appendix A Discussion on Image Reconstruction
附录 A 关于图像重建的讨论

In 4.2, we primarily evaluate MaskGIT on class-conditional image generation tasks. Here we offer more discussion on its performance on image reconstruction. We set up by first randomly sampling input mask $M$ with a mask ratio $r$ of the visual tokens masked out, and then running MaskGIT’s iterative decoding algorithm to reconstruct images. Figure 10 shows the PSNR and LPIPS[55] of the reconstructed samples as functions of $r$ , whereas Figure 9 visualizes two examples of this process with $r$ ranging from $95\%$ to $75\%$ .
在 4.2 节中，我们主要评估了 MaskGIT 在类别条件图像生成任务上的表现。此处我们进一步讨论其在图像重建方面的性能。实验设置如下：首先随机采样输入掩码 $M$ ，其视觉 token 的掩码比例为 $r$ ，然后运行 MaskGIT 的迭代解码算法重建图像。图 10 展示了重建样本的 PSNR 和 LPIPS[55]随 $r$ 变化的函数关系，而图 9 则通过 $r$ 从 $95\%$ 到 $75\%$ 的两个示例可视化这一过程。

We observe that MaskGIT reconstructs holistic information (e.g. pose and shape of the foreground objects) even with a very high percentage (e.g. 95%) of tokens masked out. More importantly, there seems to exist an inflection point around 90%: while both reconstruction quality and consistency improve drastically as the mask ratio decreases until 90%, after 90% further improvements are slowed down. This observation is corroborated by the large jump in the visual similarity between reconstruction samples and the original images from 95% to 90% in Figure 9, e.g. the fence in front of the tiger and the car’s color are consistently captured once the mask ratio is below 90%, but not at 95%.
我们观察到，即使在高比例（如 95%）的标记被遮蔽的情况下，MaskGIT 仍能重建整体信息（例如前景物体的姿态和形状）。更重要的是，在 90%左右似乎存在一个拐点：当遮蔽比例降至 90%之前，重建质量和一致性会急剧提升；而超过 90%后，进一步的改善速度明显放缓。这一观察结果与图 9 中重建样本与原图视觉相似度从 95%到 90%的显著跃升相吻合——例如当遮蔽比例低于 90%时，老虎前方的栅栏和汽车颜色能被稳定还原，但在 95%比例下则无法实现。

In other words, we find that visual tokens are highly redundant. For a holistic reconstruction, only a very small portion (e.g. 10%) of the tokens are essential; the remaining ones merely improve the recovery of finer appearance or details. This echos our intuition behind the masking design laid out in 3.3 that the prediction of the first few tokens is key to image generation. Similar observations on the spatial redundancy of images are discussed in a concurrent paper MAE [21]. In their work, they find that masking a high proportion of the input image yields a nontrivial and meaningful self-supervisory task for image representation learning.
换言之，我们发现视觉标记具有高度冗余性。要实现整体重建，仅需极少部分（例如 10%）的关键标记；其余标记仅用于提升细微外观或细节的还原度。这一发现印证了我们在 3.3 节提出的掩码设计理念——前几个标记的预测是图像生成的关键。同期论文 MAE[21]也探讨了图像空间冗余性的类似观察，其研究表明对输入图像实施高比例掩码能为图像表征学习提供具有实质意义的自监督任务。

Appendix B Additional Class-conditional Image Generation Results
附录 B 额外类别条件图像生成结果

In this section, we report additional results on class-conditional image generation.
本节展示类别条件图像生成的补充实验结果

We follow prior transformer-based methods[37, 15] to employ the classifier-based rejection sampling to improve the sample quality scores. Specifically, we use a pre-trained ResNet classifier[22] to score output samples based on the predicted probability and keep samples with an acceptance rate of $0.05$ , as in VQGAN [15]. As shown in Table 4, MaskGIT demonstrates consistent improvement over VQGAN, and is comparable with ADM with classifier guidance [12]. More importantly, by adding the rejection sampling, MaskGIT achieves state-of-the-art Inception Scores ( $355.6$ on 256 $\times$ 256 and $342.0$ on 512 $\times$ 512).
我们沿用基于 Transformer 的先前方法[37,15]，采用基于分类器的拒绝采样来提升样本质量分数。具体而言，如 VQGAN[15]所述，我们使用预训练的 ResNet 分类器[22]根据预测概率对输出样本进行评分，并保持 $0.05$ 的接受率。如表 4 所示，MaskGIT 相对 VQGAN 展现出持续改进，与采用分类器引导的 ADM[12]性能相当。更重要的是，通过引入拒绝采样，MaskGIT 在 256×256 分辨率下取得 $355.6$ 的 Inception Score，在 512×512 分辨率下达到 $342.0$ ，均实现最先进水平。

In Table 5, we report Precision and Recall scores calculated using Inception features [42]. In contrast to the VGG[40] feature-based scores, which we report in Table 1 for a more direct comparison with prior work [30, 12], we find that the Inception feature-based scores are more consistent with our qualitative observations that VQGAN’s samples are more diverse than BigGAN’s. Under both measures, MaskGIT ’s recall scores outperform those of BigGAN and VQGAN. We also report CAS evaluated on classifiers trained without augmentation from RandAugment[9]. Consistent with our main results, MaskGIT outperforms BigGAN and our baseline VQGAN by a large margin.
表 5 展示了基于 Inception 特征[42]计算的精确率与召回率分数。相较于表 1 中采用 VGG[40]特征分数（用于与先前研究[30,12]直接对比），我们发现基于 Inception 特征的评分更符合定性观察结果——VQGAN 生成样本的多样性优于 BigGAN。两项指标下，MaskGIT 的召回率均超越 BigGAN 与 VQGAN。我们还报告了未经 RandAugment[9]数据增强训练的评估分类器 CAS 分数，与主要结论一致：MaskGIT 以显著优势超越 BigGAN 及基线 VQGAN 模型。

Finally, we show a few comparisons of the class-conditional samples generated by MaskGIT with the samples generated by BigGAN-deep and VQVAE-2 in Figure 11, 12, and 13.
最后，我们在图 11、12 和 13 中展示了 MaskGIT 生成的类别条件样本与 BigGAN-deep 和 VQVAE-2 生成样本的若干对比。

Dataset 数据集	Model	Classifier guidance 分类器引导	FID	IS
ImageNet	ADM [12] ADM [12]	1.0 guidance 1.0 引导强度	4.59	186.70
256 $\times$ 256	VQGAN [15] VQGAN [15]	0.05 acceptance rate 0.05 接受率	5.88	304.8
	MaskGIT 掩码生成图像变换器	0.05 acceptance rate 0.05 接受率	4.02	355.6
ImageNet	ADM [12] ADM [12]	1.0 guidance 1.0 引导强度	7.72	172.71
512 $\times$ 512	MaskGIT 掩码生成图像变换器	0.05 acceptance rate 0.05 接受率	4.46	342.0

Table 4: Class-conditional image synthesis on ImageNet for methods with classifier guidance.
表 4：采用分类器引导方法的 ImageNet 类别条件图像合成结果

Model	Prec $\uparrow$ 精确率 $\uparrow$	Rec $\uparrow$ 召回率 $\uparrow$	CAS $\times 100$ $\uparrow$
			Top-1 (73.1) Top-1 准确率（73.1）	Top-5 (91.5) Top-5 准确率（91.5）
BigGAN-deep [4] BigGAN-deep [4]	0.82	0.27	42.65	65.92
VQ-GAN^∗ VQ-GAN @0	0.61	0.47	47.50	68.90
MaskGIT (Ours) MaskGIT（我们的方法）	0.78	0.50	58.20	79.65

Table 5: More quantitative comparison with BigGAN-deep and our baseline VQGAN on ImageNet 256

\times

256. ^∗ denotes the model we train with the same architecture and setup with ours.
表 5：在 ImageNet 256

\times

256 上与 BigGAN-deep 及我们的基线 VQGAN 的更多量化比较。 ^∗ 表示采用与我们相同架构和设置训练的模型。

Appendix C Additional Examples of Class-conditional Image Editing Applications
附录 C 类别条件图像编辑应用的更多示例

We show more examples of class-conditional image editing in Figure 14, and examples of image-conditional panorama synthesis in Figure 15.
我们在图 14 中展示了更多类别条件图像编辑的示例，并在图 15 中呈现了图像条件全景合成的案例。

Appendix D Image Outpainting Comparisons with SOTA Transformer-based Approaches
附录 D 基于 SOTA Transformer 方法的图像外延绘制对比

In Figure 16 and 17, we show a few outpainting comparisons among MaskGIT, ImageGPT[7], and VQGAN[15]. In each set of images, we show the groundtruth (left), extrapolated samples using only the top half of the groundtruth (middle), and extrapolated samples using only the bottom half of the groundtruth (right).
在图 16 和图 17 中，我们对比展示了 MaskGIT、ImageGPT[7]和 VQGAN[15]的外延绘制效果。每组图像从左至右分别为：真实图像（左侧）、仅使用上半部分真实图像进行外延绘制的样本（中侧）、以及仅使用下半部分真实图像进行外延绘制的样本（右侧）。

MaskGIT and VQGAN can both perform on higher resolutions by taking advantage of tokenization and thus achieve higher sample fidelity than ImageGPT, which runs on a maximum resolution of $192\times 192$ . At the same time, MaskGIT demonstrates stronger flexibility than ImageGPT and VQGAN in that it can outpaint in arbitrary directions (e.g. both upward and downward), while ImageGPT and VQGAN can only handle outpainting in one direction with a single model due to their autoregressive natures.
MaskGIT 和 VQGAN 通过利用标记化技术可在更高分辨率下运行，因此比最高仅支持 $192\times 192$ 分辨率的 ImageGPT 具有更好的样本保真度。同时，MaskGIT 展现出比 ImageGPT 和 VQGAN 更强的灵活性——它能朝任意方向（如向上和向下）进行外延绘制，而由于自回归特性限制，ImageGPT 和 VQGAN 的单个模型只能实现单一方向的外延绘制。

Appendix E Image Inpainting and Outpainting Comparisons with SOTA GAN-based Approaches
附录 E 基于 GAN 的先进方法在图像修复与外延任务中的对比

In this section, we show more qualitative comparisons with state-of-the-art GAN-based image completion methods in Figure 18 and Figure 19. Quantitative results have been discussed in 4.3.
本节中，我们在图 18 和图 19 中展示了与最先进的基于 GAN 的图像补全方法更多的定性比较。定量结果已在 4.3 节讨论。

We find that compared to prior GAN-based methods, MaskGIT demonstrates a stronger capability of completing structures coherently, and its samples contain fewer artifacts. In Figure 19, MaskGIT completes the bridge in row two and the building in the second to last row, which all GAN methods struggle to do in comparison.
我们发现，与先前的基于 GAN 的方法相比，MaskGIT 展现出更强大的结构连贯补全能力，其生成样本包含更少的伪影。在图 19 中，MaskGIT 成功补全了第二行的桥梁和倒数第二行的建筑物，而所有 GAN 方法在这些案例中都表现欠佳。

In addition, we compare with CoModGAN on image completion tasks with large masking ratios, i.e. conditioning on the center 50 $\%\times$ 50% and the center 31.25% $\times$ 31.25% respectively, which are challenging cases for traditional GANs. Examples are shown in Figure 20.
此外，我们还在大掩码比例（即分别以中心 50%区域 $\%\times$ 50%和中心 31.25%区域 $\times$ 31.25%作为条件）的图像补全任务中与 CoModGAN 进行对比，这些案例对传统 GAN 模型具有挑战性。具体示例如图 20 所示。

Appendix F Limitations and Failure Cases
附录 F 局限性与失败案例

In Figure 21, we show several limitations and failure cases of our approach. (A) and (B) are examples of semantic and color shifts in MaskGIT’s outpainting results. Due to its limited attention size, MaskGIT may ”forget” the synthesized semantics or color from one end when it’s outpainting the other end. (C) and (D) show cases where our approach may sometimes ignore or modify objects on the boundary when applied to outpainting and inpainting. (E) showcases MaskGIT’s failure mode in which it causes oversmoothing or creates undesired artifacts on complex structures such as human faces, text and symmetric objects. The improvement for these circumstances remains future work.
在图 21 中，我们展示了本方法的若干局限性和失败案例。(A)和(B)是 MaskGIT 外延绘制结果中出现语义偏移和色彩偏移的示例。由于注意力机制的范围限制，当 MaskGIT 绘制图像一端时，可能会"遗忘"另一端已生成的语义或色彩。(C)和(D)展示了在外延绘制和修复应用中，本方法偶尔会忽略或修改边界物体的案例。(E)呈现了 MaskGIT 在处理复杂结构（如人脸、文字和对称物体）时出现的典型失败模式——导致过度平滑或产生异常伪影。这些情况的改进将是未来的研究方向。

Input Image 输入图像
Goldfish [001] 金鱼 [001]
Ice Bear [296] 冰熊 [296]
Argaric [992] 伞菌 [992]
Lorikeet [90] 彩虹吸蜜鹦鹉 [90]
Train [829] 火车 [829]
Tiger [292] 老虎 [292]

Groundtruth 真实数据		—— Outpaint bottom 50% —— —— 底部 50%外绘 ——	—— Outpaint top 50% —— —— 顶部 50%外绘 ——
	ImageGPT[7] ImageGPT[7]
	VQGAN[15] VQGAN[15]
	MaskGIT (Ours) MaskGIT (我们的方法)
	ImageGPT[7] ImageGPT[7]
	VQGAN[15] VQGAN[15]
	MaskGIT (Ours) MaskGIT (我们的方法)

Groundtruth 真实情况		—— Outpaint bottom 50% —— —— 向外扩展底部 50% ——	—— Outpaint top 50% —— —— 向外扩展顶部 50% ——
	ImageGPT[7] ImageGPT[7]
	VQGAN[15] VQGAN[15]
	MaskGIT (Ours) MaskGIT (我们的方法)
	ImageGPT[7] ImageGPT[7]
	VQGAN[15] VQGAN[15]
	MaskGIT (Ours) MaskGIT (我们的方法)

Input 《[2202.04200] MaskGIT：掩码生成式图像变换器》	DeepFillv2[52] DeepFillv2[52]	HiFill[50] HiFill[50]	CoModGAN[57] CoModGAN[57]	MaskGIT (Ours) MaskGIT（我们的方法）	Groundtruth 真实图像

Input 《[2202.04200] MaskGIT：掩码生成式图像变换器》	Boundless[43] 无限[43]	InfinityGAN[31]✳ 无限生成对抗网络[31]✳	CoModGAN 协同调制生成对抗网络	MaskGIT (Ours) MaskGIT（我们的方法）	Groundtruth 真实图像




BigGAN-deep (FID= $6.95$ ) BigGAN-deep（FID= $6.95$ ）	MaskGIT (FID=6.18) MaskGIT（FID=6.18）	Training Set 训练集




Input 《[2202.04200] MaskGIT：掩码生成式图像变换器》	—— MaskGIT (Our Samples) —— —— MaskGIT（我们的样本）——

BigGAN-deep (FID= $6.95$ ) BigGAN-deep（FID= $6.95$ ）	VQVAE-2^† (FID= $31$ ) VQVAE-2 ^† (FID= $31$ )	MaskGIT (FID=6.18) MaskGIT（FID=6.18）

	Input	—— Our Outpainting Samples ——	Groundtruth
(C)
	Input	——Our Inpainting Samples ——	Groundtruth
(D)
	——Our Class-conditional Samples ——
(E)

MaskGIT: Masked Generative Image TransformerMaskGIT：掩码生成式图像变换器