StyleSwin: Transformer-based GAN for High-resolution Image Generation
StyleSwin：基于变换器的高分辨率图像生成 GAN

Bowen Zhang $^{1 *}$ Shuyang Gu $^{1}$ Bo Zhang $^{2 †}$ Jianmin Bao $^{2}$ Dong Chen $^{2}$
Bowen Zhang $^{1 *}$ Shuyang Gu $^{1}$ Bo Zhang $^{2 †}$ Jianmin Bao $^{2}$ Dong Chen $^{2}$ Fang Wen $^{2}$ Yong Wang $^{1}$ $^{1}$ University of Science and Technology of China
$^{1}$ 中国科学技术大学Baining Guo $^{2}$ 郭百宁 $^{2}$ $^{2}$ Microsoft Research Asia
$^{2}$ 微软亚洲研究院

Figure 1. Image samples generated by our StyleSwin on FFHQ

1024 \times 1024

and LSUN Church

256 \times 256

respectively.
图 1.我们的 StyleSwin 分别在 FFHQ

1024 \times 1024

和 LSUN Church

256 \times 256

上生成的图像样本。

Abstract 摘要

Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in windowbased transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformerbased GANs, especially on high resolutions, e.g., $1024 \times$
尽管变换器在广泛的视觉任务中取得了诱人的成功，但在高分辨率图像生成建模方面，变换器尚未展现出与 ConvNets 不相上下的能力。在本文中，我们试图探索使用纯变换器来构建一个用于高分辨率图像合成的生成对抗网络。为此，我们认为局部关注对于在计算效率和建模能力之间取得平衡至关重要。因此，我们提出的生成器采用了基于风格架构的 Swin 变换器。为了获得更大的感受野，我们提出了双重关注，即同时利用本地窗口和移动窗口的上下文，从而提高生成质量。此外，我们还表明，提供基于窗口的变换器中丢失的绝对位置知识对生成质量大有裨益。所提出的 StyleSwin 可以扩展到高分辨率，粗几何体和精细结构都能从变换器的强大表现力中获益。然而，在高分辨率合成过程中会出现阻塞假象，因为以分块方式执行局部关注可能会破坏空间一致性。为了解决这个问题，我们对各种解决方案进行了实证研究，其中我们发现采用小波鉴别器来检查频谱差异可以有效抑制伪影。广泛的实验表明，这种方法优于之前基于变压器的 GAN，尤其是在高分辨率下，例如 $1024 \times$ 。

1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the promise of using transformers for high-resolution image generation. The code and pretrained models are available at https://github.com/microsoft/StyleSwin.
1024.无需复杂的训练策略，StyleSwin 在 CelebA-HQ 1024 上的表现优于 StyleGAN，在 FFHQ-1024 上的表现与 StyleGAN 相当，证明了使用变换器生成高分辨率图像的前景。代码和预训练模型可在 https://github.com/microsoft/StyleSwin 上获取。

1. Introduction 1.导言

The state of image generative modeling has seen dramatic advancement in recent years, among which generative adversarial networks [14, 41] (GANs) offer arguably the most compelling quality on synthesizing high-resolution images. While early attempts focus on stabilizing the training dynamics via proper regularization [15,16,36,46,47] or adversarial loss designs [2, 25, 39, 45], remarkable performance leaps in recent prominent works mainly attribute to the architectural modifications that aim for stronger modeling capacity, such as adopting self-attention [66], aggressive model scaling [4], or style-based generators [29, 30]. Recently, drawn by the broad success of transformers in discriminative models [

11, 32, 43

], a few works [24,37, 62, 67] attempt to use pure transformers to build generative networks in the hope that the increased expressivity and the ability to model long-range dependencies can benefit the generation of complex images, yet high-quality image generation, especially on high resolutions, remains challenging.
近年来，图像生成建模技术取得了突飞猛进的发展，其中生成对抗网络 [14, 41]（GANs）在合成高分辨率图像方面可以说提供了最令人信服的质量。虽然早期的尝试主要是通过适当的正则化[15,16,36,46,47]或对抗损失设计[2,25,39,45]来稳定训练动态，但近期杰出作品的显著性能飞跃主要归功于旨在增强建模能力的架构修改，如采用自我关注[66]、积极的模型缩放[4]或基于风格的生成器[29,30]。最近，受变压器在判别模型中取得广泛成功的启发[

11, 32, 43

]，一些作品[24,37, 62, 67]尝试使用纯变压器来构建生成网络，希望通过增强表达能力和长程依赖建模能力来帮助生成复杂图像，然而高质量图像生成，尤其是高分辨率图像生成，仍然具有挑战性。

This paper aims to explore key ingredients when using transformers to constitute a competitive GAN for highresolution image generation. The first obstacle is to tame the quadratic computational cost so that the network is scal-
本文旨在探讨使用变换器构成具有竞争力的高分辨率图像生成 GAN 的关键要素。第一个障碍是驯服二次计算成本，使网络具有可扩展性。
able to high resolutions, e.g.,

1024 \times 1024

. We propose to leverage Swin transformers [43] as the basic building block since the window-based local attention strikes a balance between computational efficiency and modeling capacity. As such, we could take advantage of the increased expressivity to characterize all the image scales, as opposed to reducing to point-wise multi-layer perceptrons (MLP) for higher scales [67], and the synthesis is scalable to high resolution, e.g.,

1024 \times 1024

, with delicate details. Besides, the local attention introduces locality inductive bias so there is no need for the generator to relearn the regularity of images from scratch. These merits make a simple transformer network substantially outperform the convolutional baseline.
能够达到高分辨率，例如

1024 \times 1024

。我们建议利用 Swin 变换器 [43] 作为基本构件，因为基于窗口的局部关注可以在计算效率和建模能力之间取得平衡。因此，我们可以利用增强的表达能力来描述所有图像尺度，而不是将更高尺度的图像简化为点式多层感知器 (MLP)[67]。此外，局部关注引入了局部归纳偏差，因此生成器无需从头开始重新学习图像的规则性。这些优点使得简单的变换器网络大大优于卷积基线网络。

In order to compete with the state of the arts, we further propose three instrumental architectural adaptations. First, we strengthen the generative model capacity by employing the local attention in a style-based architecture [29], during which we empirically compare various style injection approaches for our transformer GAN. Second, we propose double attention in order to enlarge the limited receptive field brought by the local attention, where each layer attends to both the local and the shifted windows, effectively improving the generator capacity without much computational overhead. Moreover, we notice that Conv-based GANs implicitly utilize zero padding to infer the absolute positions, a crucial clue for generation, yet such feature is missing in the window-based transformers. We propose to fix this by introducing sinusoidal positional encoding [52] to each layer such that absolute positions can be leveraged for image synthesis. Equipped with the above techniques, the proposed network, dubbed as StyleSwin, starts to show advantageous generation quality on

256 \times 256

resolution.
为了与目前的技术水平相抗衡，我们进一步提出了三项工具性的架构调整。首先，我们在基于风格的架构[29]中采用了局部注意力，从而加强了生成模型的能力，在此期间，我们对变换器 GAN 的各种风格注入方法进行了实证比较。其次，我们提出了双重关注，以扩大局部关注带来的有限感受野，即每一层同时关注局部窗口和移动窗口，从而在不增加计算开销的情况下有效提高生成器容量。此外，我们注意到，基于 Conv 的 GAN 隐含地利用了零填充来推断绝对位置，这是生成的关键线索，但基于窗口的变换器却缺少这一特征。我们建议通过在每一层引入正弦位置编码 [52]来解决这一问题，这样就可以利用绝对位置进行图像合成。有了上述技术，被称为 StyleSwin 的拟议网络开始在

256 \times 256

分辨率下显示出良好的生成质量。

Nonetheless, we observe blocking artifacts when synthesizing high-resolution images. We conjecture that these disturbing artifacts arise because computing the attention independently in a block-wise manner breaks the spatial coherency. That is, while proven successful in discriminative tasks [43,56], the block-wise attention requires special treatment when applied in synthesis networks. To tackle these blocking artifacts, we empirically investigate various solutions, among which we find that a wavelet discriminator [13] examining the artifacts in spectral domain could effectively suppress the artifacts, making our transformerbased GAN yield visually pleasing outputs.
然而，我们在合成高分辨率图像时观察到了阻塞假象。我们推测，出现这些干扰假象的原因是，以分块方式独立计算注意力会破坏空间连贯性。也就是说，虽然分块注意力在判别任务中被证明是成功的[43,56]，但在合成网络中应用时需要特殊处理。为了解决这些阻塞假象，我们对各种解决方案进行了实证研究，其中我们发现在频谱域检查假象的小波判别器[13]可以有效抑制假象，从而使我们基于变压器的 GAN 产生视觉愉悦的输出。

The proposed StyleSwin, achieves state-of-the-art quality on multiple established benchmarks, e.g., FFHQ, CelebAHQ, and LSUN Church. In particular, our approach shows compelling quality on high-resolution image synthesis (Figure 1), achieving competitive quantitative performance relative to the leading ConvNet-based methods without complex training strategies. On CelebA-HQ 1024, our approach achieves an FID of 4.43, outperforming all the prior works including StyleGAN [29]; whereas on FFHQ-1024, we ob-
所提出的 StyleSwin 在多个既定基准（如 FFHQ、CelebAHQ 和 LSUN Church）上实现了最先进的质量。特别是，我们的方法在高分辨率图像合成上显示出令人信服的质量（图 1），与基于 ConvNet 的领先方法相比，无需复杂的训练策略就能获得具有竞争力的定量性能。在 CelebA-HQ 1024 上，我们的方法实现了 4.43 的 FID，优于包括 StyleGAN [29] 在内的所有先前研究成果；而在 FFHQ-1024 上，我们发现
tain an FID of 5.07, approaching the performance of StyleGAN2 [30].
FID 为 5.07，接近 StyleGAN2 [30] 的性能。

High-resolution image generation. Image generative modeling has improved rapidly in the past decade [14, 19, 34, 35, 41, 55]. Among various solutions, generative adversarial networks (GANs) offer competitive generation quality. While early methods [2,47,49] focus on stabilizing the adversarial training, recent prominent works [4, 28-30] rely on designing architectures with enhanced capacity, which considerably improves generation quality. However, contemporary GAN-based methods adopt convolutional backbones which are now deemed to be inferior to transformers in terms of modeling capacity. In this paper, we are interested in applying the emerging vision transformers to GANs for high-resolution image generation.
高分辨率图像生成。在过去十年中，图像生成建模技术发展迅速[14, 19, 34, 35, 41, 55]。在各种解决方案中，生成对抗网络（GAN）的生成质量极具竞争力。早期的方法[2,47,49]侧重于稳定对抗训练，而最近的杰出作品[4, 28-30]则依赖于设计具有增强能力的架构，这大大提高了生成质量。然而，当代基于 GAN 的方法采用的卷积骨干网在建模能力方面被认为不如变压器。在本文中，我们有兴趣将新兴的视觉变换器应用于高分辨率图像生成的 GAN。
Vision transformers. Recent success of transformers [5, 57] in NLP tasks inspires the research of vision transforms. The seminal work ViT [11] proposes a pure transformerbased architecture for image classification and demonstrates the great potential of transformers for vision tasks. Later, transformers dominate the benchmarks in a broad of discriminative tasks [10,17,43,53,56,59,60,64]. However, the self-attention in transformer blocks brings quadratic computational complexity, which limits its application for highresolution inputs. A few recent works [10, 43, 56] tackle this problem by proposing to compute self-attention in local windows, so that linear computational complexity can be achieved. Moreover, the hierarchical architecture makes them suitable to serve as general purpose backbones.
视觉转换器最近，变换器 [5, 57] 在 NLP 任务中的成功激发了视觉变换器的研究。开创性工作 ViT [11] 为图像分类提出了一种基于变换器的纯架构，并证明了变换器在视觉任务中的巨大潜力。后来，变换器在大量判别任务的基准测试中占据了主导地位 [10,17,43,53,56,59,60,64]。然而，变换器块中的自我关注带来了四倍的计算复杂度，限制了其在高分辨率输入中的应用。最近的一些研究 [10, 43, 56] 解决了这一问题，提出在局部窗口中计算自注意力，从而实现线性计算复杂度。此外，分层架构使它们适合作为通用主干网。
Transformer-based GANs. Recently, the research community begins to explore using transformers for generative tasks in the hope that the increased expressivity can benefit the generation of complex images. One natural way is to use transformers to synthesize pixels in an auto-regressive manner

[6, 12]

, but the slow inference speed limits their practical usage. Recently a few works [24, 37, 62, 67] attempt to propose transformer-based GANs, yet most of these methods only support the synthesis up to

256 \times 256

resolution. Notably, the HiT [67] successfully generates

1024 \times 1024

images at the cost of reducing to MLPs in its high-resolution stages, hence unable to synthesize highfidelity details as the Conv-based counterpart [29]. In comparison, our StyleSwin can synthesize fine structures using transformers, leading to comparable quality as the leading ConvNets on high-resolution synthesis.
基于变换器的 GAN。最近，研究界开始探索在生成任务中使用变换器，希望通过提高表达能力来生成复杂图像。一种自然的方法是使用变换器以自动回归的方式合成像素

[6, 12]

，但缓慢的推理速度限制了其实际应用。最近有一些作品 [24, 37, 62, 67] 尝试提出基于变换器的 GAN，但这些方法大多只支持

256 \times 256

分辨率以内的合成。值得注意的是，HiT [67] 成功生成了

1024 \times 1024

图像，但代价是在其高分辨率阶段缩减为 MLP，因此无法像基于 Conv 的对应方法 [29] 一样合成高保真细节。相比之下，我们的 StyleSwin 可以使用变换器合成精细结构，从而在高分辨率合成方面达到与领先 ConvNets 相当的质量。

3. Method 3.方法

3.1. Transformer-based GAN architecture
3.1.基于变压器的广义网络架构

We start from a simple generator architecture, as shown in Figure 2(a), which receives a latent variable

z \sim N (0, I)

我们从一个简单的生成器架构开始，如图 2（a）所示，该架构接收一个潜变量

z \sim N (0, I)

Figure 2. The architectures we investigate. (a) The baseline architecture is comprised of a series of transformer blocks hierarchically. (b) The proposed StyleSwin adopts style-based architecture, where the style codes derived from the latent code

z

modulate the feature maps of transformer blocks through style injection. © The proposed double attention enlarges the receptive field of transformer blocks by using split heads attending to the local and the shifted windows respectively.
图 2.我们研究的架构。(a) 基准架构由一系列分层的转换器块组成。(b) 拟议的 StyleSwin 采用基于风格的架构，从潜码

z

导出的风格代码通过风格注入调制转换器块的特征图。(c) 拟议的双重关注通过使用分别关注本地窗口和移位窗口的分头来扩大变换块的感受野。
as input and gradually upsamples the feature maps through a cascade of transformer blocks.
作为输入，并通过级联变压器块对特征图进行逐步升采样。

Due to the quadratic computational complexity, it is unaffordable to compute full-attention on high-resolution feature maps. We believe that local attention is a good way to achieve trade-off between computational efficiency and modeling capacity. We adopt Swin transformer [43] as the basic building block which computes multi-head selfattention (MSA) [57] locally in non-overlapping windows. To advocate the information interaction across adjacent windows, Swin transformer uses shifted window partition in alternative blocks. Specifically, given the input feature

map x^{l} \in R^{H \times W \times C}

of layer

l

, the consecutive

S

win blocks operate as follows:
由于二次计算的复杂性，在高分辨率特征图上计算全注意力是难以负担的。我们认为，局部注意力是实现计算效率和建模能力权衡的好方法。我们采用 Swin 变换器[43]作为基本构件，在非重叠窗口中局部计算多头自我注意（MSA）[57]。为了促进相邻窗口间的信息交互，Swin transformer 在替代块中使用了移位窗口分区。具体来说，给定层

l

的输入特征

map x^{l} \in R^{H \times W \times C}

，连续的

S

赢块的操作如下：

\begin{array}{l} {\hat{x}}^{l} = W - MSA (LN (x^{l})) + x^{l} \\ x^{l + 1} = MLP (LN ({\hat{x}}^{l})) + {\hat{x}}^{l} \end{array}} regular window, \begin{array}{l} {\hat{x}}^{l + 1} = SW-MSA (LN (x^{l + 1})) + x^{l + 1} \\ x^{l + 2} = MLP (LN ({\hat{x}}^{l + 1})) + {\hat{x}}^{l + 1} \end{array}} shifted window,

where W-MSA and SW-MSA denote the window-based
其中，W-MSA 和 SW-MSA 表示基于窗口的
multi-head self-attention under the regular and shifted window partitioning respectively, and LN stands for layer normalization. Since such block-wise attention induces linear computational complexity relative to the image size, the network is scalable to the high-resolution generation where the fine structures can be modeled by these capable transformers as well.
LN 表示层归一化。由于这种分块关注的计算复杂度与图像大小成线性关系，因此该网络可扩展到高分辨率生成，在高分辨率生成中，精细结构也可由这些功能转换器建模。

Since the discriminator severely affects the stability of adversarial training, we opt to use a Conv-based discriminator directly from [29]. In our experiment, we find that simply replacing the convolution with transformer blocks under this baseline architecture yields more stabilized training due to the improved model capacity. However, such naive architecture cannot make our transformer-based GAN compete with the state of the arts, so we make further studies which we introduce as follows.
由于判别器会严重影响对抗训练的稳定性，我们选择直接使用 [29] 中基于 Conv 的判别器。在实验中，我们发现在这种基线架构下，只需用变换器块替换卷积，就能提高模型容量，从而获得更稳定的训练效果。然而，这种天真的架构并不能使我们基于变换器的 GAN 与现有技术相抗衡，因此我们做了进一步的研究，具体介绍如下。

Style injection. We first strengthen the model capability by adapting the generator to a style-based architecture

[29, 30]

as shown in Figure 2(b). We learn a non-linear mapping

f : Z \to W

to map the latent code

z

from

Z

space to

W

space, which specifies the styles that are injected into the main synthesis network. We investigate the following style
风格注入。我们首先通过将生成器调整为基于风格的架构

[29, 30]

来加强模型能力，如图 2(b) 所示。我们学习非线性映射

f : Z \to W

，将潜在代码

z

从

Z

空间映射到

W

空间，从而指定注入主合成网络的样式。我们研究了以下样式

Style injection methods 样式注入方法	FID $↓$
Baseline 基线	15.03
AdaIN	$6.34$
AdaLN	6.95
AdaBN	$> 100$
AdaRMSNorm	7.43
Modulated MLP 调制 MLP	7.09
Cross attention 交叉关注	6.59

Table 1. Comparison of different style injection methods on FFHQ-256. The style injection methods considerably improve the FID, among which the AdaIN leads to the best generation quality.
表 1.FFHQ-256 上不同样式注入方法的比较。各种样式注入方法都大大提高了 FID，其中 AdaIN 的生成质量最好。
injection approaches: 注射方法：

AdaNorm modulates the statistics (i.e., mean and variance) of feature maps after normalization. We study multiple normalization variants, including instance normalization (IN) [54], batch normalization (BN) [21], layer normalization (LN) [3] and the recently proposed RMSnorm [65]. Since the RMSNorm removes the meancentering of LN , we only predict the variance from the $W$ code.
AdaNorm 调节归一化后特征图的统计量（即均值和方差）。我们研究了多种归一化变体，包括实例归一化（IN）[54]、批归一化（BN）[21]、层归一化（LN）[3]和最近提出的 RMSnorm [65]。由于 RMSNorm 去掉了 LN 的 meancentering，因此我们只预测 $W$ 代码的方差。
Modulated MLP: Instead of modulating feature maps, one can also modulate the weights of linear layers. Specifically, we rescale the channel-wise weight magnitude of the feed-forward network (FFN) within transformer blocks. According to [30], such style injection admits faster speed than AdaNorm.
调制 MLP：除了调制特征图，我们还可以调制线性层的权重。具体来说，我们在变压器块内重新调整前馈网络（FFN）的通道权重大小。根据文献[30]，这种注入方式的速度比 AdaNorm 更快。
Cross-attention: Motivated by the decoder transformer [57], we explore a transformer-specific style injection in which the transformers additionally attend to the style tokens derived from the $W$ space. The effectiveness of this cross-attention is also validated in [67].
交叉关注：受解码器转换器 [57] 的启发，我们探索了一种特定于转换器的样式注入方法，在这种方法中，转换器会额外关注来自 $W$ 空间的样式标记。这种交叉关注的有效性在 [67] 中也得到了验证。
Table 1 shows that all the above style injection methods significantly boost the generative modeling capacity except that the training with AdaBN does not converge because the batch size is compromised for high-resolution synthesis. In comparison, AdaNorm brings more sufficient style injection possibly because the network could take advantage of the style information twice - in either the attention block and the FFN, whereas the modulated MLP and cross-attention make use of the style information once. We did not further study the hybrid of modulated MLP and cross-attention due to efficiency considerations. Furthermore, compared to AdaBN and AdaLN, AdaIN offers finer and more sufficient feature modulation as feature maps are normalized and modulated independently, so we choose AdaIN by default for our following experiments.
表 1 显示，上述所有风格注入方法都显著提高了生成建模能力，只有 AdaBN 的训练没有收敛，因为批量大小对高分辨率合成有影响。相比之下，AdaNorm 带来了更充分的风格注入，这可能是因为该网络可以利用两次风格信息--在注意块和 FFN 中，而调制 MLP 和交叉注意只利用一次风格信息。出于效率考虑，我们没有进一步研究调制 MLP 和交叉注意的混合效果。此外，与 AdaBN 和 AdaLN 相比，AdaIN 可以提供更精细、更充分的特征调制，因为特征图是独立进行归一化和调制的，所以我们在接下来的实验中默认选择 AdaIN。
Double attention. Using local attention, nonetheless, sacrifices the ability to model long-range dependencies, which is pivotal to capture geometry [4,66]. Let the window size
双重注意力然而，使用局部注意力会牺牲建立长程依赖关系模型的能力，而这对于捕捉几何图形至关重要[4,66]。让窗口大小
used by the Swin block be $κ \times κ$ , then due to the shifted window strategy, the receptive field increases by $κ$ in each dimension using one more Swin block. Suppose we use Swin blocks to process a $64 \times 64$ feature map and we by default choose $κ = 8$ , then it takes $64 / κ = 8$ transformer blocks to span over the entire feature map.
那么，由于采用了移动窗口策略，使用多一个 Swin 块，每个维度的感受野就会增加 $κ$ 。假设我们使用 Swin 块处理 $64 \times 64$ 个特征图，默认情况下选择 $κ = 8$ ，那么需要 $64 / κ = 8$ 个变换块才能覆盖整个特征图。

In order to achieve an enlarged receptive field, we propose double attention which allows a single transformer block to simultaneously attend to the context of the local and shifted windows. As illustrated in Figure 2©, we split

h

attention heads into two groups: the first half of heads perform the regular window attention whereas the second half compute the shifted window attention, both of whose results are further concatenated to form the output. Specifically, we denote with

x_{w}

and

x_{s w}

the non-overlapping patches under the regular and shifted window partitioning respectively, i.e.

x_{w}, x_{s w} \in R^{\frac{H W}{κ^{2}} \times κ \times κ \times C}

, then the double attention is formulated as,
为了实现扩大的感受野，我们提出了双重注意，即允许单个转换块同时注意本地窗口和移动窗口的上下文。如图 2© 所示，我们将

h

注意头分成两组：前半部分的注意头执行常规窗口注意，而后半部分的注意头则计算偏移窗口注意，两者的结果进一步合并形成输出。具体来说，我们用

x_{w}

和

x_{s w}

分别表示常规窗口分割和移动窗口分割下的非重叠斑块，即

x_{w}, x_{s w} \in R^{\frac{H W}{κ^{2}} \times κ \times κ \times C}

，那么双重注意力的公式为：

Double-Attention = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

where

W^{O} \in R^{C \times C}

is the projection matrix used to mix the heads to output. The attention heads in Equation 2 can be computed as:
其中，

W^{O} \in R^{C \times C}

是投影矩阵，用于将磁头混合到输出中。等式 2 中的注意头可计算为

{head}_{i} = {\begin{cases} Attn (x_{w} W_{i}^{Q}, x_{w} W_{i}^{K}, x_{w} W_{i}^{V}) & i \leq ⌊ \frac{h}{2} ⌋ \\ Attn (x_{s w} W_{i}^{Q}, x_{s w} W_{i}^{K}, x_{s w} W_{i}^{V}) & i > ⌊ \frac{h}{2} ⌋ \end{cases}

where

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{C \times (C / h)}

are query, key and value projection matrix for

i

-th head respectively. One can derive that the receptive field of each dimension increases by

2.5 κ

with one additional double attention block, which allows capturing larger context more efficiently. Still, for a

64 \times 64

input, it now takes 4 transformer blocks to cover the entire feature map.
其中，

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{C \times (C / h)}

分别为

i

-th head 的查询、键和值投影矩阵。由此可以得出，每增加一个双注意区块，每个维度的感受野就会增加

2.5 κ

，从而可以更有效地捕捉更大的上下文。不过，对于

64 \times 64

输入，现在需要 4 个变换块才能覆盖整个特征图。
Local-global positional encoding. Relative positional encoding (RPE) adopted by the default Swin blocks encodes the relative position of pixels and has proven crucial for discriminative tasks [9, 43]. Theoretically, a multi-head local attention layer with RPE can express any convolutional layer of window-sized kernels [8,38]. However, when substituting the convolutional layers with transformers that use RPE, one thing is rarely noticed: ConvNets could infer the absolute positions by leveraging the clues from the zero paddings

[22, 31]

yet such feature is missing in Swin blocks using RPE. On the other hand, it is essential to let the generator be aware of the absolute positions because the synthesis of a specific component, e.g., mouth, highly depends on its spatial coordinate

[1, 40]

.
局部-全局位置编码。默认 Swin 块采用的相对位置编码（RPE）对像素的相对位置进行编码，已被证明对判别任务至关重要[9, 43]。从理论上讲，具有 RPE 的多头局部注意力层可以表达任何窗口大小核的卷积层 [8,38]。然而，在用使用 RPE 的转换器替代卷积层时，有一点很少被注意到：ConvNets 可以利用零填充

[22, 31]

中的线索推断出绝对位置，但使用 RPE 的 Swin 块却缺少这种特征。另一方面，让生成器意识到绝对位置是非常重要的，因为特定组件（如嘴巴）的合成在很大程度上取决于其空间坐标

[1, 40]

。

In view of this, we propose to introduce sinusoidal position encoding [7, 57, 61] (SPE) on each scale, as shown in Figure 2(b). Specifically, after the scale upsampling, the
有鉴于此，我们建议在每个比例尺上引入正弦位置编码 [7, 57, 61] (SPE)，如图 2(b) 所示。具体来说，在对比例尺进行上采样后，在每个比例尺上对正弦位置编码[7, 57, 61]。

Figure 3. Blocking artifacts become obvious on

1024 \times 1024

resolution. These artifacts correlate with the window size of local attentions.
图 3.在

1024 \times 1024

分辨率下，阻塞伪像变得非常明显。这些伪影与局部注意力的窗口大小相关。
feature maps are added with the following encoding:
添加的地物图编码如下：

[\underset{horizontal dimension}{\underset{⏟}{\sin (ω_{0} i), \cos (ω_{0} i), \dots}}, \underset{vertical dimension}{\underset{⏟}{\sin (ω_{0} j), \cos (ω_{0} j), \dots}}] \in R^{C}

where and

ω_{k} = 1 / 10000^{2 k}

and

(i, j)

denotes the 2D location. We use SPE rather than learnable absolute positional encoding [11] because SPE admits translation invariance [58]. In practice, we make the best of RPE and SPE by employing them altogether: the RPE applied within each transformer block offers the relative positions within the local context, whereas the SPE introduced on each scale informs the global position.
其中

ω_{k} = 1 / 10000^{2 k}

和

(i, j)

表示二维位置。我们使用 SPE 而不是可学习的绝对位置编码[11]，因为 SPE 具有平移不变性[58]。在实践中，我们将 RPE 和 SPE 结合起来使用，从而达到最佳效果：在每个变换块中应用的 RPE 提供了本地上下文中的相对位置，而在每个比例尺上引入的 SPE 则提供了全局位置。

3.2. Blocking artifact in high-resolution synthesis
3.2.高分辨率合成中的阻塞假象

While achieving state-of-the-art quality on synthesizing

256 \times 256

images with the above architecture, directly applying it for higher resolution synthesis, e.g.,

1024 \times

1024, brings blocking artifacts as shown in Figure 3, which severely affects the perceptual quality. Note that these are by no means the checkboard artifacts caused by the transposed convolution [48] as we use bilinear upsampling followed by anti-aliasing filters as [29].
虽然使用上述架构合成

256 \times 256

图像的质量达到了最先进的水平，但将其直接应用于更高分辨率的合成，例如

1024 \times

1024，会产生如图 3 所示的阻塞伪影，严重影响感知质量。请注意，这绝不是由转置卷积[48]引起的棋盘伪像，因为我们使用了双线性上采样，然后再使用抗锯齿滤波器[29]。

We conjecture that the blocking artifacts are caused by the transformers. To verify this, we remove the attention operators starting from

64 \times 64

and employ only MLPs to characterize the high-frequency details. This time we obtain artifact-free results. To be further, we find that these artifacts exhibit periodic patterns with a strong correlation with the window size of local attention. Hence, we are certain it is the window-wise processing that breaks the spatial coherency and causes the blocking artifacts. To simplify, one can consider the 1D case in Figure 4, where attention is computed locally in strided windows. For a continuous signal, the window-wise local attention is likely to produce a discontiguous output because the values within the same
我们推测，阻塞假象是由变压器引起的。为了验证这一点，我们去掉了从

64 \times 64

开始的注意算子，只使用 MLP 来描述高频细节。这一次，我们得到了无人工痕迹的结果。更进一步，我们发现这些伪像呈现出周期性的模式，与局部注意力的窗口大小密切相关。因此，我们可以肯定，是窗口处理打破了空间一致性并导致了阻塞伪像。为简化起见，我们可以考虑图 4 中的一维情况，在这种情况下，注意力是在分步窗口中进行局部计算的。对于连续信号，窗口局部注意力很可能会产生不连续的输出，因为在同一窗口内的值是不连续的。

Figure 4. A 1D example illustrates that the window-wise local attention causes blocking artifacts. (a) Input continuous signal along with partitioning windows. (b) Output discontinuous signal after window-wise attention. For simplicity, we adopt one attention head with random projection matrices.
图 4一维示例说明窗口局部关注会导致阻塞伪像。(a) 输入连续信号和分区窗口。(b) 窗口注意后的输出不连续信号。为简单起见，我们采用一个具有随机投影矩阵的注意力头。
window tend to become uniform after the softmax operation, so the outputs of neighboring windows appear rather distinct. The 2D case is analogous to the JPEG compression artifacts caused by the block-wise encoding [42].
在软极大值运算后，窗口的输出会趋于一致，因此相邻窗口的输出会显得相当不同。二维情况类似于分块编码造成的 JPEG 压缩假象 [42]。

3.3. Artifact suppression
3.3.抑制人工痕迹

In the next, we discuss a few solutions to suppress the blocking artifacts.
接下来，我们将讨论几种抑制阻塞假象的解决方案。
Artifact-free generator. We first attempt to reduce artifacts by improving the generator.
无伪影生成器。我们首先尝试通过改进生成器来减少伪影。

Token sharing. Blocking artifacts arise because there is an abrupt change of keys and values used by the attention computing across distinct windows, so we propose to make windows have shared tokens in a way like HaloNet [56]. However, artifacts are still noticeable since there always exist tokens exclusive to specific windows.
令牌共享。由于在不同的窗口中，注意力计算所使用的键和值会发生突然变化，因此会产生阻塞假象，因此我们建议采用类似 HaloNet [56] 的方式，让窗口共享令牌。不过，由于始终存在特定窗口专用的令牌，因此仍然会出现明显的假象。
Theoretically, sliding window attention [20] should lead to artifact-free results. Note that training the generator with sliding attention is too costly so we only adopt the sliding window for inference.
从理论上讲，滑动窗口注意力[20]应该能带来无伪影的结果。需要注意的是，用滑动注意力训练生成器的成本太高，因此我们只采用滑动窗口进行推理。
Reduce to MLPs on fine scales. Just as [67], one can remove self-attention and purely rely on point-wise MLPs for fine structure synthesis at the cost of sacrificing the ability to model high-frequency details.
在精细尺度上还原为 MLP。正如文献[67]所述，我们可以去掉自我关注，纯粹依靠点式 MLP 进行精细结构合成，但代价是牺牲高频细节建模的能力。

Artifact-suppression discriminator. Indeed, we observe blocking artifacts in the early training phase on

256 \times 256

resolution, but they gradually fade out as training precedes. In other words, although the window-based attention is prone to produce artifacts, the generator does have the capability to offer an artifact-free solution. The artifacts plague the high-resolution synthesis because the discriminator fails to examine the high-frequency details. This enlightens us to resort to stronger discriminators for artifact suppression.
伪影抑制判别器。事实上，我们在

256 \times 256

分辨率的早期训练阶段观察到了阻塞伪像，但随着训练的深入，伪像逐渐消失。换句话说，虽然基于窗口的注意力容易产生伪像，但生成器确实有能力提供无伪像的解决方案。伪像之所以困扰高分辨率合成，是因为鉴别器无法检查高频细节。这启示我们采用更强的鉴别器来抑制伪像。

Patch discriminator [23] possesses limited receptive field and can be employed to specifically penalize the local structures. Experiments show partial suppression of the blocking artifacts using a patch discriminator.
斑块鉴别器[23]具有有限的感受野，可以专门用于惩罚局部结构。实验表明，使用斑块鉴别器可以部分抑制阻塞伪影。

Figure 5. The Fourier spectrum of blocking artifacts. (a) Images with blocking artifacts. (b) The artifacts with periodic patterns can be clearly distinguished in the spectrum. © The spectrum of artifact-free images derived from the sliding window inference.
图 5.阻塞伪影的傅立叶频谱。(a) 有遮挡伪影的图像。(b) 在频谱中可以清晰地分辨出具有周期性图案的伪影。(c) 根据滑动窗口推理得出的无伪影图像频谱。

Figure 6. The wavelet discriminator suppresses the artifacts by examining the wavelet spectrum of the multi-scaled input.
图 6.小波鉴别器通过检查多尺度输入的小波频谱来抑制伪影。

Total variation annealing. To advocate smooth outputs, we apply a large total variation loss at the beginning of training, aiming to suppress the network’s tendency to artifacts. The loss weight is then linearly decayed to zero towards the end of training. Though artifacts can be completely removed, such handcrafted constraint favors oversmoothed results and inevitably affects the distribution matching for high-frequency details.
总变异退火。为了获得平滑的输出结果，我们在训练开始时采用了较大的总变异损失，目的是抑制网络的假象倾向。然后在训练结束时，损失权重线性衰减为零。虽然可以完全消除伪影，但这种手工制作的约束有利于获得过度平滑的结果，并不可避免地影响高频细节的分布匹配。
Wavelet discriminator. As shown in Figure 5, the periodic artifact pattern can be easily distinguished in the spectral domain. Inspired by this, we resort to a wavelet discriminator [13] to complement our spatial discriminator and we illustrate its architecture in Figure 6. The discriminator hierarchically downsamples the input image and on each scale examines the frequency discrepancy
小波鉴别器。如图 5 所示，周期性伪影模式在频谱域中很容易区分。受此启发，我们采用小波鉴别器 [13] 来补充空间鉴别器，图 6 展示了其结构。该鉴别器对输入图像进行分层下采样，并在每个尺度上检查频率差异。

Solutions 解决方案	FID $↓$	Remove artifacts? 清除人工痕迹？
Window-based attention 基于窗口的注意力	8.39	$x$
Sliding window inference 滑动窗口推理	12.08	$✓$
Token sharing 令牌共享	8.95	$x$
MLPs after $64 \times 64$ $64 \times 64$ 之后的 MLP	12.69	$✓$
Patch discriminator 补丁鉴别器	7.73	$x$
Total variation annealing 退火总变化	12.79	$✓$
Wavelet discriminator 小波鉴别器	$5.07$	$✓$

Table 2. Comparison of the artifact suppression solutions on FFHQ-1024.
表 2.FFHQ-1024 上的伪影抑制解决方案比较。
relative to real images after discrete wavelet decomposition. Such a wavelet discriminator works remarkably well in combating the blocking artifacts. Meanwhile, it does not bring any side-effects on distribution matching, effectively guiding the generator to produce rich details.
相对于离散小波分解后的真实图像。这种小波鉴别器在消除阻塞伪影方面效果显著。同时，它不会对分布匹配带来任何副作用，有效地引导生成器生成丰富的细节。
Table 2 compares the above artifact suppression methods, showing that there are four approaches that could totally remove the visual artifacts. However, sliding window inference suffers from the train-test gap, whereas MLPs fail to synthesize fine details on high-resolution stages, both of them leading to a higher FID score. On the other hand, the total variation with annealing still deteriorates the FID. In comparison, the wavelet-discriminator achieves the lowest FID score and yields the most visually pleasing results.
表 2 比较了上述伪影抑制方法，显示有四种方法可以完全消除视觉伪影。然而，滑动窗口推理存在训练-测试差距，而 MLP 无法在高分辨率阶段合成精细细节，这两种方法都会导致较高的 FID 分数。另一方面，退火后的总变化仍然会降低 FID。相比之下，小波判别器的 FID 分数最低，产生的结果也最直观。

4. Experiments 4.实验

4.1. Experiment setup 4.1.实验装置

Datasets. We validate our StyleSwin on the following datasets: CelebA-HQ [27], LSUN Church [63], and FFHQ [29]. CelebA-HQ is a high-quality version of CelebA dataset [44] which contains 30,000 human face images of

1024 \times 1024

resolution. FFHQ [29] is a commonly used dataset for high-resolution image generation. It contains 70,000 high-quality human face images with more variation of age, ethnicity and background, and has better coverage of accessories such as eyeglasses, sunglasses, hats, etc. We synthesize images on FFHQ and CelebA-HQ on either

256 \times 256

and

1024 \times 1024

resolutions. LSUN Church [63] contains around 126,000 church images in diverse architecture styles, on which we conduct experiments with

256 \times 256

resolution.
数据集。我们在以下数据集上验证了我们的 StyleSwin：CelebA-HQ [27]、LSUN Church [63] 和 FFHQ [29]。CelebA-HQ 是 CelebA 数据集 [44] 的高质量版本，包含 30,000 张分辨率为

1024 \times 1024

的人脸图像。FFHQ [29] 是常用的高分辨率图像生成数据集。它包含 7 万张高质量的人脸图像，年龄、种族和背景的变化更大，对眼镜、太阳镜、帽子等配饰的覆盖率也更高。我们在 FFHQ 和 CelebA-HQ 上以

256 \times 256

和

1024 \times 1024

两种分辨率合成图像。LSUN Church [63] 包含约 126,000 张不同建筑风格的教堂图像，我们在此基础上以

256 \times 256

分辨率进行了实验。

Evaluation protocol. We adopt Fréchet Inception Distance (FID) [18] as the quantitative metric, which measures the distribution discrepancy between generated images and real ones. Lower FID scores indicate better generation quality. For FFHQ [29] and LSUN Church [63] datasets, we randomly sample 50,000 images from the original datasets as validation sets and calculate FID between the validation sets and 50,000 generated images. For CelebA-HQ [27], we cal-
评估协议。我们采用弗雷谢特起始距离（FID）[18] 作为定量指标，衡量生成图像与真实图像之间的分布差异。FID 分数越低，说明生成质量越好。对于 FFHQ [29] 和 LSUN Church [63] 数据集，我们从原始数据集中随机抽取 50,000 张图像作为验证集，并计算验证集与 50,000 张生成图像之间的 FID。对于 CelebA-HQ [27]，我们计算的是

Methods 方法	FFHQ	CelebA-HQ	LSUN Church LSUN 教堂
StyleGAN2 [30]	${3.62}^{*}$	-	3.86
PG-GAN [27]	-	8.03	6.42
U-Net GAN [50]	7.63	-	-
INR-GAN [51]	9.57	-	5.09
MSG-GAN [26]	-	-	5.20
CIPS [1]	4.38	-	$2.92$
TransGAN [24]	-	${9.60}^{*}$	8.94
VQGAN [12]	11.40	10.70	-
HiT-B [67]	${2.95}^{*}$	${3.39}^{*}$	-
StyleSwin	${2.81}^{*}$	${3.25}^{*}$	2.95

Table 3. Comparison of state-of-the-art unconditional image generation methods on FFHQ, CelebA-HQ and LSUN Church of

256 \times 256

resolution in terms of FID score (lower is better). The subscript (

*

) indicates that bCR is applied during training.
表 3.最先进的无条件图像生成方法在 FFHQ、CelebA-HQ 和 LSUN Church 的

256 \times 256

分辨率上的 FID 分数比较（越低越好）。下标 (

*

) 表示在训练过程中应用了 bCR。
culated the FID between 30,000 generated images and all the training samples.
计算了 30,000 幅生成图像与所有训练样本之间的 FID。

4.2. Implementation details
4.2.实施细节

During training we use Adam solver [33] with

β_{1} = 0.0

β_{2} = 0.99

. Following TTUR [18], we set imbalanced learning rates,

5 e - 5

and

2 e - 4

, for the generator and discriminator respectively. We train our model using the standard nonsaturating GAN loss with

R_{1}

gradient penalty [30] and stabilize the adversarial training by applying spectral normalization [47] on the discriminator. By default, we report all the results with the wavelet discriminator as shown in Figure 6 . Using 832 GB V100 GPUs, we are able to fit 32 images as one training batch for the training on

256 \times 256

resolution and the batch size reduces to 16 on

1024 \times 1024

resolution. For fair comparison with prior works, we report the FID with balanced consistency regularization (bCR) [68] on the FFHQ-256 and CelebA-HQ 256 datasets with the loss weight

λ_{real} = λ_{fake} = 10

. Similar to [67], we do not observe performance gain using bCR on higher resolutions. Note that we do not adopt complex training strategies, such as path length and mixing regularizations [29], as we wish to conduct studies on neat network architectures.
在训练过程中，我们使用亚当求解器 [33]，

β_{1} = 0.0

，

β_{2} = 0.99

。根据 TTUR [18]，我们为生成器和判别器分别设置了不平衡学习率

5 e - 5

和

2 e - 4

。我们使用带有

R_{1}

梯度惩罚[30] 的标准非饱和 GAN 损失来训练模型，并通过在判别器上应用光谱归一化[47] 来稳定对抗训练。默认情况下，我们使用小波判别器报告所有结果，如图 6 所示。使用 832 GB V100 GPU，我们能够在

256 \times 256

分辨率下将 32 幅图像作为一个训练批次，而在

1024 \times 1024

分辨率下，批次大小减少到 16 幅。为了与之前的工作进行公平比较，我们报告了在 FFHQ-256 和 CelebA-HQ 256 数据集上使用平衡一致性正则化 (bCR) [68] 的 FID，损失权重为

λ_{real} = λ_{fake} = 10

。与 [67] 类似，我们在更高分辨率上没有观察到使用 bCR 的性能提升。请注意，我们没有采用复杂的训练策略，如路径长度和混合正则化 [29]，因为我们希望对整齐的网络架构进行研究。

4.3. Main results 4.3.主要成果

Quantitative results. We compare with state-of-the-art Conv-based GANs as well as the recent transformer-based methods. As shown in Table 3, our StyleSwin achieves state-of-the-art FID scores on all the

256 \times 256

synthesis. In particular, on both FFHQ and LSUN Church datasets, StyleSwin outperforms StyleGAN2 [30]. Besides the impressive results on resolution

256 \times 256

, the proposed StyleSwin shows a strong capability on high-resolution image generation. As shown in Table 4, we evaluate models on FFHQ and CelebA-HQ on the resolution of

1024 \times 1024

,
定量结果。我们与最先进的基于 Conv 的 GAN 以及最新的基于变换器的方法进行了比较。如表 3 所示，我们的 StyleSwin 在所有

256 \times 256

合成上都获得了最先进的 FID 分数。特别是在 FFHQ 和 LSUN Church 数据集上，StyleSwin 的表现都优于 StyleGAN2 [30]。除了在分辨率

256 \times 256

上取得了令人印象深刻的结果外，所提出的 StyleSwin 在高分辨率图像生成方面也表现出了很强的能力。如表 4 所示，我们在 FFHQ 和 CelebA-HQ 上评估了分辨率为

1024 \times 1024

的模型、

Methods 方法	FFHQ-1024	CelebA-HQ 1024
StyleGAN $^{1}$ [30] [29] StyleGAN $^{1}$ [30] [29］	$4.41$	5.06
COCO-GAN $^{2}$	-	9.49
PG-GAN [27]	-	7.30
MSG-GAN [26]	5.80	6.37
INR-GAN [51]	16.32	-
CIPS [1]	10.07	-
HiT-B [67]	6.37	8.83
StyleSwin	5.07	$4.43$

Table 4. Comparison of state-of-the-art unconditional image generation methods on FFHQ and CelebA-HQ of resolution

1024 \times

1024 in terms of FID score (lower is better).

^{1}

We report the FID score of StyleGAN2 on FFHQ-1024 and that of StyleGAN on CelebA-HQ 1024. For fair comparison, we report results of StyleGAN2 without style-mixing and path regularization.
表 4.在分辨率为

1024 \times

1024 的 FFHQ 和 CelebA-HQ 上最先进的无条件图像生成方法的 FID 分数比较（越低越好）。

^{1}

我们报告了 StyleGAN2 在 FFHQ-1024 和 StyleGAN 在 CelebA-HQ 1024 上的 FID 分数。为了进行公平比较，我们报告了 StyleGAN2 在不进行风格混合和路径正则化的情况下的结果。

Model Configuration 型号配置	FID $↓$
A. Swin baseline A.斯温基线	15.03
B. + Style injection B.+ 风格注入	8.40
C. + Double attention C.+ 双倍关注	7.86
D. + Wavelet discriminator D.+ 小波鉴别器	6.34
E. + SPE E.+ SPE	5.76
F. + Larger model F.+ 更大的模型	5.50
G. + bCR	$2.81$

Table 5. Ablation study conducted on FFHQ-256. Starting from the baseline architecture, we prove the effectiveness of each proposed component.
表 5.对 FFHQ-256 进行的消融研究。从基线架构开始，我们证明了每个建议组件的有效性。
where the proposed StyleSwin also demonstrates state-of-the-art performance. Notably, we obtain the record FID score of 4.43 on CelebA-HQ 1024 dataset while considerably closing the gap with the leading approach StyleGAN2 without involving complex training strategies or additional regularization. Also, StyleSwin outperforms the transformer-based approach HiT by a large margin on

1024 \times 1024

resolution, proving that the self-attention on high-resolution stages is beneficial to high-fidelity detail synthesis.
其中，StyleSwin 也展示了最先进的性能。值得注意的是，我们在 CelebA-HQ 1024 数据集上获得了创纪录的 4.43 FID 分数，同时大大缩小了与领先方法 StyleGAN2 的差距，而无需采用复杂的训练策略或额外的正则化。此外，StyleSwin 在

1024 \times 1024

分辨率上的表现也大大优于基于变换器的方法 HiT，这证明了高分辨率阶段的自我关注有利于高保真细节合成。
Qualitative results. Figure 7 shows the image samples generated by StyleSwin on FFHQ and CelebA-HQ of

1024 \times

1024 resolution. Our StyleSwin shows compelling quality on synthesizing diverse images of different ages, backgrounds and viewpoints on the resolution of

1024 \times 1024

. On top of face modeling, we show generation results of LSUN Church in Figure 8, showing StyleSwin is capable to model complex scene structures. Both the coherency of global geometry and the high-fidelity details all prove the advantages of using transformers among all the resolutions.
定性结果。图 7 显示了 StyleSwin 在

1024 \times

1024 分辨率的 FFHQ 和 CelebA-HQ 上生成的图像样本。我们的 StyleSwin 在

1024 \times 1024

分辨率下合成了不同年龄、背景和视角的各种图像，显示出令人信服的质量。在人脸建模的基础上，我们在图 8 中展示了 LSUN Church 的生成结果，表明 StyleSwin 能够对复杂的场景结构进行建模。全局几何图形的一致性和高保真细节都证明了在所有分辨率中使用转换器的优势。

Ablation study. To validate the effectiveness of the proposed components, we conduct ablation studies in Table 5. Compared with the baseline architecture, we observe sig-
消融研究。为了验证拟议组件的有效性，我们进行了表 5 所示的消融研究。与基线架构相比，我们观察到了明显的改进。

Figure 7. Image samples generated by our StyleSwin on (a) FFHQ

1024 \times 1024

and (b) CelebA-HQ

1024 \times 1024

.
图 7.我们的 StyleSwin 在 (a) FFHQ

1024 \times 1024

和 (b) CelebA-HQ

1024 \times 1024

上生成的图像样本。

Figure 8. Image samples generated by our StyleSwin on LSUN Church

256 \times 256

.
图 8.我们的 StyleSwin 在 LSUN 教堂

256 \times 256

上生成的图像样本。
nificant FID improvement thanks to the enhanced model capacity brought by the style injection. The double attention makes each layer leverage larger context at one time and further reduces the FID score. Wavelet discriminator brings a large FID improvement because it effectively suppresses the blocking artifacts and meanwhile brings stronger supervision for high-frequencies. In our experiment, we observe faster adversarial training when adopting the wavelet discriminator. Further, introducing sinusoidal positional encoding (SPE) on each generation scale effectively reduces the FID. Employing a larger model brings slight improvement and it seems that the model capacity of the current transformer structure is not the bottleneck. From Table 5 we see that bCR considerably improves the FID by 2.69 , which coincides with the recent findings [24, 37, 67] that data augmentation is still vital in transformer-based GAN since transformers are data-hungry and prone to overfitting. However, we do not observe its effectiveness on higher resolutions, e.g.,

1024 \times 1024

, and we leave regularization
由于样式注入增强了模型容量，FID 有了显著提高。双重关注使得每一层都能同时利用更大的上下文，进一步降低了 FID 分数。小波判别器能有效抑制阻塞伪像，同时对高频率进行更强的监督，因此能大幅提高 FID。在我们的实验中，我们发现采用小波判别器后，对抗训练的速度更快。此外，在每一代尺度上引入正弦位置编码（SPE）可有效降低 FID。采用更大的模型略有改善，看来目前变压器结构的模型容量并不是瓶颈。从表 5 中我们可以看到，bCR 使 FID 显著提高了 2.69，这与最近的研究结果 [24, 37, 67]不谋而合，即由于变换器对数据的需求量大，容易出现过拟合，因此数据增强在基于变换器的 GAN 中仍然至关重要。然而，我们并没有观察到它在更高分辨率（例如

1024 \times 1024

）上的有效性，因此我们不对其进行正则化。

Methods 方法	#params	FLOPs
StyleGAN2 [30]	30.37 M	74.27 B
StyleSwin	40.86 M	50.90 B

Table 6. Comparison of the network parameters and FLOPs with StyleGAN2.
表 6.网络参数和 FLOP 与 StyleGAN2 的比较。
schemes for high-resolution synthesis to future work.
高分辨率合成方案是今后的工作重点。
Parameters and Throughput. In Table 6, We compare the number of model parameters and FLOPs with StyleGAN2 for

1024 \times 1024

synthesis. Although our approach has a larger model size, it achieves lower FLOPs than StyleGAN2, which means the method achieves competitive generation quality with less theoretical computational cost.
参数和吞吐量。在表 6 中，我们比较了与 StyleGAN2 进行

1024 \times 1024

合成时的模型参数数和 FLOPs 数。虽然我们的方法具有更大的模型规模，但 FLOPs 却低于 StyleGAN2，这意味着我们的方法以更少的理论计算成本获得了具有竞争力的生成质量。

5. Conclusion 5.结论

We propose StyleSwin, a transformer-based GAN for high-resolution image generation. The use of local attention is efficient to compute while attaining most modeling capability since the receptive field is largely compensated by double attention. Besides, we find one key feature is missing in transformer-based GANs - the generator is not aware of the position for patches under synthesis, so we introduce SPE for global positioning. Thanks to the increased expressivity, the proposed StyleSwin consistently outperforms the leading Conv-based approaches on

256 \times 256

datasets. To solve the blocking artifacts on high-resolution synthesis, we propose to penalize the spectral discrepancy with a wavelet discriminator [13]. Ultimately, the proposed StyleSwin offers compelling quality on the resolution of

1024 \times 1024

, which for the first time, approaches the best performed ConvNets. Our work hopefully incentives more studies on utilizing the capable transformers in generative modeling.
我们提出了基于变换器的 GAN StyleSwin，用于生成高分辨率图像。由于感受野在很大程度上得到了双重注意的补偿，因此使用局部注意既能高效计算，又能获得最大的建模能力。此外，我们发现基于变压器的 GAN 缺少一个关键特征--生成器不知道正在合成的补丁的位置，因此我们引入了 SPE 来进行全局定位。得益于表现力的增强，所提出的 StyleSwin 在

256 \times 256

数据集上的表现始终优于基于 Conv 的领先方法。为了解决高分辨率合成中的阻塞伪像问题，我们建议使用小波判别器[13]对频谱差异进行惩罚。最终，所提出的 StyleSwin 在

1024 \times 1024

分辨率上提供了令人信服的质量，首次接近了性能最佳的 ConvNets。希望我们的工作能激励更多关于在生成建模中利用有能力的变换器的研究。

References 参考资料

[1] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis, 2020. 4, 7
[1] Ivan Anokhin、Kirill Demochkin、Taras Khakhulin、Gleb Sterkin、Victor Lempitsky 和 Denis Korzhenkov。具有条件独立像素合成功能的图像生成器，2020 年。4, 7
[2] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan.

ArXiv

, abs/1701.07875, 2017. 1, 2
[2] Martín Arjovsky、Soumith Chintala 和 Léon Bottou.Wasserstein gan.

ArXiv

, abs/1701.07875, 2017.1, 2
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. 4
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.层归一化。arXiv preprint arXiv:1607.06450, 2016.4
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. ArXiv, abs/1809.11096, 2019. 1, 2, 4
[4] Andrew Brock、Jeff Donahue 和 Karen Simonyan.用于高保真自然图像合成的大规模gan训练。ArXiv，abs/1809.11096，2019.1, 2, 4
[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 2
[5] Tom B Brown、Benjamin Mann、Nick Ryder、Melanie Subbiah、Jared Kaplan、Prafulla Dhariwal、Arvind Neelakantan、Pranav Shyam、Girish Sastry、Amanda Askell 等：《语言模型是少数学习者》。ArXiv 预印本 arXiv:2005.14165, 2020 年。2
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691-1703. PMLR, 2020. 2
[6] Mark Chen、Alec Radford、Rewon Child、Jeffrey Wu、Heewoo Jun、David Luan 和 Ilya Sutskever。从像素生成预训练。国际机器学习大会，第 1691-1703 页。PMLR，2020 年。2
[7] Jooyoung Choi, Jungbeom Lee, Yonghyun Jeong, and Sungroh Yoon. Toward spatially unbiased generative models. arXiv preprint arXiv:2108.01285, 2021. 4
[7] Jooyoung Choi, Jungbeom Lee, Yonghyun Jeong, and Sungroh Yoon.Toward spatially unbiased generative models. ArXiv preprint arXiv:2108.01285, 2021.4
[8] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. ArXiv, abs/1911.03584, 2020. 4
[8] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi.论自我注意与卷积层之间的关系。ArXiv，abs/1911.03584，2020.4
[9] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021. 4
[9] Zihang Dai、Hanxiao Liu、Quoc V Le 和 Mingxing Tan。Coatnet：ArXiv preprint arXiv:2106.04803, 2021.4
[10] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows, 2021. 2
[10] Xiaoyi Dong，Jianmin Bao，Dongdong Chen，Weiming Zhang，Nenghai Yu，Lu Yuan，Dong Chen，and Baining Guo.Cswin变换器：具有十字形窗口的通用视觉转换器骨干网，2021.2
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2021. 1, 2, 5
[11] Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、翟晓华、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly、Jakob Uszkoreit 和 Neil Houlsby。一幅图像胜过 16x16 个单词：规模图像识别的变换器。ArXiv, abs/2010.11929, 2021.1, 2, 5
[12] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2, 7
[12] Patrick Esser、Robin Rombach 和 Björn Ommer.用于高分辨率图像合成的驯服变换器，2020.2, 7
[13] Rinon Gal, Dana Cohen, Amit Bermano, and Daniel CohenOr. Swagan: A style-based wavelet-driven generative model, 2021. 2, 6, 8
[13] Rinon Gal、Dana Cohen、Amit Bermano 和 Daniel CohenOr.Swagan：基于风格的小波驱动生成模型，2021.2, 6, 8
[14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 1, 2
[14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio.生成对抗网。In NIPS, 2014.1, 2
[15] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Giqa: Generated image quality assessment. In European Conference on Computer Vision, pages 369-385. Springer, 2020. 1
[15] Shuyang Gu，Jianmin Bao，Dong Chen，and Fang Wen.Giqa：生成图像质量评估。欧洲计算机视觉会议，第 369-385 页。Springer, 2020.1
[16] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In NIPS, 2017. 1
[16] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville.改进的瓦塞尔斯坦甘斯训练。In NIPS, 2017.1
[17] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021. 2
[17] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang.ArXiv preprint arXiv:2103.00112, 2021.2
[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. 6, 7
[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.通过双时间尺度更新规则训练的甘斯收敛于局部纳什均衡，2018.6, 7
[19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, 2020. 2
[19] Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models. ArXiv preprint arXiv:2006.11239, 2020.2
[20] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition, 2019. 5
[20] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin.用于图像识别的局部关系网络，2019.5
[21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448-456. PMLR, 2015. 4
[21] Sergey Ioffe 和 Christian Szegedy.批量归一化：通过减少内部协变量偏移加速深度网络训练。国际机器学习会议，第 448-456 页。PMLR，2015。4
[22] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248, 2020. 4
[22] Md Amirul Islam, Sen Jia, and Neil DB Bruce.卷积神经网络能编码多少位置信息？ arXiv preprint arXiv:2001.08248, 2020.4
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017. 5
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.用条件对抗网络实现图像到图像的翻译。CVPR, 2017.5
[24] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074, 2021. 1, 2, 7, 8
[24] 江一帆、常诗雨、王占洋。Transgan：arXiv preprint arXiv:2102.07074, 2021.1, 2, 7, 8
[25] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. ArXiv, abs/1807.00734, 2019. 1
[25] Alexia Jolicoeur-Martineau.相对论判别器：标准伽恩中缺失的关键要素.ArXiv, abs/1807.00734, 2019.1
[26] Animesh Karnewar and Oliver Wang. Msg-gan: Multi-scale gradients for generative adversarial networks. arXiv preprint arXiv:1903.06048, 2019. 7
[26] Animesh Karnewar 和 Oliver Wang.Msg-gan：ArXiv preprint arXiv:1903.06048, 2019.7
[27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 6, 7
[27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.提高质量、稳定性和变异性的渐进式甘斯生长。ArXiv 预印本 arXiv:1710.10196, 2017.6, 7
[28] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Aliasfree generative adversarial networks. arXiv preprint arXiv:2106.12423, 2021. 2
[28] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.arXiv preprint arXiv:2106.12423, 2021.2
[29] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, 2019. 1, 2, 3, 5, 6, 7
[29] Tero Karras、Samuli Laine 和 Timo Aila。基于风格的生成式对抗网络生成器架构，2019.1, 2, 3, 5, 6, 7
[30] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8107-8116, 2020. 1, 2, 3, 4, 7, 8
[30] Tero Karras、Samuli Laine、Miika Aittala、Janne Hellsten、Jaakko Lehtinen 和 Timo Aila。分析和改进样式表的图像质量。2020年IEEE/CVF计算机视觉与模式识别大会（CVPR），第8107-8116页，2020年。1, 2, 3, 4, 7, 8
[31] Osman Semih Kayhan and Jan C van Gemert. On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14274-14285, 2020. 4
[31] Osman Semih Kayhan 和 Jan C van Gemert.论 CNN 的平移不变性：卷积层可利用绝对空间位置。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14274-14285, 2020.4
[32] Salman Hameed Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ArXiv, abs/2101.01169, 2021. 1
[32] Salman Hameed Khan、Muzammal Naseer、Munawar Hayat、Syed Waqas Zamir、Fahad Shahbaz Khan 和 Mubarak Shah。视觉中的变压器：一项调查。ArXiv, abs/2101.01169, 2021.1
[33] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 7
[33] Diederik P. Kingma和Jimmy Ba.Adam: A method for stochastic optimization，2017.7
[34] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1 x 1 convolutions. arXiv preprint arXiv:1807.03039, 2018. 2
[34] Diederik P Kingma 和 Prafulla Dhariwal.Glow：ArXiv preprint arXiv:1807.03039, 2018.2
[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2
[35] Diederik P Kingma 和 Max Welling.arXiv预印本arXiv:1312.6114，2013.2
[36] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study on regularization and normalization in gans. In ICML, 2019. 1
[36] Karol Kurach、Mario Lucic、Xiaohua Zhai、Marcin Michalski 和 Sylvain Gelly.关于 gans 中正则化和规范化的大规模研究。In ICML, 2019.1
[37] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision transformers. ArXiv, abs/2107.04589, 2021. 1, 2, 8
[37] Kwonjoon Lee、Huiwen Chang、Lu Jiang、Han Zhang、Zhuowen Tu 和 Ce Liu。Vitgan：用视觉转换器训练猩猩。ArXiv, abs/2107.04589, 2021.1, 2, 8
[38] Shanda Li, Xiangning Chen, Di He, and Cho-Jui Hsieh. Can vision transformers perform convolution? ArXiv, abs/2111.01353, 2021. 4
[38] Shanda Li，Xiangning Chen，Di He，and Cho-Jui Hsieh.视觉变换器能执行卷积吗？ArXiv，abs/2111.01353，2021.4
[39] Jae Hyun Lim and J. C. Ye. Geometric gan. ArXiv, abs/1705.02894, 2017. 1
[39] Jae Hyun Lim 和 J. C. Ye.Geometric gan.ArXiv，abs/1705.02894，2017.1
[40] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, DaCheng Juan, Wei Wei, and Hwann-Tzong Chen. Cocogan: Generation by parts via conditional coordinating. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4512-4521, 2019. 4
[40] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, DaCheng Juan, Wei Wei, and Hwann-Tzong Chen.Cocogan：通过条件协调的部件生成。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4512-4521, 2019.4
[41] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and Arun Mallya. Generative adversarial networks for image and video synthesis: Algorithms and applications. arXiv preprint arXiv:2008.02793, 2020. 1, 2
[41] Ming-Yu Liu，Xun Huang，Jiahui Yu，Ting-Chun Wang，and Arun Mallya.用于图像和视频合成的生成对抗网络：ArXiv preprint arXiv:2008.02793, 2020.1, 2
[42] Shizhong Liu and Alan C Bovik. Efficient dct-domain blind measurement and reduction of blocking artifacts. IEEE Transactions on Circuits and Systems for Video Technology, 12(12):1139-1149, 2002. 5
[42] Shizhong Liu 和 Alan C Bovik.高效 dct 域盲测量和减少阻塞伪影。IEEE 视频技术电路与系统论文集，12（12）：1139-1149，2002.5
[43] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. 1, 2, 3, 4
[43] Ze Liu、Yutong Lin、Yue Cao、Han Hu、Yixuan Wei、Zheng Zhang、Stephen Lin 和 Baining Guo。Swin transformer: Hierarchical vision transformer using shifted windows, 2021.1, 2, 3, 4
[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. 6
[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.在野外深度学习人脸属性。国际计算机视觉会议（ICCV）论文集，2015 年 12 月。6
[45] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813-2821, 2017. 1
[45] 毛旭东、李青、谢浩然、Raymond Y. K. Lau、王振和 Stephen Paul Smolley。最小平方生成对抗网络。2017 IEEE 计算机视觉国际会议（ICCV），第 2813-2821 页，2017.1
[46] Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In ICML, 2018. 1
[46] Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin.哪些训练方法能真正收敛？In ICML, 2018.1
[47] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ArXiv, abs/1802.05957, 2018. 1, 2, 7
[47] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.生成式对抗网络的频谱归一化。ArXiv，abs/1802.05957，2018.1, 2, 7
[48] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016. 5
[48] Augustus Odena、Vincent Dumoulin 和 Chris Olah.解卷积与棋盘伪影》。Distill, 1(10):e3, 2016.5
[49] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 2
[49] Alec Radford、Luke Metz 和 Soumith Chintala.使用深度卷积生成对抗网络的无监督表示学习》，arXiv preprint arXiv:1511.06434, 2015.2
[50] Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A unet based discriminator for generative adversarial networks.
[50] Edgar Schonfeld、Bernt Schiele 和 Anna Khoreva.基于 unet 的生成式对抗网络判别器。

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8207-8216, 2020. 7
IEEE/CVF 计算机视觉与模式识别会议论文集》，第 8207-8216 页，2020 年。7
[51] Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. Adversarial generation of continuous images, 2021. 7
[51] Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny.逆向生成连续图像，2021.7
[52] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, 2020. 2
[52] Matthew Tancik、Pratul P Srinivasan、Ben Mildenhall、Sara Fridovich-Keil、Nithin Raghavan、Utkarsh Singhal、Ravi Ramamoorthi、Jonathan T Barron 和 Ren Ng。傅立叶特征让网络在低维域中学习高频函数。arXiv 预印本 arXiv:2006.10739, 2020.2
[53] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347-10357. PMLR, 2021. 2
[53] Hugo Touvron、Matthieu Cord、Matthijs Douze、Francisco Massa、Alexandre Sablayrolles 和 Hervé Jégou。通过注意力训练数据高效图像转换器和蒸馏器。国际机器学习大会》，第 10347-10357 页。PMLR，2021 年。2
[54] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. 4
[54] Dmitry Ulyanov、Andrea Vedaldi 和 Victor Lempitsky.实例规范化：arXiv preprint arXiv:1607.08022, 2016.4
[55] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747-1756. PMLR, 2016. 2
[55] Aaron Van Oord、Nal Kalchbrenner 和 Koray Kavukcuoglu.像素递归神经网络。国际机器学习大会》，第 1747-1756 页。PMLR，2016 年。2
[56] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12894-12904, 2021. 2, 5
[56] Ashish Vaswani、Prajit Ramachandran、Aravind Srinivas、Niki Parmar、Blake Hechtman 和 Jonathon Shlens。为实现参数高效的视觉骨干网而扩展局部自我关注。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 12894-12904 页，2021 年。2, 5
[57] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017. 2, 3, 4
[57] Ashish Vaswani、Noam M. Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N. Gomez、Lukasz Kaiser 和 Illia Polosukhin。关注就是一切。ArXiv, abs/1706.03762, 2017.2, 3, 4
[58] Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob Grue Simonsen. On position embeddings in bert. In International Conference on Learning Representations, 2020. 5
[58] 王本友、尚立峰、Christina Lioma、蒋昕、杨浩、刘群和 Jakob Grue Simonsen.关于 Bert 中的位置嵌入。学习表征国际会议，2020 年。5
[59] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021. 2
[59] 王文海、谢恩泽、李翔、范登平、宋开涛、梁鼎、卢彤、罗平、邵玲。金字塔视觉转换器：无需卷积即可实现密集预测的多功能骨干。2
[60] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808, 2021. 2
[60] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang.Cvt: Introducing convolutions to vision transformers. ArXiv preprint arXiv:2103.15808, 2021.2
[61] Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1356913578, 2021. 4
[61] Rui Xu，Xintao Wang，Kai Chen，Bolei Zhou，and Chen Change Loy.位置编码作为甘斯的空间归纳偏差。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 1356913578 页，2021 年。4
[62] Rui Xu, Xiangyu Xu, Kai Chen, Bolei Zhou, and Chen Change Loy. Stransgan: An empirical study on transformer in gans. ArXiv, abs/2110.13107, 2021. 1, 2
[62] Rui Xu，Xiangyu Xu，Kai Chen，Bolei Zhou，and Chen Change Loy.Stransgan：关于甘斯中变压器的实证研究。ArXiv, abs/2110.13107, 2021.1, 2
[63] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. 6
[63] Fisher Yu、Yinda Zhang、Shuran Song、Ari Seff 和 Jianxiong Xiao.Lsun：ArXiv preprint arXiv:1506.03365, 2015.6
[64] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021. 2
[64] Li Yuan、Yunpeng Chen、Tao Wang、Weihao Yu、Yujun Shi、Zihang Jiang、Francis EH Tay、Jiashi Feng 和 Shuicheng Yan。令牌对令牌的维度：arXiv preprint arXiv:2101.11986, 2021.2
[65] Biao Zhang and Rico Sennrich. Root mean square layer normalization. In NeurIPS, 2019. 4
[65] Biao Zhang 和 Rico Sennrich.均方根层归一化。In NeurIPS, 2019.4
[66] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019. 1, 4
[66] Han Zhang、Ian J. Goodfellow、Dimitris N. Metaxas 和 Augustus Odena.自注意生成对抗网络。In ICML, 2019.1, 4
[67] Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas, and Hang Zhang. Improved transformer for high-resolution gans. ArXiv, abs/2106.07631, 2021. 1, 2, 4, 5, 7, 8
[67] Long Zhao，Zizhao Zhang，Ting Chen，Dimitris N. Metaxas，and Hang Zhang.改进的高分辨率甘斯变换器。ArXiv, abs/2106.07631, 2021.1, 2, 4, 5, 7, 8
[68] Zhengli Zhao, Sameer Singh, Honglak Lee, Zizhao Zhang, Augustus Odena, and Han Zhang. Improved consistency regularization for gans, 2020. 7
[68] 赵正利、萨米尔-辛格、李宏乐、张子钊、奥古斯都-奥德纳、张瀚。改进的甘斯一致性正则化，2020.7

*Author did this work during his internship at Microsoft Research Asia.
*作者在微软亚洲研究院实习期间完成了这项工作。
$^{†}$ Corresponding author.
$^{†}$ 通讯作者。

StyleSwin: Transformer-based GAN for High-resolution Image Generation StyleSwin：基于变换器的高分辨率图像生成 GAN