StyleSwin: Transformer-based GAN for High-resolution Image Generation StyleSwin:基于变换器的高分辨率图像生成 GAN
Bowen Zhang ^(1**)quad{ }^{1 *} \quad Shuyang Gu ^(1)quad{ }^{1} \quad Bo Zhang ^(2†)quad{ }^{2 \dagger} \quad Jianmin Bao ^(2)quad{ }^{2} \quad Dong Chen ^(2){ }^{2} Bowen Zhang ^(1**)quad{ }^{1 *} \quad Shuyang Gu ^(1)quad{ }^{1} \quad Bo Zhang ^(2†)quad{ }^{2 \dagger} \quad Jianmin Bao ^(2)quad{ }^{2} \quad Dong Chen ^(2){ }^{2}Fang Wen ^(2)quad{ }^{2} \quad Yong Wang ^(1){ }^{1}^(1){ }^{1} University of Science and Technology of China ^(1){ }^{1} 中国科学技术大学Baining Guo ^(2){ }^{2} 郭百宁 ^(2){ }^{2}^(2){ }^{2} Microsoft Research Asia ^(2){ }^{2} 微软亚洲研究院
Figure 1. Image samples generated by our StyleSwin on FFHQ 1024 xx10241024 \times 1024 and LSUN Church 256 xx256256 \times 256 respectively. 图 1.我们的 StyleSwin 分别在 FFHQ 1024 xx10241024 \times 1024 和 LSUN Church 256 xx256256 \times 256 上生成的图像样本。
Abstract 摘要
Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in windowbased transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformerbased GANs, especially on high resolutions, e.g., 1024 xx1024 \times 尽管变换器在广泛的视觉任务中取得了诱人的成功,但在高分辨率图像生成建模方面,变换器尚未展现出与 ConvNets 不相上下的能力。在本文中,我们试图探索使用纯变换器来构建一个用于高分辨率图像合成的生成对抗网络。为此,我们认为局部关注对于在计算效率和建模能力之间取得平衡至关重要。因此,我们提出的生成器采用了基于风格架构的 Swin 变换器。为了获得更大的感受野,我们提出了双重关注,即同时利用本地窗口和移动窗口的上下文,从而提高生成质量。此外,我们还表明,提供基于窗口的变换器中丢失的绝对位置知识对生成质量大有裨益。所提出的 StyleSwin 可以扩展到高分辨率,粗几何体和精细结构都能从变换器的强大表现力中获益。然而,在高分辨率合成过程中会出现阻塞假象,因为以分块方式执行局部关注可能会破坏空间一致性。为了解决这个问题,我们对各种解决方案进行了实证研究,其中我们发现采用小波鉴别器来检查频谱差异可以有效抑制伪影。广泛的实验表明,这种方法优于之前基于变压器的 GAN,尤其是在高分辨率下,例如 1024 xx1024 \times 。
1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the promise of using transformers for high-resolution image generation. The code and pretrained models are available at https://github.com/microsoft/StyleSwin. 1024.无需复杂的训练策略,StyleSwin 在 CelebA-HQ 1024 上的表现优于 StyleGAN,在 FFHQ-1024 上的表现与 StyleGAN 相当,证明了使用变换器生成高分辨率图像的前景。代码和预训练模型可在 https://github.com/microsoft/StyleSwin 上获取。
1. Introduction 1.导言
The state of image generative modeling has seen dramatic advancement in recent years, among which generative adversarial networks [14, 41] (GANs) offer arguably the most compelling quality on synthesizing high-resolution images. While early attempts focus on stabilizing the training dynamics via proper regularization [15,16,36,46,47] or adversarial loss designs [2, 25, 39, 45], remarkable performance leaps in recent prominent works mainly attribute to the architectural modifications that aim for stronger modeling capacity, such as adopting self-attention [66], aggressive model scaling [4], or style-based generators [29, 30]. Recently, drawn by the broad success of transformers in discriminative models [ 11,32,4311,32,43 ], a few works [24,37, 62, 67] attempt to use pure transformers to build generative networks in the hope that the increased expressivity and the ability to model long-range dependencies can benefit the generation of complex images, yet high-quality image generation, especially on high resolutions, remains challenging. 近年来,图像生成建模技术取得了突飞猛进的发展,其中生成对抗网络 [14, 41](GANs)在合成高分辨率图像方面可以说提供了最令人信服的质量。虽然早期的尝试主要是通过适当的正则化[15,16,36,46,47]或对抗损失设计[2,25,39,45]来稳定训练动态,但近期杰出作品的显著性能飞跃主要归功于旨在增强建模能力的架构修改,如采用自我关注[66]、积极的模型缩放[4]或基于风格的生成器[29,30]。最近,受变压器在判别模型中取得广泛成功的启发[ 11,32,4311,32,43 ],一些作品[24,37, 62, 67]尝试使用纯变压器来构建生成网络,希望通过增强表达能力和长程依赖建模能力来帮助生成复杂图像,然而高质量图像生成,尤其是高分辨率图像生成,仍然具有挑战性。
This paper aims to explore key ingredients when using transformers to constitute a competitive GAN for highresolution image generation. The first obstacle is to tame the quadratic computational cost so that the network is scal- 本文旨在探讨使用变换器构成具有竞争力的高分辨率图像生成 GAN 的关键要素。第一个障碍是驯服二次计算成本,使网络具有可扩展性。
able to high resolutions, e.g., 1024 xx10241024 \times 1024. We propose to leverage Swin transformers [43] as the basic building block since the window-based local attention strikes a balance between computational efficiency and modeling capacity. As such, we could take advantage of the increased expressivity to characterize all the image scales, as opposed to reducing to point-wise multi-layer perceptrons (MLP) for higher scales [67], and the synthesis is scalable to high resolution, e.g., 1024 xx10241024 \times 1024, with delicate details. Besides, the local attention introduces locality inductive bias so there is no need for the generator to relearn the regularity of images from scratch. These merits make a simple transformer network substantially outperform the convolutional baseline. 能够达到高分辨率,例如 1024 xx10241024 \times 1024 。我们建议利用 Swin 变换器 [43] 作为基本构件,因为基于窗口的局部关注可以在计算效率和建模能力之间取得平衡。因此,我们可以利用增强的表达能力来描述所有图像尺度,而不是将更高尺度的图像简化为点式多层感知器 (MLP)[67]。此外,局部关注引入了局部归纳偏差,因此生成器无需从头开始重新学习图像的规则性。这些优点使得简单的变换器网络大大优于卷积基线网络。
In order to compete with the state of the arts, we further propose three instrumental architectural adaptations. First, we strengthen the generative model capacity by employing the local attention in a style-based architecture [29], during which we empirically compare various style injection approaches for our transformer GAN. Second, we propose double attention in order to enlarge the limited receptive field brought by the local attention, where each layer attends to both the local and the shifted windows, effectively improving the generator capacity without much computational overhead. Moreover, we notice that Conv-based GANs implicitly utilize zero padding to infer the absolute positions, a crucial clue for generation, yet such feature is missing in the window-based transformers. We propose to fix this by introducing sinusoidal positional encoding [52] to each layer such that absolute positions can be leveraged for image synthesis. Equipped with the above techniques, the proposed network, dubbed as StyleSwin, starts to show advantageous generation quality on 256 xx256256 \times 256 resolution. 为了与目前的技术水平相抗衡,我们进一步提出了三项工具性的架构调整。首先,我们在基于风格的架构[29]中采用了局部注意力,从而加强了生成模型的能力,在此期间,我们对变换器 GAN 的各种风格注入方法进行了实证比较。其次,我们提出了双重关注,以扩大局部关注带来的有限感受野,即每一层同时关注局部窗口和移动窗口,从而在不增加计算开销的情况下有效提高生成器容量。此外,我们注意到,基于 Conv 的 GAN 隐含地利用了零填充来推断绝对位置,这是生成的关键线索,但基于窗口的变换器却缺少这一特征。我们建议通过在每一层引入正弦位置编码 [52]来解决这一问题,这样就可以利用绝对位置进行图像合成。有了上述技术,被称为 StyleSwin 的拟议网络开始在 256 xx256256 \times 256 分辨率下显示出良好的生成质量。
Nonetheless, we observe blocking artifacts when synthesizing high-resolution images. We conjecture that these disturbing artifacts arise because computing the attention independently in a block-wise manner breaks the spatial coherency. That is, while proven successful in discriminative tasks [43,56], the block-wise attention requires special treatment when applied in synthesis networks. To tackle these blocking artifacts, we empirically investigate various solutions, among which we find that a wavelet discriminator [13] examining the artifacts in spectral domain could effectively suppress the artifacts, making our transformerbased GAN yield visually pleasing outputs. 然而,我们在合成高分辨率图像时观察到了阻塞假象。我们推测,出现这些干扰假象的原因是,以分块方式独立计算注意力会破坏空间连贯性。也就是说,虽然分块注意力在判别任务中被证明是成功的[43,56],但在合成网络中应用时需要特殊处理。为了解决这些阻塞假象,我们对各种解决方案进行了实证研究,其中我们发现在频谱域检查假象的小波判别器[13]可以有效抑制假象,从而使我们基于变压器的 GAN 产生视觉愉悦的输出。
The proposed StyleSwin, achieves state-of-the-art quality on multiple established benchmarks, e.g., FFHQ, CelebAHQ, and LSUN Church. In particular, our approach shows compelling quality on high-resolution image synthesis (Figure 1), achieving competitive quantitative performance relative to the leading ConvNet-based methods without complex training strategies. On CelebA-HQ 1024, our approach achieves an FID of 4.43, outperforming all the prior works including StyleGAN [29]; whereas on FFHQ-1024, we ob- 所提出的 StyleSwin 在多个既定基准(如 FFHQ、CelebAHQ 和 LSUN Church)上实现了最先进的质量。特别是,我们的方法在高分辨率图像合成上显示出令人信服的质量(图 1),与基于 ConvNet 的领先方法相比,无需复杂的训练策略就能获得具有竞争力的定量性能。在 CelebA-HQ 1024 上,我们的方法实现了 4.43 的 FID,优于包括 StyleGAN [29] 在内的所有先前研究成果;而在 FFHQ-1024 上,我们发现
tain an FID of 5.07, approaching the performance of StyleGAN2 [30]. FID 为 5.07,接近 StyleGAN2 [30] 的性能。
2. Related Work 2.相关工作
High-resolution image generation. Image generative modeling has improved rapidly in the past decade [14, 19, 34, 35, 41, 55]. Among various solutions, generative adversarial networks (GANs) offer competitive generation quality. While early methods [2,47,49] focus on stabilizing the adversarial training, recent prominent works [4, 28-30] rely on designing architectures with enhanced capacity, which considerably improves generation quality. However, contemporary GAN-based methods adopt convolutional backbones which are now deemed to be inferior to transformers in terms of modeling capacity. In this paper, we are interested in applying the emerging vision transformers to GANs for high-resolution image generation. 高分辨率图像生成。在过去十年中,图像生成建模技术发展迅速[14, 19, 34, 35, 41, 55]。在各种解决方案中,生成对抗网络(GAN)的生成质量极具竞争力。早期的方法[2,47,49]侧重于稳定对抗训练,而最近的杰出作品[4, 28-30]则依赖于设计具有增强能力的架构,这大大提高了生成质量。然而,当代基于 GAN 的方法采用的卷积骨干网在建模能力方面被认为不如变压器。在本文中,我们有兴趣将新兴的视觉变换器应用于高分辨率图像生成的 GAN。
Vision transformers. Recent success of transformers [5, 57] in NLP tasks inspires the research of vision transforms. The seminal work ViT [11] proposes a pure transformerbased architecture for image classification and demonstrates the great potential of transformers for vision tasks. Later, transformers dominate the benchmarks in a broad of discriminative tasks [10,17,43,53,56,59,60,64]. However, the self-attention in transformer blocks brings quadratic computational complexity, which limits its application for highresolution inputs. A few recent works [10, 43, 56] tackle this problem by proposing to compute self-attention in local windows, so that linear computational complexity can be achieved. Moreover, the hierarchical architecture makes them suitable to serve as general purpose backbones. 视觉转换器最近,变换器 [5, 57] 在 NLP 任务中的成功激发了视觉变换器的研究。开创性工作 ViT [11] 为图像分类提出了一种基于变换器的纯架构,并证明了变换器在视觉任务中的巨大潜力。后来,变换器在大量判别任务的基准测试中占据了主导地位 [10,17,43,53,56,59,60,64]。然而,变换器块中的自我关注带来了四倍的计算复杂度,限制了其在高分辨率输入中的应用。最近的一些研究 [10, 43, 56] 解决了这一问题,提出在局部窗口中计算自注意力,从而实现线性计算复杂度。此外,分层架构使它们适合作为通用主干网。
Transformer-based GANs. Recently, the research community begins to explore using transformers for generative tasks in the hope that the increased expressivity can benefit the generation of complex images. One natural way is to use transformers to synthesize pixels in an auto-regressive manner [6,12][6,12], but the slow inference speed limits their practical usage. Recently a few works [24, 37, 62, 67] attempt to propose transformer-based GANs, yet most of these methods only support the synthesis up to 256 xx256256 \times 256 resolution. Notably, the HiT [67] successfully generates 1024 xx10241024 \times 1024 images at the cost of reducing to MLPs in its high-resolution stages, hence unable to synthesize highfidelity details as the Conv-based counterpart [29]. In comparison, our StyleSwin can synthesize fine structures using transformers, leading to comparable quality as the leading ConvNets on high-resolution synthesis. 基于变换器的 GAN。最近,研究界开始探索在生成任务中使用变换器,希望通过提高表达能力来生成复杂图像。一种自然的方法是使用变换器以自动回归的方式合成像素 [6,12][6,12] ,但缓慢的推理速度限制了其实际应用。最近有一些作品 [24, 37, 62, 67] 尝试提出基于变换器的 GAN,但这些方法大多只支持 256 xx256256 \times 256 分辨率以内的合成。值得注意的是,HiT [67] 成功生成了 1024 xx10241024 \times 1024 图像,但代价是在其高分辨率阶段缩减为 MLP,因此无法像基于 Conv 的对应方法 [29] 一样合成高保真细节。相比之下,我们的 StyleSwin 可以使用变换器合成精细结构,从而在高分辨率合成方面达到与领先 ConvNets 相当的质量。
3. Method 3.方法
3.1. Transformer-based GAN architecture 3.1.基于变压器的广义网络架构
We start from a simple generator architecture, as shown in Figure 2(a), which receives a latent variable z∼N(0,I)\boldsymbol{z} \sim \mathcal{N}(0, \boldsymbol{I}) 我们从一个简单的生成器架构开始,如图 2(a)所示,该架构接收一个潜变量 z∼N(0,I)\boldsymbol{z} \sim \mathcal{N}(0, \boldsymbol{I})
Due to the quadratic computational complexity, it is unaffordable to compute full-attention on high-resolution feature maps. We believe that local attention is a good way to achieve trade-off between computational efficiency and modeling capacity. We adopt Swin transformer [43] as the basic building block which computes multi-head selfattention (MSA) [57] locally in non-overlapping windows. To advocate the information interaction across adjacent windows, Swin transformer uses shifted window partition in alternative blocks. Specifically, given the input feature mapx^(l)inR^(H xx W xx C)\operatorname{map} \boldsymbol{x}^{l} \in \mathbb{R}^{H \times W \times C} of layer ll, the consecutive SS win blocks operate as follows: 由于二次计算的复杂性,在高分辨率特征图上计算全注意力是难以负担的。我们认为,局部注意力是实现计算效率和建模能力权衡的好方法。我们采用 Swin 变换器[43]作为基本构件,在非重叠窗口中局部计算多头自我注意(MSA)[57]。为了促进相邻窗口间的信息交互,Swin transformer 在替代块中使用了移位窗口分区。具体来说,给定层 ll 的输入特征 mapx^(l)inR^(H xx W xx C)\operatorname{map} \boldsymbol{x}^{l} \in \mathbb{R}^{H \times W \times C} ,连续的 SS 赢块的操作如下:
where W-MSA and SW-MSA denote the window-based 其中,W-MSA 和 SW-MSA 表示基于窗口的
multi-head self-attention under the regular and shifted window partitioning respectively, and LN stands for layer normalization. Since such block-wise attention induces linear computational complexity relative to the image size, the network is scalable to the high-resolution generation where the fine structures can be modeled by these capable transformers as well. LN 表示层归一化。由于这种分块关注的计算复杂度与图像大小成线性关系,因此该网络可扩展到高分辨率生成,在高分辨率生成中,精细结构也可由这些功能转换器建模。
Since the discriminator severely affects the stability of adversarial training, we opt to use a Conv-based discriminator directly from [29]. In our experiment, we find that simply replacing the convolution with transformer blocks under this baseline architecture yields more stabilized training due to the improved model capacity. However, such naive architecture cannot make our transformer-based GAN compete with the state of the arts, so we make further studies which we introduce as follows. 由于判别器会严重影响对抗训练的稳定性,我们选择直接使用 [29] 中基于 Conv 的判别器。在实验中,我们发现在这种基线架构下,只需用变换器块替换卷积,就能提高模型容量,从而获得更稳定的训练效果。然而,这种天真的架构并不能使我们基于变换器的 GAN 与现有技术相抗衡,因此我们做了进一步的研究,具体介绍如下。
Style injection. We first strengthen the model capability by adapting the generator to a style-based architecture [29,30][29,30] as shown in Figure 2(b). We learn a non-linear mapping f:ZrarrWf: \mathcal{Z} \rightarrow \mathcal{W} to map the latent code z\boldsymbol{z} from Z\mathcal{Z} space to W\mathcal{W} space, which specifies the styles that are injected into the main synthesis network. We investigate the following style 风格注入。我们首先通过将生成器调整为基于风格的架构 [29,30][29,30] 来加强模型能力,如图 2(b) 所示。我们学习非线性映射 f:ZrarrWf: \mathcal{Z} \rightarrow \mathcal{W} ,将潜在代码 z\boldsymbol{z} 从 Z\mathcal{Z} 空间映射到 W\mathcal{W} 空间,从而指定注入主合成网络的样式。我们研究了以下样式
Table 1. Comparison of different style injection methods on FFHQ-256. The style injection methods considerably improve the FID, among which the AdaIN leads to the best generation quality. 表 1.FFHQ-256 上不同样式注入方法的比较。各种样式注入方法都大大提高了 FID,其中 AdaIN 的生成质量最好。
injection approaches: 注射方法:
AdaNorm modulates the statistics (i.e., mean and variance) of feature maps after normalization. We study multiple normalization variants, including instance normalization (IN) [54], batch normalization (BN) [21], layer normalization (LN) [3] and the recently proposed RMSnorm [65]. Since the RMSNorm removes the meancentering of LN , we only predict the variance from the W\mathcal{W} code. AdaNorm 调节归一化后特征图的统计量(即均值和方差)。我们研究了多种归一化变体,包括实例归一化(IN)[54]、批归一化(BN)[21]、层归一化(LN)[3]和最近提出的 RMSnorm [65]。由于 RMSNorm 去掉了 LN 的 meancentering,因此我们只预测 W\mathcal{W} 代码的方差。
Modulated MLP: Instead of modulating feature maps, one can also modulate the weights of linear layers. Specifically, we rescale the channel-wise weight magnitude of the feed-forward network (FFN) within transformer blocks. According to [30], such style injection admits faster speed than AdaNorm. 调制 MLP:除了调制特征图,我们还可以调制线性层的权重。具体来说,我们在变压器块内重新调整前馈网络(FFN)的通道权重大小。根据文献[30],这种注入方式的速度比 AdaNorm 更快。
Cross-attention: Motivated by the decoder transformer [57], we explore a transformer-specific style injection in which the transformers additionally attend to the style tokens derived from the W\mathcal{W} space. The effectiveness of this cross-attention is also validated in [67]. 交叉关注:受解码器转换器 [57] 的启发,我们探索了一种特定于转换器的样式注入方法,在这种方法中,转换器会额外关注来自 W\mathcal{W} 空间的样式标记。这种交叉关注的有效性在 [67] 中也得到了验证。
Table 1 shows that all the above style injection methods significantly boost the generative modeling capacity except that the training with AdaBN does not converge because the batch size is compromised for high-resolution synthesis. In comparison, AdaNorm brings more sufficient style injection possibly because the network could take advantage of the style information twice - in either the attention block and the FFN, whereas the modulated MLP and cross-attention make use of the style information once. We did not further study the hybrid of modulated MLP and cross-attention due to efficiency considerations. Furthermore, compared to AdaBN and AdaLN, AdaIN offers finer and more sufficient feature modulation as feature maps are normalized and modulated independently, so we choose AdaIN by default for our following experiments. 表 1 显示,上述所有风格注入方法都显著提高了生成建模能力,只有 AdaBN 的训练没有收敛,因为批量大小对高分辨率合成有影响。相比之下,AdaNorm 带来了更充分的风格注入,这可能是因为该网络可以利用两次风格信息--在注意块和 FFN 中,而调制 MLP 和交叉注意只利用一次风格信息。出于效率考虑,我们没有进一步研究调制 MLP 和交叉注意的混合效果。此外,与 AdaBN 和 AdaLN 相比,AdaIN 可以提供更精细、更充分的特征调制,因为特征图是独立进行归一化和调制的,所以我们在接下来的实验中默认选择 AdaIN。
Double attention. Using local attention, nonetheless, sacrifices the ability to model long-range dependencies, which is pivotal to capture geometry [4,66]. Let the window size 双重注意力然而,使用局部注意力会牺牲建立长程依赖关系模型的能力,而这对于捕捉几何图形至关重要[4,66]。让窗口大小
used by the Swin block be kappa xx kappa\kappa \times \kappa, then due to the shifted window strategy, the receptive field increases by kappa\kappa in each dimension using one more Swin block. Suppose we use Swin blocks to process a 64 xx6464 \times 64 feature map and we by default choose kappa=8\kappa=8, then it takes 64//kappa=864 / \kappa=8 transformer blocks to span over the entire feature map. 那么,由于采用了移动窗口策略,使用多一个 Swin 块,每个维度的感受野就会增加 kappa\kappa 。假设我们使用 Swin 块处理 64 xx6464 \times 64 个特征图,默认情况下选择 kappa=8\kappa=8 ,那么需要 64//kappa=864 / \kappa=8 个变换块才能覆盖整个特征图。
" Double-Attention "=" Concat "(head_(1),dots," head "_(h))W^(O)\text { Double-Attention }=\text { Concat }\left(\operatorname{head}_{1}, \ldots, \text { head }_{h}\right) \boldsymbol{W}^{O}
where W^(O)inR^(C xx C)\boldsymbol{W}^{O} \in \mathcal{R}^{C \times C} is the projection matrix used to mix the heads to output. The attention heads in Equation 2 can be computed as: 其中, W^(O)inR^(C xx C)\boldsymbol{W}^{O} \in \mathcal{R}^{C \times C} 是投影矩阵,用于将磁头混合到输出中。等式 2 中的注意头可计算为
where W_(i)^(Q),W_(i)^(K),W_(i)^(V)inR^(C xx(C//h))\boldsymbol{W}_{i}^{Q}, \boldsymbol{W}_{i}^{K}, \boldsymbol{W}_{i}^{V} \in \boldsymbol{R}^{C \times(C / h)} are query, key and value projection matrix for ii-th head respectively. One can derive that the receptive field of each dimension increases by 2.5 kappa2.5 \kappa with one additional double attention block, which allows capturing larger context more efficiently. Still, for a 64 xx6464 \times 64 input, it now takes 4 transformer blocks to cover the entire feature map. 其中, W_(i)^(Q),W_(i)^(K),W_(i)^(V)inR^(C xx(C//h))\boldsymbol{W}_{i}^{Q}, \boldsymbol{W}_{i}^{K}, \boldsymbol{W}_{i}^{V} \in \boldsymbol{R}^{C \times(C / h)} 分别为 ii -th head 的查询、键和值投影矩阵。由此可以得出,每增加一个双注意区块,每个维度的感受野就会增加 2.5 kappa2.5 \kappa ,从而可以更有效地捕捉更大的上下文。不过,对于 64 xx6464 \times 64 输入,现在需要 4 个变换块才能覆盖整个特征图。
Local-global positional encoding. Relative positional encoding (RPE) adopted by the default Swin blocks encodes the relative position of pixels and has proven crucial for discriminative tasks [9, 43]. Theoretically, a multi-head local attention layer with RPE can express any convolutional layer of window-sized kernels [8,38]. However, when substituting the convolutional layers with transformers that use RPE, one thing is rarely noticed: ConvNets could infer the absolute positions by leveraging the clues from the zero paddings [22,31][22,31] yet such feature is missing in Swin blocks using RPE. On the other hand, it is essential to let the generator be aware of the absolute positions because the synthesis of a specific component, e.g., mouth, highly depends on its spatial coordinate [1,40][1,40]. 局部-全局位置编码。默认 Swin 块采用的相对位置编码(RPE)对像素的相对位置进行编码,已被证明对判别任务至关重要[9, 43]。从理论上讲,具有 RPE 的多头局部注意力层可以表达任何窗口大小核的卷积层 [8,38]。然而,在用使用 RPE 的转换器替代卷积层时,有一点很少被注意到:ConvNets 可以利用零填充 [22,31][22,31] 中的线索推断出绝对位置,但使用 RPE 的 Swin 块却缺少这种特征。另一方面,让生成器意识到绝对位置是非常重要的,因为特定组件(如嘴巴)的合成在很大程度上取决于其空间坐标 [1,40][1,40] 。
In view of this, we propose to introduce sinusoidal position encoding [7, 57, 61] (SPE) on each scale, as shown in Figure 2(b). Specifically, after the scale upsampling, the 有鉴于此,我们建议在每个比例尺上引入正弦位置编码 [7, 57, 61] (SPE),如图 2(b) 所示。具体来说,在对比例尺进行上采样后,在每个比例尺上对正弦位置编码[7, 57, 61]。
Figure 3. Blocking artifacts become obvious on 1024 xx10241024 \times 1024 resolution. These artifacts correlate with the window size of local attentions. 图 3.在 1024 xx10241024 \times 1024 分辨率下,阻塞伪像变得非常明显。这些伪影与局部注意力的窗口大小相关。
feature maps are added with the following encoding: 添加的地物图编码如下:
where and omega_(k)=1//10000^(2k)\omega_{k}=1 / 10000^{2 k} and (i,j)(i, j) denotes the 2D location. We use SPE rather than learnable absolute positional encoding [11] because SPE admits translation invariance [58]. In practice, we make the best of RPE and SPE by employing them altogether: the RPE applied within each transformer block offers the relative positions within the local context, whereas the SPE introduced on each scale informs the global position. 其中 omega_(k)=1//10000^(2k)\omega_{k}=1 / 10000^{2 k}