T-former: An Efficient Transformer for Image Inpainting T-former:一种高效的图像修复变换器
Ye Deng 叶灯Xi'an Jiaotong University 西安交通大学Xi'an, Shaanxi, China 西安,陕西,中国dengye@stu.xjtu.edu.cn
Siqi Hui 慧思琪Xi'an Jiaotong University 西安交通大学Xi'an, Shaanxi, China 西安,陕西,中国huisiqi@stu.xjtu.edu.cn
Sanping Zhou* 周三平*Xi'an Jiaotong University 西安交通大学Xi'an, Shaanxi, China 西安,陕西,中国spzhou@xjtu.edu.cn
Deyu Meng 德宇梦Xi'an Jiaotong University 西安交通大学Xi'an, Shaanxi, China 西安,陕西,中国dymeng@mail.xjtu.edu.cn
Jinjun Wang 王金军Xi'an Jiaotong University 西安交通大学Xi'an, Shaanxi, China 西安,陕西,中国jinjun@mail.xjtu.edu.cn
Input 输入
Result 结果
Truth 真相
Input 输入
Result 结果
Truth 真相
Figure 1: Image inpainting outputs by our proposed -former. In each group, the input image is shown on the left, with gray pixels representing the missing areas. (Best with color and zoomed-in view) 图 1:我们提出的 -former 的图像修复输出。在每组中,输入图像显示在左侧,灰色像素表示缺失区域。(最佳效果为彩色和放大视图)
Abstract 摘要
Benefiting from powerful convolutional neural networks (CNNs), learning-based image inpainting methods have made significant breakthroughs over the years. However, some nature of CNNs (e.g. local prior, spatially shared parameters) limit the performance in the face of broken images with diverse and complex forms. Recently, a class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields and high-level vision tasks. Compared with CNNs, attention operators are better at long-range modeling and have dynamic weights, but their computational complexity is quadratic 得益于强大的卷积神经网络(CNN),基于学习的图像修复方法近年来取得了显著突破。然而,CNN 的一些特性(例如局部先验、空间共享参数)在面对形态多样且复杂的破损图像时限制了其性能。最近,一类基于注意力的网络架构,称为变换器,在自然语言处理领域和高级视觉任务中表现出显著的性能。与 CNN 相比,注意力操作在长距离建模方面表现更好,并且具有动态权重,但其计算复杂度是二次的。
in spatial resolution, and thus less suitable for applications involving higher resolution images, such as image inpainting. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion. And based on this attention, a network called -former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity. The code can be found at github.com/dengyecode/Tformer_image_inpainting 在空间分辨率方面,因此不太适合涉及高分辨率图像的应用,例如图像修复。在本文中,我们设计了一种根据泰勒展开与分辨率线性相关的新型注意力机制。基于这种注意力机制,设计了一种名为 -former 的网络用于图像修复。在多个基准数据集上的实验表明,我们提出的方法在保持相对较低的参数数量和计算复杂度的同时,达到了最先进的准确性。代码可以在 github.com/dengyecode/Tformer_image_inpainting 找到。
Ye Deng, Siqi Hui, Sanping Zhou, Deyu Meng, and Jinjun Wang. 2022. Tformer: An Efficient Transformer for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (MM '22), October 10-14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3503161.3548446 叶邓、徐琦、周三平、孟德宇和王金军。2022 年。《Tformer:一种高效的图像修复变换器》。发表于第 30 届 ACM 国际多媒体会议(MM '22),2022 年 10 月 10-14 日,葡萄牙里斯本。ACM,纽约,NY,美国,10 页。https: //doi.org/10.1145/3503161.3548446
1 INTRODUCTION 1 引言
Image inpainting (or completion) [4] is the process of filling in corrupted or missing parts of an image, as Figure 1 shown. It is an important task in the field of computer vision and image processing and can benefit users in a wide range of applications, such as removing unwanted objects in image editing. The key challenge in image inpainting is to make the filled pixels blend in with the non-missing parts. 图像修复(或补全)[4] 是填补图像中损坏或缺失部分的过程,如图 1 所示。这是计算机视觉和图像处理领域中的一项重要任务,可以在广泛的应用中为用户带来好处,例如在图像编辑中去除不需要的物体。图像修复的关键挑战在于使填充的像素与未缺失的部分融合。
Prior to deep learning, non-learning inpainting algorithms can be roughly divided into two categories, diffusion-based approaches [1, and exemplar-based approaches [2, 5, 28, 58]. Diffusionbased approaches smoothly propagate the information from observed boundaries to the interior of damaged areas. However, since diffusion-based approaches do not consider the global image structure, they are only effective in filling small holes and less effective in dealing with large scale breakage. To address the drawbacks of diffusion-based inpainting approaches, the exemplar-based approach searches for valid information from known regions of the entire image and copies or relocates this information to the missing locations. Although the exemplar-based algorithms perform well in the face of simple pattern breakage in larger areas, they do not perform well in filling images with complex patterns because they do not understand the semantic information of the image. 在深度学习之前,非学习型修复算法大致可以分为两类:基于扩散的方法和基于示例的方法。基于扩散的方法平滑地将信息从观察到的边界传播到受损区域的内部。然而,由于基于扩散的方法不考虑全局图像结构,因此它们仅在填补小孔时有效,而在处理大规模破损时效果较差。为了解决基于扩散的修复方法的缺点,基于示例的方法从整个图像的已知区域中搜索有效信息,并将这些信息复制或重新定位到缺失的位置。尽管基于示例的算法在面对较大区域的简单模式破损时表现良好,但在填补具有复杂模式的图像时效果不佳,因为它们无法理解图像的语义信息。
With the development of convolutional neural networks (CNNs), learning-based approaches have reached state-of-arts in the field of image inpainting. These inpainting models [24, 30, 42, 66] formulate the inpainting as a conditional image generation problem and customize a CNNs-based encoder-decoder as their corresponding conditional image generator. By training on sufficiently large scale datasets, CNNs show their strength in learning rich patterns and image semantics, and filling the target regions with such learned knowledge. In addition, the sparse connectivity and parameter sharing of CNNs in space make them computationally efficient However, some basic characteristics of CNNs make them may have some limitations in handling the inpainting task. (a) the locality CNNs are good at acquiring local relationships but not good at capturing long-range dependencies. Although the importance of locality for images has been demonstrated in various vision tasks for a long time, for image inpainting, focusing on non-local features (the whole image) is more likely to find the appropriate information for broken regions. (b) spatial-sharing and static parameters. The same convolution kernel operates on features across all spatial locations and the parameters of the kernel are static at the time of inference. This is somewhat inflexible in the face of inpainting tasks where images are mixed with broken and unbroken pixels and the damaged regions are variable. 随着卷积神经网络(CNN)的发展,基于学习的方法在图像修复领域达到了最先进的水平。这些修复模型将修复问题表述为条件图像生成问题,并定制了基于 CNN 的编码器-解码器作为相应的条件图像生成器。通过在足够大规模的数据集上进行训练,CNN 展现了其在学习丰富模式和图像语义方面的优势,并利用这些学习到的知识填充目标区域。此外,CNN 在空间上的稀疏连接和参数共享使其计算效率高。然而,CNN 的一些基本特性可能在处理修复任务时存在一些局限性。(a)局部性,CNN 擅长获取局部关系,但不擅长捕捉长距离依赖。尽管局部性在各种视觉任务中对图像的重要性早已得到证明,但对于图像修复,关注非局部特征(整个图像)更有可能找到适合破损区域的信息。(b)空间共享和静态参数。 相同的卷积核在所有空间位置上对特征进行操作,并且在推理时核的参数是静态的。这在图像混合了破损和未破损像素且受损区域变化的修复任务中显得有些不灵活。
Recently, (self-)attention [51], popular in the field of natural language processing, has been introduced in vision tasks . Compared to CNNs, attention operators whose weights dynamically adjust with the input are better able to capture long-range dependencies through explicit interaction with global features. And as a well-explored architecture in language tasks, the transformer model, based on the attention, is emerging in high-level vision tasks. Although the attention operator has advantages over CNNs in some aspects, its computational complexity grows quadratically with spatial resolution and is therefore not particularly suitable for high-resolution images, a situation that occurs frequently in low-level vision tasks including image inpainting. Recently, some designs that can reduce the computational complexity of attention operators have been transferred to inpainting [12] or other lowlevel vision tasks [31, 56]. These methods either apply attention to a sequence of patches unfolded from the image [12], or divide the image into non-overlapping parts and compute the attention for each part independently ? ]. However, limiting the spatial extent of attention somewhat defeats the original purpose of capturing the long-range dependence between pixels. 最近,自注意力机制在自然语言处理领域广受欢迎,已被引入到视觉任务中。与卷积神经网络(CNN)相比,权重随输入动态调整的注意力操作符能够通过与全局特征的显式交互,更好地捕捉长距离依赖关系。作为语言任务中经过充分探索的架构,基于注意力的变换器模型正在高层次视觉任务中崭露头角。尽管在某些方面,注意力操作符相较于 CNN 具有优势,但其计算复杂度随着空间分辨率的增加而呈平方增长,因此并不特别适合高分辨率图像,这种情况在包括图像修复在内的低层次视觉任务中经常出现。最近,一些可以降低注意力操作符计算复杂度的设计已被转移到图像修复或其他低层次视觉任务中。这些方法要么将注意力应用于从图像展开的一系列补丁,要么将图像划分为不重叠的部分,并独立计算每个部分的注意力。 然而,限制注意力的空间范围在某种程度上违背了捕捉像素之间长距离依赖关系的初衷。
Specifically, for the computational load caused by the dot product and softmax operator in the attention operator, we utilize Taylor's formula to approximate exponential function, and then reduce the computational complexity by swapping the computational order of matrix multiplication. In addition, to mitigate the performance loss caused by the error in the Taylor approximation, we introduced the gating mechanism [10] for the proposed attention operator. The previous work [61] showed that the gating mechanism on the convolution in the inpainting can be seen as controlling which features should flow forward. The gating mechanism we impose on the attention is equivalent to an adjustment of the "inaccurate" attention, allowing the subsequent layers in the network to focus on the information that will help the inpainting, thus producing a high quality complementation result. 具体来说,对于注意力操作中点积和 softmax 运算造成的计算负担,我们利用泰勒公式来近似指数函数,然后通过交换矩阵乘法的计算顺序来降低计算复杂度。此外,为了减轻泰勒近似误差带来的性能损失,我们为提出的注意力操作引入了门控机制[10]。之前的研究[61]表明,图像修复中的卷积上的门控机制可以看作是控制哪些特征应该向前流动。我们对注意力施加的门控机制相当于对“不准确”注意力的调整,使得网络中的后续层能够专注于有助于图像修复的信息,从而产生高质量的补全结果。
In this paper, based on our designed linear attention, we propose an U-net [46] style network, called -former, for image inpainting. Compared with the convolution-based encoder-decoder, in former we replace the convolution with the designed transformer module based on the proposed linear attention. Our proposed former combines the texture pattern learning capability of CNNs with the ability of the attention to capture long-range dependencies, and the complexity of this attention is linear rather than quadratically related to the resolution. Our proposed -former is able to achieve performance comparable to other advanced models while maintaining a small complexity compared to those models. 在本文中,基于我们设计的线性注意力,我们提出了一种 U-net [46] 风格的网络,称为 -former,用于图像修复。与基于卷积的编码器-解码器相比,在 -former 中,我们用基于所提线性注意力的设计变换模块替代了卷积。我们提出的 -former 结合了 CNN 的纹理模式学习能力和注意力捕捉长程依赖的能力,并且这种注意力的复杂度是线性的,而不是与分辨率成平方关系。我们提出的 -former 能够在保持较小复杂度的同时,达到与其他先进模型相当的性能。
2 RELATED WORK 2 相关工作
2.1 Vision Transformer 2.1 视觉变换器
The transformer model [51] is a neural network centered on the (self-)attention that plays an important role in natural language processing, and Carion et al. [6] were the first to introduce it into the field of vision for object detection. Dosovitskiy et al. [13] then designed a transformer structure more suitable for use in the visual field based on the characteristics of images. Touvron et al. [49] reduced the data requirements of visual transfoemer with the help of knowledge distillation. Wang et al. [55] then introduced the feature pyramid idea commonly used to build CNNs networks into transformer network construction, which improved the performance of transformer networks. Next, Vaswani et al. [50] reduce the computational demand of the model by limiting the range of attention so that the self-attention acts only on a local window. Subsequently, Liu et al. [37] extended the use and performance of the transformer model by more subtle design of window attention. These works demonstrated the potential of the transformer for high-level vision tasks, yet because its core self-attention excels in features such as long-range modeling, it also meets the needs 变压器模型[51]是一个以(自)注意力为中心的神经网络,在自然语言处理领域发挥着重要作用,而 Carion 等人[6]是首个将其引入视觉领域进行目标检测的研究者。随后,Dosovitskiy 等人[13]根据图像的特征设计了更适合视觉领域的变压器结构。Touvron 等人[49]借助知识蒸馏减少了视觉变压器的数据需求。接着,Wang 等人[55]将构建 CNN 网络时常用的特征金字塔思想引入变压器网络的构建中,从而提高了变压器网络的性能。随后,Vaswani 等人[50]通过限制注意力的范围来降低模型的计算需求,使得自注意力仅在局部窗口内起作用。随后,Liu 等人[37]通过更细致的窗口注意力设计扩展了变压器模型的使用和性能。这些工作展示了变压器在高层次视觉任务中的潜力,但由于其核心自注意力在长距离建模等特征上表现出色,它也满足了需求。
of low-level tasks such as inpainting. However, the computational complexity of the attention in the transformer grows quadratically with spatial resolution, making it inappropriate for direct use in low-vision tasks that require the generation of higher-resolution outputs. Therefore, a class of models chooses to process only lowresolution features of the image with transformer. VQGAN [15], an autoregressive transformer is utilized to learn the effective codebook. ImageBART [14] improves the quality of image synthesis by replacing the autoregressive model in VQGAN with the diffusion process model. MaskGIT [7], in contrast to VQGAN, abandons the autoregressive generation paradigm and introduces a mask, which determines the inference token by the probability value of the mask instead of synthesizing it sequentially as in autoregressive. ICT[53] is a two-stage inpainting where the first stage gets a coarse result by transformer and then feeds this result into a CNN to refine the details. BAT [62] improves on the first stage of ICT by introducing bidirectional and autoregressive transformer to improve the capability of the model. TFill [67] introduces a restricted CNN head on the transformer in ICT to mitigates the proximity influence. These approaches allowed the models to obtain more compact image encodings, but still did not change the limitation that they could not be applied to high resolution images. Subsequently, a different strategy to reduce the complexity was generally adopted. For example, Zamir et al. [63] propose replacing spatial attention with inter-channel attention. Or the replacement of inter-pixel attention with inter-patch attention as in [12,64]. There is also the use of the window attention as in to reduce computational complexity by directly limiting the spatial range of action of attention in a similar way to [37]. Our -former, which does not avoid the problem of attention between full-space pixels, learns long-range dependencies without imposing excessive complexity. 低级任务,如图像修复。然而,变换器中注意力的计算复杂度随着空间分辨率的增加而呈平方增长,这使得它不适合直接用于需要生成高分辨率输出的低视力任务。因此,一类模型选择仅处理图像的低分辨率特征。VQGAN [15],一种自回归变换器,用于学习有效的代码本。ImageBART [14] 通过用扩散过程模型替换 VQGAN 中的自回归模型来提高图像合成的质量。与 VQGAN 相比,MaskGIT [7] 放弃了自回归生成范式,引入了一个掩码,该掩码通过掩码的概率值来确定推理标记,而不是像自回归那样顺序合成。ICT[53] 是一个两阶段的图像修复,其中第一阶段通过变换器获得粗略结果,然后将该结果输入 CNN 以细化细节。BAT [62] 通过引入双向和自回归变换器来改进 ICT 的第一阶段,以提高模型的能力。 TFill [67] 在 ICT 中的变压器上引入了一个受限的 CNN 头,以减轻邻近影响。这些方法使模型能够获得更紧凑的图像编码,但仍未改变无法应用于高分辨率图像的限制。随后,通常采用了一种不同的降低复杂性的策略。例如,Zamir 等人 [63] 提出了用通道间注意力替代空间注意力。或者如 [12,64] 中所示,用块间注意力替代像素间注意力。还有使用窗口注意力,如 所示,通过直接限制注意力的空间作用范围来降低计算复杂性,类似于 [37]。我们的 -former 并未回避全空间像素之间的注意力问题,而是在不施加过度复杂性的情况下学习长程依赖。
2.2 Image Inpainting 2.2 图像修复
Prior to deep learning, non-learning methods could only fill pixels based on the content of the missing regions around or all observed regions because they could not understand the semantics of the image. These methods tend to be more effective for small missing holes or simple background filling, and have limited effect in the face of images with complex patterns. In order to enable the model to output semantic results, Pathak et al. [42] introduced the generative adversarial network (GAN) [17] framework to train a conditional image generation model with the help of convolutional neural networks (CNNs). Then, in response to the shared, static parameters of the convolution, some researchers have modified the convolution so that it can manually [34] or automatically [57, 61] adjust the features according to the image breakage. Next, since it is not easy for the model to recover complex patterns directly, some researchers have chosen to guide the model to complete the image with the help of additional extra image information (e.g. edges [40], structure [18, 29, 35, 45], semantics [32, 33]). To improve this, the researchers designed a class of attention operators called contextual attention [36,54, 59, 60, 65]. Specifically, with the help of the attention module, they explicitly search the entire image for appropriate content to fill the missing regions. Nonetheless, the high burden of performing attention limits its large-scale deployment in the network, so the model is limited in the extent to which it can improve its long-range modeling capabilities as well as its complementary quality. In contrast, our proposed linear attention in -former is not only able to model long-range dependencies between features, but also reduces the complexity compared to the vanilla attention. This enables us to deploy more attention operators in the proposed -former and achieve state-of-the-art in image inpainting. 在深度学习之前,非学习方法只能根据缺失区域周围的内容 或所有观察到的区域 填充像素,因为它们无法理解图像的语义。这些方法在处理小的缺失孔或简单背景填充时往往更有效,但在面对复杂图案的图像时效果有限。为了使模型能够输出语义结果,Pathak 等人[42]引入了生成对抗网络(GAN)[17]框架,通过卷积神经网络(CNN)训练条件图像生成模型。随后,针对卷积的共享静态参数,一些研究人员修改了卷积,使其能够根据图像破损手动[34]或自动[57, 61]调整特征。接下来,由于模型直接恢复复杂图案并不容易,一些研究人员选择借助额外的图像信息(例如边缘[40]、结构[18, 29, 35, 45]、语义[32, 33])来引导模型完成图像。 为了改善这一点,研究人员设计了一类称为上下文注意力的注意力操作符[36,54, 59, 60, 65]。具体来说,在注意力模块的帮助下,他们明确地搜索整个图像以寻找适当的内容来填补缺失区域。然而,执行注意力的高负担限制了其在网络中的大规模部署,因此模型在提高其长程建模能力和互补质量方面受到限制。相比之下,我们在 -former 中提出的线性注意力不仅能够建模特征之间的长程依赖关系,还降低了与传统注意力相比的复杂性。这使我们能够在提出的 -former 中部署更多的注意力操作符,并在图像修复中实现了最先进的水平。
3 APPROACH 3 方法
The goal of image inpainting is to fill the target area of the input image with the appropriate pixels so that the image looks intact. To achieve this goal, we designed an U-net [46] style network, based on our proposed linear attention module. In this section, we present our approach from bottom to top. We first describe our proposed linear attention module, and then introduce the architecture of our inpainting network. 图像修复的目标是用适当的像素填充输入图像 的目标区域,使图像看起来完整。为了实现这个目标,我们设计了一种基于我们提出的线性注意力模块的 U-net [46] 风格网络。在本节中,我们从下到上介绍我们的方法。我们首先描述我们提出的线性注意力模块,然后介绍我们的修复网络的架构。
3.1 Linear Attention 3.1 线性注意力
Vanilla Attention. We first explain why the attention operator of the vanilla transformer [51] model is not applicable to images with higher resolution. Considering a feature map , assuming , the attention operator first feeds the feature through three different transformations and reshapes them into the two-dimensional matrix to obtain the corresponding three embeddings: the query , the key , and the value . Then the corresponding attention result can be obtained by: 香草注意力。我们首先解释为什么香草变压器 [51] 模型的注意力操作不适用于更高分辨率的图像。考虑一个特征图 ,假设 ,注意力操作首先通过三种不同的变换处理特征 ,并将其重塑为二维矩阵,以获得相应的三个嵌入:查询 、键 和值 。然后可以通过以下方式获得相应的注意力结果 :
where the attention function which has quadratic space and time complexity with respect to . And each can be obtained as: 在 中,注意力函数的空间和时间复杂度与 成平方关系。每个 可以表示为:
The above equation is the dot-product attention with softmax normalization. We can find that the complexity of computing each row in is . Therefore, the computational complexity of is obtained as , which is quadratic with respect to the image resolution . 上述方程是带有 softmax 归一化的点积注意力。我们可以发现计算 中每一行 的复杂度为 。因此, 的计算复杂度为 ,与图像分辨率 成平方关系。
Linearization of Attention. We can notice that the computational complexity of Eq. (2) mainly comes from the softmax term, therefore most linearizations of attention focus mainly on modifications to softmax. Revisiting Eq. (2), previous methods [27, 43, 44] compute the attention by using different kernel functions instead of , by: 注意力的线性化。我们可以注意到,公式(2)的计算复杂度主要来自 softmax 项,因此大多数注意力的线性化主要集中在对 softmax 的修改。重新审视公式(2),之前的方法[27, 43, 44]通过使用不同的核函数 而不是 来计算注意力,具体为:
Note that the property of kernel function is used here, and is a projection. These methods obtain linear attention by changing the order of computation of matrix multiplication from to . 注意这里使用了核函数 的性质,而 是一个投影。这些方法通过将矩阵乘法的计算顺序从 更改为 来获得线性注意力 。
Inspired by the above linear attention approaches, in this paper we take another perspective to linearize the attention by approximating the exponential function through Taylor expansion. Specifically, we note that Taylor's formula of the exponential function constituting the softmax operator is: 受到上述线性注意力方法的启发,本文从另一个角度通过泰勒展开近似指数函数来线性化注意力。具体来说,我们注意到构成 softmax 运算符的指数函数的泰勒公式是:
Putting Eq. (4) into Eq. (2), we can get (the channel is ignored for simplicity): 将方程(4)代入方程(2),我们可以得到(为了简化,忽略通道 ):
It is worth noting that the last line in Eq. 5 is obtained by the properties of vector multiplication. 值得注意的是,方程 5 中的最后一行是通过向量乘法的性质得到的。
Analysis. From the above, a linear complexity version of attention can be obtained from the properties of matrix multiplication: 分析。从上述内容可以得出,注意力的线性复杂度版本可以通过矩阵乘法的性质获得:
In Eq. (6), instead of calculating the attention matrix first, is computed first and then multiplying . With the help of this trick, the computational complexity of the attention operation is . It is noted that in the task of image inpainting, the feature (channel) dimension is always much smaller than the spatial resolution , so we reduce the computational complexity of the model by a large amount. Also similar to the vanilla transformer [51], we also use a multi-headed [51] version of attention to enhance the performance of our proposed linear attention operator. Furthermore, the term seems to be seen as a residual term with respect to , and from the ablation experiments (as seen in Table 3) we find that it improves the performance of our inpainting model. 在公式(6)中,首先计算 ,然后再乘以 ,而不是先计算注意力矩阵 。借助这个技巧,注意力操作的计算复杂度为 。需要注意的是,在图像修复任务中,特征(通道)维度 通常远小于空间分辨率 ,因此我们大大降低了模型的计算复杂度。与普通变换器[51]类似,我们也使用多头[51]版本的注意力来增强我们提出的线性注意力算子的性能。此外,术语 似乎被视为相对于 的残差项,从消融实验(见表 3)中我们发现它提高了我们修复模型的性能。
3.2 Gated Mechanism for Linear Attention 3.2 线性注意力的门控机制
The gating mechanism from recurrent neural networks (GRU [8], LSTM [22]) initially proved its effectiveness on language models [10]. And the gating mechanism is also widely used in the feed-forward networks (FFN) of the state-of-arts transformer networks [23, 47, 63]. A gating mechanism, (or gated linear unit) can be thought of as a neural network layer whose output is the product of the components of two linear transformations of the input , as: 门控机制来自递归神经网络(GRU [8],LSTM [22])最初在语言模型上证明了其有效性 [10]。门控机制也广泛应用于最先进的变换器网络(FFN) [23, 47, 63]。门控机制(或门控线性单元)可以被视为一个神经网络层,其输出 是输入 的两个线性变换的分量的乘积,如下所示:
where are the learnable parameters, are the corresponding activation functions (which can be absent), and denotes the Hadamard product. The simple and effective gating mechanism significantly enhances the performance of the network making us want to generalize it to the proposed linear attention operator. Specifically, for an input feature and the linear attention operator , then the out of the attention with a gating mechanism can be written as: 其中 是可学习的参数, 是相应的激活函数(可以不存在), 表示哈达玛积。简单而有效的门控机制显著提升了网络的性能,使我们希望将其推广到提出的线性注意力算子。具体来说,对于输入特征 和线性注意力算子 ,则带有门控机制的注意力输出 可以表示为:
The gating mechanism applied on the convolution [61] plays an important role in the field of image inpainting and can be seen as distinguishing invalid features caused by broken pixels in the input image. Since our proposed linear attention is an "inaccurate" attention, we complement our linear attention with a gating mechanism that allows subsequent layers in the network to focus on features that contribute to the inpainting. 在卷积中应用的门控机制[61]在图像修复领域中发挥着重要作用,可以视为区分输入图像中因破损像素而导致的无效特征。由于我们提出的线性注意力是一种“非精确”的注意力,我们通过门控机制来补充我们的线性注意力,使网络中的后续层能够专注于有助于修复的特征。
3.3 Network Architecture 3.3 网络架构
Our -former is an U-net [46] style network based on the proposed transformer block, containing both encoder-decoder parts, as shown in Figure 2. The design of this transformer block we refer to the encoder block in vanilla transformer [51] and contains two sub-layers. The first is the proposed linear attention with gating mechanism (LAG), and the second is a simple feed-forward network (FFN). In addition, we adopt a residual connection [19] adhering to each of the sub-layers. Besides these transformer modules, we also use some convolutional layers to cope with scale changes (like upsampling) of features (inputs). 我们的 -前身是基于所提出的变换器块的 U-net [46] 风格网络,包含编码器-解码器部分,如图 2 所示。我们设计的这个变换器块参考了普通变换器中的编码器块 [51],并包含两个子层。第一个是提出的带门控机制的线性注意力(LAG),第二个是一个简单的前馈网络(FFN)。此外,我们在每个子层中采用了残差连接 [19]。除了这些变换器模块外,我们还使用了一些卷积层来应对特征(输入)的尺度变化(如上采样)。
Encoder Pipeline. Given a masked images , our encoder part first feeds it into a convolution layer and then get the corresponding feature map . Here and represent the dimension of the spatial resolution and denotes the dimension of the channel. Next these features are fed into 4 -level encoder stages. Each level stage contains a stack of the designed transformer blocks, and we use a convolution with kernel size and stride 2 to downsample the features between every two stages. For the given feature map , the -level encoder stage of the transformer block produces the feature map 编码器管道。给定一个被遮罩的图像 ,我们的编码器部分首先将其输入到一个 卷积层,然后获得相应的特征图 。这里 和 表示空间分辨率的维度, 表示通道的维度。接下来,这些特征被输入到 4 级编码器阶段。每个级别阶段包含一组设计好的变换器块,我们使用一个核大小为 、步幅为 2 的卷积在每两个阶段之间对特征进行下采样。对于给定的特征图 ,变换器块的 级编码器阶段生成特征图。
final (4-level) stage of the encoder is . 编码器的最终(4 级)阶段是 。
Decoder Pipeline. The decoder takes the final feature map of encoder as input and progressively restores the high resolution representations. The decoder part consists of 3-level (arranged from largest to smallest) stages, each of which is stacked by several transformer blocks. In each stage of the decoder, the features are first passed through an upsampling layer consisting of nearest neighbor interpolation and convolution. Given a feature map, the -level decoder stage of the upsampling produces the feature . In addition, to help the decoder part, the encoder feature are concatenated to the decoder feature via a skip connection. And the concatenation operation is followed by a convolution layer to decrease the channels (by half). These fused features are then fed into the corresponding transformer block to obtain the dimensionally invariant features . Finally, after last 解码器管道。解码器将编码器的最终特征图 作为输入,并逐步恢复高分辨率表示。解码器部分由 3 个级别(从大到小排列)组成,每个级别由多个变换器块堆叠而成。在解码器的每个级别中,特征首先通过一个包含最近邻插值和 卷积的上采样层。给定一个特征图,上采样的 级解码器阶段生成特征 。此外,为了帮助解码器部分,编码器特征 通过跳跃连接与解码器特征 连接。连接操作后是一个 卷积层,以减少通道数(减半)。这些融合的特征随后被输入到相应的变换器块中,以获得维度不变的特征 。最后,在最后一次
Figure 2: Overview of our proposed -former. Our model accepts masked images as input and outputs complemented images. Our -former which is an U-net style network composed of transformer blocks that we designed. The transformer block we designed contains two sublayers: (1) Linear attention with gating mechanism (LAG) that performs our proposed linear attention for full-space feature interaction, supplemented with a gating mechanism; (2) Feed-forward network (FFN) that transforms the features learned by the attention operator to send useful representations for subsequent layers. 图 2:我们提出的 -former 概述。我们的模型接受掩蔽图像作为输入,并输出补全图像。我们的 -former 是一个由我们设计的由变换器块组成的 U-net 风格网络。我们设计的变换器块包含两个子层:(1)带有门控机制的线性注意力(LAG),执行我们提出的线性注意力以实现全空间特征交互,并辅以门控机制;(2)前馈网络(FFN),将注意力操作学习到的特征转换为用于后续层的有用表示。
(1-level) stage of the decoder we add a convolution layer to convert the feature map into the complemented image . 在解码器的(1 级)阶段,我们添加一个 卷积层,将特征图 转换为补充图像 。
Transformer Block. As shown in the Figure 2, the transformer block we used in -former contains two sub-layers. The first is the proposed linear attention with the gating mechanism (LAG), and the second is a simple feed-forward network (FFN). In the LAG layer, the gating value we obtain by feeding the input into a convolution layer with a GELU [20] activation function, i.e. and of Eq. (8). 变压器块。如图 2 所示,我们在 -former 中使用的变压器块包含两个子层。第一个是提出的带门控机制的线性注意力(LAG),第二个是简单的前馈网络(FFN)。在 LAG 层中,我们通过将输入 输入到带有 GELU [20]激活函数的 卷积层中获得的门控值,即公式(8)中的 和 。
For the design of the FFN we refer to the recent transformers [23, 63], which uses a gate-linear layer [10] with residual connections [19] instead of a residual block [19] composed of two convolutions in series. Specifically, to reduce the complexity, for the input , whose parameter in Eq. (7) we replace the standard convolution with a combination of a convolution and a depth-wise convolution. 对于 FFN 的设计,我们参考了最近的变压器[23, 63],它使用了带有残差连接[19]的门线性层[10],而不是由两个串联卷积组成的残差块[19]。具体来说,为了降低复杂性,对于输入 ,我们在公式(7)中将参数 的标准卷积替换为 卷积和 深度卷积的组合。
3.4 Loss Function 3.4 损失函数
The loss function used to train our -former can be written as: 用于训练我们的 -former 的损失函数 可以写成:
where represents the reconstruction Loss, denotes the perceptual loss [25],