2024_08_29_1a403d7ec4046fe1f834g

T-former: An Efficient Transformer for Image Inpainting
T-former：一种高效的图像修复变换器

Ye Deng 叶灯Xi'an Jiaotong University
西安交通大学Xi'an, Shaanxi, China 西安，陕西，中国dengye@stu.xjtu.edu.cn

Siqi Hui 慧思琪Xi'an Jiaotong University
西安交通大学Xi'an, Shaanxi, China 西安，陕西，中国huisiqi@stu.xjtu.edu.cn

Sanping Zhou* 周三平*Xi'an Jiaotong University
西安交通大学Xi'an, Shaanxi, China 西安，陕西，中国spzhou@xjtu.edu.cn

Deyu Meng 德宇梦Xi'an Jiaotong University
西安交通大学Xi'an, Shaanxi, China 西安，陕西，中国dymeng@mail.xjtu.edu.cn

Jinjun Wang 王金军Xi'an Jiaotong University
西安交通大学Xi'an, Shaanxi, China 西安，陕西，中国jinjun@mail.xjtu.edu.cn

Input 输入

Result 结果

Truth 真相

Input 输入

Result 结果

Truth 真相
Figure 1: Image inpainting outputs by our proposed

-former. In each group, the input image is shown on the left, with gray pixels representing the missing areas. (Best with color and zoomed-in view)
图 1：我们提出的

-former 的图像修复输出。在每组中，输入图像显示在左侧，灰色像素表示缺失区域。（最佳效果为彩色和放大视图）

Abstract 摘要

Benefiting from powerful convolutional neural networks (CNNs), learning-based image inpainting methods have made significant breakthroughs over the years. However, some nature of CNNs (e.g. local prior, spatially shared parameters) limit the performance in the face of broken images with diverse and complex forms. Recently, a class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields and high-level vision tasks. Compared with CNNs, attention operators are better at long-range modeling and have dynamic weights, but their computational complexity is quadratic
得益于强大的卷积神经网络（CNN），基于学习的图像修复方法近年来取得了显著突破。然而，CNN 的一些特性（例如局部先验、空间共享参数）在面对形态多样且复杂的破损图像时限制了其性能。最近，一类基于注意力的网络架构，称为变换器，在自然语言处理领域和高级视觉任务中表现出显著的性能。与 CNN 相比，注意力操作在长距离建模方面表现更好，并且具有动态权重，但其计算复杂度是二次的。

in spatial resolution, and thus less suitable for applications involving higher resolution images, such as image inpainting. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion. And based on this attention, a network called

-former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity. The code can be found at github.com/dengyecode/Tformer_image_inpainting
在空间分辨率方面，因此不太适合涉及高分辨率图像的应用，例如图像修复。在本文中，我们设计了一种根据泰勒展开与分辨率线性相关的新型注意力机制。基于这种注意力机制，设计了一种名为

-former 的网络用于图像修复。在多个基准数据集上的实验表明，我们提出的方法在保持相对较低的参数数量和计算复杂度的同时，达到了最先进的准确性。代码可以在 github.com/dengyecode/Tformer_image_inpainting 找到。

CCS CONCEPTS CCS 概念

Computing methodologies Computer vision.
计算方法计算机视觉。

KEYWORDS 关键词

image inpainting, attention, neural networks, transformer
图像修复，注意力，神经网络，变换器

ACM Reference Format: ACM 参考格式：

Ye Deng, Siqi Hui, Sanping Zhou, Deyu Meng, and Jinjun Wang. 2022. Tformer: An Efficient Transformer for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (MM '22), October 10-14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3503161.3548446
叶邓、徐琦、周三平、孟德宇和王金军。2022 年。《Tformer：一种高效的图像修复变换器》。发表于第 30 届 ACM 国际多媒体会议（MM '22），2022 年 10 月 10-14 日，葡萄牙里斯本。ACM，纽约，NY，美国，10 页。https: //doi.org/10.1145/3503161.3548446

1 INTRODUCTION 1 引言

Image inpainting (or completion) [4] is the process of filling in corrupted or missing parts of an image, as Figure 1 shown. It is an important task in the field of computer vision and image processing and can benefit users in a wide range of applications, such as removing unwanted objects in image editing. The key challenge in image inpainting is to make the filled pixels blend in with the non-missing parts.
图像修复（或补全）[4] 是填补图像中损坏或缺失部分的过程，如图 1 所示。这是计算机视觉和图像处理领域中的一项重要任务，可以在广泛的应用中为用户带来好处，例如在图像编辑中去除不需要的物体。图像修复的关键挑战在于使填充的像素与未缺失的部分融合。

Prior to deep learning, non-learning inpainting algorithms can be roughly divided into two categories, diffusion-based approaches [1,

and exemplar-based approaches [2, 5, 28, 58]. Diffusionbased approaches smoothly propagate the information from observed boundaries to the interior of damaged areas. However, since diffusion-based approaches do not consider the global image structure, they are only effective in filling small holes and less effective in dealing with large scale breakage. To address the drawbacks of diffusion-based inpainting approaches, the exemplar-based approach searches for valid information from known regions of the entire image and copies or relocates this information to the missing locations. Although the exemplar-based algorithms perform well in the face of simple pattern breakage in larger areas, they do not perform well in filling images with complex patterns because they do not understand the semantic information of the image.
在深度学习之前，非学习型修复算法大致可以分为两类：基于扩散的方法和基于示例的方法。基于扩散的方法平滑地将信息从观察到的边界传播到受损区域的内部。然而，由于基于扩散的方法不考虑全局图像结构，因此它们仅在填补小孔时有效，而在处理大规模破损时效果较差。为了解决基于扩散的修复方法的缺点，基于示例的方法从整个图像的已知区域中搜索有效信息，并将这些信息复制或重新定位到缺失的位置。尽管基于示例的算法在面对较大区域的简单模式破损时表现良好，但在填补具有复杂模式的图像时效果不佳，因为它们无法理解图像的语义信息。

With the development of convolutional neural networks (CNNs), learning-based approaches have reached state-of-arts in the field of image inpainting. These inpainting models [24, 30, 42, 66] formulate the inpainting as a conditional image generation problem and customize a CNNs-based encoder-decoder as their corresponding conditional image generator. By training on sufficiently large scale datasets, CNNs show their strength in learning rich patterns and image semantics, and filling the target regions with such learned knowledge. In addition, the sparse connectivity and parameter sharing of CNNs in space make them computationally efficient However, some basic characteristics of CNNs make them may have some limitations in handling the inpainting task. (a) the locality CNNs are good at acquiring local relationships but not good at capturing long-range dependencies. Although the importance of locality for images has been demonstrated in various vision tasks for a long time, for image inpainting, focusing on non-local features (the whole image) is more likely to find the appropriate information for broken regions. (b) spatial-sharing and static parameters. The same convolution kernel operates on features across all spatial locations and the parameters of the kernel are static at the time of inference. This is somewhat inflexible in the face of inpainting tasks where images are mixed with broken and unbroken pixels and the damaged regions are variable.
随着卷积神经网络（CNN）的发展，基于学习的方法在图像修复领域达到了最先进的水平。这些修复模型将修复问题表述为条件图像生成问题，并定制了基于 CNN 的编码器-解码器作为相应的条件图像生成器。通过在足够大规模的数据集上进行训练，CNN 展现了其在学习丰富模式和图像语义方面的优势，并利用这些学习到的知识填充目标区域。此外，CNN 在空间上的稀疏连接和参数共享使其计算效率高。然而，CNN 的一些基本特性可能在处理修复任务时存在一些局限性。（a）局部性，CNN 擅长获取局部关系，但不擅长捕捉长距离依赖。尽管局部性在各种视觉任务中对图像的重要性早已得到证明，但对于图像修复，关注非局部特征（整个图像）更有可能找到适合破损区域的信息。（b）空间共享和静态参数。相同的卷积核在所有空间位置上对特征进行操作，并且在推理时核的参数是静态的。这在图像混合了破损和未破损像素且受损区域变化的修复任务中显得有些不灵活。

Recently, (self-)attention [51], popular in the field of natural language processing, has been introduced in vision tasks

. Compared to CNNs, attention operators whose weights dynamically adjust with the input are better able to capture long-range dependencies through explicit interaction with global features. And as a well-explored architecture in language tasks, the transformer model, based on the attention, is emerging in high-level vision tasks. Although the attention operator has advantages over CNNs in some aspects, its computational complexity grows quadratically with spatial resolution and is therefore not particularly suitable for high-resolution images, a situation that occurs frequently in low-level vision tasks including image inpainting. Recently, some designs that can reduce the computational complexity of attention operators have been transferred to inpainting [12] or other lowlevel vision tasks [31, 56]. These methods either apply attention to a sequence of patches unfolded from the image [12], or divide the image into non-overlapping parts and compute the attention for each part independently

? ]. However, limiting the spatial extent of attention somewhat defeats the original purpose of capturing the long-range dependence between pixels.
最近，自注意力机制在自然语言处理领域广受欢迎，已被引入到视觉任务中。与卷积神经网络（CNN）相比，权重随输入动态调整的注意力操作符能够通过与全局特征的显式交互，更好地捕捉长距离依赖关系。作为语言任务中经过充分探索的架构，基于注意力的变换器模型正在高层次视觉任务中崭露头角。尽管在某些方面，注意力操作符相较于 CNN 具有优势，但其计算复杂度随着空间分辨率的增加而呈平方增长，因此并不特别适合高分辨率图像，这种情况在包括图像修复在内的低层次视觉任务中经常出现。最近，一些可以降低注意力操作符计算复杂度的设计已被转移到图像修复或其他低层次视觉任务中。这些方法要么将注意力应用于从图像展开的一系列补丁，要么将图像划分为不重叠的部分，并独立计算每个部分的注意力。然而，限制注意力的空间范围在某种程度上违背了捕捉像素之间长距离依赖关系的初衷。

Specifically, for the computational load caused by the dot product and softmax operator in the attention operator, we utilize Taylor's formula to approximate exponential function, and then reduce the computational complexity by swapping the computational order of matrix multiplication. In addition, to mitigate the performance loss caused by the error in the Taylor approximation, we introduced the gating mechanism [10] for the proposed attention operator. The previous work [61] showed that the gating mechanism on the convolution in the inpainting can be seen as controlling which features should flow forward. The gating mechanism we impose on the attention is equivalent to an adjustment of the "inaccurate" attention, allowing the subsequent layers in the network to focus on the information that will help the inpainting, thus producing a high quality complementation result.
具体来说，对于注意力操作中点积和 softmax 运算造成的计算负担，我们利用泰勒公式来近似指数函数，然后通过交换矩阵乘法的计算顺序来降低计算复杂度。此外，为了减轻泰勒近似误差带来的性能损失，我们为提出的注意力操作引入了门控机制[10]。之前的研究[61]表明，图像修复中的卷积上的门控机制可以看作是控制哪些特征应该向前流动。我们对注意力施加的门控机制相当于对“不准确”注意力的调整，使得网络中的后续层能够专注于有助于图像修复的信息，从而产生高质量的补全结果。

In this paper, based on our designed linear attention, we propose an U-net [46] style network, called

-former, for image inpainting. Compared with the convolution-based encoder-decoder, in

former we replace the convolution with the designed transformer module based on the proposed linear attention. Our proposed

former combines the texture pattern learning capability of CNNs with the ability of the attention to capture long-range dependencies, and the complexity of this attention is linear rather than quadratically related to the resolution. Our proposed

-former is able to achieve performance comparable to other advanced models while maintaining a small complexity compared to those models.
在本文中，基于我们设计的线性注意力，我们提出了一种 U-net [46] 风格的网络，称为

-former，用于图像修复。与基于卷积的编码器-解码器相比，在

-former 中，我们用基于所提线性注意力的设计变换模块替代了卷积。我们提出的

-former 结合了 CNN 的纹理模式学习能力和注意力捕捉长程依赖的能力，并且这种注意力的复杂度是线性的，而不是与分辨率成平方关系。我们提出的

-former 能够在保持较小复杂度的同时，达到与其他先进模型相当的性能。

2.1 Vision Transformer 2.1 视觉变换器

The transformer model [51] is a neural network centered on the (self-)attention that plays an important role in natural language processing, and Carion et al. [6] were the first to introduce it into the field of vision for object detection. Dosovitskiy et al. [13] then designed a transformer structure more suitable for use in the visual field based on the characteristics of images. Touvron et al. [49] reduced the data requirements of visual transfoemer with the help of knowledge distillation. Wang et al. [55] then introduced the feature pyramid idea commonly used to build CNNs networks into transformer network construction, which improved the performance of transformer networks. Next, Vaswani et al. [50] reduce the computational demand of the model by limiting the range of attention so that the self-attention acts only on a local window. Subsequently, Liu et al. [37] extended the use and performance of the transformer model by more subtle design of window attention. These works demonstrated the potential of the transformer for high-level vision tasks, yet because its core self-attention excels in features such as long-range modeling, it also meets the needs
变压器模型[51]是一个以（自）注意力为中心的神经网络，在自然语言处理领域发挥着重要作用，而 Carion 等人[6]是首个将其引入视觉领域进行目标检测的研究者。随后，Dosovitskiy 等人[13]根据图像的特征设计了更适合视觉领域的变压器结构。Touvron 等人[49]借助知识蒸馏减少了视觉变压器的数据需求。接着，Wang 等人[55]将构建 CNN 网络时常用的特征金字塔思想引入变压器网络的构建中，从而提高了变压器网络的性能。随后，Vaswani 等人[50]通过限制注意力的范围来降低模型的计算需求，使得自注意力仅在局部窗口内起作用。随后，Liu 等人[37]通过更细致的窗口注意力设计扩展了变压器模型的使用和性能。这些工作展示了变压器在高层次视觉任务中的潜力，但由于其核心自注意力在长距离建模等特征上表现出色，它也满足了需求。
of low-level tasks such as inpainting. However, the computational complexity of the attention in the transformer grows quadratically with spatial resolution, making it inappropriate for direct use in low-vision tasks that require the generation of higher-resolution outputs. Therefore, a class of models chooses to process only lowresolution features of the image with transformer. VQGAN [15], an autoregressive transformer is utilized to learn the effective codebook. ImageBART [14] improves the quality of image synthesis by replacing the autoregressive model in VQGAN with the diffusion process model. MaskGIT [7], in contrast to VQGAN, abandons the autoregressive generation paradigm and introduces a mask, which determines the inference token by the probability value of the mask instead of synthesizing it sequentially as in autoregressive. ICT[53] is a two-stage inpainting where the first stage gets a coarse result by transformer and then feeds this result into a CNN to refine the details. BAT [62] improves on the first stage of ICT by introducing bidirectional and autoregressive transformer to improve the capability of the model. TFill [67] introduces a restricted CNN head on the transformer in ICT to mitigates the proximity influence. These approaches allowed the models to obtain more compact image encodings, but still did not change the limitation that they could not be applied to high resolution images. Subsequently, a different strategy to reduce the complexity was generally adopted. For example, Zamir et al. [63] propose replacing spatial attention with inter-channel attention. Or the replacement of inter-pixel attention with inter-patch attention as in [12,64]. There is also the use of the window attention as in

to reduce computational complexity by directly limiting the spatial range of action of attention in a similar way to [37]. Our

-former, which does not avoid the problem of attention between full-space pixels, learns long-range dependencies without imposing excessive complexity.
低级任务，如图像修复。然而，变换器中注意力的计算复杂度随着空间分辨率的增加而呈平方增长，这使得它不适合直接用于需要生成高分辨率输出的低视力任务。因此，一类模型选择仅处理图像的低分辨率特征。VQGAN [15]，一种自回归变换器，用于学习有效的代码本。ImageBART [14] 通过用扩散过程模型替换 VQGAN 中的自回归模型来提高图像合成的质量。与 VQGAN 相比，MaskGIT [7] 放弃了自回归生成范式，引入了一个掩码，该掩码通过掩码的概率值来确定推理标记，而不是像自回归那样顺序合成。ICT[53] 是一个两阶段的图像修复，其中第一阶段通过变换器获得粗略结果，然后将该结果输入 CNN 以细化细节。BAT [62] 通过引入双向和自回归变换器来改进 ICT 的第一阶段，以提高模型的能力。 TFill [67] 在 ICT 中的变压器上引入了一个受限的 CNN 头，以减轻邻近影响。这些方法使模型能够获得更紧凑的图像编码，但仍未改变无法应用于高分辨率图像的限制。随后，通常采用了一种不同的降低复杂性的策略。例如，Zamir 等人 [63] 提出了用通道间注意力替代空间注意力。或者如 [12,64] 中所示，用块间注意力替代像素间注意力。还有使用窗口注意力，如

所示，通过直接限制注意力的空间作用范围来降低计算复杂性，类似于 [37]。我们的

-former 并未回避全空间像素之间的注意力问题，而是在不施加过度复杂性的情况下学习长程依赖。

2.2 Image Inpainting 2.2 图像修复

Prior to deep learning, non-learning methods could only fill pixels based on the content of the missing regions around

or all observed regions

because they could not understand the semantics of the image. These methods tend to be more effective for small missing holes or simple background filling, and have limited effect in the face of images with complex patterns. In order to enable the model to output semantic results, Pathak et al. [42] introduced the generative adversarial network (GAN) [17] framework to train a conditional image generation model with the help of convolutional neural networks (CNNs). Then, in response to the shared, static parameters of the convolution, some researchers have modified the convolution so that it can manually [34] or automatically [57, 61] adjust the features according to the image breakage. Next, since it is not easy for the model to recover complex patterns directly, some researchers have chosen to guide the model to complete the image with the help of additional extra image information (e.g. edges [40], structure [18, 29, 35, 45], semantics [32, 33]). To improve this, the researchers designed a class of attention operators called contextual attention [36,54, 59, 60, 65]. Specifically, with the help of the attention module, they explicitly search the entire image for appropriate content to fill the missing regions. Nonetheless, the high burden of performing attention limits its large-scale deployment in the network, so the model is limited in the extent to which it can improve its long-range modeling capabilities as well as its complementary quality. In contrast, our proposed linear attention in

-former is not only able to model long-range dependencies between features, but also reduces the complexity compared to the vanilla attention. This enables us to deploy more attention operators in the proposed

-former and achieve state-of-the-art in image inpainting.
在深度学习之前，非学习方法只能根据缺失区域周围的内容

或所有观察到的区域

填充像素，因为它们无法理解图像的语义。这些方法在处理小的缺失孔或简单背景填充时往往更有效，但在面对复杂图案的图像时效果有限。为了使模型能够输出语义结果，Pathak 等人[42]引入了生成对抗网络（GAN）[17]框架，通过卷积神经网络（CNN）训练条件图像生成模型。随后，针对卷积的共享静态参数，一些研究人员修改了卷积，使其能够根据图像破损手动[34]或自动[57, 61]调整特征。接下来，由于模型直接恢复复杂图案并不容易，一些研究人员选择借助额外的图像信息（例如边缘[40]、结构[18, 29, 35, 45]、语义[32, 33]）来引导模型完成图像。为了改善这一点，研究人员设计了一类称为上下文注意力的注意力操作符[36,54, 59, 60, 65]。具体来说，在注意力模块的帮助下，他们明确地搜索整个图像以寻找适当的内容来填补缺失区域。然而，执行注意力的高负担限制了其在网络中的大规模部署，因此模型在提高其长程建模能力和互补质量方面受到限制。相比之下，我们在

-former 中提出的线性注意力不仅能够建模特征之间的长程依赖关系，还降低了与传统注意力相比的复杂性。这使我们能够在提出的

-former 中部署更多的注意力操作符，并在图像修复中实现了最先进的水平。

3 APPROACH 3 方法

The goal of image inpainting is to fill the target area of the input image

with the appropriate pixels so that the image looks intact. To achieve this goal, we designed an U-net [46] style network, based on our proposed linear attention module. In this section, we present our approach from bottom to top. We first describe our proposed linear attention module, and then introduce the architecture of our inpainting network.
图像修复的目标是用适当的像素填充输入图像

的目标区域，使图像看起来完整。为了实现这个目标，我们设计了一种基于我们提出的线性注意力模块的 U-net [46] 风格网络。在本节中，我们从下到上介绍我们的方法。我们首先描述我们提出的线性注意力模块，然后介绍我们的修复网络的架构。

3.1 Linear Attention 3.1 线性注意力

Vanilla Attention. We first explain why the attention operator of the vanilla transformer [51] model is not applicable to images with higher resolution. Considering a feature map

, assuming

, the attention operator first feeds the feature

through three different transformations and reshapes them into the two-dimensional matrix to obtain the corresponding three embeddings: the query

, the key

, and the value

. Then the corresponding attention result

can be obtained by:
香草注意力。我们首先解释为什么香草变压器 [51] 模型的注意力操作不适用于更高分辨率的图像。考虑一个特征图

，假设

，注意力操作首先通过三种不同的变换处理特征

，并将其重塑为二维矩阵，以获得相应的三个嵌入：查询

、键

和值

。然后可以通过以下方式获得相应的注意力结果

：

where

the attention function which has quadratic space and time complexity with respect to

. And each

can be obtained as:
在

中，注意力函数的空间和时间复杂度与

成平方关系。每个

可以表示为：

The above equation is the dot-product attention with softmax normalization. We can find that the complexity of computing each row

. Therefore, the computational complexity of

is obtained as

, which is quadratic with respect to the image resolution

.
上述方程是带有 softmax 归一化的点积注意力。我们可以发现计算

中每一行

的复杂度为

。因此，

的计算复杂度为

，与图像分辨率

成平方关系。

Linearization of Attention. We can notice that the computational complexity of Eq. (2) mainly comes from the softmax term, therefore most linearizations of attention focus mainly on modifications to softmax. Revisiting Eq. (2), previous methods [27, 43, 44] compute the attention by using different kernel functions

instead of

, by:
注意力的线性化。我们可以注意到，公式（2）的计算复杂度主要来自 softmax 项，因此大多数注意力的线性化主要集中在对 softmax 的修改。重新审视公式（2），之前的方法[27, 43, 44]通过使用不同的核函数

而不是

来计算注意力，具体为：

Note that the property of kernel function

is used here, and

is a projection. These methods obtain linear attention

by changing the order of computation of matrix multiplication from

.
注意这里使用了核函数

的性质，而

是一个投影。这些方法通过将矩阵乘法的计算顺序从

更改为

来获得线性注意力

。

Inspired by the above linear attention approaches, in this paper we take another perspective to linearize the attention by approximating the exponential function through Taylor expansion. Specifically, we note that Taylor's formula of the exponential function constituting the softmax operator is:
受到上述线性注意力方法的启发，本文从另一个角度通过泰勒展开近似指数函数来线性化注意力。具体来说，我们注意到构成 softmax 运算符的指数函数的泰勒公式是：

Putting Eq. (4) into Eq. (2), we can get (the channel

is ignored for simplicity):
将方程（4）代入方程（2），我们可以得到（为了简化，忽略通道

）：

It is worth noting that the last line in Eq. 5 is obtained by the properties of vector multiplication.
值得注意的是，方程 5 中的最后一行是通过向量乘法的性质得到的。

Analysis. From the above, a linear complexity version of attention can be obtained from the properties of matrix multiplication:
分析。从上述内容可以得出，注意力的线性复杂度版本可以通过矩阵乘法的性质获得：

In Eq. (6), instead of calculating the attention matrix

first,

is computed first and then multiplying

. With the help of this trick, the computational complexity of the attention operation is

. It is noted that in the task of image inpainting, the feature (channel) dimension

is always much smaller than the spatial resolution

, so we reduce the computational complexity of the model by a large amount. Also similar to the vanilla transformer [51], we also use a multi-headed [51] version of attention to enhance the performance of our proposed linear attention operator. Furthermore, the term

seems to be seen as a residual term with respect to

, and from the ablation experiments (as seen in Table 3) we find that it improves the performance of our inpainting model.
在公式（6）中，首先计算

，然后再乘以

，而不是先计算注意力矩阵

。借助这个技巧，注意力操作的计算复杂度为

。需要注意的是，在图像修复任务中，特征（通道）维度

通常远小于空间分辨率

，因此我们大大降低了模型的计算复杂度。与普通变换器[51]类似，我们也使用多头[51]版本的注意力来增强我们提出的线性注意力算子的性能。此外，术语

似乎被视为相对于

的残差项，从消融实验（见表 3）中我们发现它提高了我们修复模型的性能。

3.2 Gated Mechanism for Linear Attention
3.2 线性注意力的门控机制

The gating mechanism from recurrent neural networks (GRU [8], LSTM [22]) initially proved its effectiveness on language models [10]. And the gating mechanism is also widely used in the feed-forward networks (FFN) of the state-of-arts transformer networks [23, 47, 63]. A gating mechanism, (or gated linear unit) can be thought of as a neural network layer whose output

is the product of the components of two linear transformations of the input

, as:
门控机制来自递归神经网络（GRU [8]，LSTM [22]）最初在语言模型上证明了其有效性 [10]。门控机制也广泛应用于最先进的变换器网络（FFN） [23, 47, 63]。门控机制（或门控线性单元）可以被视为一个神经网络层，其输出

是输入

的两个线性变换的分量的乘积，如下所示：

where

are the learnable parameters,

are the corresponding activation functions (which can be absent), and

denotes the Hadamard product. The simple and effective gating mechanism significantly enhances the performance of the network making us want to generalize it to the proposed linear attention operator. Specifically, for an input feature

and the linear attention operator

, then the out

of the attention with a gating mechanism can be written as:
其中

是可学习的参数，

是相应的激活函数（可以不存在），

表示哈达玛积。简单而有效的门控机制显著提升了网络的性能，使我们希望将其推广到提出的线性注意力算子。具体来说，对于输入特征

和线性注意力算子

，则带有门控机制的注意力输出

可以表示为：

The gating mechanism applied on the convolution [61] plays an important role in the field of image inpainting and can be seen as distinguishing invalid features caused by broken pixels in the input image. Since our proposed linear attention is an "inaccurate" attention, we complement our linear attention with a gating mechanism that allows subsequent layers in the network to focus on features that contribute to the inpainting.
在卷积中应用的门控机制[61]在图像修复领域中发挥着重要作用，可以视为区分输入图像中因破损像素而导致的无效特征。由于我们提出的线性注意力是一种“非精确”的注意力，我们通过门控机制来补充我们的线性注意力，使网络中的后续层能够专注于有助于修复的特征。

3.3 Network Architecture 3.3 网络架构

Our

-former is an U-net [46] style network based on the proposed transformer block, containing both encoder-decoder parts, as shown in Figure 2. The design of this transformer block we refer to the encoder block in vanilla transformer [51] and contains two sub-layers. The first is the proposed linear attention with gating mechanism (LAG), and the second is a simple feed-forward network (FFN). In addition, we adopt a residual connection [19] adhering to each of the sub-layers. Besides these transformer modules, we also use some convolutional layers to cope with scale changes (like upsampling) of features (inputs).
我们的

-前身是基于所提出的变换器块的 U-net [46] 风格网络，包含编码器-解码器部分，如图 2 所示。我们设计的这个变换器块参考了普通变换器中的编码器块 [51]，并包含两个子层。第一个是提出的带门控机制的线性注意力（LAG），第二个是一个简单的前馈网络（FFN）。此外，我们在每个子层中采用了残差连接 [19]。除了这些变换器模块外，我们还使用了一些卷积层来应对特征（输入）的尺度变化（如上采样）。

Encoder Pipeline. Given a masked images

, our encoder part first feeds it into a

convolution layer and then get the corresponding feature map

. Here

and

represent the dimension of the spatial resolution and

denotes the dimension of the channel. Next these features are fed into 4 -level encoder stages. Each level stage contains a stack of the designed transformer blocks, and we use a convolution with kernel size

and stride 2 to downsample the features between every two stages. For the given feature map

, the

-level encoder stage of the transformer block produces the feature map
编码器管道。给定一个被遮罩的图像

，我们的编码器部分首先将其输入到一个

卷积层，然后获得相应的特征图

。这里

和

表示空间分辨率的维度，

表示通道的维度。接下来，这些特征被输入到 4 级编码器阶段。每个级别阶段包含一组设计好的变换器块，我们使用一个核大小为

、步幅为 2 的卷积在每两个阶段之间对特征进行下采样。对于给定的特征图

，变换器块的

级编码器阶段生成特征图。

final (4-level) stage of the encoder is

.
编码器的最终（4 级）阶段是

。

Decoder Pipeline. The decoder takes the final feature map of encoder

as input and progressively restores the high resolution representations. The decoder part consists of 3-level (arranged from largest to smallest) stages, each of which is stacked by several transformer blocks. In each stage of the decoder, the features are first passed through an upsampling layer consisting of nearest neighbor interpolation and

convolution. Given a feature map, the

-level decoder stage of the upsampling produces the feature

. In addition, to help the decoder part, the encoder feature

are concatenated to the decoder feature

via a skip connection. And the concatenation operation is followed by a

convolution layer to decrease the channels (by half). These fused features are then fed into the corresponding transformer block to obtain the dimensionally invariant features

. Finally, after last
解码器管道。解码器将编码器的最终特征图

作为输入，并逐步恢复高分辨率表示。解码器部分由 3 个级别（从大到小排列）组成，每个级别由多个变换器块堆叠而成。在解码器的每个级别中，特征首先通过一个包含最近邻插值和

卷积的上采样层。给定一个特征图，上采样的

级解码器阶段生成特征

。此外，为了帮助解码器部分，编码器特征

通过跳跃连接与解码器特征

连接。连接操作后是一个

卷积层，以减少通道数（减半）。这些融合的特征随后被输入到相应的变换器块中，以获得维度不变的特征

。最后，在最后一次

Figure 2: Overview of our proposed

-former. Our model accepts masked images as input and outputs complemented images. Our

-former which is an U-net style network composed of transformer blocks that we designed. The transformer block we designed contains two sublayers: (1) Linear attention with gating mechanism (LAG) that performs our proposed linear attention for full-space feature interaction, supplemented with a gating mechanism; (2) Feed-forward network (FFN) that transforms the features learned by the attention operator to send useful representations for subsequent layers.
图 2：我们提出的

-former 概述。我们的模型接受掩蔽图像作为输入，并输出补全图像。我们的

-former 是一个由我们设计的由变换器块组成的 U-net 风格网络。我们设计的变换器块包含两个子层：（1）带有门控机制的线性注意力（LAG），执行我们提出的线性注意力以实现全空间特征交互，并辅以门控机制；（2）前馈网络（FFN），将注意力操作学习到的特征转换为用于后续层的有用表示。
(1-level) stage of the decoder we add a

convolution layer to convert the feature map

into the complemented image

.
在解码器的（1 级）阶段，我们添加一个

卷积层，将特征图

转换为补充图像

。

Transformer Block. As shown in the Figure 2, the transformer block we used in

-former contains two sub-layers. The first is the proposed linear attention with the gating mechanism (LAG), and the second is a simple feed-forward network (FFN). In the LAG layer, the gating value we obtain by feeding the input

into a

convolution layer with a GELU [20] activation function, i.e.

and

of Eq. (8).
变压器块。如图 2 所示，我们在

-former 中使用的变压器块包含两个子层。第一个是提出的带门控机制的线性注意力（LAG），第二个是简单的前馈网络（FFN）。在 LAG 层中，我们通过将输入

输入到带有 GELU [20]激活函数的

卷积层中获得的门控值，即公式(8)中的

和

。

For the design of the FFN we refer to the recent transformers [23, 63], which uses a gate-linear layer [10] with residual connections [19] instead of a residual block [19] composed of two convolutions in series. Specifically, to reduce the complexity, for the input

, whose parameter

in Eq. (7) we replace the standard convolution with a combination of a

convolution and a

depth-wise convolution.
对于 FFN 的设计，我们参考了最近的变压器[23, 63]，它使用了带有残差连接[19]的门线性层[10]，而不是由两个串联卷积组成的残差块[19]。具体来说，为了降低复杂性，对于输入

，我们在公式(7)中将参数

的标准卷积替换为

卷积和

深度卷积的组合。

3.4 Loss Function 3.4 损失函数

The loss function

used to train our

-former can be written as:
用于训练我们的

-former 的损失函数

可以写成：

where

represents the reconstruction Loss,

denotes the perceptual loss [25],

denotes the style loss [16] and

is the adversarial loss [17]. And we set

, and

. We will describe each loss function in detail below
其中

代表重建损失，

表示感知损失 [25]，

表示风格损失 [16]，而

是对抗损失 [17]。我们设置了

和

。下面我们将详细描述每个损失函数。

Reconstruction Loss. The reconstruction loss

refers to the

-distance between the output

and the ground truth

, which can be defined as:
重建损失。重建损失

指的是输出

与真实值

之间的

距离，可以定义为：

Perceptual Loss. The perceptual loss

is formulated with:
感知损失。感知损失

的公式为：

where

is the activation function of the

-th layer of the VGG19 [48] pre-trained on ImageNet [11].
其中

是在 ImageNet [11] 上预训练的 VGG19 [48] 第

层的激活函数。

Style Loss. If the size of feature maps is

, then the style loss

is calculated by:
风格损失。如果特征图的大小为

，则风格损失

的计算方法为：

Where

denotes a

Gram matrix constructed by the corresponding activation maps

.
其中

表示由相应的激活图

构建的

Gram 矩阵。

Adversarial Loss. The adversarial loss

is formulated with:
对抗损失。对抗损失

的公式为：

where

represents a patch GAN discriminator [69] with the spectral normalization [39].
其中

代表一个具有谱归一化的补丁 GAN 判别器 [69][39]。

4 EXPERIMENTS 4 实验

We evaluated our proposed

-former on three datasets, including Paris street view (Paris) [42], CelebA-HQ [26] and Places2 [68]. For CelebA-HQ, we use the first 2000 images for test and the rest for training. For Paris and Places2, we follow the training, testing, and validation splits themselves. During the experiments, all images in datasets were resized to

. Furthermore, during the
我们在三个数据集上评估了我们提出的

-former，包括巴黎街景（Paris）[42]、CelebA-HQ [26] 和 Places2 [68]。对于 CelebA-HQ，我们使用前 2000 张图像进行测试，其余用于训练。对于巴黎和 Places2，我们遵循它们的训练、测试和验证划分。在实验过程中，所有数据集中的图像都被调整为

。此外，在此期间，

Table 1: Numerical comparisons on the several datasets. The

indicates lower is better, while

indicates higher is better
表 1：在多个数据集上的数值比较。

表示越低越好，而

表示越高越好。

DataSet 数据集		Paris Street View 巴黎街景				Celeba-HQ				Places2 地点 2
Mask Ratio 口罩比例		10-20%	20-30%	30-40%	40-50%	10-20%	20-30%	30-40%	40-50%		20-30%	30-40%	40-50%
FID	GC	20.68	39.48	58.66	82.51	2.54	4.49	6.54	9.83	18.91	30.97	45.26	61.16
	RFR	20.33	28.93	39.84	49.96	3.17	4.01	4.89	6.11	17.88	22.94	30.68	38.69
	CTN	18.08	24.04	36.31	48.46	1.77	3.33	5.24	7.69	15.70	26.41	40.05	55.41
	DTS	16.66	31.94	47.30	65.44	2.08	3.86	6.06	8.58	15.72	27.88	42.44	57.78
	Ours 我们的	12.15	22.63	34.47	46.60	1.40	2.55	3.88	5.42	10.85	17.96	26.56	34.52
PSNR	GC	32.28	29.12	26.93	24.80	32.25	29.10	26.71	24.78	28.55	25.22	22.97	21.24
	RFR	30.18	27.76	25.99	24.25	30.93	28.94	27.11	25.47	27.26	24.83	22.75	21.11
	CTN	31.22	28.62	26.62	24.91	32.84	29.75	27.35	25.41	27.83	24.91	22.83	21.18
	DTS	32.69	29.28	26.89	24.97	32.91	29.51	27.02	25.13	28.91	25.36	22.94	21.21
	Ours 我们的	32.79	29.72	27.47	25.47	33.36	30.15	27.67	25.67	29.06	25.69	23.36	21.52
SSIM	GC	0.960	0.925	0.872	0.800	0.979	0.959	0.931	0.896	0.944	0.891	0.824	0.742
	RFR	0.943	0.908	0.861	0.799	0.970	0.958	0.939	0.913	0.929	0.891	0.830	0.756
	CTN	0.955	0.921	0.872	0.812	0.981	0.964	0.940	0.909	0.942	0.892	0.827	0.746
	DTS	0.963	0.929	0.875	0.812	0.981	0.962	0.937	0.905	0.952	0.901	0.834	0.755
	Ours 我们的	0.964	0.933	0.887	0.825	0.983	0.967	0.945	0.915	0.953	0.907	0.846	0.770

Table 2: Complexity measure of different models. Including multiply-accumulate operation count (MAC) and number of parameters (Params). Compared to other baseline models, our

-former has a smaller number of parameters and computational complexity
表 2：不同模型的复杂度测量。包括乘加操作计数（MAC）和参数数量（Params）。与其他基线模型相比，我们的

-former 具有更少的参数和计算复杂度。

Model 模型	GC	RFR	CTN	DTS	Ours 我们的
MAC	103.1 G	206.1 G	133.4 G	75.9 G 75.9 克	51.3 G
Params 参数	16.0 M	30.6 M 30.6 百万	21.3 M 21.3 百万	52.1 M 52.1 百万	14.8 M 14.8 百万

experiments in image inpainting we have to specify the location of the broken areas. Therefore, we use the mask dataset from the PC [34] to simulate the location of the corruption. The

-former we propose was based on a Pytorch [41] implementation and was trained on one RTX3090 ( 24 GB

with a batch size of 6 . From input to output, the number of transformer blocks of different levels is 1 ,

in order. We used the AdamW [38] optimizer to train the model with a learning rate of

and then fine-tune the model with a learning rate of

. Specifically, on the CelebA-HQ and Paris street view we trained 450,000 iterations and then fine-tuned 200,000 iterations. As for the Places2 data set, we trained about 1 million iterations and then fine-tuned 500,000 iterations.
在图像修复实验中，我们必须指定破损区域的位置。因此，我们使用来自 PC 的掩码数据集[34]来模拟损坏的位置。我们提出的

-former 基于 Pytorch [41]实现，并在一台 RTX3090（24 GB

）上以批量大小为 6 进行训练。从输入到输出，不同层次的变换器块数量依次为 1，

。我们使用 AdamW [38]优化器以学习率

训练模型，然后以学习率

对模型进行微调。具体而言，在 CelebA-HQ 和巴黎街景上，我们训练了 450,000 次迭代，然后微调了 200,000 次迭代。至于 Places2 数据集，我们训练了大约 1,000,000 次迭代，然后微调了 500,000 次迭代。

Baselines. To demonstrate the effectiveness of our

-former, We compare with the following baselines for their state-of-the-art performance:
基准线。为了展示我们

-former 的有效性，我们将其与以下基准进行比较，以展示它们的最先进性能：

GC [61]: a CNNs-based inpainting model that exploits the gating mechanism and the contextual attention [60] to get high-quality complementary images.
GC [61]: 一种基于卷积神经网络的修复模型，利用门控机制和上下文注意力 [60] 来获取高质量的补充图像。
RFR [30]: a recurrent inpainting method with a special contextual attention that recurrently recovers the missing and progressively strengthens the result.
RFR [30]：一种具有特殊上下文注意力的递归修复方法，能够递归地恢复缺失部分并逐步增强结果。
CTN [12]: a transformer-style model for image inpainting relies on a patch-based version of the attention operator to model long-range dependencies between features.
CTN [12]：一种基于变压器的图像修复模型依赖于基于补丁的注意力算子版本，以建模特征之间的长距离依赖关系。
DTS [18]: a dual U-net inpainting model based CNNs, which recovers corrupted images by simultaneous modeling structureconstrained texture synthesis and texture-guided structure reconstruction
DTS [18]：一种双 U-net 修复模型的卷积神经网络，通过同时建模结构约束的纹理合成和纹理引导的结构重建来恢复损坏的图像

Quantitative Comparison. Following previous inpainting works [12, 18], we chosen FID (Fréchet Inception Distance) [21], PSNR (peak signal-to-noise ratio), SSIM (structural similarity index) to assess our model. And according to the masks with different masking percentage provided by the dataset [34], the performance of different models under different damage degrees (mask ratio) is tested in Table 1. In addition, we show the number of parameters and the computational complexity (multiply-accumulate operations, MAC) for each model in Table 2. SSIM and PSNR are widely used in image quality assessment for restoration tasks, quantifying the pixel and structural similarities between pairs of images. In addition we adopt the FID, a generally used numeric metric in the task of image generation, to evaluate the image distribution between the inpainting results and the original images. As shown in Table 1 and Table 2, benefiting from the long-distance dependency capture capability and the dynamic nature of the parameters brought by the proposed linear attention, our

-former, in the face of different scenarios (datasets) and encountering different breakage situations, can give relatively good complemented images with a low number of parameters and computational complexity.
定量比较。根据之前的修复工作[12, 18]，我们选择了 FID（Fréchet Inception Distance）[21]、PSNR（峰值信噪比）、SSIM（结构相似性指数）来评估我们的模型。根据数据集[34]提供的不同遮罩百分比的遮罩，表 1 测试了不同模型在不同损坏程度（遮罩比例）下的性能。此外，我们在表 2 中展示了每个模型的参数数量和计算复杂度（乘加操作，MAC）。SSIM 和 PSNR 在图像质量评估中被广泛使用，用于量化图像对之间的像素和结构相似性。此外，我们采用 FID，这是一种在图像生成任务中普遍使用的数值指标，用于评估修复结果与原始图像之间的图像分布。如表 1 和表 2 所示，得益于所提出的线性注意力所带来的长距离依赖捕获能力和参数的动态特性，我们的

-former 在面对不同场景（数据集）和遇到不同破损情况时，能够以较少的参数和计算复杂度生成相对良好的补全图像。

Qualitative Comparisons. Figures 3, 4, and 5 show some comparison results between our and the baseline models on the three data sets Paris [42], CelebA-HQ [26] and Places2 [68] respectively. From these results, we can find that GC [61] is able to complete the basic semantic filling, but the filled position in the image is
定性比较。图 3、4 和 5 分别展示了我们模型与基线模型在三个数据集 Paris [42]、CelebA-HQ [26]和 Places2 [68]上的一些比较结果。从这些结果中，我们可以发现 GC [61]能够完成基本的语义填充，但图像中填充的位置是

Figure 3: Qualitative results on the Paris with GC [61], RFR [30], CTN [12], DTS [18] and our T-former. (Best viewed with zoom-in)
图 3：在巴黎的定性结果，使用 GC [61]、RFR [30]、CTN [12]、DTS [18]和我们的 T-former。（最佳查看效果为放大）

Figure 4: Qualitative results on the CelebA-HQ with GC [61], RFR [30], CTN [12], DTS [18] and our T-former. (Best viewed with zoom-in)
图 4：在 CelebA-HQ 上使用 GC [61]、RFR [30]、CTN [12]、DTS [18]和我们的 T-former 的定性结果。（最佳查看效果为放大）
prone to blurring, especially when filling images with complex patterns, such as the 2nd row in Figure 3 and 2nd row in Figure 5 The detailed textures of the images complemented by RFR [30] look quite good, but the results are prone to obvious artifacts and are prone to semantic inconsistencies. As in the 1 st and 2 nd rows of Figure 4, both images generated by RFR show the problem of inconsistent eye color. CTN [12] also performs quite well, but its results are occasionally blurred (Figure 5, line 2) and also prone to black artifacts as shown in (Figure 5, line 1). DTS [18] performs quite well with simple content images, but when it comes to images with complex patterns, the fill content appears to be significantly disorganized, as shown in the 1st row of Figure 5. Compared to these baselines, in most cases our complemented images look more reasonable and realistic.
容易模糊，尤其是在填充复杂图案的图像时，例如图 3 的第 2 行和图 5 的第 2 行。由 RFR [30] 补充的图像细节纹理看起来相当不错，但结果容易出现明显的伪影，并且容易出现语义不一致。如图 4 的第 1 行和第 2 行所示，RFR 生成的两幅图像都显示出眼睛颜色不一致的问题。CTN [12] 的表现也相当不错，但其结果偶尔会模糊（图 5，第 2 行），并且如（图 5，第 1 行）所示，也容易出现黑色伪影。DTS [18] 在简单内容图像上表现相当好，但在处理复杂图案的图像时，填充内容显得相当杂乱，如图 5 的第 1 行所示。与这些基线相比，在大多数情况下，我们的补充图像看起来更合理和真实。

Input 输入
GC
RFR
CTN
DTS
Ours 我们的
Truth 真相
Figure 5: Qualitative results on the Places 2 with GC [61], RFR [30], CTN [12], DTS [18] and our

-former. (Best viewed with zoom-in)
图 5：在 Places 2 上使用 GC [61]、RFR [30]、CTN [12]、DTS [18]和我们的

-former 的定性结果。（最佳查看效果为放大）

Table 3: Ablation study on the Paris. The

indicates lower is better, while

indicates higher is better
表 3：巴黎的消融研究。

表示越低越好，而

表示越高越好。

Mask Ratio 口罩比例
FID	w/o	12.20	23.93	35.14	47.99
	w/o	12.74	22.89	36.08	47.48
	w/o	12.94	24.19	37.08	48.47
	Ours 我们的
	w/o	32.74	29.69	27.35	25.36
	w/o	32.72	29.68	27.36	25.33
	w/o	32.70	29.66	27.33	25.28
	Ours 我们的
	w/o	0.962	0.932	0.884	0.823
	w/o	0.963	0.931	0.884	0.823
	w/o	0.961	0.930	0.883	0.821
	Ours 我们的

4.1 Ablation Study 4.1 消融研究

We analyze the effectiveness of our proposed module. And all the ablation experiments are conducted on the Paris street view. In the ablation experiments we explored two main components: (1) for the effect of the residual-like connection resulting from

in Eq. (6) (or Eq. (5)), i.e. 1 in

in the Taylor expansion, denoted by

. When

does not exist, our linear attention is somewhat similar to the current implementation of a series of linear attentions [27,

in the field of natural language processing where softmax operators are replaced by kernel functions. And the

is equivalent to adding a new residual term to this family of linear attention operators; (2) for the effect of the gating mechanism on the model performance, denoted by

. It can be noticed that both components have a positive impact on the inpainting task. A more interesting point is that it can be found that

acts more significantly when the input image is more broken, while the effect of

is independent of the degree of input image breakage. The paper [52] has showed that the residual connections can be seen as an ensemble of the model, and one more connection is equivalent to one more subnetwork. We speculate that when the model encounters difficult scenes (more broken parts of the input image), more sub-networks (with

) are needed to assistant the model get the proper content to fill the missing areas.
我们分析了我们提出的模块的有效性。所有的消融实验都在巴黎街景上进行。在消融实验中，我们探讨了两个主要组件：（1）来自方程（6）（或方程（5））中

的残差连接的影响，即泰勒展开式中

的 1，记作

。当

不存在时，我们的线性注意力在某种程度上类似于当前在自然语言处理领域中一系列线性注意力的实现[27，

，其中 softmax 运算符被核函数替代。而

相当于向这一系列线性注意力算子中添加一个新的残差项；（2）门控机制对模型性能的影响，记作

。可以注意到，这两个组件对修复任务都有积极的影响。更有趣的是，当输入图像更破损时，可以发现

的作用更为显著，而

的效果与输入图像破损程度无关。论文[52]显示，残差连接可以视为模型的一个集成，而多一个连接相当于多一个子网络。我们推测，当模型遇到困难场景（输入图像中更多破损部分）时，需要更多的子网络（带有

）来帮助模型获取适当的内容以填补缺失区域。

5 CONCLUSION 5 结论

In this paper, we propose

-former, a U-net style network built by the proposed linear attention for for image inpainting. To address the problem that CNNs-based inpainting networks have insufficient long-range modeling capability and the standard self-attention operator has high computational load, we propose a linear attention operator based on Taylor's formula that captures the long-range dependence between features at a small computational cost. In addition, we utilize a gating mechanism to enhance the performance of the proposed linear attentional operator. Quantitative and qualitative results demonstrate that our proposed

-former outperforms state-of-the-art methods in terms of performance and also maintains a relatively small complexity.
在本文中，我们提出了

-former，这是一种基于所提线性注意力的 U-net 风格网络，用于图像修复。为了解决基于 CNN 的修复网络在长程建模能力不足和标准自注意力操作符计算负担过重的问题，我们提出了一种基于泰勒公式的线性注意力操作符，能够以较小的计算成本捕捉特征之间的长程依赖。此外，我们利用门控机制来增强所提线性注意力操作符的性能。定量和定性结果表明，我们提出的

-former 在性能上优于最先进的方法，并且保持了相对较小的复杂性。

ACKNOWLEDGMENTS 致谢

This work is jointly supported by the National Key Research and Development Program of China under Grant No. 2017YFA0700800, the General Program of China Postdoctoral Science Foundation under Grant No. 2020M683490, and the Youth program of Shaanxi Natural Science Foundation under Grant No. 2021JQ-054.
本研究得到了中国国家重点研发计划（资助号：2017YFA0700800）、中国博士后科学基金一般项目（资助号：2020M683490）以及陕西省自然科学基金青年项目（资助号：2021JQ-054）的共同支持。

REFERENCES 参考文献

[1] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. 2001. Filling-in by joint interpolation of vector fields and gray levels. IEEE Transactions on Image Processing 10, 8 (2001), 1200-1211. https://doi.org/10.1109/83.935036
[1] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, 和 J. Verdera. 2001. 通过向量场和灰度级的联合插值进行填充. IEEE 图像处理汇刊 10, 8 (2001), 1200-1211. https://doi.org/10.1109/83.935036
[2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009 PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing. ACM Transactions on Graphics (Proc. SIGGRAPH) 28, 3 (Aug. 2009)
[2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, 和 Dan B Goldman. 2009 PatchMatch: 一种用于结构图像编辑的随机对应算法. ACM 图形学汇刊 (Proc. SIGGRAPH) 28, 3 (2009 年 8 月)
[3] M. Bertalmio. 2006. Strong-continuation, contrast-invariant inpainting with a third-order optimal PDE. IEEE Transactions on Image Processing 15, 7 (2006), 1934-1938. https://doi.org/10.1109/TIP.2006.877067
[3] M. Bertalmio. 2006. 强连续性、对比不变的第三阶最优偏微分方程图像修复。IEEE 图像处理汇刊 15, 7 (2006), 1934-1938. https://doi.org/10.1109/TIP.2006.877067
[4] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester 2000. Image Inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '00). ACM Press/Addison-Wesley Publishing Co., USA, 417-424. https://doi.org/10.1145/344779.344972
[4] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, 和 Coloma Ballester 2000. 图像修复. 在第 27 届计算机图形学与交互技术年会论文集 (SIGGRAPH '00) 中. ACM Press/Addison-Wesley Publishing Co., USA, 417-424. https://doi.org/10.1145/344779.344972
[5] Pierre Buyssens, Maxime Daisy, David Tschumperlé, and Olivier Lézoray. 2015 Exemplar-Based Inpainting: Technical Review and New Heuristics for Better Geometric Reconstructions. IEEE Transactions on Image Processing 24, 6 (2015) 1809-1824. https://doi.org/10.1109/TIP.2015.2411437
[5] Pierre Buyssens, Maxime Daisy, David Tschumperlé 和 Olivier Lézoray. 2015 年基于示例的修复：技术评审和用于更好几何重建的新启发式方法. IEEE 图像处理汇刊 24, 6 (2015) 1809-1824. https://doi.org/10.1109/TIP.2015.2411437
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In European Conference on Computer Vision, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, 213-229.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, 和 Sergey Zagoruyko. 2020. 基于变换器的端到端目标检测. 在欧洲计算机视觉会议, Andrea Vedaldi, Horst Bischof, Thomas Brox, 和 Jan-Michael Frahm (编辑). 施普林格国际出版, 213-229.
[7] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. 2022 MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11315-11325.
[7] 常慧文, 张汉, 姜璐, 刘策, 和威廉·T·弗里曼. 2022 年 MaskGIT: 掩蔽生成图像变换器. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集 (CVPR). 11315-11325.
[8] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
[8] Cho, Kyunghyun, Van Merriënboer, Bart, Bahdanau, Dzmitry, 和 Bengio, Yoshua. 2014. 关于神经机器翻译的特性：编码器-解码器方法. arXiv 预印本 arXiv:1409.1259 (2014).
[9] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing 13, 9 (2004), 1200-1212.
[9] 安东尼奥·克里米尼西，帕特里克·佩雷斯，和丰山健太郎。2004 年。基于示例的图像修复中的区域填充和物体移除。IEEE 图像处理汇刊 13, 9 (2004), 1200-1212。
[10] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th Interna tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 933-941.
[10] Yann N. Dauphin, Angela Fan, Michael Auli, 和 David Grangier. 2017. 使用门控卷积网络进行语言建模. 在第 34 届国际机器学习会议论文集（机器学习研究论文集，第 70 卷），Doina Precup 和 Yee Whye Teh（编辑）。PMLR, 933-941.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255
[11] 贾邓，魏东，理查德·索彻，李佳，凯·李，李飞飞。2009 年。《ImageNet：一个大规模分层图像数据库》。在 2009 年 IEEE 计算机视觉与模式识别会议上。248-255
[12] Ye Deng, Siqi Hui, Sanping Zhou, Deyu Meng, and Jinjun Wang. 2021. Learning Contextual Transformer Network for Image Inpainting. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 2529-2538 https://doi.org/10.1145/3474085.3475426
[12] 叶登，惠思琪，周三平，孟德宇，王金俊。2021 年。学习上下文变换器网络进行图像修复。在第 29 届 ACM 国际多媒体会议论文集（虚拟活动，中国）（MM '21）。计算机协会，纽约，NY，美国，2529-2538 https://doi.org/10.1145/3474085.3475426
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, 施耀华 Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 和 Neil Houlsby. 2021. 一张图像价值 16x16 个词：用于大规模图像识别的变换器. 在国际学习表征会议上.
[14] Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. 2021. Im ageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis. In Advances in Neural Information Processing Systems, Vol. 34

.
[14] Patrick Esser, Robin Rombach, Andreas Blattmann, 和 Bjorn Ommer. 2021. ImageBART：用于自回归图像合成的双向上下文与多项式扩散. 在《神经信息处理系统进展》，第 34 卷

.
[15] Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873-12883.
[15] Patrick Esser, Robin Rombach, 和 Bjorn Ommer. 2021. 驯服变换器以进行高分辨率图像合成. 在 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集中. 12873-12883.
[16] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Trans fer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Leon A. Gatys, Alexander S. Ecker 和 Matthias Bethge. 2016. 使用卷积神经网络进行图像风格迁移. 载于 IEEE 计算机视觉与模式识别会议论文集 (CVPR).
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc.
[17] 伊恩·古德费洛，尚·普吉特-阿巴迪，梅赫迪·米尔扎，邢炳，戴维·瓦德-法利，谢尔吉尔·奥扎尔，亚伦·库维尔，和约书亚·本吉奥。2014 年。生成对抗网络。在《神经信息处理系统进展》中，Z. 盖哈拉马尼，M. 韦林，C. 科尔特斯，N. 劳伦斯，和 K. Q. 温伯格（编辑），第 27 卷。Curran Associates, Inc.
[18] Xiefan Guo, Hongyu Yang, and Di Huang. 2021. Image Inpainting via Conditional Texture and Structure Dual Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14134-14143.
[18] 郭谢凡, 杨洪宇, 黄迪. 2021. 通过条件纹理和结构双重生成进行图像修复. 载于 IEEE/CVF 国际计算机视觉会议论文集 (ICCV). 14134-14143.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] 何凯明，张向宇，任少卿，孙剑。2016。用于图像识别的深度残差学习。在 IEEE 计算机视觉与模式识别会议（CVPR）论文集中。
[20] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus) arXiv preprint arXiv:1606.08415 (2016)
[20] Dan Hendrycks 和 Kevin Gimpel. 2016. 高斯误差线性单元（gelus）arXiv 预印本 arXiv:1606.08415 (2016)
[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, Vol. 30
[21] 马丁·赫乌塞尔，胡伯特·拉姆绍尔，托马斯·乌特因纳，伯恩哈德·内斯勒，和塞普·霍赫雷特。2017 年。通过两种时间尺度更新规则训练的生成对抗网络收敛到局部纳什均衡。在《神经信息处理系统进展》，第 30 卷
[22] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9,8 (1997), 1735-1780.
[22] Sepp Hochreiter 和 Jürgen Schmidhuber. 1997. 长短期记忆. 神经计算 9,8 (1997), 1735-1780.
[23] Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. 2022. Transformer Quality in Linear Time. arXiv preprint arXiv:2202.10447 (2022).
[23] 华维哲, 戴自航, 刘汉晓, 和黎国伟. 2022. 线性时间内的变换器质量. arXiv 预印本 arXiv:2202.10447 (2022).
[24] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017) 36, 4, Article 107 (2017), 107:1-107:14 pages.
[24] Iizuka Satoshi, Simo-Serra Edgar, 和 Ishikawa Hiroshi. 2017. 全球和局部一致的图像补全. ACM 图形学汇刊 (SIGGRAPH 2017 会议论文集) 36, 4, 文章 107 (2017), 107:1-107:14 页.
[25] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for realtime style transfer and super-resolution. In European conference on computer vision. Springer, 694-711.
[25] 贾斯廷·约翰逊，亚历山大·阿拉希，李飞飞。2016 年。实时风格迁移和超分辨率的感知损失。在欧洲计算机视觉会议上。施普林格，694-711。
[26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations. https://openreview.net/forum?id= Hk 99 zCeAb
[26] Tero Karras, Timo Aila, Samuli Laine, 和 Jaakko Lehtinen. 2018. GANs 的渐进式增长以提高质量、稳定性和变化性. 在国际学习表征会议上. https://openreview.net/forum?id= Hk 99 zCeAb
[27] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR,

.
[27] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, 和 François Fleuret. 2020. 变压器是 RNN：具有线性注意力的快速自回归变压器. 在第 37 届国际机器学习会议论文集（机器学习研究论文集，第 119 卷），Hal Daumé III 和 Aarti Singh（编辑）. PMLR,

.
[28] Nikos Komodakis and Georgios Tziritas. 2007. Image Completion Using Efficient Belief Propagation Via Priority Scheduling and Dynamic Pruning. IEEE Transactions on Image Processing 16, 11 (2007), 2649-2661. https://doi.org/10.1109/TIP. 2007.906269
[28] Nikos Komodakis 和 Georgios Tziritas. 2007. 通过优先调度和动态修剪的高效信念传播进行图像补全. IEEE 图像处理汇刊 16, 11 (2007), 2649-2661. https://doi.org/10.1109/TIP. 2007.906269
[29] Jingyuan Li, Fengxiang He, Lefei Zhang, Bo Du, and Dacheng Tao. 2019. Progressive Reconstruction of Visual Structure for Image Inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[29] 李京源, 何凤翔, 张乐飞, 杜博, 陶大成. 2019. 图像修复的视觉结构渐进重建. 在 IEEE/CVF 国际计算机视觉会议（ICCV）论文集中
[30] Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, and Dacheng Tao. 2020. Recurrent Feature Reasoning for Image Inpainting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] 李京源, 王宁, 张乐飞, 杜博, 和陶大成. 2020. 图像修复的递归特征推理. 在 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 上.
[31] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.

.
[31] 梁静云, 曹杰章, 孙国雷, 张凯, 卢克·范·古尔, 和拉杜·蒂莫夫特. 2021. SwinIR: 使用 Swin Transformer 进行图像恢复. 在 IEEE/CVF 国际计算机视觉会议（ICCV）研讨会的会议录中.

.
[32] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin'ichi Satoh. 2020. Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In European Conference on Computer Vision. Springer, 683-700.
[32] 梁辽，肖静，王铮，林家文，佐藤信一。2020。指导与评估：针对混合场景的语义感知图像修复。在欧洲计算机视觉会议上。施普林格，683-700。
[33] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin'ichi Satoh. 2021. Image Inpainting Guided by Coherence Priors of Semantics and Textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6539-6548.
[33] 梁燎, 肖静, 王铮, 林家文, 佐藤信一. 2021. 受语义和纹理一致性先验指导的图像修复. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集 (CVPR). 6539-6548.
[34] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the European Conference on Computer Vision

.
[34] 桂林·刘, 菲茨姆·A·雷达, 凯文·J·施, 丁春·王, 安德鲁·陶, 和布莱恩·卡坦扎罗. 2018. 使用部分卷积对不规则孔进行图像修复. 在欧洲计算机视觉会议论文集

.
[35] Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang. 2020. Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations. In European Conference on Computer Vision. Springer International Publishing, Cham,

.
[35] 刘洪宇, 姜斌, 宋宜兵, 黄伟, 杨超. 2020. 通过特征均衡的互编码器-解码器重新思考图像修复. 在欧洲计算机视觉会议. 施普林格国际出版, 查姆,

.
[36] Hongyu Liu, Bin Jiang, Yi Xiao, and Chao Yang. 2019. Coherent Semantic Attention for Image Inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[36] 刘洪宇, 姜斌, 肖毅, 杨超. 2019. 图像修复的连贯语义注意力. 载于 IEEE/CVF 国际计算机视觉会议（ICCV）论文集.
[37] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10012-10022.
[37] 刘泽, 林宇彤, 曹悦, 胡汉, 魏奕轩, 张铮, 林斯蒂芬, 和郭百宁. 2021. Swin Transformer: 使用移动窗口的分层视觉变换器. 在 IEEE/CVF 国际计算机视觉会议（ICCV）论文集中. 10012-10022.
[38] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
[38] 伊利亚·洛希奇洛夫和弗兰克·胡特。2019 年。解耦权重衰减正则化。在国际学习表征会议上。
[39] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations.
[39] 宫户武、片冈俊树、小山雅典和吉田裕一。2018 年。生成对抗网络的谱归一化。在国际学习表征会议上。
[40] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. 2019. EdgeConnect: Structure Guided Image Inpainting using Edge Prediction. In The IEEE International Conference on Computer Vision (ICCV) Workshops.
[40] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi 和 Mehran Ebrahimi. 2019. EdgeConnect：基于边缘预测的结构引导图像修复. 在 IEEE 国际计算机视觉会议 (ICCV) 研讨会中.
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc
[41] 亚当·帕兹克，萨姆·格罗斯，弗朗西斯科·马萨，亚当·勒雷，詹姆斯·布拉德伯里，格雷戈里·查南，特雷弗·基林，林泽铭，娜塔莉亚·吉梅尔谢因，卢卡·安蒂加，阿尔班·德斯梅松，安德烈亚斯·科普，爱德华·杨，扎卡里·德维托，马丁·雷松，阿利汉·特贾尼，萨桑克·奇拉姆库尔提，贝努瓦·斯坦纳，卢芳，白俊杰，和索米特·钦塔拉。2019 年。PyTorch：一种命令式风格的高性能深度学习库。在《神经信息处理系统进展》中，H. 瓦拉赫，H. 拉罗谢尔，A. 贝格尔齐默，F. 达尔切-布克，E. 福克斯，和 R. 加尔内特（编辑），第 32 卷。Curran Associates, Inc.
[42] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. 2016. Context Encoders: Feature Learning by Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, 和 Alexei A. Efros. 2016. 上下文编码器：通过修复进行特征学习. 在 IEEE 计算机视觉与模式识别会议（CVPR）论文集中.
[43] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. 2021. Random Feature Attention. In International Conference on Learning Representations.
[43] 郝鹏, 尼古拉斯·帕帕斯, 达尼·尤加塔玛, 罗伊·施瓦茨, 诺亚·史密斯, 和凌鹏·孔. 2021. 随机特征注意力. 在国际学习表征会议上.
[44] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. 2022. cosFormer: Rethinking Softmax In
[44] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, 和 Yiran Zhong. 2022. cosFormer: 重新思考 Softmax 在

Attention. In International Conference on Learning Representations.
注意。在国际学习表征会议上。
[45] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H. Li, Shan Liu, and Ge Li. 2019. StructureFlow: Image Inpainting via Structure-Aware Appearance Flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[45] 任宇瑞, 于晓明, 张若南, 李汤姆·H, 刘珊, 和李戈. 2019. 结构流：通过结构感知外观流进行图像修复. 在 IEEE/CVF 国际计算机视觉会议（ICCV）论文集中.
[46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutiona Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention, Nassir Navab, Joachim Hornegger, William M Wells, and Alejandro F. Frangi (Eds.). Springer International Publishing, Cham,

.
[46] Olaf Ronneberger, Philipp Fischer, 和 Thomas Brox. 2015. U-Net: 用于生物医学图像分割的卷积网络. 在医学图像计算与计算机辅助干预中, Nassir Navab, Joachim Hornegger, William M Wells, 和 Alejandro F. Frangi (编辑). 施普林格国际出版, 查姆,

.
[47] Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)
[47] 诺姆·沙泽尔. 2020. Glu 变体改善变换器. arXiv 预印本 arXiv:2002.05202 (2020)
[48] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net works for Large-Scale Image Recognition. In International Conference on Learning Representations
[48] Karen Simonyan 和 Andrew Zisserman. 2015. 用于大规模图像识别的非常深的卷积网络. 在国际学习表征会议上
[49] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers

distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10347-10357.
[49] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, 和 Herve Jegou. 2021. 通过注意力蒸馏训练数据高效的图像变换器

。在第 38 届国际机器学习会议论文集（机器学习研究论文集，第 139 卷），Marina Meila 和 Tong Zhang（编辑）。PMLR, 10347-10357.
[50] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. 2021. Scaling Local Self-Attention for Parameter Efficient Visual Backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12894-12904.
[50] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, 和 Jonathon Shlens. 2021. 扩展局部自注意力以实现参数高效的视觉骨干网络. 在 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集中. 12894-12904.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones Aidan N Gomez, Ł ukasz Kaiser, 和 Illia Polosukhin. 2017. 注意力即一切. 载于《神经信息处理系统进展》，第 30 卷. Curran Associates, Inc.
[52] Andreas Veit, Michael J Wilber, and Serge Belongie. 2016. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc.
[52] Andreas Veit, Michael J Wilber 和 Serge Belongie. 2016. 残差网络表现得像相对较浅网络的集成. 在《神经信息处理系统进展》中, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon 和 R. Garnett (编辑), 第 29 卷. Curran Associates, Inc.
[53] Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. 2021. High-Fidelity Pluralistic Image Completion With Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4692-4701.
[53] 万子宇, 张景博, 陈东东, 和廖静. 2021. 基于变换器的高保真多元图像补全. 在 IEEE/CVF 国际计算机视觉会议（ICCV）论文集中. 4692-4701.
[54] Ning Wang, Jingyuan Li, Lefei Zhang, and Bo Du. 2019. MUSICAL: Multi-Scale Image Contextual Attention Learning for Inpainting. In Proceedings of the Twenty Eighth International foint Conference on Artificial Intelligence, IFCAI-19. Inter national Joint Conferences on Artificial Intelligence Organization, 3748-3754 https://doi.org/10.24963/ijcai.2019/520
[54] 宁王，景源李，乐飞张，博杜。2019 年。MUSICAL：用于图像修复的多尺度图像上下文注意力学习。在第二十八届国际人工智能联合会议论文集，IFCAI-19。国际人工智能联合会议组织，3748-3754 https://doi.org/10.24963/ijcai.2019/520
[55] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 568-578.
[55] 王文海, 谢恩泽, 李翔, 范登平, 宋凯涛, 梁丁, 陆彤, 罗平, 邵玲. 2021. 金字塔视觉变换器：一种无需卷积的密集预测通用骨干网络. 载于 IEEE/CVF 国际计算机视觉会议论文集 (ICCV). 568-578.
[56] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. 2021 Uformer: A general u-shaped transformer for image restoration. arXiv preprint
[56] 王振东, 孙晓东, 包建民, 刘建壮. 2021 Uformer: 一种用于图像恢复的通用 U 型变换器. arXiv 预印本
arXiv:2106.03106 (2021)
[57] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shilei Wen, and Errui Ding. 2019. Image Inpainting With Learnable Bidirectional Attention Maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[57] 谢超浩, 刘少辉, 李超, 程明明, 左望萌, 刘晓, 温世雷, 丁尔瑞. 2019. 使用可学习的双向注意力图进行图像修复. 载于 IEEE/CVF 国际计算机视觉会议论文集 (ICCV)
[58] Zongben Xu and Jian Sun. 2010. Image Inpainting by Patch Propagation Using Patch Sparsity. IEEE Transactions on Image Processing 19, 5 (2010), 1153-1165. https://doi.org/10.1109/TIP.2010.2042098
[58] 许宗本和孙健。2010。通过使用补丁稀疏的补丁传播进行图像修复。IEEE 图像处理汇刊 19, 5 (2010), 1153-1165。https://doi.org/10.1109/TIP.2010.2042098
[59] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. 2018. Shift-Net: Image Inpainting via Deep Feature Rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV)
[59] 闫兆义, 李晓明, 李穆, 左望萌, 和单时光. 2018. Shift-Net: 通过深度特征重排进行图像修复. 在欧洲计算机视觉会议（ECCV）论文集中.
[60] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. 2018. Generative Image Inpainting With Contextual Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[60] 余佳辉, 林哲, 杨季梅, 沈晓辉, 陆鑫, 和黄 Thomas S. 2018. 基于上下文注意力的生成图像修复. 在 IEEE 计算机视觉与模式识别会议（CVPR）论文集中.
[61] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. 2019. Free-Form Image Inpainting With Gated Convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[61] 余佳辉, 林哲, 杨季梅, 沈晓辉, 陆鑫, 和托马斯·S·黄. 2019. 基于门控卷积的自由形式图像修复. 载于 IEEE/CVF 国际计算机视觉会议论文集 (ICCV).
[62] Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, and Chunyan Miao. 2021. Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 69-78.
[62] 余英晨, 詹方能, 吴荣亮, 潘建雄, 崔凯文, 陆世健, 马飞英, 谢轩松, 和苗春燕. 2021. 使用双向和自回归变换器的多样化图像修复. 在第 29 届 ACM 国际多媒体会议论文集中. 69-78.
63] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021. Restormer: Efficient Transformer for HighResolution Image Restoration. arXiv preprint arXiv:2111.09881 (2021).
63] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, 和 Ming-Hsuan Yang. 2021. Restormer: 高效变换器用于高分辨率图像恢复. arXiv 预印本 arXiv:2111.09881 (2021).
64] Yanhong Zeng, Jianlong Fu, and Hongyang Chao. 2020. Learning Joint SpatialTemporal Transformations for Video Inpainting. In European Conference on Computer Vision, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing.
64] 曾艳红, 傅建龙, 和赵洪阳. 2020. 学习视频修复的联合时空变换. 在欧洲计算机视觉会议, 安德烈亚·维达尔迪, 霍斯特·比肖夫, 托马斯·布罗克斯, 和扬-迈克尔·弗拉姆 (编). 施普林格国际出版.
[65] Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. 2019. Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting. In Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitio (CVPR).
[65] 曾艳红, 傅建龙, 赵洪阳, 和郭百宁. 2019. 学习金字塔上下文编码网络用于高质量图像修复. 在 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集中.
[66] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2019. Pluralistic Image Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[66] 郭川霞, 詹达仁, 蔡建飞. 2019. 多元图像补全. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集 (CVPR).
[67] Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai, and Dinh Phung. 2022. Bridging Global Context Interactions for High-Fidelity Image Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67] 郭传霞, 詹达仁, 蔡建飞, 和丁丰. 2022. 为高保真图像补全架起全球上下文交互的桥梁. 在 IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集中.

[68] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1452-1464. https://doi.org/10.1109/TPAMI.2017.2723009
[68] 周博磊, 阿加塔·拉佩德里扎, 阿迪提亚·科斯拉, 奥德·奥利瓦, 和安东尼奥·托拉尔巴. 2018. 地点：一个用于场景识别的 1000 万图像数据库. IEEE 模式分析与机器智能汇刊 40, 6 (2018), 1452-1464. https://doi.org/10.1109/TPAMI.2017.2723009
69] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
69] 朱俊彦，朴泰成，菲利普·伊索拉，亚历克谢·艾弗罗斯。2017 年。使用循环一致对抗网络进行无配对图像到图像的转换。发表于 IEEE 国际计算机视觉会议（ICCV）论文集。

*The author is also with the Shunan Academy of Artificial Intelligence, Ningbo, Zhe jiang 315000, China.
作者还在中国浙江省宁波市 315000 的顺南人工智能学院工作。

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
允许制作本作品的数字或纸质副本，供个人或课堂使用，前提是副本不得用于盈利或商业利益，并且副本在首页上必须标明此通知和完整引用。必须尊重由 ACM 以外的其他人拥有的本作品的组成部分的版权。允许带有信用的摘要。其他复制或重新发布到服务器或重新分发到列表，需要事先获得特定许可和/或支付费用。请求权限请联系 permissions@acm.org。
MM '22, October 10-14, 2022, Lisboa, Portugal
MM '22，2022 年 10 月 10 日至 14 日，葡萄牙里斯本
(c) 2022 Association for Computing Machinery.
(c) 2022 计算机协会。

ACM ISBN 978-1-4503-9203-7/22/10...$15.00
https://doi.org/10.1145/3503161.3548446

T-former: An Efficient Transformer for Image Inpainting T-former：一种高效的图像修复变换器

Abstract 摘要

CCS CONCEPTS CCS 概念

KEYWORDS 关键词

ACM Reference Format: ACM 参考格式：

1 INTRODUCTION 1 引言

2 RELATED WORK 2 相关工作

2.1 Vision Transformer 2.1 视觉变换器

2.2 Image Inpainting 2.2 图像修复

3 APPROACH 3 方法

3.1 Linear Attention 3.1 线性注意力

3.2 Gated Mechanism for Linear Attention3.2 线性注意力的门控机制

3.3 Network Architecture 3.3 网络架构

3.4 Loss Function 3.4 损失函数

4 EXPERIMENTS 4 实验

4.1 Ablation Study 4.1 消融研究

5 CONCLUSION 5 结论

ACKNOWLEDGMENTS 致谢

REFERENCES 参考文献

T-former: An Efficient Transformer for Image Inpainting
T-former：一种高效的图像修复变换器

3.2 Gated Mechanism for Linear Attention
3.2 线性注意力的门控机制