这是用户在 2024-11-27 20:51 为 https://ar5iv.labs.arxiv.org/html/2109.07161?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?


基于傅立叶卷积的分辨率鲁棒性大掩模修复


Roman苏沃罗夫1 Elizaveta Logacheva1 Anton Mashikhin1 Anastasia Remizova3星级 ArseniiAshukha


阿列克谢·西尔维斯特罗夫1 内仁孔2 Harshith Goka2 基雄公园2 维克托Lempitsky1,4



1三星人工智能中心莫斯科2三星研究院,


3洛桑联邦理工学院计算机与通信科学学院


4斯科尔科沃科学技术学院,俄罗斯莫斯科
 Abst
footnotetext:
与罗曼·苏沃罗夫的通信windj007@gmail.com
footnotetext:
这项工作是在三星人工智能中心莫斯科完成的


现代图像修复系统,尽管取得了重大进展,往往与大的丢失区域,复杂的几何结构,和高分辨率的图像。我们发现,其中一个主要原因是修复网络和损失函数都缺乏有效的感受野。为了缓解这个问题,我们提出了一种新的方法,称为大掩模修复(LaMa)。LaMa基于 i)i) 一种新的修复网络架构,该架构使用快速傅立叶卷积(FFC),具有图像范围的感受野; ii)ii) 高感受野感知损失; iii)iii) 大型训练掩码,释放了前两个组件的潜力。我们的修复网络在一系列数据集上提高了最先进的水平,即使在具有挑战性的场景中也能实现出色的性能,例如完成周期性结构。 我们的模型令人惊讶地很好地推广到比训练时看到的分辨率更高的分辨率,并且以比竞争基线更低的参数和时间成本实现了这一点。该代码可在https://github.com/saic-mdal/lama上获得。

[Uncaptioned image]

图1:所提出的方法可以成功地修补大区域,并适用于各种图像,包括具有复杂重复结构的图像。该方法推广到高分辨率图像,而仅在低 256×256256256256\times 256 分辨率下训练。


1引言


图像修复问题的解决方案-真实感填充缺失部分-需要“理解”自然图像的大规模结构并执行图像合成。该主题在深度学习时代之前已经被研究[1513],近年来通过使用深度和广泛的神经网络[263025]和对抗学习[3418564457325461],进展加速。


通常的做法是在自动生成的大型数据集上训练修复系统,该数据集是通过随机掩蔽真实的图像创建的。通常使用具有中间预测的复杂两阶段模型,例如平滑图像[27,54,61],边缘[32,48]和分割图[44]。在这项工作中,我们用一个简单的单级网络实现了最先进的结果。


大的有效感受野[29]对于理解图像的全局结构并因此解决修复问题至关重要。此外,在大掩模的情况下,甚至大但有限的感受野可能不足以访问生成质量修补所需的信息。我们注意到,流行的卷积架构可能缺乏足够大的有效感受野。我们仔细干预系统的每个组件,以缓解问题并释放单级解决方案的潜力。具体而言:


i)我们提出了一个基于最近开发的快速傅立叶卷积FFCs)的修复网络[4]。FFC允许接收场覆盖整个图像,即使在网络的早期层中。我们表明,FFC的这一属性提高了网络的感知质量和参数效率。有趣的是,FFC的归纳偏差允许网络泛化到训练期间从未见过的高分辨率(图5,图6)。这一发现带来了显着的实际好处,因为需要更少的训练数据和计算。


ii)我们提出使用基于具有高感受野的语义分割网络的感知损失[20]。这依赖于这样的观察,即感受野不足不仅会损害修复网络,还会损害感知损失。我们的损失促进了全球结构和形状的一致性。


iii)我们引入了一种积极的训练掩模生成策略,以释放前两个组件的高感受野的潜力。该过程产生宽而大的掩码,这迫使网络充分利用模型的高感受野和损失函数。

Refer to caption

图2:大掩模修复(LaMa)方法的方案。LaMa基于前馈ResNet类修复网络,该网络用途:使用:最近提出的快速傅立叶卷积(FFC)[4],结合对抗性损失和高感受野感知损失的多分量损失,以及训练时间大掩码生成过程。


这导致我们大掩模修复(LaMa)-一种新颖的单阶段图像修复系统。LaMa的主要组件是高感受野架构 (i)(i) ,具有高感受野损失函数 (ii)(ii) ,以及训练掩码生成的积极算法 (iii)(iii) 。我们将LaMa与最先进的基线进行了细致的比较,并分析了每个拟议组件的影响。通过评估,我们发现LaMa在仅对低分辨率数据进行训练后可以推广到高分辨率图像。LaMa可以捕获和生成复杂的周期性结构,并且对大掩模具有鲁棒性。此外,与竞争基线相比,这是通过显著更少的可训练参数和推理时间成本来实现的。


2方法


我们的目标是修补被未知像素 mm 的二进制掩码所掩蔽的彩色图像 xx ,被掩蔽的图像被表示为 xmdirect-productx\odot m 。掩模 mm 与掩模图像 xmdirect-productx\odot m 堆叠,从而产生四通道输入张量 x=stack(xm,m)superscriptdirect-productx^{\prime}=\texttt{stack}(x\odot m,m) 。我们使用前馈修复网络 fθ()subscriptf_{\theta}(\cdot) ,我们也称之为生成器。以 xsuperscriptx^{\prime} 为例,修复网络以全卷积方式处理输入,并生成修复后的三通道彩色图像 x^=fθ(x)subscriptsuperscript\hat{x}=f_{\theta}(x^{\prime}) 。对从真实的图像和合成生成的掩模获得的(图像,掩模)对的数据集执行训练。


2.1早期各层的全球背景


在具有挑战性的情况下,例如填充大掩模,生成适当的修复需要考虑全局上下文。因此,我们认为,一个好的架构应该有尽可能广泛的接收域的单位,尽可能早的管道。传统的全卷积模型,例如ResNet[14],受到有效感受野增长缓慢的影响[29]。由于卷积核通常很小(例如 3×3333\times 3 ),接收场可能不足,特别是在网络的早期层。因此,网络中的许多层将缺乏全局上下文,并且将浪费计算和参数来创建一个全局上下文。对于宽掩模,特定位置处的发生器的整个感受野可以在掩模内,从而仅观察缺失的像素。对于高分辨率图像,这个问题变得尤其明显。


快速傅立叶卷积(FFC)[4]是最近提出的算子,允许在早期层中使用全局上下文。FFC基于逐通道快速傅里叶变换(FFT)[2],并且具有覆盖整个图像的感受野。FFC将通道分成两个并行分支:i)局部分支使用常规卷积,以及ii)全局分支使用真实的FFT来考虑全局上下文。真实的FFT只能应用于真实的值信号,而逆真实的FFT确保输出是真实的值。与FFT相比,真实的FFT仅使用频谱的一半。具体而言,FFC采取以下步骤:

  • a)a)


    真实的FFT 2d应用于输入张量

    Real FFT2d:H×W×CH×W2×C,\textit{Real FFT2d}:\mathbb{R}^{H\times W\times C}\to\mathbb{C}^{H\times\frac{W}{2}\times C},


    并连接真实的和虚部

    ComplexToReal:H×W2×CH×W2×2C;\textit{ComplexToReal}:\mathbb{C}^{H\times\frac{W}{2}\times C}\to\mathbb{R}^{H\times\frac{W}{2}\times 2C};
  • b)b)


    在频域中应用卷积块

    ReLUBNConv1×1:H×W2×2CH×W2×2C;\textit{ReLU}\circ\textit{BN}\circ\textit{Conv1}\!\!\times\!\!\textit{1}:\mathbb{R}^{H\times\frac{W}{2}\times 2C}\to\mathbb{R}^{H\times\frac{W}{2}\times 2C};
  • c)c)


    应用逆变换来恢复空间结构

    RealToComplex:H×W2×2CH×W2×C,\textit{RealToComplex}:\mathbb{R}^{H\times\frac{W}{2}\times 2C}\to\mathbb{C}^{H\times\frac{W}{2}\times C},
    Inverse Real FFT2d:H×W2×CH×W×C.\textit{Inverse Real FFT2d}:\mathbb{C}^{H\times\frac{W}{2}\times C}\to\mathbb{R}^{H\times W\times C}.


最后,局部(i)和全局(ii)分支的输出被融合在一起。FFC的图示见图2

The power of FFCs  FFCs are fully differentiable and easy-to-use drop-in replacement for conventional convolutions. Due to the image-wide receptive field, FFCs allow the generator to account for the global context starting from the early layers, which is crucial for high-resolution image inpainting. This also leads to better efficiency: trainable parameters can be used for reasoning and generation instead of “waiting” for a propagation of information.

We show that FFCs are well suited to capture periodic structures, which are common in human-made environments, e.g. bricks, ladders, windows, etc (Figure 4). Interestingly, sharing the same convolutions across all frequencies shifts the model towards scale equivariance [4] (Figures 56).

2.2 Loss functions

The inpainting problem is inherently ambiguous. There could be many plausible fillings for the same missing areas, especially when the “holes” become wider. We will discuss the components of the proposed loss, that together allow to handle the complex nature of the problem.

2.2.1 High receptive field perceptual loss

Naive supervised losses require the generator to reconstruct the ground truth precisely. However, the visible parts of the image often do not contain enough information for the exact reconstruction of the masked part. Therefore, using naive supervision leads to blurry results due to the averaging of multiple plausible modes of the inpainted content.

In contrast, perceptual loss [20] evaluates a distance between features extracted from the predicted and the target images by a base pre-trained network ϕ()\phi(\cdot). It does not require an exact reconstruction, allowing for variations in the reconstructed image. The focus of large-mask inpainting is shifted towards understanding of global structure. Therefore, we argue that it is important to use the base network with a fast growth of a receptive field. We introduce the high receptive field perceptual loss (HRF PL), that uses a high receptive field base model ϕHRF()\phi_{\text{\it HRF}}(\cdot):

HRFPL(x,x^)=([ϕHRF(x)ϕHRF(x^)]2),\mathcal{L}_{\text{\it HRFPL}}(x,\hat{x})=\mathcal{M}([\phi_{\text{\it HRF}}(x)-\phi_{\text{\it HRF}}(\hat{x})]^{2}), (1)

where []2[\cdot-\cdot]^{2} is an element-wise operation, and \mathcal{M} is the sequential two-stage mean operation (interlayer mean of intra-layer means). The ϕHRF(x)\phi_{\text{\it HRF}}(x) can be implemented using Fourier or Dilated convolutions. The HRF perceptual loss appears to be crucial for our large-mask inpainting system, as demonstrated in the ablation study (Table 3).

Pretext problem  A pretext problem on which the base network for a perceptual loss was trained is important. For example, using a segmentation model as a backbone for perceptual loss may help to focus on high-level information, e.g. objects and their parts. On the contrary, classification models are known to focus more on textures [10], which can introduce biases harmful for high-level information.

2.2.2 Adversarial loss

We use adversarial loss to ensure that inpainting model fθ(x)f_{\theta}(x^{\prime}) generates naturally looking local details. We define a discriminator Dξ()D_{\xi}(\cdot) that works on a local patch-level [19], discriminating between “real” and “fake” patches. Only patches that intersect with the masked area get the “fake” label. Due to the supervised HRF perceptual loss, the generator quickly learns to copy the known parts of the input image, thus we label the known parts of generated images as “real”. Finally, we use the non-saturating adversarial loss:

D=𝔼x[logDξ(x)]𝔼x,m[logDξ(x^)m]𝔼x,m[log(1Dξ(x^))(1m)]\displaystyle\begin{split}\mathcal{L}_{\text{{D}}}\!=\!-\!\mathbb{E}_{x}\Big{[}\log{\!D_{\xi}(x)}\Big{]}\!-\!\mathbb{E}_{x,m}\Big{[}\log{D_{\xi}(\hat{x})}\odot m\Big{]}\\ \!-\!\mathbb{E}_{x,m}\Big{[}\log{(1\!-\!D_{\xi}(\hat{x}))}\odot(1-m)\Big{]}\\ \end{split} (2)
G=𝔼x,m[logDξ(x^)]\displaystyle\mathcal{L}_{\text{{G}}}=-\mathbb{E}_{x,m}\Big{[}\log{D_{\xi}(\hat{x})}\Big{]} (3)
LAdv=sgθ(D)+sgξ(G)minθ,ξ\displaystyle L_{\textit{Adv}}=\texttt{sg}_{\theta}(\mathcal{L}_{\text{{D}}})+\texttt{sg}_{\xi}(\mathcal{L}_{\text{{G}}})\to\min_{\theta,\xi} (4)

where xx is a sample from a dataset, mm is a synthetically generated mask, x^=fθ(x)\hat{x}=f_{\theta}(x^{\prime}) is the inpainting result for x=stack(xm,m)x^{\prime}=\texttt{stack}(x\odot m,m), sgvar\texttt{sg}_{\textit{var}} stops gradients w.r.t var, and LAdvL_{\textit{Adv}} is the joint loss to optimise.

# Params
×106\times 10^{6}
Places (𝟓𝟏𝟐×𝟓𝟏𝟐\bf 512\times 512) CelebA-HQ (𝟐𝟓𝟔×𝟐𝟓𝟔\bf 256\times 256)
Narrow masks Wide masks Segm. masks Narrow masks Wide masks
Method FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow
LaMa-Fourier (ours) 27{\color[rgb]{1,1,1}\scriptstyle\blacktriangle} 0.630.63 0.0900.090 2.212.21 0.1350.135 5.355.35 0.0580.058 7.267.26 0.0850.085 6.966.96 0.0980.098
CoModGAN [64] 109{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle} 0.8230%0.82{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 30\%} 0.11123%0.111{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 23\%} 1.8218%1.82{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 18\%} 0.1479%0.147{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 9\%} 6.4020%6.40{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 20\%} 0.06614%0.066{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 14\%} 16.8131%16.8{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 131\%} 0.0797%0.079{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 7\%} 24.4250%24.4{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 250\%} 0.1024%0.102{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 4\%}
MADF [67] 85{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle} 0.5710%0.57{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 10\%} 0.0855%0.085{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 5\%} 3.7670%3.76{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 70\%} 0.1393%0.139{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 3\%} 6.5122%6.51{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 22\%} 0.0615%0.061{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 5\%}
AOT GAN [60] 15{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown} 0.7925%0.79{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 25\%} 0.0911%0.091{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 1\%} 5.94169%5.94{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 169\%} 0.14911%0.149{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 11\%} 7.3437%7.34{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 37\%} 0.06310%0.063{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 10\%} 6.678%6.67{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 8\%} 0.0814%0.081{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 4\%} 10.348%10.3{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 48\%} 0.11820%0.118{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 20\%}
GCPR [17] 30{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle} 2.93363%2.93{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 363\%} 0.14359%0.143{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 59\%} 6.54196%6.54{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 196\%} 0.16119%0.161{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 19\%} 9.2072%9.20{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 72\%} 0.07327%0.073{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 27\%}
HiFill [54] 3{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown} 9.241361%9.24{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 1361\%} 0.218142%0.218{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 142\%} 12.8479%12.8{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 479\%} 0.18034%0.180{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 34\%} 12.7137%12.7{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 137\%} 0.08549%0.085{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 49\%}
RegionWise [30] 47{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle} 0.9042%0.90{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 42\%} 0.10214%0.102{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 14\%} 4.75115%4.75{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 115\%} 0.14911%0.149{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 11\%} 7.5842%7.58{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 42\%} 0.06614%0.066{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 14\%} 11.153%11.1{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 53\%} 0.12446%0.124{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 46\%} 8.5423%8.54{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 23\%} 0.12123%0.121{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 23\%}
DeepFill v2 [57] 4{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown} 1.0668%1.06{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 68\%} 0.10416%0.104{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 16\%} 5.20135%5.20{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 135\%} 0.15515%0.155{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 15\%} 9.1771%9.17{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 71\%} 0.06818%0.068{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 18\%} 12.573%12.5{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 73\%} 0.13053%0.130{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 53\%} 11.261%11.2{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 61\%} 0.12628%0.126{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 28\%}
EdgeConnect [32] 22{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown} 1.33110%1.33{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 110\%} 0.11123%0.111{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 23\%} 8.37279%8.37{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 279\%} 0.16019%0.160{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 19\%} 9.4476%9.44{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 76\%} 0.07327%0.073{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 27\%} 9.6132%9.61{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 32\%} 0.09917%0.099{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 17\%} 9.0230%9.02{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 30\%} 0.12022%0.120{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 22\%}
RegionNorm [58] 12{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown} 2.13236%2.13{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 236\%} 0.12033%0.120{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 33\%} 15.7613%15.7{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 613\%} 0.17631%0.176{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 31\%} 13.7156%13.7{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 156\%} 0.08242%0.082{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 42\%}
Table 1: Quantitative evaluation of inpainting on Places and CelebA-HQ datasets. We report Learned perceptual image patch similarity (LPIPS) and Fréchet inception distance (FID) metrics. The {\color[rgb]{0.91,0.33,0.5}\blacktriangle} denotes deterioration, and {\color[rgb]{0.01,0.75,0.24}\blacktriangledown} denotes improvement of a score compared to our LaMa-Fourier model (presented in the first row). The metrics are reported for different policies of test masks generation, i.e. narrow, wide, and segmentation masks. LaMa-Fourier consistently outperforms a wide range of the baselines. CoModGAN [64] and MADF [67] are the only two baselines that come close. However, both models are much heavier than LaMa-Fourier and perform worse on average, showing that our method utilizes trainable parameters more efficiently.

2.2.3 The final loss function

In the final loss we also use R1=ExDξ(x)2R_{1}\!=\!E_{x}||\nabla D_{\xi}(x)||^{2} gradient penalty [31, 38, 7], and a discriminator-based perceptual loss or so-called feature matching loss—a perceptual loss on the features of discriminator network DiscPL\mathcal{L}_{\text{\it DiscPL}} [45]. DiscPL\mathcal{L}_{\text{\it DiscPL}} is known to stabilize training, and in some cases slightly improves the performance.

The final loss function for our inpainting system

final=κLAdv+αHRFPL+βDiscPL+γR1\displaystyle\mathcal{L_{\textit{final}}}=\kappa L_{\textit{Adv}}+\alpha\mathcal{L}_{\text{\it HRFPL}}+\beta\mathcal{L}_{\text{\it DiscPL}}+\gamma R_{1} (5)

is the weighted sum of the discussed losses, where LAdvL_{\textit{Adv}} and DiscPL\mathcal{L}_{\text{\it DiscPL}} are responsible for generation of naturally looking local details, while HRFPL\mathcal{L}_{\text{\it HRFPL}} is responsible for the supervised signal and consistency of the global structure.

2.3 Generation of masks during training

Refer to caption
Figure 3: The samples from different training masks generation policies. We argue that the way masks are generated greatly influences the final performance of the system. Unlike the conventional practice, e.g. DeepFillv2, we use a more aggressive large mask generation strategy where masks come uniformly either from wide masks or box masks strategies. The masks from large mask strategy have large area and, more importantly, are wider (see supplementary material for histograms). Training with our strategy helps a model to perform better on both wide and narrow masks (Table 4). During preparation of the test datasets, we avoid masks which cover more than 50% of an image.

The last component of our system is a mask generation policy. Each training example xx^{\prime} is a real photograph from a training dataset superimposed by a synthetically generated mask. Similar to discriminative models where data-augmentation has a high influence on the final performance, we find that the policy of mask generation noticeably influences the performance of the inpainting system.

We thus opted for an aggressive large mask generation strategy. This strategy uniformly uses samples from polygonal chains dilated by a high random width (wide masks) and rectangles of arbitrary aspect ratios (box masks). The examples of our masks are demonstrated in Figure 3.

We tested large mask training against narrow mask training for several methods, and found that training with large mask strategy generally improves performance on both narrow and wide masks (Table 4). That suggests that increasing diversity of the masks might be beneficial for various inpainting systems. The sampling algorithm is provided in supplementary material.

3 Experiments

In this section we demonstrate that the proposed technique outperforms a range of strong baselines on standard low resolutions, and the difference is even more pronounced when inpainting wider holes. Then we conduct the ablation study, showing the importance of FFC, the high receptive field perceptual loss, and large masks. The model, surprisingly, can generalise to high, never seen resolutions, while having significantly less parameters compared to most competitive baselines.

Implementation details For LaMa inpainting network we use a ResNet-like [14] architecture with 3 downsampling blocks, 6-18 residual blocks, and 3 upsampling blocks. In our model, the residual blocks use FFC. The further details on the discriminator architecture are provided in the supplementary material. We use Adam [23] optimizer, with the fixed learning rates 0.0010.001 and 0.00010.0001 for inpainting and discriminator networks, respectively. All models are trained for 1M iterations with a batch size of 30 unless otherwise stated. In all experiments, we select hyperparameters using the coordinate-wise beam-search strategy. That scheme led to the weight values κ=10\kappa=10, α=30\alpha=30, β=100\beta=100, γ=0.001\gamma=0.001. We use these hyperparameters for the training of all models, except those described in the loss ablation study (shown in Sec. 3.2). In all cases, the hyperparameter search is performed on a separate validation subset. More information about dataset splits is provided in supplementary material.

Data and metrics  We use Places [66] and CelebA-HQ [21] datasets. We follow the established practice in recent image2image literature and use Learned Perceptual Image Patch Similarity (LPIPS) [63] and Fréchet inception distance (FID) [15] metrics. Compared to pixel-level L1 and L2 distances, LPIPS and FID are more suitable for measuring performance of large masks inpainting when multiple natural completions are plausible. The experimentation pipeline is implemented using PyTorch [33], PyTorch-Lightning [9], and Hydra [49]. The code and the models are publicly available at github.com/saic-mdal/lama.

3.1 Comparisons to the baselines

We compare the proposed approach with a number of strong baselines that are presented in Table 1. Only publicly available pretrained models are used to calculate these metrics. For each dataset, we validate the performance across narrow, wide, and segmentation-based masks. LaMa-Fourier consistently outperforms most of the baselines, while having fewer parameters than the strongest competitors. The only two competitive baselines CoModGAN [64] and MADF [67] use 4×\approx 4\times and 3×\approx 3\times more parameters. The difference is especially noticeable for wide masks.

User study  To alleviate a possible bias of the selected metrics, we have conducted a crowdsourced user study. The results of the user study correlate well with the quantitative evaluation and demonstrate that the inpainting produced by our method is more preferable and less detectable compared to other methods. The protocol and the results of the user study are provided in the supplementary material.

Refer to caption
Figure 4: The side-by-side comparison of various inpainting systems on 512×\times512 images. Repetitive structures, such as windows and chain-link fences are known to be hard to inpaint. FFCs allow to generate these types of structures significantly better. Interestingly, LaMa-Fourier performs the best even with fewer parameters across the comparison while serving feasible inference time, i.e. LaMa-Fourier on average is only 20%\sim 20\% slower than LaMa-Regular.

3.2 Ablation Study

The goal of the study is to carefully examine the influence of different components of the method. In this section, we present results on Places dataset; the additional results for CelebA dataset are available in supplementary material.

# Params # Blocks Narrow masks Wide masks
Model Convs FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow
Base Fourier 27 9 0.630.63 0.0900.090 2.212.21 0.1350.135
Base Dilated 46 9 0.664%0.66{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 4\%} 0.0891%0.089{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 1\%} 2.304%2.30{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 4\%} 0.1361%0.136{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 1\%}
Base Regular 46 9 0.605%0.60{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 5\%} 0.0891%0.089{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 1\%} 3.5159%3.51{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 59\%} 0.1393%0.139{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 3\%}
Shallow Fourier 19 6 0.7213%0.72{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 13\%} 0.0944%0.094{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 4\%} 2.315%2.31{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 5\%} 0.1382%0.138{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 2\%}
Deep Regular 74 15 0.630.63 0.0900.090 2.6218%2.62{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 18\%} 0.1372%0.137{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 2\%}
Table 2: The table demonstrates performance of different LaMa architectures while leaving the other components the same. The {\color[rgb]{0.91,0.33,0.5}\blacktriangle} denotes deterioration, and {\color[rgb]{0.01,0.75,0.24}\blacktriangledown} denotes improvement compared to the Base-Fourier model (presented in the first row). The FFC-based models may sacrifice a little performance on narrow masks, but significantly outperform bigger models with regular convolutions on wide masks. Visually, the FFC-based models recover complex visual structures significantly better, as shown in Figure 4.

Receptive field of fθ()f_{\theta}(\cdot)  FFCs increase the effective receptive field of our system. Adding FFCs substantially improves FID scores of inpainting in wide masks (Table 2).

The importance of the receptive field is most noticeable when a model is applied to a higher resolution than it was trained on. As demonstrated in Figure 5, the model with regular convolutions produces visible artifacts as the resolution increases beyond those used at train time. The same effect is validated quantitatively (Figure 6). FFCs also improve generation of repetitive structures such as windows a lot (Figure 4). Interestingly, the LaMa-Fourier is only 20%20\% slower, while 40%40\% smaller than LaMa-Regular.

Dilated convolutions [55, 3] are an alternative option that allows the fast growth of a receptive field. Similar to FFCs, dilated convolutions boost the performance of our inpainting system. This further supports our hypothesis on the importance of the fast growth of the effective receptive field for image inpainting. However, dilated convolutions have more restrictive receptive field and heavily rely on scale, leading to inferior generalization to higher resolutions (Figure 6). Dilated convolutions are widely implemented in most frameworks and may serve as a practical replacement for Fourier ones when the resources are limited, e.g. on mobile devices. We provide more details on the LaMa-Dilated architecture in the supplementary material.

Loss  We verify that the high receptive field of the perceptual loss—implemented with Dilated convolutions—indeed improves the quality of inpainting (Table 3). The pretext problem and the design choice beyond using dilation layers also prove to be important. For each loss variant, we performed a weight coefficient search to ensure a fair evaluation.

Pretext
Problem
Segmentation masks
Model Dilation FID \downarrow LPIPS \downarrow
HRFPL\mathcal{L}_{\text{\it HRFPL}} RN50 Segm. + 5.695.69 0.0590.059
RN50 Clf. + 5.873%5.87{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 3\%} 0.0590.059
ClfPL\mathcal{L}_{\text{Clf\it PL}} RN50 Clf. - 6.005%6.00{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 5\%} 0.0613%0.061{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 3\%}
VGG19 Clf. - 6.2911%6.29{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 11\%} 0.0636%0.063{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 6\%}
PL\cancel{\mathcal{L}_{\text{\it PL}}} - - - 6.4613%6.46{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 13\%} 0.0659%0.065{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 9\%}
Table 3: Comparison of LaMa-Regular trained with different perceptual losses. The {\color[rgb]{0.91,0.33,0.5}\blacktriangle} denotes deterioration, and {\color[rgb]{0.01,0.75,0.24}\blacktriangledown} denotes improvement of a score compared to the model trained with HRF perceptual loss based on segmentation ResNet50 with dilated convolutions (presented in the first row). Both dilated convolutions and pretext problem improved the scores.

Masks generation  Wider training masks improve inpainting of both wide and narrow holes for LaMa (ours) and RegionWise [30] (Table 4). However, wider masks may make results worse, which is the case for DeepFill v2 [57] and EdgeConnect [32] on narrow masks. We hypothesize that this difference is caused by specific design choices (e.g. high receptive field of a generator or loss functions) that make a method more or less suitable for inpainting of both narrow and wide masks at the same time.

Refer to caption
Figure 5: Transfer of inpainting models to a higher resolution. All LaMa models were trained using 256×256256\times 256 crops from 512×512512\times 512, and MADF [67] was trained on 512×512512\times 512 directly. As the resolution increases, the models with regular convolutions swiftly start to produce critical artifacts, while FFC-based models continue to generate semantically consistent image with fine details. More negative and positive examples of our 51M model can be found at bit.ly/3k0gaIK.
Narrow masks Wide masks
Method FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow
Training masks Narrow LaMa-Regular 0.680.68 0.0910.091 5.415.41 0.1440.144
DeepFill v2 1.061.06 0.1040.104 5.205.20 0.1550.155
EdgeConnect 1.331.33 0.1110.111 8.378.37 0.1600.160
RegionWise 0.900.90 0.1020.102 4.754.75 0.1490.149
Wide LaMa-Regular 0.6012%0.60{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 12\%} 0.0892%0.089{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 2\%} 3.5154%3.51{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 54\%} 0.1394%0.139{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 4\%}
DeepFill v2 1.3521%1.35{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 21\%} 0.1073%0.107{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 3\%} 4.3420%4.34{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 20\%} 0.1484%0.148{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 4\%}
EdgeConnect 2.7852%2.78{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 52\%} 0.14127%0.141{\color[rgb]{0.91,0.33,0.5}\scriptstyle\blacktriangle 27\%} 7.945%7.94{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 5\%} 0.1600.160
RegionWise 0.7421%0.74{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 21\%} 0.0957%0.095{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 7\%} 3.5633%3.56{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 33\%} 0.1443%0.144{\color[rgb]{0.01,0.75,0.24}\scriptstyle\blacktriangledown 3\%}
Table 4: The table shows performance metrics for the training of different inpainting methods with either narrow or wide masks. The {\color[rgb]{0.91,0.33,0.5}\blacktriangle} denotes deterioration, and {\color[rgb]{0.01,0.75,0.24}\blacktriangledown} denotes improvement of a score induced by wide-mask training for the corresponding method. LaMa and RegionWise inpainting clearly benefit from training with wide masks. This is an empirical evidence that the aggressive mask generation may be beneficial for other inpainting systems.

3.3 Generalization to higher resolution

Training directly at high-resolution is slow and computationally expensive. Still, most real-world image editing scenarios require inpainting to work in high-resolution. So, we evaluate our models, which were trained using 256×256256\times 256 crops from 512×512512\times 512 images, on much larger images. We apply models in a fully-convolutional fashion, i.e. an image is processed in a single pass, not patch-wise.

FFC-based models transfer to higher resolutions significantly better (Figure 6). We hypothesize that FFCs are more robust across different scales due to i)i) image-wide receptive field, ii)ii) preserving the low-frequencies of the spectrum after scale change, iii)iii) the inherent scale equivariance of 1×11\times{}1 convolutions in the frequency domain. While all models generalize reasonably well to the 512×512512\!\times\!512 resolution, the FFC-enabled models preserve much more quality and consistency at the 1536×15361536\!\times\!1536 resolution, compared to all other models (Figure 5). It is worth noting, that they achieve this quality at a significantly lower parameter cost than the competitive baselines.

Refer to caption
Refer to caption
Figure 6: The FFC-based inpainting models can transfer to higher resolutions—that are never seen in training—with significantly smaller quality degradation. All LaMa models are trained in 256×256256\times 256 resolution. The Big LaMa-Fourier—our best model—is provided for reference as it was trained in different conditions (Sec. 3.4).

3.4 Teaser model: Big LaMa

To verify the scalability and applicability of our approach to real high-resolution images, we trained a large inpainting Big LaMa model with more resources.

Big LaMa-Fourier differs from LaMa-Fourier in three aspects: the depth of the generator; the training dataset; and the size of the batch. It has 18 residual blocks, all based on FFC, resulting in 51M parameters. The model was trained on a subset of 4.5M images from Places-Challenge dataset [66]. Just as our standard base model, the Big LaMa was trained only on low-resolution 256×256256\times 256 crops of approximately 512×512512\times 512 images. Big LaMa uses a larger batch size of 120 (instead of 30 for our other models). Although we consider this model relatively large, it is still smaller than some of the baselines. It was trained on eight NVidia V100 GPUs for approximately 240 hours. The inpainting examples of Big LaMa model are presented in Figures 1 and 5.

4 Related Work

Early data-driven approaches to image inpainting relied on patch-based [5] and nearest neighbor-based [13] generation. One of the first inpainting works in deep learning era [34] used a convnet with an encoder-decoder architecture trained in an adversarial way [11]. This approach remains commonly used for deep inpainting to date. Another popular group of choices for the completion network is architectures based on U-Net [37], such as [26, 50, 59, 27].

One common concern is the ability of the network to grasp the local and global context. Towards this end, [18] proposed to incorporate dilated convolutions [55] to expand receptive field; besides, two discriminators were supposed to encourage global and local consistency separately. In [46], the use of branches in the completion network with varying receptive fields was suggested. To borrow information from spatially distant patches, [56] proposed the contextual attention layer. Alternative attention mechanisms were suggested in [28, 47, 65]. Our study confirms the importance of the efficient propagation of information between distant locations. One variant of our approach relies heavily on dilated convolutional blocks, inspired by [41]. As an even better alternative, we propose a mechanism based on transformations in the frequency domain (FFC) [4]. This also aligns with a recent trend on using Transformers in computer vision [6, 8] and treating Fourier transform as a lightweight replacement to the self-attention [24, 35].

At a more global level, [56] introduced a coarse-to-fine framework that involves two networks. In their approach, the first network completes coarse global structure in the holes, while the second network then uses it as a guidance to refine local details. Such two-stage approaches that follow a relatively old idea of structure-texture decomposition [1] became prevalent in the subsequent works. Some studies [40, 42] modify the framework so that coarse and fine result components are obtained simultaneously rather than sequentially. Several works suggest two-stage methods that use completion of other structure types as an intermediate step: salient edges in [32], semantic segmentation maps in [44], foreground object contours in [48], gradient maps in [52], and edge-preserved smooth images in [36]. Another trend is progressive approaches [62, 12, 25, 61]. In contrast to all these works, we demonstrate that a meticulously designed single-stage approach can achieve very strong results.

To deal with irregular masks, several works modified convolutional layers, introducing partial [26], gated [57], light-weight gated [54] and region-wise [30] convolutions. Various shapes of training masks were explored, including random [18], free-form [57] and object-shaped masks [54, 61]. We found that as long as contours of training masks are diverse enough, the exact way of mask generation is not as important as the width of the masks.

Many losses were proposed to train inpainting networks. Typically, pixel-wise (e.g. 1\ell 1, 2\ell 2) and adversarial losses are used. Some approaches apply spatially discounted weighting strategies for a pixel-wise loss [34, 53, 56]. Simple convolutional discriminators [34, 52] or PatchGAN discriminators [18, 59, 36, 28] were used to implement adversarial losses. Other popular choices are Wasserstein adversarial losses with gradient-penalized discriminators [56, 54] and spectral-normalized discriminators [32, 57, 27, 61]. Following previous works [31, 22], we use an r1-gradient penalized patch discriminator in our system. A perceptual loss is also commonly applied, usually with VGG-16 [26, 47, 25, 27] or VGG-19 [51, 43, 32, 52] backbones pretrained on ImageNet classification [39]. In contrast to those works, we have found that such perceptual losses are suboptimal for image inpainting and proposed a better alternative. Inpainting frameworks often incorporate style [26, 30, 30, 47, 32, 25] and feature matching [51, 44, 32, 16] losses. The latter is also employed in our system.

5 Discussion

In this study, we have investigated the use of a simple, single-stage approach for large-mask inpainting. We have shown that such an approach is very competitive and can push the state of the art in image inpainting, given the appropriate choices of the architecture, the loss function, and the mask generation strategy. The proposed method is arguably good in generating repetitive visual structures (Figure 14), which appears to be an issue for many inpainting methods. However, LaMa usually struggles when a strong perspective distortion gets involved (see supplementary material). We would like to note that this is usually the case for complex images from the Internet, that do not belong to a dataset. It remains a question whether FFCs can account for these deformations of periodic signals. Interestingly, FFCs allow the method to generalize to never seen high resolutions, and be more parameter-efficient compared to state-of-the-art baselines. The Fourier or Dilated convolutions are not the only options to receive a high receptive field. For instance, a high receptive field can be obtained with vision transformer [6] that is also an exciting topic for future research. We believe that models with a large receptive field will open new opportunities for the development of efficient high-resolution computer vision models.

Acknowledgements  We want to thank Nikita Dvornik, Gleb Sterkin, Aibek Alanov, Anna Vorontsova, Alexander Grishin, and Julia Churkina for their valuable feedback.

Supplementary material  For more details and visual samples, please refer to the project page https://saic-mdal.github.io/lama-project/ or supplementary material https://bit.ly/3zhv2rD.

References

  • [1] Marcelo Bertalmío, Luminita A. Vese, Guillermo Sapiro, and Stanley J. Osher. Simultaneous structure and texture image inpainting. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), 16-22 June 2003, Madison, WI, USA, pages 707–712. IEEE Computer Society, 2003.
  • [2] E Oran Brigham and RE Morrow. The fast fourier transform. IEEE spectrum, 4(12):63–70, 1967.
  • [3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Yoshua Bengio and Yann LeCun, editors, Proc. ICLR, 2015.
  • [4] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4479–4488. Curran Associates, Inc., 2020.
  • [5] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Object removal by exemplar-based inpainting. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), 16-22 June 2003, Madison, WI, USA, pages 721–728. IEEE Computer Society, 2003.
  • [6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [7] H. Drucker and Y. Le Cun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6):991–997, 1992.
  • [8] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021.
  • [9] WA Falcon and .al. Pytorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 3, 2019.
  • [10] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
  • [11] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.
  • [12] Zongyu Guo, Zhibo Chen, Tao Yu, Jiale Chen, and Sen Liu. Progressive image inpainting with full-resolution residual network. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2496–2504, 2019.
  • [13] James Hays and Alexei A. Efros. Scene completion using millions of photographs. ACM Trans. Graph., 26(3):4, 2007.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.
  • [16] Zheng Hui, Jie Li, Xiumei Wang, and Xinbo Gao. Image fine-grained inpainting. arXiv preprint arXiv:2002.02609, 2020.
  • [17] Håkon Hukkelås, Frank Lindseth, and Rudolf Mester. Image inpainting with learnable feature imputation. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28–October 1, 2020, Proceedings 42, pages 388–403. Springer, 2021.
  • [18] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
  • [19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
  • [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
  • [21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [22] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
  • [25] Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, and Dacheng Tao. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7760–7768, 2020.
  • [26] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 85–100, 2018.
  • [27] Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. arXiv preprint arXiv:2007.06929, 2020.
  • [28] Hongyu Liu, Bin Jiang, Yi Xiao, and Chao Yang. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4170–4179, 2019.
  • [29] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [30] Yuqing Ma, Xianglong Liu, Shihao Bai, Lei Wang, Aishan Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial image inpainting for large missing areas. arXiv preprint arXiv:1909.12507, 2019.
  • [31] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Which training methods for gans do actually converge? In International Conference on Machine Learning (ICML), 2018.
  • [32] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019.
  • [33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [34] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • [35] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. arXiv preprint arXiv:2107.00645, 2021.
  • [36] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H Li, Shan Liu, and Ge Li. Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 181–190, 2019.
  • [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [38] Andrew Slavin Ros and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1660–1669, 2018.
  • [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [40] Min-cheol Sagong, Yong-goo Shin, Seung-wook Kim, Seung Park, and Sung-jea Ko. Pepsi: Fast image inpainting with parallel decoding network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11360–11368, 2019.
  • [41] René Schuster, Oliver Wasenmuller, Christian Unger, and Didier Stricker. Sdc-stacked dilated convolution: A unified descriptor network for dense matching tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2556–2565, 2019.
  • [42] Yong-Goo Shin, Min-Cheol Sagong, Yoon-Jae Yeo, Seung-Wook Kim, and Sung-Jea Ko. Pepsi++: Fast and lightweight network for image inpainting. IEEE transactions on neural networks and learning systems, 32(1):252–265, 2020.
  • [43] Yuhang Song, Chao Yang, Zhe Lin, Xiaofeng Liu, Qin Huang, Hao Li, and C-C Jay Kuo. Contextual-based image inpainting: Infer, match, and translate. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [44] Yuhang Song, Chao Yang, Yeji Shen, Peng Wang, Qin Huang, and C-C Jay Kuo. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
  • [45] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [46] Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Image inpainting via generative multi-column convolutional neural networks. arXiv preprint arXiv:1810.08771, 2018.
  • [47] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shilei Wen, and Errui Ding. Image inpainting with learnable bidirectional attention maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8858–8867, 2019.
  • [48] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo. Foreground-aware image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5840–5848, 2019.
  • [49] Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019.
  • [50] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European conference on computer vision (ECCV), pages 1–17, 2018.
  • [51] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6721–6729, 2017.
  • [52] Jie Yang, Zhiquan Qi, and Yong Shi. Learning to incorporate structure knowledge for image inpainting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12605–12612, 2020.
  • [53] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5485–5493, 2017.
  • [54] Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7508–7517, 2020.
  • [55] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [56] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  • [57] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4471–4480, 2019.
  • [58] Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, and Sen Liu. Region normalization for image inpainting. In AAAI, pages 12733–12740, 2020.
  • [59] Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1486–1494, 2019.
  • [60] Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. Aggregated contextual transformations for high-resolution image inpainting. In Arxiv, pages –, 2020.
  • [61] Yu Zeng, Zhe Lin, Jimei Yang, Jianming Zhang, Eli Shechtman, and Huchuan Lu. High-resolution image inpainting with iterative confidence feedback and guided upsampling. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
  • [62] Haoran Zhang, Zhenzhen Hu, Changzhi Luo, Wangmeng Zuo, and Meng Wang. Semantic image inpainting with progressive generative networks. In Proceedings of the 26th ACM international conference on Multimedia, pages 1939–1947, 2018.
  • [63] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [64] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. In International Conference on Learning Representations (ICLR), 2021.
  • [65] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. Pluralistic image completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1438–1447, 2019.
  • [66] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
  • [67] Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li, Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Transactions on Image Processing, 30:4855–4866, 2021.