这是用户在 2024-3-21 24:04 为 https://arxiv.org/html/2402.17525v2?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
HTML转换有时会显示错误,原因是内容没有从源文件正确转换。本文使用了以下尚未被HTML转换工具支持的包。对于这些问题的反馈并不必要,因为它们已知并正在解决中。

  • failed: forest 失败:森林
  • failed: hyphenat 失败:连字符化
  • failed: printlen 失败:打印长度

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
作者:通过遵循这些最佳实践,从您的LaTeX提交中获得最佳的HTML结果。

License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org永久非独占许可证
arXiv:2402.17525v2 [cs.CV] 16 Mar 2024
arXiv:2402.17525v2 [cs.CV] 2024年3月16日
\uselengthunit 使用长度单位

pt

Diffusion Model-Based Image Editing: A Survey
扩散模型基于图像编辑:一项调查

Yi Huang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jiancheng Huang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Yifan Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Mingfu Yan*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jiaxi Lv*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jianzhuang Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, ,
黄毅,黄建成,刘一凡,严明福,吕佳熙,刘建壮,

Wei Xiong, He Zhang, Shifeng Chen, and Liangliang Cao
魏雄,何张,施锋陈和梁亮曹
Y. Huang, J. Huang, M. Yan, and J. Lv are with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, and also with University of Chinese Academy of Sciences, Beijing, China. (E-mail: yi.huang@siat.ac.cn) Y. Liu is with Southern University of Science and Technology, Shenzhen, China, and also with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. J. Liu and S. Chen are with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. (E-mail: jz.liu@siat.ac.cn, shifeng.chen@siat.ac.cn) W. Xiong and H. Zhang are with Adobe Inc, San Jose, USA.L. Cao is with Apple Inc, Cupertino, USA.*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT denotes equal contributions. S. Chen is the corresponding author.
黄勇,黄杰,闫敏和吕杰就职于中国科学院深圳先进技术研究院和中国科学院大学。 (电子邮件:yi.huang@siat.ac.cn) 刘洋就职于南方科技大学和中国科学院深圳先进技术研究院。刘军和陈师峰就职于中国科学院深圳先进技术研究院。 (电子邮件:jz.liu@siat.ac.cn, shifeng.chen@siat.ac.cn) 熊伟和张华就职于Adobe公司,美国圣何塞。曹亮就职于苹果公司,美国库比蒂诺。 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 表示相等的贡献。陈师峰为通讯作者。
Abstract 摘要

Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research. The accompanying repository is released at https://github.com/SiatMMLab/
去噪扩散模型已经成为各种图像生成和编辑任务的强大工具,以无条件或输入条件的方式促进了视觉内容的合成。它们的核心思想是学习逆转逐渐向图像添加噪声的过程,从而能够从复杂分布中生成高质量的样本。在本调查中,我们对使用扩散模型进行图像编辑的现有方法进行了详尽的概述,涵盖了领域中的理论和实践方面。我们从多个角度对这些作品进行了深入分析和分类,包括学习策略、用户输入条件以及可以完成的特定编辑任务的数组。此外,我们特别关注图像修复和扩展,并探索了早期的传统上下文驱动和当前的多模态条件方法,对它们的方法进行了全面分析。为了进一步评估文本引导的图像编辑算法的性能,我们提出了一个系统的基准测试,EditEval,其中包含了一种创新的度量标准,LMM分数。 最后,我们解决当前的限制,并展望未来研究的一些潜在方向。附带的存储库已发布在https://github。com/SiatMMLab/ 翻译为简体中文为:com/SiatMMLab/

Awesome-Diffusion-Model-Based-Image-Editing-Methods
基于神奇扩散模型的图像编辑方法
.

Index Terms:
Diffusion Model, Image Editing, AIGC
索引词:扩散模型,图像编辑,AIGC
Refer to caption
Refer to caption
Refer to caption
Figure 1: Statistical overview of research publications in diffusion model-based image editing. Top: learning strategies. Middle: input conditions. Bottom: editing tasks.
图1:扩散模型图像编辑研究出版物的统计概述。顶部:学习策略。中部:输入条件。底部:编辑任务。

1 Introduction 1 简介

In the realm of AI-generated content (AIGC), where artificial intelligence is leveraged to create and modify digital content, image editing is recognized as a significant area of innovation and practical applications. Different from image generation that creates new images from minimal inputs, image editing involves altering the appearance, structure, or content of an image, encompassing a range of changes from subtle adjustments to major transformations. This research is fundamental in various areas, including digital media, advertising, and scientific research, where altering visual content is essential. The evolution of image editing has reflected the advancements in digital technology, progressing from manual, labor-intensive processes to advanced digital techniques driven by learning-based algorithms. A pivotal advancement in this evolution is the introduction of Generative Adversarial Networks (GANs) [1, 2, 3, 4, 5, 6], which significantly enhance the possibilities for creative image manipulation. More recently, diffusion models have emerged in AIGC [7, 8, 9, 10, 11, 12, 13, 14, 1, 15], leading to notable breakthroughs in visual generation tasks. Diffusion models, inspired by principles from non-equilibrium thermodynamics [15], work by gradually adding noise to data and then learning to reverse this process from random noise until generating desired data matching the source data distribution. They can be roughly classified into denoising diffusion based [15, 16, 17, 18] and score-matching based [19, 20, 21, 22, 23]. Their adaptability and effectiveness have led to widespread applications across various tasks such as image generation [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], video generation [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56], image restoration [57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], and image editing.
在人工智能生成内容(AIGC)领域中,人工智能被用于创建和修改数字内容,图像编辑被认为是一个重要的创新和实际应用领域。与从最小输入创建新图像的图像生成不同,图像编辑涉及改变图像的外观、结构或内容,包括从微调到重大转变的一系列变化。这项研究在数字媒体、广告和科学研究等各个领域都是基础,其中改变视觉内容是必不可少的。图像编辑的发展反映了数字技术的进步,从手动、劳动密集型的过程发展到由学习算法驱动的先进数字技术。这一进化中的关键进展是生成对抗网络(GANs)的引入[1, 2, 3, 4, 5, 6],它显著增强了创造性图像处理的可能性。最近,在AIGC中出现了扩散模型[7, 8, 9, 10, 11, 12, 13, 14, 1, 15],在视觉生成任务中取得了显著突破。 扩散模型受非平衡热力学原理的启发[15],通过逐渐向数据添加噪声,然后学习逆转这个过程,从随机噪声生成与源数据分布相匹配的期望数据。它们可以粗略地分为基于去噪扩散[15, 16, 17, 18]和基于分数匹配[19, 20, 21, 22, 23]的模型。它们的适应性和有效性已经在各种任务中得到广泛应用,如图像生成[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38],视频生成[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56],图像恢复[57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]和图像编辑。

The application of diffusion models in image editing has experienced a surge in interest, evidenced by the significant number of research publications in this field over recent years. This growing attention highlights the potential and versatility of diffusion models in improving image editing performance compared to previous works. Given this significant advancement, it is essential to systematically review and summarize these contributions. However, existing survey literature on diffusion models concentrates on other specific visual tasks [72, 73, 74, 75], such as video applications [73] or image restoration and enhancement [74, 75]. Some surveys that mention image editing tend to provide only a cursory overview [76, 77, 78, 79, 80, 81, 82, 83], missing a detailed and focused exploration of the methods. To bridge this gap, we conduct a survey to offer an in-depth and comprehensive analysis that is distinctly focused on image editing. We delve deeply into the methodologies, input conditions, and a wide range of editing tasks achieved by diffusion models in this field. The survey critically reviews over 100 research papers, organizing them into three primary classes in terms of learning strategies: training-based approaches, testing-time finetuning approaches, and training and finetuning free approaches. Each class is further divided based on its core techniques, with detailed discussions presented in Sections 45, and 6, respectively. We also explore 10 distinct types of input conditions used in these methods, including text, mask, reference (Ref.) image, class, layout, pose, sketch, segmentation (Seg.) map, audio, and dragging points to show the adaptability of diffusion models in diverse image editing scenarios. Furthermore, our survey presents a new classification of image editing tasks into three broad categories, semantic editing, stylistic editing, and structural editing, covering 12 specific types. Fig. 1 visually represents the statistical distributions of research across learning strategies, input conditions, and editing task categories. In addition, we pay special attention to inpainting and outpainting, which together stand out as a unique type of editing. We explore both earlier traditional and current multimodal conditional methods, offering a comprehensive analysis of their methodologies in Section 7. We also introduce EditEval, a benchmark designed to evaluate text-guided image editing algorithms, as detailed in Section 8. In particular, an effective evaluation metric, LMM Score, is proposed by leveraging the advanced visual-language understanding capabilities of large multimodal models (LMMs). Finally, we present some current challenges and potential future trends as an outlook in Section 9.
在图像编辑中应用扩散模型的兴趣急剧增加,近年来在这一领域的研究出版物数量显著增加,这表明扩散模型在改善图像编辑性能方面具有潜力和多样性,相比之前的工作有所提升。鉴于这一重大进展,有必要对这些贡献进行系统的回顾和总结。然而,现有关于扩散模型的调查文献集中在其他特定的视觉任务上[72, 73, 74, 75],如视频应用[73]或图像恢复和增强[74, 75]。一些提到图像编辑的调查往往只提供粗略的概述[76, 77, 78, 79, 80, 81, 82, 83],缺乏对方法的详细和专注的探索。为了填补这一空白,我们进行了一项调查,提供了一种深入和全面的分析,专注于图像编辑。我们深入研究了在这一领域中扩散模型的方法、输入条件和广泛的编辑任务。 该调查对100多篇研究论文进行了批判性评价,并将它们按照学习策略分为三个主要类别:基于训练的方法、测试时微调的方法和无需训练和微调的方法。每个类别根据其核心技术进一步细分,并分别在第4、5和6节中进行了详细讨论。我们还探讨了这些方法中使用的10种不同类型的输入条件,包括文本、遮罩、参考图像、类别、布局、姿势、草图、分割图、音频和拖动点,以展示扩散模型在不同图像编辑场景中的适应性。此外,我们的调查将图像编辑任务分为三个广泛的类别:语义编辑、风格编辑和结构编辑,涵盖了12种具体类型。图1直观地表示了研究在学习策略、输入条件和编辑任务类别上的统计分布。此外,我们特别关注修复和扩展,它们共同构成了一种独特的编辑类型。 我们在第7节中探讨了早期传统和当前多模态条件方法,并对它们的方法进行了全面分析。我们还在第8节中介绍了一个名为EditEval的基准,用于评估文本引导的图像编辑算法。特别地,通过利用大型多模态模型(LMMs)的先进视觉语言理解能力,提出了一种有效的评估指标LMM Score。最后,在第9节中,我们展示了一些当前的挑战和潜在的未来趋势作为展望。

In summary, this survey aims to systematically categorize and critically assess the extensive body of research in diffusion model-based image editing. Our goal is to provide a comprehensive resource that not only synthesizes current findings but also guides future research directions in this rapidly advancing field.
总之,本调查旨在系统地分类和批判性评估扩散模型为基础的图像编辑的广泛研究成果。我们的目标是提供一个全面的资源,不仅综合当前的研究结果,还指导这个快速发展领域的未来研究方向。

2 Background 2 背景

2.1 Diffusion Models 2.1 扩散模型

Diffusion Models have exerted a profound influence on the field of generative AI, giving rise to a plethora of approaches falling under their umbrella. Essentially, these models are grounded in a pivotal principle known as diffusion, which gradually adds noise to data samples of some distribution, transforming them into a predefined typically simple distribution, such as Gaussian, and this process is then reversed iteratively to generate data matching the original distribution. What distinguishes diffusion models from earlier generative models is their dynamic execution across iterative time steps, covering both forward and backward movements in time.
扩散模型对生成式人工智能领域产生了深远影响,涌现出许多属于该模型范畴的方法。基本上,这些模型以扩散为基础原理,逐渐向某个分布的数据样本添加噪声,将其转化为预定义的通常简单分布,如高斯分布,然后通过迭代逆转这个过程,生成与原始分布相匹配的数据。扩散模型与早期的生成模型的区别在于它们在迭代的时间步骤中具有动态执行,涵盖了时间上的正向和反向移动。

For each time step t𝑡titalic_t, the noisy latent 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT delineates the current state. The time step t𝑡titalic_t increments progressively during forward diffusion process, and it reduces towards 0 during reverse diffusion process. Notably, the literature lacks a distinct differentiation between 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in forward diffusion and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in reverse diffusion. In the context of forward diffusion, let 𝐳tq(𝐳t𝐳t1)similar-tosubscript𝐳𝑡𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\mathbf{z}_{t}\sim q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right)bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), and in reverse diffusion, let 𝐳t1p(𝐳t1𝐳t)similar-tosubscript𝐳𝑡1𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\sim p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Herein, we denote T𝑇Titalic_T with 0<tT0𝑡𝑇0<t\leq T0 < italic_t ≤ italic_T as the maximal time step for finite cases. The initial data distribution at t=0𝑡0t=0italic_t = 0 is represented by 𝐳0q(𝐳0)similar-tosubscript𝐳0𝑞subscript𝐳0\mathbf{z}_{0}\sim q\left(\mathbf{z}_{0}\right)bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), slowly contaminated by additive noise to 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Diffusion models gradually eliminate noise by a parameterized model pθ(𝐳t1𝐳t)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the reverse time direction. This model approximates the ideal, albeit unattainable, denoised distribution p(𝐳t1𝐳t)𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denoising Diffusion Probabilistic Models (DDPMs), as introduced in Ho et al. [16], effectively utilize Markov chains to facilitate both the forward and backward processes over a finite series of time steps.
对于每个时间步骤 t𝑡titalic_t ,噪声潜变量 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 描述当前状态。时间步骤 t𝑡titalic_t 在正向扩散过程中逐渐增加,并在反向扩散过程中减小至0。值得注意的是,文献中缺乏对正向扩散中 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 和反向扩散中 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的明确区分。在正向扩散的情况下,设 𝐳tq(𝐳t𝐳t1)similar-tosubscript𝐳𝑡𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\mathbf{z}_{t}\sim q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right)bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ,在反向扩散的情况下,设 𝐳t1p(𝐳t1𝐳t)similar-tosubscript𝐳𝑡1𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\sim p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 。在此,我们将有限情况下的最大时间步骤 T𝑇Titalic_T 表示为 0<tT0𝑡𝑇0<t\leq T0 < italic_t ≤ italic_T 。在 t=0𝑡0t=0italic_t = 0 处的初始数据分布由 𝐳0q(𝐳0)similar-tosubscript𝐳0𝑞subscript𝐳0\mathbf{z}_{0}\sim q\left(\mathbf{z}_{0}\right)bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) 表示,通过加性噪声逐渐污染为 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 。扩散模型通过反向时间方向上的参数化模型 pθ(𝐳t1𝐳t)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 逐渐消除噪声。该模型近似于理想的去噪分布 p(𝐳t1𝐳t)𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,尽管无法实现。去噪扩散概率模型(DDPMs),如Ho等人[16]所介绍的,有效利用马尔可夫链在有限的一系列时间步骤中促进正向和反向过程。

Forward Diffusion Process. This process serves as transforming the data distribution into a predefined distribution, such as the Gaussian distribution. The transformation is represented as:
前向扩散过程。该过程用于将数据分布转化为预定义的分布,例如高斯分布。转化表示为:

q(𝐳t𝐳t1)=𝒩(𝐳t1βt𝐳t1,βt𝐈),𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1𝒩conditionalsubscript𝐳𝑡1subscript𝛽𝑡subscript𝐳𝑡1subscript𝛽𝑡𝐈q(\mathbf{z}_{t}\mid\mathbf{z}_{t-1})=\mathcal{N}(\mathbf{z}_{t}\mid\sqrt{1-% \beta_{t}}\,\mathbf{z}_{t-1},\beta_{t}\mathbf{I}),italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (1)

where the set of hyper-parameters 0<β1:T<10subscript𝛽:1𝑇10<\beta_{1:T}<10 < italic_β start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT < 1 is indicative of the noise variance introduced at each sequential step. This diffusion process can be briefly expressed via a single-step equation:
其中超参数 0<β1:T<10subscript𝛽:1𝑇10<\beta_{1:T}<10 < italic_β start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT < 1 的集合表示每个顺序步骤引入的噪声方差。这个扩散过程可以通过一个单步方程简要表示:

q(𝐳t𝐳0)=𝒩(𝐳tα¯t𝐳0,(1α¯t)𝐈),𝑞conditionalsubscript𝐳𝑡subscript𝐳0𝒩conditionalsubscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳01subscript¯𝛼𝑡𝐈q(\mathbf{z}_{t}\mid\mathbf{z}_{0})=\mathcal{N}(\mathbf{z}_{t}\mid\sqrt{\bar{% \alpha}_{t}}\,\mathbf{z}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (2)

with αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as elaborated in Sohl-Dickstein et al. [15]. As a result, bypassing the need to consider intermediate time steps, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be directly sampled by:
如Sohl-Dickstein等人所述,通过 αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTα¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,可以绕过考虑中间时间步骤的需要,从而可以直接通过以下方式对 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 进行采样:

𝐳t=α¯t𝐳0+1α¯tϵ,ϵ𝒩(𝟎,𝐈).formulae-sequencesubscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳01subscript¯𝛼𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝐈\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\cdot\mathbf{z}_{0}+\sqrt{1-\bar{\alpha}% _{t}}\cdot\epsilon,\quad\epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}% \right).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_ϵ , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) . (3)

Reverse Diffusion Process. The primary objective here is to learn the reverse of the forward diffusion process, with the aim of generating a distribution that closely aligns with the original unaltered data samples 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In the context of image editing, 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the edited image. Practically, this is achieved using a UNet architecture to learn a parameterized version of p𝑝pitalic_p. Given that the forward diffusion process is approximated by q(𝐳T)𝒩(𝟎,𝐈)𝑞subscript𝐳𝑇𝒩0𝐈q(\mathbf{z}_{T})\approx\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)italic_q ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≈ caligraphic_N ( bold_0 , bold_I ), the formulation of the learnable transition is expressed as:
逆扩散过程。这里的主要目标是学习正向扩散过程的逆过程,以生成与原始未改变的数据样本密切对齐的分布。在图像编辑的背景下, 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 代表编辑后的图像。实际上,这是通过使用UNet架构来学习 p𝑝pitalic_p 的参数化版本来实现的。鉴于正向扩散过程由 q(𝐳T)𝒩(𝟎,𝐈)𝑞subscript𝐳𝑇𝒩0𝐈q(\mathbf{z}_{T})\approx\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)italic_q ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≈ caligraphic_N ( bold_0 , bold_I ) 近似,可学习的转换的表达式为:

pθ(𝐳t1𝐳t)=𝒩(𝐳t1μθ(𝐳t,α¯t),Σθ(𝐳t,α¯t)),subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝒩conditionalsubscript𝐳𝑡1subscript𝜇𝜃subscript𝐳𝑡subscript¯𝛼𝑡subscriptΣ𝜃subscript𝐳𝑡subscript¯𝛼𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)=\mathcal{N}\left(% \mathbf{z}_{t-1}\mid\mu_{\theta}(\mathbf{z}_{t},\bar{\alpha}_{t}),\Sigma_{% \theta}(\mathbf{z}_{t},\bar{\alpha}_{t})\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (4)

where the functions μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are learnable parameters. In addition, for the conditional formulation pθ(𝐳t1𝐳t,𝐜)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝐜p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{c}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ), which is conditioned on an external variable 𝐜𝐜\mathbf{c}bold_c (in image editing, 𝐜𝐜\mathbf{c}bold_c can be the source image), the model becomes μθ(𝐳t,𝐜,α¯t)subscript𝜇𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\mu_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Σθ(𝐳t,𝐜,α¯t)subscriptΣ𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\Sigma_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
其中函数 μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPTΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 是可学习的参数。此外,对于条件形式 pθ(𝐳t1𝐳t,𝐜)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝐜p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{c}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ,它是基于外部变量 𝐜𝐜\mathbf{c}bold_c 的条件(在图像编辑中, 𝐜𝐜\mathbf{c}bold_c 可以是源图像),模型变为 μθ(𝐳t,𝐜,α¯t)subscript𝜇𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\mu_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )Σθ(𝐳t,𝐜,α¯t)subscriptΣ𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\Sigma_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

Optimization. The optimization strategy for guiding the reverse diffusion in learning the forward process involves minimizing the Kullback-Leibler (KL) divergence between the joint distributions of the forward and reverse sequences. These are mathematically defined as:
优化。引导学习正向过程中的逆向扩散的优化策略涉及最小化正向和逆向序列的联合分布之间的Kullback-Leibler(KL)散度。这些在数学上定义为:

pθ(𝐳0,,𝐳T)subscript𝑝𝜃subscript𝐳0subscript𝐳𝑇\displaystyle p_{\theta}\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =p(𝐳T)t=1Tpθ(𝐳t1𝐳t),absent𝑝subscript𝐳𝑇subscriptsuperscriptproduct𝑇𝑡1subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡\displaystyle=p\left(\mathbf{z}_{T}\right)\prod^{T}_{t=1}p_{\theta}\left(% \mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right),= italic_p ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (5)
q(𝐳0,,𝐳T)𝑞subscript𝐳0subscript𝐳𝑇\displaystyle q\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =q(𝐳0)t=1Tq(𝐳t𝐳t1),absent𝑞subscript𝐳0subscriptsuperscriptproduct𝑇𝑡1𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\displaystyle=q\left(\mathbf{z}_{0}\right)\prod^{T}_{t=1}q\left(\mathbf{z}_{t}% \mid\mathbf{z}_{t-1}\right),= italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (6)

leading to the minimization of:
导致最小化:

KL(q(𝐳0,,𝐳T)pθ(𝐳0,,𝐳T))KLconditional𝑞subscript𝐳0subscript𝐳𝑇subscript𝑝𝜃subscript𝐳0subscript𝐳𝑇\displaystyle\text{KL}(q\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)\|p_{% \theta}\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right))KL ( italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) (7)
𝔼q(𝐳0)[logpθ(𝐳0)]+c,absentsubscript𝔼𝑞subscript𝐳0delimited-[]subscript𝑝𝜃subscript𝐳0𝑐\displaystyle\geq\mathbb{E}_{q(\mathbf{z}_{0})}\left[-\log p_{\theta}\left(% \mathbf{z}_{0}\right)\right]+c,≥ blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] + italic_c ,

which is detailed in Ho et al. [16] and the constant c𝑐citalic_c is irrelevant for optimizing θ𝜃\thetaitalic_θ. The KL divergence of Eq. 7 represents the variational bound of the log-likelihood of the data (logpθ(𝐳0)subscript𝑝𝜃subscript𝐳0\log p_{\theta}(\mathbf{z}_{0})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )). This KL divergence serves as the loss and is minimized in DDPMs. Practically, Ho et al. [16] adopt a reweighed version of this loss as a simpler denoising loss:
这在Ho等人的论文[16]中有详细说明,常数 c𝑐citalic_c 对于优化 θ𝜃\thetaitalic_θ 是无关紧要的。方程7的KL散度表示数据( logpθ(𝐳0)subscript𝑝𝜃subscript𝐳0\log p_{\theta}(\mathbf{z}_{0})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )的对数似然的变分上界。这个KL散度作为损失函数,在DDPMs中被最小化。实际上,Ho等人[16]采用了这个损失的加权版本作为更简单的降噪损失。

𝔼t𝒰(1,T),𝐳0q(𝐳0),ϵ𝒩(𝟎,𝐈)[λ(t)ϵϵθ(𝐳t,t)2],subscript𝔼formulae-sequencesimilar-to𝑡𝒰1𝑇formulae-sequencesimilar-tosubscript𝐳0𝑞subscript𝐳0similar-toitalic-ϵ𝒩0𝐈delimited-[]𝜆𝑡superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡2\mathbb{E}_{t\sim\mathcal{U}(1,T),\mathbf{z}_{0}\sim q(\mathbf{z}_{0}),% \epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)}\left[\lambda(t)\|% \epsilon-\epsilon_{\theta}(\mathbf{z}_{t},t)\|^{2}\right],blackboard_E start_POSTSUBSCRIPT italic_t ∼ caligraphic_U ( 1 , italic_T ) , bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT [ italic_λ ( italic_t ) ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

where λ(t)>0𝜆𝑡0\lambda(t)>0italic_λ ( italic_t ) > 0 denotes a weighting function, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained using Eq. 3, and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represents a network designed to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ based on 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t𝑡titalic_t.
其中 λ(t)>0𝜆𝑡0\lambda(t)>0italic_λ ( italic_t ) > 0 表示加权函数, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 是通过公式3获得的, ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 表示基于 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTt𝑡titalic_t 设计的用于预测噪声 ϵitalic-ϵ\epsilonitalic_ϵ 的网络。

DDIM Sampling and Inversion. When working with a real image 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, prevalent editing methods [84, 85] initially invert this 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into a corresponding 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT utilizing a specific inversion scheme. Subsequently, the sampling begins from this 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, employing some editing strategy to produce an edited outcome 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In an ideal scenario, direct sampling from 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, without any edits, should yield a 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that closely resembles 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A significant deviation of 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, termed reconstruction failure, indicates the inability of the edited image to maintain the integrity of unaltered regions in 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, it is crucial to use an inversion method that ensures 𝐳~0𝐳0subscript~𝐳0subscript𝐳0\tilde{\mathbf{z}}_{0}\approx\mathbf{z}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
DDIM采样和反演。在处理真实图像时,常见的编辑方法[84, 85]首先将其反演为相应的图像,采用特定的反演方案。随后,从这个图像开始采样,采用一些编辑策略来生成编辑后的结果。在理想情况下,直接从未经编辑的图像进行采样应该得到一个与原图相似的图像。如果编辑后的图像与原图存在显著偏差,即重建失败,说明编辑后的图像无法保持未修改区域的完整性。因此,使用一种能确保重建成功的反演方法至关重要。

The DDIM sampling equation [18] is:
DDIM采样方程[18]是:

𝐳t1=α¯t1𝐳t1α¯tϵθ(𝐳t,t)α¯t+1α¯t1ϵθ(𝐳t,t),subscript𝐳𝑡1subscript¯𝛼𝑡1subscript𝐳𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\small\mathbf{z}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\frac{\mathbf{z}_{t}-\sqrt{1-% \bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}% +\sqrt{1-\bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t},t),bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (9)

which is alternatively expressed as:
这可以另外表达为:

𝐳t=α¯t𝐳t11α¯t1ϵθ(𝐳t,t)α¯t1+1α¯tϵθ(𝐳t,t).subscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳𝑡11subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\small\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\frac{\mathbf{z}_{t-1}-\sqrt{1-% \bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t},t)}{\sqrt{\bar{\alpha}_{t-% 1}}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t},t).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) . (10)

Although Eq. 10 appears to provide an ideal inversion from 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the problem arises from the unknown nature of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is also used as an input for the network ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). To address this, DDIM Inversion[18] operates under the assumption that 𝐳t1𝐳tsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\approx\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and replaces 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the right-hand side of Eq. 10 with 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, leading to the following approximation:
尽管方程10似乎提供了从 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的理想反演,但问题在于 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的未知性质,它也被用作网络 ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) 的输入。为了解决这个问题,DDIM反演[18]在假设 𝐳t1𝐳tsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\approx\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的情况下进行操作,并将方程10右侧的 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 替换为 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ,得到以下近似:

𝐳t=α¯t𝐳t11α¯t1ϵθ(𝐳t1,t)α¯t1+1α¯tϵθ(𝐳t1,t).subscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳𝑡11subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡1𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡1𝑡\displaystyle\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\frac{\mathbf{z}_{t-1}-% \sqrt{1-\bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t-1},t)}{\sqrt{\bar{% \alpha}_{t-1}}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t-1},t).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) . (11)

Text Condition and Classifier-Free Guidance. Text-conditional diffusion models are designed to synthesize outcomes from random noise 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT guided by a text prompt P𝑃Pitalic_P. During inference in the sampling process, the noise estimation network ϵθ(𝐳t,t,C)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) is utilized to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ, where C=ψ(P)𝐶𝜓𝑃C=\psi(P)italic_C = italic_ψ ( italic_P ) represents the text embedding. This process methodically removes noise from 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT across T𝑇Titalic_T steps until the final result 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is obtained.
文本条件和无分类器指导。文本条件扩散模型旨在通过文本提示引导随机噪声 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 合成结果。在采样过程中的推理中,利用噪声估计网络 ϵθ(𝐳t,t,C)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) 预测噪声 ϵitalic-ϵ\epsilonitalic_ϵ ,其中 C=ψ(P)𝐶𝜓𝑃C=\psi(P)italic_C = italic_ψ ( italic_P ) 表示文本嵌入。这个过程有条不紊地在 T𝑇Titalic_T 步骤中去除 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 中的噪声,直到获得最终结果 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

In the realm of text-conditional image generation, it is vital to ensure substantial textual influence and control over the generated output. To this end, Ho et al. [86] introduce the concept of classifier-free guidance, a technique that amalgamates conditional and unconditional predictions. More specifically, let =ψ(``")𝜓``"\varnothing=\psi(``")∅ = italic_ψ ( ` ` " )111The placeholder \varnothing is typically used for negative prompts to prevent certain attributes from manifesting in the generated image. denote the null text embedding. When combined with a guidance scale w𝑤witalic_w, the classifier-free guidance prediction is formalized as:
在文本条件图像生成领域,确保生成的输出受到文本的重要影响和控制是至关重要的。为此,何等人引入了无分类器指导的概念,这是一种将条件和非条件预测相结合的技术。具体而言,假设 =ψ(``")𝜓``"\varnothing=\psi(``")∅ = italic_ψ ( ` ` " ) 表示空文本嵌入。当与指导比例 w𝑤witalic_w 结合时,无分类器指导预测被形式化为:

ϵθ(𝐳t,t,C,)=wϵθ(𝐳t,t,C)+(1w)ϵθ(𝐳t,t,).subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶𝑤subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶1𝑤subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)=w\epsilon_{\theta}(\mathbf{z% }_{t},t,C)+(1-w)\ \epsilon_{\theta}(\mathbf{z}_{t},t,\varnothing).italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) = italic_w italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) + ( 1 - italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) . (12)

In this formulation, ϵθ(𝐳t,t,C,)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) replaces ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) in the sampling Eq. 9. The value of w𝑤witalic_w, typically ranging from [1,7.5]17.5[1,7.5][ 1 , 7.5 ] as suggested in [26, 27], dictates the extent of textual control. A higher w𝑤witalic_w value correlates with a stronger text-driven influence in the generation process.
在这个公式中, ϵθ(𝐳t,t,C,)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) 替换了采样方程式9中的 ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )w𝑤witalic_w 的值通常在 [1,7.5]17.5[1,7.5][ 1 , 7.5 ] 范围内,如[26, 27]中所建议的,决定了文本控制的程度。较高的 w𝑤witalic_w 值与生成过程中更强的文本驱动影响相关。

2.2 Related Tasks 2.2 相关任务

2.2.1 Conditional Image Generation
2.2.1 条件图像生成

While our primary focus is on diffusion models in image editing, it is important to acknowledge related areas like conditional image generation. Unlike image editing which involves altering certain parts of an existing image, conditional image generation is about creating new images from scratch, guided by specified conditions. Early efforts [32, 31, 87, 88, 89, 90] typically involve class-conditioned image generation, which usually incorporates the class-induced gradients via an additional pretrained classifier during sampling. However, Ho et al. [86] introduce the classifier-free guidance, which does not rely on an external classifier and allows for more versatile conditions, e.g., text, as guidance. Text-to-Image (T2I) Generation. GLIDE [34] is the first work that uses text to guide image generation directly from the high-dimensional pixel level, replacing the label in class-conditioned diffusion models. Similarly, Imagen [27] uses a cascaded framework to generate high-resolution images more efficiently in pixel space. A different line of research first projects the image into a lower-dimensional space and then applies diffusion models in this latent space. Representative works include Stable Diffusion (SD) [26], VQ-diffusion [91], and DALL-E 2 [25]. Following these pioneering studies, a large number of works [37, 92, 93, 94, 95, 96, 97] are proposed, advancing this field over the past two years. Additional Conditions. Beyond text, more specific conditions are also used to achieve higher fidelity and more precise control in image synthesis. GLIGEN [98] inserts a gated self-attention layer between the original self-attention and cross-attention layers in each block of a pretrained T2I diffusion model for generating images conditioned on grounding boxes. Make-A-Scene [99] and SpaText [100] use segmentation masks to guide the generation process. In addition to segmentation maps, ControlNet [101] can also incorporate other types of input, such as depth maps, normal maps, canny edges, pose, and sketches as conditions. Other methods like UniControlNet [102], UniControl [103], Composer [104], and T2I-Adapter [105] integrate diverse conditional inputs and include additional layers, enhancing the generative process controlled by these conditions. Customized Image Generation. Closely related to image editing within conditional image generation is the task of creating personalized images. This task focuses on generating images that maintain a certain identity, typically guided by a few reference images of the same subject. Two early approaches addressing this customized generation through few-shot images are Textual Inversion [106] and DreamBooth [107]. Specifically, Textual Inversion learns a unique identifier word to represent a new subject and incorporates this word into the text encoder’s dictionary. DreamBooth, on the other hand, binds a new, rare word with a specific subject by finetuning the entire Imagen [27] model with a few reference images. To combine multiple new concepts effectively, CustomDiffusion [108] only optimizes the cross-attention parameters in Stable Diffusion [26], representing new concepts and conducting joint training for the multiple-concept combination. Following these foundational works, subsequent methods [109, 110, 111, 112, 113, 114, 115, 116, 117] provide more refined control over the generated images, enhancing the precision and accuracy of the output.
尽管我们的主要关注点是图像编辑中的扩散模型,但重要的是要承认相关领域,如条件图像生成。与图像编辑不同,图像生成是指根据指定条件从头开始创建新的图像。早期的努力[32, 31, 87, 88, 89, 90]通常涉及类别条件的图像生成,在采样过程中通常通过额外的预训练分类器引入类别诱导的梯度。然而,Ho等人[86]引入了无分类器指导的方法,它不依赖于外部分类器,并允许更多样化的条件,例如文本作为指导。文本到图像(T2I)生成。GLIDE [34]是第一个直接使用文本来指导从高维像素级别生成图像的工作,取代了类别条件的扩散模型中的标签。类似地,Imagen [27]使用级联框架在像素空间更高效地生成高分辨率图像。另一条研究线路首先将图像投影到较低维空间,然后在这个潜在空间中应用扩散模型。 代表作品包括稳定扩散(SD)[26],VQ-扩散[91]和DALL-E 2[25]。在这些开创性研究之后,过去两年中提出了大量作品[37, 92, 93, 94, 95, 96, 97],推动了这个领域的发展。附加条件。除了文本之外,还使用更具体的条件来实现更高的保真度和更精确的图像合成控制。GLIGEN[98]在预训练的T2I扩散模型的每个块中,在原始自注意力和交叉注意力层之间插入了一个门控自注意力层,用于生成以定位框为条件的图像。Make-A-Scene[99]和SpaText[100]使用分割掩模来指导生成过程。除了分割图,ControlNet[101]还可以将其他类型的输入,如深度图、法线图、边缘图、姿势和草图作为条件。像UniControlNet[102]、UniControl[103]、Composer[104]和T2I-Adapter[105]这样的其他方法集成了多样的条件输入,并包含了额外的层,增强了由这些条件控制的生成过程。定制图像生成。 与条件图像生成中的图像编辑密切相关的是创建个性化图像的任务。这个任务的重点是生成保持特定身份的图像,通常由几个相同主题的参考图像指导。早期解决这个定制生成问题的两种方法是文本反转[106]和DreamBooth[107]。具体而言,文本反转学习一个唯一的标识符词来表示一个新的主题,并将这个词纳入文本编码器的词典中。另一方面,DreamBooth通过使用几个参考图像对整个Imagen[27]模型进行微调,将一个新的、罕见的词与特定的主题绑定在一起。为了有效地组合多个新概念,CustomDiffusion[108]只优化了Stable Diffusion[26]中的交叉注意力参数,表示新概念,并对多概念组合进行联合训练。在这些基础工作之后,后续方法[109, 110, 111, 112, 113, 114, 115, 116, 117]提供了对生成图像的更精细控制,增强了输出的精度和准确性。

2.2.2 Image Restoration and Enhancement
图像恢复和增强

Image restoration (IR) stands as a pivotal task in low-level vision, aiming to enhance the quality of images contaminated by diverse degradations. Recent progress in diffusion models prompts researchers to investigate their potential for image restoration. Pioneering efforts integrate diffusion models into this task, surpassing previous GAN-based methods.
图像恢复(IR)是低级视觉中的一个关键任务,旨在提高受各种降质影响的图像的质量。最近扩散模型的进展促使研究人员探索其在图像恢复中的潜力。开创性的努力将扩散模型整合到这个任务中,超越了先前基于GAN的方法。

Input Image as a Condition. Generative models have significantly contributed to diverse image restoration tasks, such as super-resolution (SR) and deblurring [118, 29, 12, 13, 119]. Super-Resolution via Repeated Refinement (SR3) [57] utilizes DDPM for conditional image generation through a stochastic, iterative denoising process. Cascaded Diffusion Model [31] adopts multiple diffusion models in sequence, each generating images of increasing resolution. SRDiff [118] follows a close realization of SR3. The main distinction between SRDiff and SR3 is that SR3 predicts the target image directly, whereas SRDiff predicts the difference between the input and output images.
将图像作为条件输入。生成模型在各种图像恢复任务中发挥了重要作用,例如超分辨率(SR)和去模糊[118, 29, 12, 13, 119]。通过重复细化(SR3)[57]利用DDPM进行条件图像生成,通过随机的迭代去噪过程。级联扩散模型[31]按顺序采用多个扩散模型,每个模型生成分辨率逐渐增加的图像。SRDiff [118]是SR3的一个近似实现。SRDiff和SR3之间的主要区别在于SR3直接预测目标图像,而SRDiff预测输入图像和输出图像之间的差异。

Restoration in Non-Spatial Spaces. Some diffusion model-based IR methods focus in other spaces. For example, Refusion [120, 63] employs a mean-reverting Image Restoration (IR)-SDE to transform target images into their degraded counterparts. They leverage an autoencoder to compress the input image into its latent representation, with skip connections for accessing multi-scale details. Chen et al. [121] employ a similar approach by proposing a two-stage strategy called Hierarchical Integration Diffusion Model. The conversion from the spatial to the wavelet domain is lossless and offers significant advantages. For instance, WaveDM [67] modifies the low-frequency band, whereas WSGM [122] or ResDiff [60] conditions the high-frequency bands relative to the low-resolution image. BDCE [123] designs a bootstrap diffusion model in the depth curve space for high-resolution low-light image enhancement.
非空间空间中的恢复。一些基于扩散模型的信息检索方法专注于其他空间。例如,Refusion [120, 63] 使用均值回归图像恢复(IR)-SDE将目标图像转换为其退化的对应物。他们利用自动编码器将输入图像压缩为其潜在表示,使用跳跃连接来访问多尺度细节。陈等人[121]采用类似的方法,提出了一种称为分层集成扩散模型的两阶段策略。从空间到小波域的转换是无损的,并且具有显著的优势。例如,WaveDM [67] 修改了低频带,而WSGM [122]或ResDiff [60]将高频带相对于低分辨率图像进行了调整。BDCE [123]在深度曲线空间中设计了一种引导扩散模型,用于增强高分辨率低光图像。

T2I Prior Usage. The incorporation of T2I information proves advantageous as it allows the usage of pretrained T2I models. These models can be finetuned by adding specific layers or encoders tailored to the IR task. Wang et al. put this concept into practice with StableSR [124]. Central to StableSR is a time-aware encoder trained in tandem with a frozen Stable Diffusion model [26]. This setup seamlessly integrates trainable spatial feature transform layers, enabling conditioning based on the input image. DiffBIR [125] uses pretrained T2I diffusion models for blind image restoration, with a two-stage pipeline and a controllable module. CoSeR [126] introduces Cognitive Super-Resolution, merging image appearance and language understanding. SUPIR [127] leverages generative priors, model scaling, and a dataset of 20 million images for advanced restoration guided by textual prompts, featuring negative-quality prompts and a restoration-guided sampling method.
T2I先前使用。将T2I信息整合进来是有优势的,因为它允许使用预训练的T2I模型。这些模型可以通过添加特定的层或编码器来进行微调,以适应IR任务。王等人在StableSR [124]中将这个概念付诸实践。StableSR的核心是一个与冻结的稳定扩散模型[26]同时训练的时间感知编码器。这个设置无缝地集成了可训练的空间特征转换层,使得可以根据输入图像进行条件处理。DiffBIR [125]使用预训练的T2I扩散模型进行盲目图像恢复,采用两阶段流程和可控模块。CoSeR [126]引入了认知超分辨率,将图像外观和语言理解合并在一起。SUPIR [127]利用生成先验、模型缩放和2000万张图像的数据集,通过文本提示进行高级恢复,具有负质量提示和恢复引导的采样方法。

Projection-Based Methods. These methods aim to extract inherent structures or textures from input images to complement the generated images at each step and to ensure data consistency. ILVR [65], which projects the low-frequency information from the input image to the output image, ensures data consistency and establishes an improved condition. To address this and enhance data consistency, some recent works [71, 70, 128] take a different approach by aiming to estimate the posterior distribution using the Bayes theorem.
基于投影的方法。这些方法旨在从输入图像中提取固有的结构或纹理,以补充每个步骤生成的图像,并确保数据的一致性。ILVR [65]将输入图像的低频信息投影到输出图像中,确保数据的一致性并建立改进的条件。为了解决这个问题并增强数据的一致性,一些最近的工作[71, 70, 128]采用了不同的方法,通过使用贝叶斯定理来估计后验分布。

Decomposition-Based Methods. These methods view IR tasks as a linear reverse problem. Denoising Diffusion Restoration Models (DDRM) [66] utilizes a pretrained denoising diffusion generative model to address linear inverse problems, showcasing versatility across super-resolution, deblurring, inpainting, and colorization under varying levels of measurement noise. Denoising Diffusion Null-space Model (DDNM) [68] represents another decomposition-based zero-shot approach applicable to a broad range of linear IR problems beyond image SR, such as colorization, inpainting, and deblurring. It leverages the range-null space decomposition methodology [129, 130] to tackle diverse IR challenges effectively.
分解方法。这些方法将信息检索任务视为线性反问题。去噪扩散恢复模型(DDRM)[66]利用预训练的去噪扩散生成模型来解决线性反问题,在不同水平的测量噪声下展示了超分辨率、去模糊、修复和上色的多功能性。去噪扩散零空间模型(DDNM)[68]是另一种基于分解的零样本方法,适用于图像超分辨率之外的广泛线性信息检索问题,如上色、修复和去模糊。它利用了范围-零空间分解方法[129, 130]来有效应对各种信息检索挑战。

TABLE I: Comprehensive categorization of diffusion model-based image editing methods from multiple perspectives. The methods are color-rendered according to training, testing-time finetuning, and training & finetuning free. Input conditions include text, class, reference (Ref.) image, segmentation (Seg.) map, pose, mask, layout, sketch, dragging points, and audio. Task capabilities—semantic, stylistic, and structural—are marked with checkmarks (✓) according to the experimental results provided in the source papers.
表格I:基于扩散模型的图像编辑方法的综合分类,从多个角度进行分类。根据训练、测试时间微调和训练与微调自由的情况进行彩色渲染。输入条件包括文本、类别、参考图像、分割图、姿势、遮罩、布局、草图、拖动点和音频。根据源论文提供的实验结果,任务能力(语义、风格和结构)用勾号(✓)标记。
Method 方法

Condition(s) 条件(们)

Semantic Editing 语义编辑 Stylistic Editing 风格编辑 Structural Editing 结构编辑

Obj. 目标 Add. 添加。

Obj. 目标 Remo. 雷莫。

Obj. 目标 Repl. 替代

Bg. Chg. 变更

Emo. 情绪。 Expr. 表达 Mod. 模型

Color 颜色 Chg. 变更

Style 风格 Chg. 变更

Text. 文本。 Chg. 变更

Obj. 目标 Move. 移动。

Obj. 目标 Size. 大小。 Chg. 变更

Obj. 目标 Act. 行动。 Chg. 变更

Persp. /View. Chg.

DiffusionCLIP  [131] 扩散剪辑 [ 131]

Text, Class 文本,类别

Asyrp  [132]

Text, Class 文本,类别

EffDiff  [133]

Class 班级

DiffStyler  [134]

Text, Class 文本,类别

StyleDiffusion  [135] 风格扩散 [ 135]

Ref. Image, Class 参考图像,类别

UNIT-DDPM  [136] 单位-DDPM [ 136]

Class 班级

CycleNet  [137]

Class 班级

Diffusion Autoencoders  [138]
扩散自编码器[138]

Class 班级

HDAE  [139]

Class 班级

EGSDE  [140]

Class 班级

Pixel-Guided Diffusion  [141]
像素引导扩散[141]

Seg. Map, Class 分段地图,分类

PbE  [142]

Ref. Image 参考图像

RIC  [143] RIC [143]

Ref. Image, Sketch 参考图像,草图

ObjectStitch  [144] 对象拼接 [144]

Ref. Image 参考图像

PhD  [145] 博士 [ 145]

Ref. Image, Layout 参考图像,布局

DreamInpainter  [146] 梦境修复师 [ 146]

Ref. Image, Mask, Text 参考图像,遮罩,文本

Anydoor  [147] 任意门 [147]

Ref. Image, Mask 参考图像,掩码

FADING  [148] 褪色 [148]

Text 文本

PAIR Diffusion  [149] PAIR扩散[149]

Ref. Image, Text 参考图像,文本

SmartBrush  [150] 智能刷牙器 [150]

Text, Mask 文本,掩码

IIR-Net  [151]

Text 文本

PowerPaint  [152] 力量绘画[152]

Text, Mask 文本,掩码

Imagen Editor  [153] 图像编辑器 [ 153]

Text, Mask 文本,掩码

SmartMask  [154] 智能面罩 [154]

Text 文本

Uni-paint  [155] Uni-paint [ 155] 统一油漆

Text, Mask, Ref. Image 文本,遮罩,参考图像

InstructPix2Pix  [156]

Text 文本

MoEController  [157] 教育部控制器 [ 157]

Text 文本

FoI  [158] 自由信息法 [ 158]

Text 文本

LOFIE  [159]

Text 文本

InstructDiffusion  [160] 指导扩散[160]

Text 文本

Emu Edit  [161] 鸸鹋编辑[161]

Text 文本

DialogPaint  [162] 对话绘制 [162]

Text 文本

Inst-Inpaint  [163]

Text 文本

HIVE  [164] 蜂巢 [164]

Text 文本

ImageBrush  [165] 图像刷[165]

Ref. Image, Seg. Map, Pose
参考图像,分割地图,姿态

InstructAny2Pix  [166]

Ref. Image, Audio, Text 参考图像、音频、文本

MGIE  [167] MGIE [ 167] 的翻译为简体中文是:MGIE [ 167]

Text 文本

SmartEdit  [168] 智能编辑 [168]

Text 文本

iEdit  [169] iEdit [169]

Text 文本

TDIELR  [170] TDIELR [ 170] 翻译为简体中文:

Text 文本

ChatFace  [171] 聊天脸 [171]

Text 文本

UniTune  [172]

Text 文本

Custom-Edit  [173] 定制编辑 [173]

Text, Ref. Image 文本,参考图像

KV-Inversion  [174] KV-Inversion [174]

Text 文本

Null-Text Inversion  [175]
空文本反转 [175]

Text 文本

DPL  [176]

Text 文本

DiffusionDisentanglement  [177]
扩散解缠 [177]

Text 文本

Prompt Tuning Inversion  [178]
提示调谐反转[178]

Text 文本

(Continued on the next page)
(续下一页)

TABLE 1 (continued) 表1(续)

Method 方法 Condition(s) 条件(们) Semantic Editing 语义编辑 Stylistic Editing 风格编辑 Structural Editing 结构编辑 Obj. 目标 Add. 添加。 Obj. 目标 Remo. 雷莫。 Obj. 目标 Repl. 替代 Bg. Chg. 变更 Emo. 情绪。 Expr. 表达 Mod. 模型 Color 颜色 Chg. 变更 Style 风格 Chg. 变更 Text. 文本。 Chg. 变更 Obj. 目标 Move. 移动。 Obj. 目标 Size. 大小。 Chg. 变更 Obj. 目标 Act. 行动。 Chg. 变更 Persp. 透视。 /View. /视图。 Chg. 变更 StyleDiffusion  [179] 风格扩散 [179] Text 文本 InST  [180] Text 文本 DragonDiffusion  [181] 龙差异[181] Dragging Points 拖动点 DragDiffusion  [182] 拖曳扩散[182] Dragging Points 拖动点 DDS  [183] DDS [183] Text 文本 DiffuseIT  [184] 扩散IT [ 184] Text 文本 CDS  [185] CDS [185] Text 文本 MagicRemover  [186] 魔法移除器 [186] Text 文本 Imagic  [187] Text 文本 LayerDiffusion  [188] 层扩散 [188] Mask, Text 口罩,文本 Forgedit  [189] 锻造 [189] Text 文本 SINE  [190] 正弦 [ 190] Text 文本 PRedItOR  [191] Text 文本 ReDiffuser  [192] ReDiffuser [ 192] 重新扩散器 Text 文本 Captioning and Injection  [193]
字幕和注入 [193]
Text 文本
InstructEdit  [194] 指导编辑[194] Text 文本 Direct Inversion  [195] 直接反转[195] Text 文本 DDPM Inversion  [196] DDPM反演[196] Text 文本 SDE-Drag  [197] Layout 布局 LEDITS++  [198] Text 文本 FEC  [199] Text 文本 EMILIE  [200] 艾米莉 [200] Text 文本 Negative Inversion  [201]
负倒装[201]
Text 文本
ProxEdit  [202] Text 文本 Null-Text Guidance  [203]
空文本指导[203]
Ref. Image, Text 参考图像,文本
EDICT  [204] EDICT [204] Text 文本 AIDI  [205] AIDI [205] Text 文本 CycleDiffusion  [206] 循环扩散 [206] Text 文本 InjectFusion  [207] 注射融合 [207] Ref. Image 参考图像 Fixed-point inversion  [208]
固定点反转[208]
Text 文本
TIC  [209] TIC [209] Text, Mask 文本,掩码 Diffusion Brush  [210] 扩散刷 [210] Text, Mask 文本,掩码 Self-guidance  [211] 自我引导[211] Layout 布局 P2P  [212] Text 文本 Pix2Pix-Zero  [213] Text 文本 MasaCtrl  [85] Text, Pose, Sketch 文本,姿势,素描 PnP  [214] Text 文本 TF-ICON  [215] Text, Ref. Image, Mask 文本,参考图像,掩码 Object-Shape Variations  [216]
物体形状变化[216]
Text 文本
Conditional Score Guidance[217]
条件得分指导[217]
Text 文本
EBMs  [218] Text 文本 Shape-Guided Diffusion  [219]
形状引导扩散[219]
Text, Mask 文本,掩码
HD-Painter  [220] 高清画家 [ 220] Text, Mask 文本,掩码 FISEdit  [221] Text 文本 Blended Latent Diffusion  [222]
混合潜在扩散[ 222]
Text, Mask 文本,掩码
PFB-Diff  [223] Text, Mask 文本,掩码 DiffEdit  [224] Text 文本 RDM  [225] Text 文本 MFL  [226] Text 文本 Differential Diffusion  [227]
差异扩散[227]
Text, Mask 文本,掩码
Watch Your Steps  [228]
小心脚下 [ 228]
Text 文本
Blended Diffusion  [229] 混合扩散[229] Text, Mask 文本,掩码 ZONE  [230] 区域[230] Text, Mask 文本,掩码 Inpaint Anything  [231] 修复任何东西 [231] Text, Mask 文本,掩码 The Stable Artist  [232]
稳定的艺术家 [232]
Text 文本
SEGA  [233] 世嘉[233] Text 文本 LEDITS  [234] LEDITS [ 234] 翻译为简体中文:LEDITS [ 234] Text 文本 OIR-Diffusion  [235] Text 文本

3 Categorization of Image Editing
图像编辑的三种分类

{forest} {森林}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw=hidden-draw, rounded corners, align=left, minimum width=2em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=4.5em,font=,, where level=2text width=3.1em,font=,, where level=3text width=3.0em,font=,, [ Training-Based
分叉边缘,对于树=向东生长,反转=true,锚点=基准西,父锚点=东,子锚点=西,基准=左,字体=,矩形,绘制=隐藏绘制,圆角,对齐=左对齐,最小宽度=2em,边缘+=深灰色,线宽=1pt,s sep=3pt,内部x sep=2pt,内部y sep=3pt,线宽=0.8pt,ver/.style=rotate=90,子锚点=北,父锚点=南,锚点=中心,where level=1文本宽度=4.5em,字体=,where level=2文本宽度=3.1em,字体=,where level=3文本宽度=3.0em,字体=,[基于训练的

Approaches [ Domain-Specific Editing
方法 [领域特定编辑]

with Weak Supervision, text width=7.5em [ DiffusionCLIP [131], Asyrp [132], EffDiff [133], DiffStyler [134], StyleDiffusion [135],
使用弱监督,文本宽度=7.5em [DiffusionCLIP [131],Asyrp [132],EffDiff [133],DiffStyler [134],StyleDiffusion [135],

UNIT-DDPM [136], CycleNet [137], Diffusion Autoencoders [138], HDAE [139],
UNIT-DDPM [136], CycleNet [137], Diffusion Autoencoders [138], HDAE [139]

EGSDE [140], Pixel-Guided Diffusion [141] , leaf, text width=27em ] ] [ Reference and Attribute
EGSDE [ 140],像素引导扩散[ 141],叶子,文本宽度=27em ] ] [ 参考和属性

Guidance via  指导通过
Self-Supervision, text width=7.5em [ PbE [142], RIC [143], ObjectStitch [144], PhD [145], DreamInpainter [146],
自我监督,文本宽度=7.5em [ PbE [ 142],RIC [ 143],ObjectStitch [ 144],PhD [ 145],DreamInpainter [ 146],

Anydoor [147], FADING [148], PAIR Diffusion [149], SmartBrush [150],
任意门 [147],褪色 [148],对偶扩散 [149],智能刷 [150]

IIR-Net [151], PowerPaint [152], Imagen Editor [153], SmartMask [154],
IIR-Net [151], PowerPaint [152], Imagen Editor [153], SmartMask [154]

Uni-paint [155] , leaf, text width=24em ] ] [ Instructional Editing
Uni-paint [155],叶子,文本宽度=24em]] [教学编辑

via Full Supervision, text width=6.5em [ InstructPix2Pix [156], MoEController [157], FoI [158], LOFIE [159],
通过全面监督,文本宽度=6.5em [ InstructPix2Pix [ 156], MoEController [ 157], FoI [ 158], LOFIE [ 159]

InstructDiffusion [160], Emu Edit [161], DialogPaint [162], Inst-Inpaint [163],
HIVE [164], ImageBrush [165], InstructAny2Pix [166], MGIE [167], SmartEdit [168] , leaf, text width=27em ] ] [ Pseudo-Target Retrieval
HIVE [ 164], ImageBrush [ 165], InstructAny2Pix [ 166], MGIE [ 167], SmartEdit [ 168] , 叶子, 文本宽度=27em ] ] [ 伪目标检索

with Weak Supervision, text width=7.5em [ iEdit [169], TDIELR [170], ChatFace [171] , leaf, text width=14em ] ] ]
使用弱监督,文本宽度=7.5em [ iEdit [ 169],TDIELR [ 170],ChatFace [ 171],叶子,文本宽度=14em ] ] ]

Figure 2: Taxonomy of training-based approaches for image editing.
图2:基于训练的图像编辑方法的分类。

In addition to the significant advancements that diffusion models have achieved in image generation, restoration, and enhancement, they also make notable breakthroughs in image editing by offering enhanced controllability compared to the previously dominant GANs. Unlike image generation that focuses on creating new images from scratch, and image restoration and enhancement that aims repairing and improving the quality of degraded images, image editing involves the modification of existing images in the aspects of appearance, structure, or content, including tasks like adding objects, replacing backgrounds, and altering textures.
除了在图像生成、恢复和增强方面取得的显著进展外,扩散模型在图像编辑方面也取得了显著突破,相比之前主导地位的生成对抗网络(GANs),它们提供了更强的可控性。与专注于从头开始创建新图像的图像生成不同,以及旨在修复和改善降质图像质量的图像恢复和增强不同,图像编辑涉及到对现有图像的外观、结构或内容进行修改,包括添加对象、替换背景和改变纹理等任务。

In this survey, we organize image editing papers into three principle groups based on their learning strategies: training-based approaches, testing-time finetuning approaches, and training and finetuning free approaches, which are elaborated in Sections 45, and 6, respectively. Additionally, we explore 10 types of input conditions employed by these methods to control the editing process, including text, mask, reference (Ref.) image, class, layout, pose, sketch, segmentation (Seg.) map, audio, and dragging points. Furthermore, we investigate 12 most common editing types that can be accomplished by these methods, which are organized into three broad categories defined as follows.
在这项调查中,我们根据图像编辑论文的学习策略将其分为三个主要组:基于训练的方法、测试时微调的方法和无需训练和微调的方法,分别在第4、5和6节中详细阐述。此外,我们还探讨了这些方法使用的10种输入条件,以控制编辑过程,包括文本、掩码、参考图像、类别、布局、姿势、草图、分割图、音频和拖动点。此外,我们还研究了这些方法可以完成的12种最常见的编辑类型,这些类型被分为三个广泛的类别,定义如下。

  • Semantic Editing: This category encompasses alterations to the content and narrative of an image, affecting the depicted scene’s story, context, or thematic elements. Tasks within this category include object addition (Obj. Add.), object removal (Obj. Remo.), object replacement (Obj. Repl.), background change (Bg. Chg.), and emotional expression modification (Emo. Expr. Mod.).


    • 语义编辑:这个类别包括对图像内容和叙述的改变,影响所描绘场景的故事、背景或主题元素。该类别中的任务包括物体添加(Obj. Add.)、物体移除(Obj. Remo.)、物体替换(Obj. Repl.)、背景更改(Bg. Chg.)和情感表达修改(Emo. Expr. Mod.)。
  • Stylistic Editing: This category focuses on enhancing or transforming the visual style and aesthetic elements of an image without altering its narrative content. Tasks within this category include color change (Color Chg.), texture change (Text. Chg.), and overall style change (Style Chg.) encompassing both artistic and realistic styles.


    • 风格编辑:这个类别专注于增强或改变图像的视觉风格和美学元素,而不改变其叙事内容。该类别的任务包括改变颜色(颜色更改)、改变纹理(纹理更改)以及整体风格更改(风格更改),包括艺术风格和真实风格。
  • Structural Editing: This category pertains to changes in the spatial arrangement, positioning, viewpoints, and characteristics of elements within an image, emphasizing the organization and presentation of objects within the scene. Tasks within this category include object movement (Obj. Move.), object size and shape change (Obj. Size. Chg.), object action and pose change (Obj. Act. Chg.), and perspective/viewpoint change (Persp./View. Chg.).


    • 结构编辑:这个类别涉及图像中元素的空间排列、定位、视角和特征的变化,强调场景中物体的组织和呈现。该类别中的任务包括物体移动(Obj. Move.)、物体大小和形状的变化(Obj. Size. Chg.)、物体动作和姿势的变化(Obj. Act. Chg.)以及透视/视角的变化(Persp./View. Chg.)。

Table I summarizes the multi-perspective categorization of the surveyed papers comprehensively, providing a quick search.
表格全面总结了调查论文的多角度分类,提供了快速搜索。

4 Training-Based Approaches
4个基于训练的方法

In the field of diffusion model-based image editing, training-based approaches have gained significant prominence. These approaches are not only notable for their stable training of diffusion models and their effective modeling of data distribution but also for their reliable performance in a variety of editing tasks. To thoroughly examine these methods, we categorize them into four main groups based on their application scopes, the conditions required for training, and the types of supervision, as shown in Fig. 2. Further, within each of these primary groups, we classify the approaches into distinct types based on their core editing methodologies. This classification illustrates the range of these methods, from targeted domain-specific applications to broader open-world uses.
在扩散模型为基础的图像编辑领域,基于训练的方法已经获得了显著的重要性。这些方法不仅以其稳定的扩散模型训练和有效的数据分布建模而闻名,还以其在各种编辑任务中可靠的性能表现。为了全面研究这些方法,我们根据它们的应用范围、训练所需条件和监督类型将它们分为四个主要组别,如图2所示。此外,在每个主要组别中,我们根据它们的核心编辑方法将这些方法分类为不同类型。这种分类展示了这些方法的范围,从针对特定领域的应用到更广泛的开放世界使用。

4.1 Domain-Specific Editing with Weak Supervision
4.1 弱监督下的领域特定编辑

Over the past several years, Generative Adversarial Networks (GANs) have been widely explored in image editing for their ability to generate high-quality images. However, diffusion models, with their advanced image generation capabilities, emerge as a new focus in this field. One challenge with diffusion models is their need for extensive computational resources when trained on large datasets. To address this, earlier studies train these models via weak supervision on smaller specialized datasets. These datasets highly concentrate on specific domains such as CelebA [236] and FFHQ [2] for human face manipulation, AFHQ [237] for animal face editing and translation, LSUN [238] for object modification, and WikiArt [239] for style transfer. To thoroughly understand these approaches, we organize them according to their types of weak supervision.
在过去几年中,生成对抗网络(GANs)因其生成高质量图像的能力而在图像编辑领域得到广泛探索。然而,扩散模型以其先进的图像生成能力成为该领域的新焦点。扩散模型面临的一个挑战是在大型数据集上训练时需要大量的计算资源。为了解决这个问题,早期的研究通过对较小的专门数据集进行弱监督训练这些模型。这些数据集高度集中在特定领域,如CelebA [236]和FFHQ [2]用于人脸操作,AFHQ [237]用于动物脸部编辑和翻译,LSUN [238]用于物体修改,以及WikiArt [239]用于风格转换。为了全面了解这些方法,我们根据它们的弱监督类型进行组织。

Refer to caption

(a) Simplified training pipeline of DiffusionCLIP [131]
(a)DiffusionCLIP [131]的简化训练流程

Refer to caption

(b) Simplified training pipeline of Asyrp [132]
(b) Asyrp [132]的简化训练流程

Figure 3: Comparison of the training pipelines between two representative CLIP guided methods: DiffusionCLIP [131] and Asyrp [132]. The sample images are from DiffusionCLIP [131] and Asyrp [132] on the AFHQ [237] dataset.
图3:两种代表性的CLIP引导方法(DiffusionCLIP [131]和Asyrp [132])的训练流程比较。样本图像来自DiffusionCLIP [131]和Asyrp [132]在AFHQ [237]数据集上的应用。

CLIP Guidance. Drawing inspiration from GAN-based methods [240, 241] that use CLIP [242] for guiding image editing with text, several studies incorporate CLIP into diffusion models. A key example is DiffusionCLIP [131], which allows for image manipulation in both trained and new domains using CLIP. Specifically, it first converts a real image into latent noise using DDIM inversion, and then finetunes the pretrained diffusion model during the reverse diffusion process to adjust the image’s attributes constrained by a CLIP loss between the source and target text prompts. Instead of finetuning on the whole diffusion model, Asyrp [132] focuses on a semantic latent space internally, termed h-space, where it defines an additional implicit function parameterized by a small neural network. Then it trains the network with the guidance of the CLIP loss while keeping the diffusion model frozen. A visual comparison of the simplified pipelines of DiffusionCLIP and Asyrp is shown in Fig. 3. To address the time-consuming problem of multi-step optimization in DiffusionCLIP, EffDiff [133] introduces a faster method with single-step training and efficient processing.
CLIP指导。受到使用CLIP引导图像编辑的基于GAN的方法[240, 241]的启发,几项研究将CLIP纳入扩散模型中。一个关键的例子是DiffusionCLIP [131],它允许使用CLIP在经过训练和新领域中进行图像操作。具体而言,它首先使用DDIM反演将真实图像转换为潜在噪声,然后在反向扩散过程中微调预训练的扩散模型,通过源文本和目标文本提示之间的CLIP损失来调整图像的属性。Asyrp [132]不是在整个扩散模型上进行微调,而是在内部的语义潜空间(称为h空间)上进行,它定义了一个由小型神经网络参数化的附加隐式函数。然后,它在保持扩散模型冻结的同时,通过CLIP损失的指导来训练网络。图3显示了DiffusionCLIP和Asyrp简化流程的视觉比较。为了解决DiffusionCLIP中多步优化耗时的问题,EffDiff [133]引入了一种更快的单步训练和高效处理方法。

Beyond face editing the above methods primarily focus on, DiffStyler [134] and StyleDiffusion [135] target artistic style transfer. DiffStyler uses a CLIP instruction loss for the alignment of target text descriptions and generated images, along with a CLIP aesthetic loss for visual quality enhancement. StyleDiffusion, on the other hand, introduces a CLIP-based style disentanglement loss to improve the style-content harmonization.
除了上述方法主要关注的面部编辑之外,DiffStyler [134]和StyleDiffusion [135]主要针对艺术风格转移。DiffStyler使用CLIP指令损失来对齐目标文本描述和生成的图像,同时使用CLIP审美损失来提高视觉质量。另一方面,StyleDiffusion引入了基于CLIP的风格解缠损失,以改善风格和内容的协调。

Cycling Regularization. Since diffusion models are capable of domain translation, the cycling framework commonly used in methods like CycleGAN [5] is also explored within them. For instance, UNIT-DDPM [136] defines a dual-domain Markov chain within diffusion models, using cycle consistency to regularise training for unpaired image-to-image translation. Similarly, CycleNet [137] adopts ControlNet [101] with pretrained Stable Diffusion [26] as the backbone for text conditioning. It also employs consistency regularization throughout the image translation cycle, which encompasses both a forward translation from the source domain to the target domain and a backward translation in the reverse direction.
循环正则化。由于扩散模型能够进行领域转换,因此在其中也常常使用CycleGAN等方法中常用的循环框架。例如,UNIT-DDPM [136] 在扩散模型中定义了一个双域马尔可夫链,使用循环一致性来规范无配对图像到图像的转换训练。类似地,CycleNet [137] 采用了ControlNet [101] 作为文本调节的骨干,并使用预训练的稳定扩散 [26]。它还在整个图像转换循环中采用了一致性正则化,包括从源域到目标域的正向转换和反向转换。

Projection and Interpolation. Another technique frequently used in GANs [243, 244] involves projecting two real images into the GAN latent space and then interpolating between them for smooth image manipulation, which is also adopted in some diffusion models for image editing. For instance, Diffusion Autoencoders [138] introduces a semantic encoder to map an input image to a semantically meaningful embedding, which then serves as a condition of a diffusion model for reconstruction. After training the semantic encoder and the conditional diffusion model, any image can be projected into this semantic space for interpolation. However, HDAE [139] points out that this approach tends to miss rich low-level and mid-level features. It addresses this by enhancing the framework to hierarchically exploit the coarse-to-fine features of both the semantic encoder and the diffusion-based decoder, aiming for more comprehensive representations.
投影和插值。GANs中经常使用的另一种技术[243, 244]是将两个真实图像投影到GAN潜在空间中,然后在它们之间进行插值,以实现平滑的图像操作,这也被一些图像编辑的扩散模型所采用。例如,Diffusion Autoencoders [138]引入了一个语义编码器,将输入图像映射到一个语义上有意义的嵌入,然后作为扩散模型重建的条件。在训练语义编码器和条件扩散模型之后,任何图像都可以投影到这个语义空间进行插值。然而,HDAE [139]指出这种方法往往会忽略丰富的低层和中层特征。它通过增强框架来层次化地利用语义编码器和基于扩散的解码器的粗到细特征,以实现更全面的表示。

Classifier Guidance. Some studies enhance image editing performance by introducing an additional pretrained classifier for guidance. EGSDE [140], for example, uses an energy function to guide the sampling for realistic unpaired image-to-image translation. This function consists of two log potential functions specified by a time-dependent domain-specific classifier and a low-pass filter, respectively. For fine-grained image editing, Pixel-Guided Diffusion [141] trains a pixel-level classifier to both estimate segmentation maps and guide the sampling with its gradient.
分类器指导。一些研究通过引入额外的预训练分类器来提高图像编辑性能。例如,EGSDE [140] 使用能量函数来指导逼真的非配对图像到图像的转换的采样。该函数由一个时间相关的领域特定分类器和一个低通滤波器指定的两个对数潜势函数组成。对于细粒度图像编辑,Pixel-Guided Diffusion [141] 训练一个像素级分类器来估计分割图并通过其梯度指导采样。

4.2 Reference and Attribute Guidance via Self-        Supervision
4.2 通过自我监督进行参考和属性指导

This category of works extracts attributes or other information from single images to serve as conditions for training diffusion-based image editing models in a self-supervised manner. They can be classified into two types: reference-based image composition and attribute-controlled image editing.
这类作品从单张图像中提取属性或其他信息,作为自我监督训练扩散式图像编辑模型的条件。它们可以分为两种类型:基于参考的图像合成和属性控制的图像编辑。

Reference-Based Image Composition. To learn how to composite images, PbE [142] is trained in a self-supervised manner using the content inside the object’s bounding box of an image as the reference image, and the content outside this bounding box as the source image. To prevent the trivial copy-and-paste solution, it applies strong augmentations to the reference image, creates an arbitrarily shaped mask based on the bounding box, and employs the CLIP image encoder to compress information of the reference image as the condition of the diffusion model.
参考图像合成。为了学习如何合成图像,PbE [142]以自我监督的方式进行训练,使用图像对象边界框内的内容作为参考图像,边界框外的内容作为源图像。为了防止简单的复制粘贴解决方案,它对参考图像应用强大的增强技术,基于边界框创建任意形状的蒙版,并使用CLIP图像编码器将参考图像的信息压缩为扩散模型的条件。

Building upon this, RIC [143] incorporates sketches of the masked areas as control conditions for training, allowing users to finely tune the effects of reference image synthesis through sketches. ObjectStitch [144] designs a Content Adaptor to better preserve the key identity information of the reference image. Meanwhile, PhD [145] trains an Inpainting and Harmonizing module on a frozen pretrained diffusion model for efficiently guiding the inpainting of masked areas. To preserve the low-level details of the reference image for inpainting, DreamInpainter [146] utilizes the downsampling network of U-Net to extract its features. During the training process, it adds noise to the entire image, requiring the diffusion model to learn how to restore the clear image under the guidance of a detailed text description. Furthermore, Anydoor [147] uses image pairs from video frames as training samples to enhance image composition quality, and introduces modules for capturing identity features, preserving textures, and learning appearance changes.
在此基础上,RIC [143]将掩码区域的草图作为训练的控制条件,允许用户通过草图精细调整参考图像合成的效果。ObjectStitch [144]设计了一个内容适配器,以更好地保留参考图像的关键身份信息。同时,PhD [145]在冻结的预训练扩散模型上训练了一个修复和协调模块,以有效地指导掩码区域的修复。为了保留参考图像的低级细节进行修复,DreamInpainter [146]利用U-Net的下采样网络提取其特征。在训练过程中,它向整个图像添加噪声,要求扩散模型学习如何在详细的文本描述的指导下恢复清晰的图像。此外,Anydoor [147]使用视频帧中的图像对作为训练样本,以提高图像合成质量,并引入了捕捉身份特征、保留纹理和学习外观变化的模块。

Attribute-Controlled Image Editing. This type of papers typically involves augmenting pretrained diffusion models with specific image features as control conditions to learn the generation of corresponding images. This approach allows for image editing by altering these specific control conditions. After training on age-text-face pairs, FADING [148] edits facial images through null-text inversion and attention control for age manipulation. PAIR Diffusion [149] perceives images as a collection of objects, learning to modulate each object’s properties, specifically structure and appearance. SmartBrush [150] employs masks of varying granularities as control conditions, enabling the diffusion model to inpaint masked regions according to the text and the shape of the mask. To better preserve image information that is irrelevant to the editing text, IIR-Net [151] performs color and texture erasure in the required areas. The resulting image, post-erasure, is then used as one of the control conditions for the diffusion model.
属性控制的图像编辑。这类论文通常涉及使用特定图像特征作为控制条件来增强预训练的扩散模型,以学习生成相应图像。这种方法允许通过改变这些特定的控制条件来进行图像编辑。在对年龄-文本-面部对进行训练后,FADING [148]通过空文本反转和注意力控制来编辑面部图像以进行年龄操作。PAIR Diffusion [149]将图像视为对象的集合,学习调节每个对象的属性,特别是结构和外观。SmartBrush [150]使用不同粒度的掩码作为控制条件,使扩散模型能够根据文本和掩码的形状填充掩码区域。为了更好地保留与编辑文本无关的图像信息,IIR-Net [151]在所需区域执行颜色和纹理擦除。擦除后的图像随后被用作扩散模型的控制条件之一。

4.3 Instructional Editing via Full Supervision
4.3 全面监督下的教学编辑

Using instructions (e.g., “Remove the hat”) to drive the image editing process as opposed to using a description of the edited image (e.g., “A puppy with a hat and a smile”) seems more natural, humanized, and more accurate to the user’s needs. InstructPix2Pix [156] is the first study that learns to edit images following human instructions. Subsequent works have improved upon it in terms of model architecture, dataset quality, multi-modality, and so on. Therefore, we first describe InstructPix2Pix, and then categorize and present the subsequent works based on their most prominent contributions. Correspondingly, a common framework of these instruction-based methods is depicted in Fig. 4.
使用指令(例如“去掉帽子”)来驱动图像编辑过程,而不是使用对编辑后的图像的描述(例如“带帽子和微笑的小狗”),似乎更自然、人性化,更符合用户的需求。InstructPix2Pix [156] 是第一个学习按照人类指令编辑图像的研究。随后的研究在模型架构、数据集质量、多模态等方面进行了改进。因此,我们首先描述InstructPix2Pix,然后根据它们最显著的贡献对后续的研究进行分类和介绍。相应地,这些基于指令的方法的一个共同框架如图4所示。

InstructPix2Pix Framework. One of the major challenges in enabling diffusion models to edit images according to instructions is the construction of the instruction-image paired dataset. InstructPix2Pix generates these image pairs in two steps. First, given an image caption (e.g., “photograph of a girl riding a horse”), it uses a finetuned GPT-3 [245] to generate both an instruction (e.g., “have her ride a dragon.”) and an edited image caption (e.g., “photograph of a girl riding a dragon”). Second, it employs Stable Diffusion and the Prompt-to-Prompt algorithm [212] to generate edited images, collecting over 450,000 training image pairs. Then it trains an instructional image editing diffusion model in a fully supervised manner, taking into account the conditions of the input image and the instruction.
InstructPix2Pix框架。使扩散模型能够根据指令编辑图像的一个主要挑战是构建指令-图像配对数据集。InstructPix2Pix通过两个步骤生成这些图像配对。首先,给定一个图像标题(例如,“一个女孩骑马的照片”),它使用经过微调的GPT-3 [245]生成指令(例如,“让她骑一条龙。”)和编辑后的图像标题(例如,“一个女孩骑龙的照片”)。然后,它使用稳定扩散和Prompt-to-Prompt算法[212]生成编辑后的图像,收集超过45万个训练图像配对。然后,它以完全监督的方式训练一个指令图像编辑扩散模型,考虑输入图像和指令的条件。

Model Architecture Enhancement. MoEController [157] introduces a Mixture-of-Expert (MOE) architecture, which includes three specialized experts for fine-grained local translation, global style transfer, and complex local editing tasks. On the other hand, FoI [158] harnesses the implicit grounding capabilities of InstructPix2Pix to identify and focus on specific editing regions. It also employs cross-condition attention modulation to ensure each instruction targets its corresponding area, reducing interference between multiple instructions.
模型架构增强。MoEController [157]引入了一种专家混合(MOE)架构,其中包括三个专门的专家用于细粒度的局部翻译、全局风格转换和复杂的局部编辑任务。另一方面,FoI [158]利用InstructPix2Pix的隐式基础能力来识别和聚焦于特定的编辑区域。它还采用交叉条件注意调节来确保每个指令都针对其相应的区域,减少多个指令之间的干扰。

Data Quality Enhancement. LOFIE [159] enhances the quality of training datasets by leveraging recent advances in segmentation [246], Chain-of-Thought prompting [247], and visual question answering (VQA) [248]. MagicBrush [249] hires crowd workers from Amazon Mechanical Turk (AMT) to manually perform continuous edits using DALL-E 2 [25]. It comprises 5,313 edit sessions and 10,388 edit turns, thereby establishing a comprehensive benchmark for instructional image editing.
数据质量增强。LOFIE [159]通过利用分割[246]、思维链提示[247]和视觉问答(VQA) [248]的最新进展,提高了训练数据集的质量。MagicBrush [249]雇佣了来自亚马逊机械土耳其的众包工人,使用DALL-E 2 [25]手动进行连续编辑。它包括5,313个编辑会话和10,388个编辑转换,从而建立了一个全面的教学图像编辑基准。

Refer to caption
Figure 4: Common framework of instructional image editing methods. The sample images are from InstructPix2Pix [156], InstructAny2Pix [166] and MagicBrush [249].
图4:教学图像编辑方法的常见框架。示例图像来自InstructPix2Pix [156],InstructAny2Pix [166]和MagicBrush [249]。

InstructDiffusion [160] is a unified framework that treats a diverse range of vision tasks as human-intuitive image-manipulating processes, i.e., keypoint detection, segmentation, image enhancement, and image editing. For image editing, it not only utilizes existing datasets but also augmentes them with additional data generated using tools for object removal and replacement, and by collecting image editing pairs from real-world Photoshop requests on the internet. Emu Edit [161] is also trained on both image editing and recognition data, leveraging a dataset comprising 16 distinct tasks with 10 million examples. The dataset is created using Llama 2 [250] and in-context learning to generate diverse and creative editing instructions. During training, the model learns task embeddings in conjunction with its weights, enabling efficient adaptation to new tasks with only a few examples.
InstructDiffusion [160]是一个统一的框架,将各种视觉任务视为人类直观的图像处理过程,例如关键点检测、分割、图像增强和图像编辑。对于图像编辑,它不仅利用现有数据集,还使用对象移除和替换工具生成额外的数据,并从互联网上收集真实世界Photoshop请求的图像编辑对。Emu Edit [161]也在图像编辑和识别数据上进行训练,利用包含16个不同任务和1000万个示例的数据集。该数据集使用Llama 2 [250]和上下文学习生成多样化和创造性的编辑指令。在训练过程中,模型学习任务嵌入和权重,使其能够高效地适应只有少量示例的新任务。

DialogPaint [162] aims to extract user’s editing intentions during a multi-turn dialogue process and edit images accordingly. It employs the self-instruct technique [251] on GPT-3 to create a multi-turn dialogue dataset and, in conjunction with four image editing models, generates an instructional image editing dataset. Additionally, the authors finetune the Blender dialogue model [252] to generate corresponding editing instructions based on the dialogue data, which then drives the trained instruction editing model to edit images.
DialogPaint [162]旨在在多轮对话过程中提取用户的编辑意图,并相应地编辑图像。它采用GPT-3上的自我指导技术[251]创建多轮对话数据集,并结合四个图像编辑模型生成指导性图像编辑数据集。此外,作者还对Blender对话模型[252]进行微调,以根据对话数据生成相应的编辑指令,然后驱动训练过的指令编辑模型来编辑图像。

Inst-Inpaint [163] allows users to specify objects to be removed from an image simply through text commands, without the need for binary masks. It constructs a dataset named GQA-Inpaint, based on the image and scene graph dataset GQA [253]. It first selects objects and their corresponding relationships from the scene graph, and then uses Detectron2 [254] and Detic [255] to extract the segmentation masks of these objects. Afterward, it employs CRFill [256] to generate inpainted target images. The editing instructions are generated through fixed templates. After constructing the dataset, Inst-Inpaint is trained to perform instructional image inpainting.
Inst-Inpaint [163]允许用户通过文本命令指定要从图像中删除的对象,而无需使用二进制掩码。它基于图像和场景图数据集GQA [253]构建了一个名为GQA-Inpaint的数据集。它首先从场景图中选择对象及其对应的关系,然后使用Detectron2 [254]和Detic [255]提取这些对象的分割掩码。然后,它使用CRFill [256]生成修复的目标图像。编辑指令是通过固定模板生成的。构建数据集后,Inst-Inpaint被训练用于执行指导性图像修复。

Human Feedback-Enhanced Learning. To improve the alignment between edited images and human instructions, HIVE [164] introduces Reinforcement Learning from Human Feedback (RLHF) in instructional image editing. After obtaining a base model following [156], a reward model is then trained on a human ranking dataset. The reward model’s estimation is integrated into the training process to finetune the diffusion model, aligning it with human feedback.
人类反馈增强学习。为了改善编辑图像与人类指令之间的对齐,HIVE [164] 在图像编辑教学中引入了来自人类反馈的强化学习(RLHF)。在按照[156]的方法获得基础模型之后,然后在人类排名数据集上训练奖励模型。奖励模型的估计被整合到训练过程中,以微调扩散模型,使其与人类反馈对齐。

Visual Instruction. ImageBrush [165] is proposed to learn visual instruction from a pair of transformation images that illustrate the desired manipulation, and to apply this instruction to edit a new image. The method concatenates example images, the source image, and a blank image into a grid, using a diffusion model to iteratively denoise the blank image based on the contextual information provided by the example images. Additionally, a visual prompting encoder is proposed to extract features from the visual instructions to enhance the diffusion process.
视觉指导。提出使用ImageBrush [165]从一对展示所需操作的转换图像中学习视觉指导,并将此指导应用于编辑新图像。该方法将示例图像、源图像和空白图像连接成一个网格,并使用扩散模型根据示例图像提供的上下文信息迭代去噪空白图像。此外,提出了一种视觉提示编码器,用于从视觉指导中提取特征以增强扩散过程。

Leveraging Multimodal Large-Scale Models. InstructAny2Pix [166] enables users to edit images through instructions that integrate audio, image, and text. It employs ImageBind [257], a multimodal encoder, to convert diverse inputs into a unified latent space representation. Vicuna-7b [258], a large language model (LLM), encodes the multimodal input sequence to predict two special tokens as the conditions to align the multimodal inputs with the editing results of the diffusion model.
利用多模态大规模模型。InstructAny2Pix [166]使用户能够通过整合音频、图像和文本的指令来编辑图像。它使用ImageBind [257],一个多模态编码器,将多样化的输入转换为统一的潜在空间表示。Vicuna-7b [258]是一个大型语言模型(LLM),它对多模态输入序列进行编码,以预测两个特殊的标记作为将多模态输入与扩散模型的编辑结果对齐的条件。

MGIE [167] inputs the image and instruction, along with multiple [IMG] tokens, into the Multimodal Large Language Model (MLLM), LLaVA [259]. It then projects the hidden state of [IMG] in the penultimate layer into the cross attention layer of the UNet in Stable Diffusion. During the training process, the weights of LLaVA and Stable Diffusion are jointly optimized.
MGIE [167]将图像和指令以及多个[IMG]令牌输入到多模态大语言模型(MLLM)LLaVA [259]中。然后,它将[IMG]在倒数第二层的隐藏状态投影到稳定扩散中的UNet的交叉注意力层中。在训练过程中,LLaVA和Stable Diffusion的权重被联合优化。

Similarly, SmartEdit [168] employs LLaVA and additionally introduces a Bidirectional Interaction Module (BIM) to perform image editing in complex scenarios. It first aligns the MLLM’s hidden states with the CLIP text encoder using QFormer [248], and then it facilitates the fusion between the image feature and the QFormer output through BIM. Besides, it leverages perception-related data, such as segmentation data, to strengthen the model’s understanding of spatial and conceptual attributes. Additionally, it incorporates synthetic editing data in complex understanding and reasoning scenarios to activate the MLLM’s reasoning ability.
同样,SmartEdit [168]采用LLaVA,并引入了双向交互模块(BIM)来在复杂场景中进行图像编辑。它首先使用QFormer [248]将MLLM的隐藏状态与CLIP文本编码器对齐,然后通过BIM促进图像特征和QFormer输出之间的融合。此外,它利用与感知相关的数据,如分割数据,来增强模型对空间和概念属性的理解。此外,在复杂的理解和推理场景中,它还结合了合成编辑数据来激活MLLM的推理能力。

4.4 Pseudo-Target Retrieval with Weak Supervision
4.4 弱监督下的伪目标检索

Since obtaining edited images that accurately represent the ground truth is challenging, the methods in this category aim to retrieve pseudo-target images or directly use CLIP scores [242] as the objective to optimize model parameters. iEdit [169] trains the diffusion model with weak supervision, utilizing CLIP to retrieve and edit the dataset images that are most similar to the editing text, serving as pseudo-target image post-editing. Additionally, it incorporates masks into the image editing process to enable localized preservation by CLIPSeg [260]. To effectively tackle region-based image editing, TDIELR [170] initially processes the input image using DINO [261] to generate attention maps and features for anchor initialization. It learns a region generation network (RGN) to select the most fitting region proposals. The chosen regions and text descriptions are then fed into a pretrained text-to-image model for editing. TDIELR employs CLIP to calculate scores, assessing the similarity between the text descriptions and editing results, providing a training signal for RGN. Furthermore, ChatFace [171] also utilizes CLIP scores as a metric to learn how to edit on real facial images.
由于获取准确代表真实情况的编辑图像具有挑战性,该类方法旨在检索伪目标图像或直接使用CLIP分数[242]作为优化模型参数的目标。iEdit [169]使用弱监督训练扩散模型,利用CLIP检索和编辑数据集图像,这些图像与编辑文本最相似,用作伪目标图像后编辑。此外,它还将掩码纳入图像编辑过程中,通过CLIPSeg [260]实现局部保护。为了有效处理基于区域的图像编辑,TDIELR [170]首先使用DINO [261]处理输入图像,生成注意力地图和锚点初始化的特征。它学习了一个区域生成网络(RGN),以选择最合适的区域提案。然后,选择的区域和文本描述被输入预训练的文本到图像模型进行编辑。TDIELR使用CLIP计算分数,评估文本描述和编辑结果之间的相似性,为RGN提供训练信号。 此外,ChatFace [171] 还利用 CLIP 分数作为度量标准,学习如何在真实的面部图像上进行编辑。

5 Testing-Time Finetuning Approaches
5个测试时间微调方法

In image generation and editing, testing-time finetuning represents a significant step towards precision and control. This section explores various finetuning strategies (Fig. 5) that enhance image editing capabilities. These approaches, as shown in Fig.  6, range from finetuning the entire denoising model to focusing on specific layers or embeddings. We investigate methods that finetune the entire model, target specific parameters, and optimize text-based embeddings. Additionally, we discuss the integration of hypernetworks and direct image representation optimization. These methods collectively demonstrate the evolving complexity and effectiveness of finetuning techniques in image editing, catering to a wide array of editing requirements and user intentions.
在图像生成和编辑中,测试时间微调是向精确性和控制迈出的重要一步。本节探讨了各种微调策略(图5),以增强图像编辑能力。如图6所示,这些方法从微调整个去噪模型到专注于特定层或嵌入,涵盖了各种范围。我们研究了微调整个模型、针对特定参数和优化基于文本的嵌入的方法。此外,我们还讨论了超网络和直接图像表示优化的整合。这些方法共同展示了图像编辑中微调技术的不断演变的复杂性和有效性,满足了各种编辑需求和用户意图。

5.1 Denoising Model Finetuning
5.1 降噪模型微调

The denoising model is the central component in image generation and editing. Directly finetuning it is a simple and effective approach. Consequently, many editing methods are based on this finetuning process. Some of them involve finetuning the entire denoising model, while others focus on finetuning specific layers within the model.
去噪模型是图像生成和编辑的核心组件。直接微调它是一种简单有效的方法。因此,许多编辑方法都是基于这个微调过程的。其中一些方法涉及对整个去噪模型进行微调,而其他方法则专注于对模型中特定层进行微调。

Refer to caption
Figure 5: Testing-time finetuning framework with different finetuning components. The sample images are from Custom-Edit [173].
图5:具有不同微调组件的测试时间微调框架。样本图像来自Custom-Edit [173]。

Finetuning Entire Denoising Models. Finetuning the entire denoising model allows the model to better learn specific features of images and more accurately interpret textual prompts, resulting in edits that more closely align with user intent. UniTune [172] finetunes the diffusion model on a single base image during the tuning phase, encouraging the model to produce images similar to the base image. During the sampling phase, a modified sampling process is used to balance fidelity to the base image and alignment to the editing prompt. This includes starting sampling from a noisy version of the base image and applying classifier-free guidance during the sampling process. Custom-Edit [173] uses a small set of reference images to customize the diffusion model, enhancing the similarity of the edited image to the reference images while maintaining the similarity to the source image.
整体微调去噪模型。整体微调去噪模型可以使模型更好地学习图像的特定特征,并更准确地解释文本提示,从而使编辑结果更符合用户意图。在调整阶段,UniTune [172] 在单个基础图像上微调扩散模型,鼓励模型生成与基础图像相似的图像。在采样阶段,使用修改后的采样过程来平衡对基础图像的保真度和对编辑提示的对齐度。这包括从基础图像的噪声版本开始采样,并在采样过程中应用无分类器的引导。Custom-Edit [173] 使用一小组参考图像来定制扩散模型,增强编辑图像与参考图像的相似性,同时保持与源图像的相似性。

{forest} {森林}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw=hidden-draw, rounded corners, align=left, minimum width=2em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=4.5em,font=,, where level=2text width=3.1em,font=,, where level=3text width=3.0em,font=,, [ Testing-Time
分叉边缘,对于树= grow=east,reversed=true,anchor=base west,parent anchor=east,child anchor=west,base=left,font=,rectangle,draw=hidden-draw,rounded corners,align=left,minimum width=2em,edge+=darkgray,line width=1pt,s sep=3pt,inner xsep=2pt,inner ysep=3pt,line width=0.8pt,ver/.style=rotate=90,child anchor=north,parent anchor=south,anchor=center,where level=1text width=4.5em,font=,where level=2text width=3.1em,font=,where level=3text width=3.0em,font=,[ Testing-Time

Finetuning Approaches [ Denosing Model
微调方法【去噪模型】

Finetuning, text width=5.5em [ UniTune [172], Custom-Edit [173], KV-Inversion [174] , leaf, text width=17em ] ] [ Embedding
微调,文本宽度=5.5em [ UniTune [ 172],自定义编辑 [ 173],KV反转 [ 174],叶子,文本宽度=17em ] ] [ 嵌入

Finetuning, text width=4em [ Null-Text Inversion [175], DPL [176], DiffusionDisentanglement [177],
微调,文本宽度=4em [空文本反转[175],DPL[176],扩散解缠[177],

Prompt Tuning Inversion [178] , leaf, text width=22.4em ] ] [ Guidance with
提示调谐反转[178],叶子,文本宽度=22.4em] ] [指导与

a Hypernetwork, text width=5em [ StyleDiffusion [179], InST [180] , leaf, text width=10.5em ] ] [ Latent Variable
一个超网络,文本宽度=5em [ StyleDiffusion [ 179],InST [ 180],叶子,文本宽度=10.5em ] ] [ 潜变量

Optimization, text width=5em [ DragonDiffusion [181], DragDiffusion [182], DDS [183],
优化,文本宽度=5em [ DragonDiffusion [ 181], DragDiffusion [ 182], DDS [ 183]

DiffuseIT [184], CDS [185], MagicRemover [186] , leaf, text width=18em ] ] [ Hybrid
DiffuseIT [ 184], CDS [ 185], MagicRemover [ 186], 叶子, 文本宽度=18em ] ] [ 混合

Finetuning, text width=3.5em [ Imagic [187], LayerDiffusion [188], Forgedit [189], SINE [190] , leaf, text width=20em ] ] ]
微调,文本宽度=3.5em [Imagic [187],LayerDiffusion [188],Forgedit [189],SINE [190],叶子,文本宽度=20em]]]]

Figure 6: Taxonomy of testing-time finetuning approaches for image editing.
图6:图像编辑的测试时间微调方法的分类。

Partial Parameter Finetuning in Denoising Models. Some methods focus on finetuning specific parts of the denoising model, such as the self-attention layers, cross-attention layers, encoder, or decoder. This type of finetuning is more precise, largely preserving the capabilities of the pretrained models and building upon them. KV Inversion [174], by learning keys (K) and values (V), designs an enhanced version of self-attention, termed Content-Preserving Self-Attention (CP-attn). This effectively addresses the issue of action editing on real images while maintaining the content and structure of the original images. It offers an efficient and flexible solution for image editing without the need for model finetuning or training on large-scale datasets.
去噪模型中的部分参数微调。一些方法专注于微调去噪模型的特定部分,如自注意力层、交叉注意力层、编码器或解码器。这种微调更加精确,大部分保留了预训练模型的能力,并在此基础上进行改进。通过学习键(K)和值(V),KV Inversion [174] 设计了增强版的自注意力,称为内容保持自注意力(CP-attn)。这有效地解决了在真实图像上进行动作编辑时保持原始图像内容和结构的问题。它为图像编辑提供了高效灵活的解决方案,无需进行模型微调或在大规模数据集上进行训练。

5.2 Embedding Finetuning
5.2 嵌入微调

Many finetuning approaches opt to target either text or null-text embeddings for refinement, allowing for better integration of embeddings with the generative process to achieve enhanced editing outcomes.
许多微调方法选择针对文本或空文本嵌入进行细化,以实现更好地将嵌入与生成过程整合,从而实现改进的编辑结果。

Null-Text Embedding Finetuning. The goal of null-text embedding finetuning is to solve the problem of reconstruction failures in DDIM Inversion [18] and thus improve the consistency with the original image. In Null-Text Inversion [175], DDIM Inversion is first applied to the original image to obtain the inversion trajectory. Then, during the sampling process, the null-text embedding is finetuned to reduce the distance between the sampling trajectory and the inversion trajectory so that the sampling process can reconstruct the original image. The advantage of this approach is that neither the U-Net weights nor the text embedding are changed, so it is possible to improve reconstruction performance without changing the target prompt set by the user. Similarly, DPL [176] dynamically updates nouns in the text prompt with a leakage fixation loss to address cross-attention leakage. The null-text embedding is finetuned to ensure high quality cross-attention maps and accurate image reconstruction.
Null-Text嵌入微调。Null-Text嵌入微调的目标是解决DDIM Inversion [18]中的重建失败问题,从而提高与原始图像的一致性。在Null-Text Inversion [175]中,首先将DDIM Inversion应用于原始图像,以获得反演轨迹。然后,在采样过程中,对空文本嵌入进行微调,以减小采样轨迹与反演轨迹之间的距离,从而使采样过程能够重建原始图像。这种方法的优点是既不改变U-Net权重也不改变文本嵌入,因此可以在不改变用户设置的目标提示的情况下改善重建性能。类似地,DPL [176]通过泄漏修复损失动态更新文本提示中的名词,以解决交叉注意力泄漏问题。对空文本嵌入进行微调,以确保高质量的交叉注意力图和准确的图像重建。

Text Embedding Finetuning. Finetuning embeddings derived from input text can enhance image editing, making edited images more aligned with conditional characteristics. DiffusionDisentanglement [177] introduces a simple lightweight image editing algorithm that achieves style matching and content preservation by optimizing the blending weights of two text embeddings. This process involves optimizing about 50 parameters, with the optimized weights generalizing well across different images. Prompt Tuning Inversion [178] designs an accurate and fast inversion technique for text-based image editing, comprising a reconstruction phase and an editing phase. In the reconstruction phase, it encodes the information of the input image into a learnable text embedding. During the editing phase, a new text embedding is computed through linear interpolation, combining the target embedding with the optimized one to achieve effective editing while maintaining high fidelity.
文本嵌入微调。通过微调从输入文本中得出的嵌入,可以增强图像编辑,使编辑后的图像更符合条件特征。DiffusionDisentanglement [177]引入了一种简单轻量级的图像编辑算法,通过优化两个文本嵌入的混合权重实现风格匹配和内容保留。这个过程涉及优化大约50个参数,经过优化的权重在不同图像上具有很好的泛化能力。Prompt Tuning Inversion [178]为基于文本的图像编辑设计了一种准确快速的反演技术,包括重建阶段和编辑阶段。在重建阶段,它将输入图像的信息编码为可学习的文本嵌入。在编辑阶段,通过线性插值计算出一个新的文本嵌入,将目标嵌入与优化嵌入相结合,实现有效的编辑同时保持高保真度。

5.3 Guidance with a Hypernetwork
5.3 使用超网络进行指导

Beyond conventional generative frameworks, some methods incorporate a custom network to better align with specific editing intentions. StyleDiffusion [179] introduces a Mapping Network that maps features of the input image to an embedding space aligned with the embedding space of textual prompts, effectively generating a prompt embedding. Cross-attention layers are used to combine textual prompt embeddings with image feature representations. These layers achieve text-image interaction by computing attention maps of keys, values, and queries. InST [180] integrates a multi-layer cross-attention mechanism in its Textual Inversion segment to process image embeddings. The learned key information is transformed into a text embedding, which can be seen as a “new word” representing the unique style of the artwork, effectively expressing its distinctive style.
超越传统的生成框架,一些方法结合了自定义网络,以更好地与特定的编辑意图相匹配。StyleDiffusion [179]引入了一个映射网络,将输入图像的特征映射到与文本提示的嵌入空间对齐的嵌入空间,从而有效地生成提示嵌入。交叉注意力层用于将文本提示嵌入与图像特征表示相结合。这些层通过计算键、值和查询的注意力图来实现文本-图像交互。InST [180]在其文本反转段中集成了一个多层交叉注意力机制,以处理图像嵌入。学习到的关键信息被转化为文本嵌入,可以看作是代表艺术作品独特风格的“新词”,有效地表达其独特风格。

5.4 Latent Variable Optimization
5.4 潜变量优化

The direct optimization of an image’s latent variable is also a technique employed during the finetuning process. This approach involves directly optimizing the noisy latents by introducing certain loss function relationships and features of some intermediate layers instead of optimizing the parameters of the generator or the embedded conditional parameters. Utilizing a pretrained diffusion model, most methods acquire its capability to perform image translation without the need of paired training data.
图像潜变量的直接优化也是微调过程中使用的一种技术。这种方法涉及通过引入某些损失函数关系和一些中间层的特征来直接优化噪声潜变量,而不是优化生成器或嵌入式条件参数的参数。大多数方法利用预训练的扩散模型,无需配对训练数据即可实现图像转换的能力。

Human-Guided Optimization of Latent Variables. This approach allows users to participate in the image editing process, guiding the generation of images. Represented by DragGAN [262], this interactive editing process enables users to specify and move specific points within an image to new locations while keeping the rest of the image unchanged. DragGAN achieves this editing by optimizing the latent space of GANs. Subsequently, there are developments based on diffusion models, such as DragonDiffusion [181], which constructs an energy function in the intermediate features of the diffusion model to guide editing. This enables image editing guidance directly through image features, independent of textual descriptions. DragDiffusion [182] concentrates on optimizing the diffusion latent representations at a specific time step, rather than across multiple time steps. This design is based on the observation that the U-Net feature maps at a particular time step offer ample semantic and geometric information to facilitate drag-and-drop editing.
人类引导的潜变量优化。这种方法允许用户参与图像编辑过程,引导图像的生成。由DragGAN [262]代表,这个交互式编辑过程使用户能够指定并移动图像中的特定点到新位置,同时保持图像的其余部分不变。DragGAN通过优化GAN的潜空间实现这种编辑。随后,基于扩散模型的发展,如DragonDiffusion [181],在扩散模型的中间特征中构建能量函数来引导编辑。这使得图像编辑可以直接通过图像特征进行引导,而不依赖于文本描述。DragDiffusion [182]专注于在特定时间步骤上优化扩散潜在表示,而不是跨多个时间步骤。这个设计基于这样的观察:U-Net在特定时间步骤上的特征图提供了丰富的语义和几何信息,以促进拖放式编辑。

Utilizing Network Layers and Input for Optimizing Latent Variables. Some optimization methods utilize embeddings or network features derived from input condition to construct loss functions, thereby enabling direct optimization of latent variables. The DDS [183] method leverages two image-text pairs: one comprising a source image and its descriptive text, and the other a target image with its corresponding descriptive text. DDS calculates the discrepancy between these two image-text pairs, deriving the loss through this comparison. The loss function in DiffuseIT [184] also incorporates the use of the CLIP model’s text and image encoders to compute the similarity between the target text and the source image. CDS [185] integrates a contrastive learning loss function into the DDS framework, utilizing the spatially rich features of the self-attention layers of LDM [26] to guide image editing through contrastive loss calculation.
利用网络层和输入优化潜变量。一些优化方法利用从输入条件中派生的嵌入或网络特征构建损失函数,从而实现对潜变量的直接优化。DDS [183]方法利用两个图像-文本对:一个包括源图像及其描述性文本,另一个包括目标图像及其相应的描述性文本。DDS通过比较这两个图像-文本对之间的差异来计算损失。DiffuseIT [184]中的损失函数还利用了CLIP模型的文本和图像编码器来计算目标文本和源图像之间的相似度。CDS [185]将对比学习损失函数整合到DDS框架中,利用LDM [26]的自注意力层的空间丰富特征通过对比损失计算来指导图像编辑。

5.5 Hybrid Finetuning 5.5 混合微调

Some works combine the various finetuning approaches mentioned above, which can be sequential, with stages of tuning occurring in series, or conducted simultaneously as part of a single integrated workflow. Such composite finetuning methods can achieve targeted and effective image editing.
一些作品结合了上述各种微调方法,可以是顺序的,微调阶段按照系列进行,或者作为单一集成工作流程的一部分同时进行。这种综合微调方法可以实现有针对性和有效的图像编辑。

{forest} {森林}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw=hidden-draw, rounded corners, align=left, minimum width=2em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=4.5em,font=,, where level=2text width=3.1em,font=,, where level=3text width=3.0em,font=,, [ Training and Finetuning
分叉边缘,对于树=向东生长,反转=true,锚点=基础西,父锚点=东,子锚点=西,基础=左,字体=,矩形,绘制=隐藏绘制,圆角,对齐=左对齐,最小宽度=2em,边缘+=深灰色,线宽=1pt,s sep=3pt,内部x sep=2pt,内部y sep=3pt,线宽=0.8pt,ver/.style=rotate=90,子锚点=北,父锚点=南,锚点=中心,where level=1文本宽度=4.5em,字体=,where level=2文本宽度=3.1em,字体=,where level=3文本宽度=3.0em,字体=,[培训和微调

Free Approaches [ Input Text
自由方法 [输入文本]

Refinement, text width=4em [ PRedItOR [191], ReDiffuser [192], Captioning and Injection [193], InstructEdit [194] , leaf, text width=26em ] ] [ Inversion/Sampling
细化,文本宽度=4em [ PRedItOR [ 191],ReDiffuser [ 192],字幕和注入 [ 193],InstructEdit [ 194],叶子,文本宽度=26em ] ] [ 反转/采样

Modification, text width=6em [ Direct Inversion [195], DDPM Inversion [196], SDE-Drag [197], LEDITS++ [198],
修改,文本宽度=6em [直接反转[195],DDPM反转[196],SDE-Drag[197],LEDITS++[198],

FEC [199], EMILIE [200], Negative Inversion [201], Proximal Inversion [202],
FEC [199], EMILIE [200], 负倒装 [201], 近位倒装 [202]

Null-Text Guidance [203], EDICT [204], AIDI [205], CycleDiffusion [206],
空文本指导[203],EDICT[204],AIDI[205],CycleDiffusion[206]

InjectFusion [207], Fixed-point inversion [208], TIC [209], Diffusion Brush [210],
注射融合[207],定点反转[208],TIC[209],扩散刷[210]

Self-guidance [211] , leaf, text width=25.5em ] ] [ Attention
自我引导[211],叶子,文本宽度=25.5em]] [注意

Modification, text width=4em [ P2P [212], Pix2Pix-Zero [213], MasaCtrl [85], PnP [214], TF-ICON [215],
修改,文本宽度=4em [P2P [212],Pix2Pix-Zero [213],MasaCtrl [85],PnP [214],TF-ICON [215],

Object-Shape Variations [216], Conditional Score Guidance [217], EBMs [218],
物体形状变化[216],条件分数指导[217],EBMs[218]

Shape-Guided Diffusion [219], HD-Painter [220] , leaf, text width=25em ] ] [ Mask
形状引导扩散[219],高清绘画[220],叶子,文本宽度=25em ] ] [掩码

Guidance, text width=3em [ FISEdit [221], Blended Latent Diffusion [222], PFB-Diff [223], DiffEdit [224], RDM [225],
指导,文本宽度=3em [FISEdit [221],混合潜在扩散[222],PFB-Diff [223],DiffEdit [224],RDM [225],

MFL [226], Differential Diffusion [227], Watch Your Steps [228], Blended Diffusion [229],
MFL [ 226 ],差分扩散 [ 227 ],小心脚步 [ 228 ],混合扩散 [ 229 ],

ZONE [230], Inpaint Anything [231] , leaf, text width=28em ] ] [ Multi-Noise
区域[230],修复任何东西[231],叶子,文本宽度=28em]] [多噪声

Redirection, text width=4em [ The Stable Artist [232], SEGA [233], LEDITS [234], OIR-Diffusion [235] , leaf, text width=23em ] ] ]
重定向,文本宽度=4em [稳定的艺术家[232],SEGA[233],LEDITS[234],OIR-Diffusion[235],叶子,文本宽度=23em]]]]

Figure 7: Taxonomy of training and finetuning free approaches for image editing.
图7:图像编辑的训练和微调自由方法的分类。

Text Embedding & Denoising Model Finetuning. Imagic [187] implements its goal in stages, starting with converting target text into a text embedding, which is then optimized to reconstruct an input image by minimizing the difference between the embedding and the image. Simultaneously, the diffusion model is finetuned for better image reconstruction. A midpoint is found through linear interpolation between the optimized text embedding and the target text, combining features of both. This embedding is then used by the finetuned diffusion model to generate the final edited image. LayerDiffusion [188] optimizes text embeddings to match an input image’s background and employs a layered diffusion strategy to finetune the model, enhancing its ability to maintain subject and background consistency. Forgedit [189] focuses on rapid image reconstruction and finding suitable text embeddings for editing, utilizing the diffusion model’s encoder and decoder for learning image layout and texture details, respectively.
文本嵌入和去噪模型微调。Imagic [187]通过多个阶段实现其目标,首先将目标文本转换为文本嵌入,然后通过最小化嵌入和图像之间的差异来优化以重建输入图像。同时,微调扩散模型以实现更好的图像重建。通过在优化的文本嵌入和目标文本之间进行线性插值找到中间点,结合两者的特征。然后,微调的扩散模型使用这个嵌入来生成最终编辑的图像。LayerDiffusion [188]优化文本嵌入以匹配输入图像的背景,并采用分层扩散策略来微调模型,增强其保持主题和背景一致性的能力。Forgedit [189]专注于快速图像重建和寻找适合编辑的文本嵌入,利用扩散模型的编码器和解码器分别学习图像布局和纹理细节。

Text Encoder & Denoising Model Finetuning. SINE [190] initially finetunes the text encoder and the denoising model to better understand the content and geometric structure of individual images. It introduces a patch-based finetuning strategy, allowing the model to generate images at any resolution, not just the fixed resolution of the pretrained model. Through this finetuning and patch training, SINE is capable of handling single-image editing tasks, including but not limited to style transfer, content addition, and object manipulation.
文本编码器和去噪模型微调。SINE [190] 最初微调文本编码器和去噪模型,以更好地理解单个图像的内容和几何结构。它引入了基于补丁的微调策略,使模型能够生成任何分辨率的图像,而不仅仅是预训练模型的固定分辨率。通过这种微调和补丁训练,SINE 能够处理单图像编辑任务,包括但不限于风格转移、内容添加和对象操作。

6 Training and Finetuning Free Approaches
6个训练和微调自由的方法

In the field of image editing, training and finetuning free approaches start from the premise that they are fast and low-cost because they do not require any form of training (on the dataset) or finetuning (on the source image) during the complete process of editing. This section categorizes them into five categories according to what they modify, as shown in Figs. 7 and  8. They ingeniously leverage the inherent principles within the diffusion model for their editing goals.
在图像编辑领域,训练和微调自由方法的前提是它们快速且低成本,因为在整个编辑过程中不需要任何形式的训练(在数据集上)或微调(在源图像上)。根据它们修改的内容,本节将它们分为五个类别,如图7和图8所示。它们巧妙地利用了扩散模型内在的原理来实现编辑目标。

6.1 Input Text Refinement
6.1 输入文本的细化

Input text refinement in the realm of image editing marks a significant progression in refining the text-to-image translation mechanism. This approach emphasizes the enhancement of text embeddings and the streamlining of user inputs, trying to guarantee that images are modified both accurately and contextually in accordance with the given text. It enables conceptual modifications and intuitive user instructions, getting rid of the necessity for complex model modifications.
图像编辑领域中的输入文本精炼标志着文本到图像翻译机制的重要进展。该方法强调文本嵌入的增强和用户输入的简化,试图确保根据给定的文本准确和上下文地修改图像。它实现了概念上的修改和直观的用户指令,摆脱了复杂模型修改的必要性。

Enhancement of Input Text Embeddings. These methods concentrate on the refinement of text embeddings to modify the text-to-image translation. PRedItOR [191] enhances input text embeddings by leveraging a diffusion prior model to perform conceptual edits in the CLIP image embedding space, allowing for more nuanced and context-aware image editing. ReDiffuser [192] generates rich prompts through regeneration learning, which are then fused to guide the editing process, ensuring that the desired features are enhanced while maintaining the source image structure. Additionally, Captioning and Injection [193] introduces a user-friendly image editing method by employing a captioning model such as BLIP [263] and prompt injection techniques [193] to augment input text embeddings. This method allows the user to provide minimal text input, such as a single noun, and then automatically generates or optimizes detailed prompts that effectively guide the editing process, resulting in more accurate and contextually relevant image editing.
输入文本嵌入的增强。这些方法集中在改进文本嵌入以修改文本到图像的翻译。PRedItOR [191]通过利用扩散先验模型在CLIP图像嵌入空间中执行概念编辑来增强输入文本嵌入,从而实现更细致和上下文感知的图像编辑。ReDiffuser [192]通过再生学习生成丰富的提示,然后融合这些提示来指导编辑过程,确保增强所需的特征同时保持源图像结构。此外,Captioning and Injection [193]通过使用诸如BLIP [263]和提示注入技术[193]的字幕模型来增强输入文本嵌入,引入了一种用户友好的图像编辑方法。该方法允许用户提供最少的文本输入,例如一个名词,然后自动生成或优化详细的提示,有效地指导编辑过程,从而实现更准确和上下文相关的图像编辑。

Refer to caption
Figure 8: Common framework of training and finetuning free methods, where the modifications described in different sections are indicated. The sample images are from LEDITS++ [198].
图8:训练和微调自由方法的常见框架,其中不同部分描述的修改被指示出来。示例图像来自LEDITS++ [198]。

Instruction-Based Textual Guidance. This category enables fine-grained control over the image editing process via user instructions. It represents a significant step forward in that users to steer the editing process with less technical expertise. InstructEdit [194] introduces a framework comprising a language processor, a segmentation model like SAM [246], and an image editing model. The language processor interprets user instructions into segmentation prompts and captions, which are subsequently utilized to create masks and guide the image editing process. This approach shows the capability of instruction-based textual guidance in achieving detailed and precise image editing.
基于指令的文本指导。这个类别通过用户指令实现对图像编辑过程的精细控制。它代表了一个重要的进步,使得用户在编辑过程中能够以较少的技术专长来引导。InstructEdit [194]引入了一个框架,包括一个语言处理器、一个类似于SAM [246]的分割模型和一个图像编辑模型。语言处理器将用户指令解释为分割提示和标题,然后利用它们创建掩模并指导图像编辑过程。这种方法展示了基于指令的文本指导在实现详细和精确的图像编辑方面的能力。

6.2 Inversion/Sampling Modification
6.2 反转/采样修改

Modification of inversion and sampling formulas are common techniques in training and finetuning free methods [195, 196, 201]. The inversion process is used to invert a real image into a noisy latent space, and then the sampling process is used to generate an edited result given the target prompt. Among them, DDIM Inversion [18] is the most commonly used baseline method, although it often leads to the failure of reconstructing the source image [204, 196, 201]. Direct Inversion [195] pioneers a paradigm that edits a real image by changing the source prompt to the target prompt, showcasing its ability to handle diverse tasks. However, it still faces the reconstruction failure problem. Thus, several methods [204, 196, 201] modify the inversion and sampling formulas to improve the reconstruction capability.
修改反转和采样公式是训练和微调自由方法中常用的技术[195, 196, 201]。反转过程用于将真实图像反转为带有噪声的潜在空间,然后采样过程用于根据目标提示生成编辑结果。其中,DDIM反转[18]是最常用的基准方法,尽管它经常导致源图像重建失败[204, 196, 201]。直接反转[195]开创了一种通过改变源提示到目标提示来编辑真实图像的范例,展示了其处理多样任务的能力。然而,它仍然面临重建失败的问题。因此,几种方法[204, 196, 201]修改了反转和采样公式以提高重建能力。

Reconstruction Information Memory. DDPM Inversion [196], SDE-Drag [197], LEDITS++ [198], FEC [199], and EMILIE [200] all belong to the category of reconstruction information memory; they save the information in the inversion stage and use it in the sampling stage to ensure reconstruction. DDPM Inversion explores the latent noise space of DDPM, by preserving all noisy latents during the inversion stage, thus achieving user-friendly image editing while ensuring reconstruction performance. SDE-Drag introduces stochastic differential equations as a unified and improved framework for editing and shows its superiority in drag-conditioned editing. LEDITS++ addresses inefficiencies in DDIM Inversion, presenting a DPM-Solver [89] inversion approach. FEC focuses on consistency enhancement, providing three techniques for saving the information during inversion to improve the reconstruction consistency. EMILIE pioneers iterative multi-granular editing, offering iterative capabilities and multi-granular control for desired changes. These methods ensure reconstruction performance by memorizing certain inversion information, and on this basis, enhance the diversity of editing operations.
重建信息记忆。DDPM反演[196],SDE-Drag[197],LEDITS++[198],FEC[199]和EMILIE[200]都属于重建信息记忆的范畴;它们在反演阶段保存信息,并在采样阶段使用该信息以确保重建。DDPM反演探索DDPM的潜在噪声空间,通过在反演阶段保留所有噪声潜变量,从而实现用户友好的图像编辑,同时确保重建性能。SDE-Drag引入随机微分方程作为统一和改进的编辑框架,并展示了其在拖动条件编辑中的优越性。LEDITS++解决了DDIM反演中的效率问题,提出了DPM-Solver[89]反演方法。FEC专注于一致性增强,提供了三种保存反演信息的技术,以提高重建一致性。EMILIE开创了迭代多粒度编辑,为所需的变化提供了迭代能力和多粒度控制。这些方法通过记忆特定的反演信息来确保重建性能,并在此基础上增强了编辑操作的多样性。

Utilising Null-Text in Sampling. Negative Inversion [201], ProxEdit [202], and Null-Text Guidance [203] all explore the impact of null-text (also known as negative prompts) information on image editing. Negative Inversion uses the source prompt as the negative prompt during sampling, achieving comparable reconstruction quality to optimization-based approaches. ProxEdit addresses degradation in DDIM Inversion reconstruction with larger classifier-free guidance scales, and proposes an efficient solution through proximal guidance and mutual self-attention control. Meanwhile, Null-Text Guidance [203] leverages a disturbance scheme to transform generated images into cartoons without the need for training. Despite the different uses of null-text by these methods, they all explore the role of null-text or negative prompt settings in image editing from different perspectives.
利用空文本进行采样。负反转[201]、ProxEdit[202]和Null-Text Guidance[203]都探讨了空文本(也称为负提示)信息对图像编辑的影响。负反转在采样过程中使用源提示作为负提示,实现了与基于优化的方法相当的重建质量。ProxEdit通过更大的无分类器引导尺度解决了DDIM反转重建中的退化问题,并通过近端引导和互相自注意力控制提出了一种高效的解决方案。与此同时,Null-Text Guidance[203]利用干扰方案将生成的图像转化为卡通图像,无需训练。尽管这些方法对空文本的使用方式不同,但它们都从不同的角度探索了空文本或负提示设置在图像编辑中的作用。

Single-Step Multiple Noise Prediction. EDICT [204] and AIDI [205] propose to predict multiple noises during single-step sampling to solve the problem of reconstruction failure. EDICT, drawing inspiration from affine coupling layers [204], achieves mathematically exact inversion of real images. It maintains coupled noise vectors, enabling precise image reconstruction without local linearization assumptions, outperforming DDIM Inversion. On the other hand, AIDI introduces accelerated iterative diffusion inversion, and focuses on improving reconstruction accuracy by predicting multiple noises during single-step inversion. By employing a blended guidance technique, AIDI demonstrates effective results in various image editing tasks with a low classifier-free guidance scale ranging from 1 to 3.
单步多噪声预测。EDICT [204]和AIDI [205]提出在单步采样过程中预测多个噪声,以解决重建失败的问题。EDICT从仿射耦合层中汲取灵感[204],实现了对真实图像的数学精确反演。它维护了耦合的噪声向量,使得图像重建不需要局部线性化假设,优于DDIM反演。另一方面,AIDI引入了加速迭代扩散反演,并通过在单步反演过程中预测多个噪声来提高重建精度。通过采用混合引导技术,AIDI在各种图像编辑任务中展示了有效的结果,其低分类器自由引导范围从1到3。

6.3 Attention Modification
6.3 注意力修改

Attention modification methods enhance the operations in the attention layers, which is the most common and direct way of training-free image editing. In the U-Net of Stable Diffusion, there are many cross-attention and self-attention layers, and the attention maps and feature maps in these layers contain a lot of semantic information. The common characteristic of attention modification entails discerning the intrinsic principles in the attention layers, and then utilizing them by modifying the attention operations. Prompt-to-Prompt (P2P)[212] is the pioneering research of attention modification.
注意力修改方法增强了注意力层的操作,这是一种最常见和直接的无需训练的图像编辑方法。在稳定扩散的U-Net中,有许多交叉注意力和自注意力层,这些层中的注意力图和特征图包含了大量的语义信息。注意力修改的共同特点是识别注意力层中的内在原则,然后通过修改注意力操作来利用它们。Prompt-to-Prompt(P2P)[212]是注意力修改的开创性研究。

Attention Map Replacement. P2P introduces an intuitive prompt-to-prompt editing framework reliant solely on text inputs by identifying cross-attention layers as pivotal in governing the spatial relationship between image layout and prompt words. As shown in Fig. 8, given the reconstruction and editing paths, P2P replaces the attention maps of the editing path with the corresponding ones of the reconstruction path, resulting in consistency between the edited result and the source image. Unlike P2P, Pix2Pix-Zero [213] eliminates the need for user-defined text prompts in real image editing. It autonomously discovers editing directions in the text embedding space, preserving the original content structure. Also, it introduces cross-attention guidance in the U-Net to retain input image features.
注意地图替换。P2P引入了一种直观的提示到提示的编辑框架,仅依靠文本输入来识别交叉注意力层在控制图像布局和提示词之间的空间关系中的关键作用。如图8所示,给定重建和编辑路径,P2P将编辑路径的注意力地图替换为重建路径的对应地图,从而使编辑结果与源图像保持一致。与P2P不同,Pix2Pix-Zero [213]在真实图像编辑中消除了对用户定义文本提示的需求。它在文本嵌入空间中自主发现编辑方向,保留原始内容结构。此外,它在U-Net中引入了交叉注意力指导,以保留输入图像特征。

Attention Feature Replacement. Both MasaCtrl [85] and PnP [214] emphasize replacing the attention features for consistency. MasaCtrl converts self-attention into mutual self-attention, which replaces the Key and Value features in the self-attention layer for action editing. A mask-guided mutual self-attention strategy further enhances consistency by addressing confusion of foreground and background. PnP enables fine-grained control over generated structures through spatial feature manipulation and self-attention, directly injecting guidance image features. While both contribute to feature replacement, MasaCtrl focuses on the Key and Value in the self-attention layer, while PnP emphasizes the Query and Key in the self-attention layer.
注意特征替换。MasaCtrl [85]和PnP [214]都强调为了一致性而替换注意力特征。MasaCtrl将自注意力转换为相互自注意力,用于替换自注意力层中的键和值特征以进行动作编辑。通过使用掩码引导的相互自注意力策略,进一步增强了前景和背景的一致性。PnP通过空间特征操作和自注意力,直接注入引导图像特征,实现对生成结构的细粒度控制。虽然两者都有助于特征替换,但MasaCtrl侧重于自注意力层中的键和值,而PnP则强调自注意力层中的查询和键。

Local Attention Map Modification. The TF-ICON [215] framework and Object-Shape Variation [216] share the utilization of local attention map modification. TF-ICON focuses on cross-domain image-guided composition, seamlessly integrating user-provided objects into visual contexts without additional training. It introduces the exceptional prompt for accurate image inversion and deformation mapping of the source image self-attention. Meanwhile, Object-Shape Variation [216] aims to generate collections depicting shape variations of specific objects in text-to-image workflows. It employs prompt-mixing for shape choices and proposes a localization technique using self-attention layers.
本地注意力图修改。TF-ICON [215]框架和Object-Shape Variation [216]共享本地注意力图修改的利用。TF-ICON专注于跨领域的图像引导合成,无需额外训练即可将用户提供的对象无缝地整合到视觉环境中。它引入了用于源图像自我注意力的准确图像反转和变形映射的特殊提示。同时,Object-Shape Variation [216]旨在生成描述文本到图像工作流程中特定对象形状变化的集合。它采用形状选择的提示混合,并提出了一种使用自我注意力层的定位技术。

Attention Score Guidance. Both Conditional Score Guidance [217] and EBMs [218] work on using different attention score guidances. Conditional Score Guidance focuses on image-to-image translation by introducing a conditional score function. This function considers both the source image and text prompt to selectively edit regions while preserving others. The mixup technique enhances the fusion of unedited and edited regions, ensuring high-fidelity translation. EBMs tackle the problem of semantic misalignment with the target prompt. It introduces an energy-based framework for adaptive context control, modeling the posterior of context vectors to enhance semantic alignment.
注意分数指导。无论是条件分数指导[217]还是EBMs[218],都使用不同的注意分数指导进行工作。条件分数指导侧重于图像到图像的转换,引入了条件分数函数。该函数同时考虑源图像和文本提示,以选择性地编辑某些区域,同时保留其他区域。混合技术增强了未编辑和已编辑区域的融合,确保高保真度的翻译。EBMs解决了与目标提示的语义不对齐问题。它引入了一种基于能量的自适应上下文控制框架,对上下文向量的后验进行建模,以增强语义对齐。

6.4 Mask Guidance 6.4 口罩指导

Mask guidance in diffusion-based image editing represents a kind of techniques for enhancing image editing. It includes techniques for enhancing denoising efficiency through selective processing, precise mask auto-generation for targeted image editing, and mask-guided regional focus to ensure localized modifications aligning accurately with specific regions of interest. These approaches leverage masks to steer and refine the sampling process, offering improved precision, speed, and flexibility.
扩散式图像编辑中的遮罩指导是一种增强图像编辑的技术。它包括通过选择性处理增强降噪效率的技术,针对性图像编辑的精确遮罩自动生成,以及遮罩引导的区域焦点,以确保局部修改与特定感兴趣区域准确对齐。这些方法利用遮罩来引导和优化采样过程,提供了更高的精度、速度和灵活性。

Mask-Enhanced Denoising Efficiency. These methods utilize masks to enhance the efficiency of diffusion-based image editing. The methods [221, 222, 223] prioritize speed improvements while ensuring high-quality results. By selectively processing image regions through masks, they effectively reduce computational demands and enhance overall efficiency. FISEdit [221] introduces a cache-enabled sparse diffusion model that leverages semantic mapping between the minor modifications on the input text and the affected regions on the output image. It automatically identifies the affected image regions and utilizes the cached unchanged regions’ feature map to accelerate the inference process. Blended Latent Diffusion [222] leverages the text-to-image Latent Diffusion Model [26], which accelerates the diffusion process by functioning within a lower-dimensional latent space, and effectively obviates the requirement for gradient computations of CLIP [242] at every timestep of the diffusion model. PFB-Diff [223] seamlessly integrates text-conditioned generated content into the target image through multi-level feature blending and introduces an attention masking mechanism in the cross-attention layers to confine the impact of specific words to desired regions, improving the performance of background editing.
蒙版增强去噪效率。这些方法利用蒙版来增强基于扩散的图像编辑的效率。这些方法[221, 222, 223]在确保高质量结果的同时,优先考虑速度改进。通过通过蒙版选择性地处理图像区域,它们有效地减少了计算需求并提高了整体效率。FISEdit [221]引入了一个启用缓存的稀疏扩散模型,利用输入文本上的次要修改与输出图像上的受影响区域之间的语义映射。它自动识别受影响的图像区域,并利用缓存的未更改区域的特征图加速推理过程。混合潜在扩散[222]利用文本到图像潜在扩散模型[26],通过在较低维度的潜在空间中运行,加速了扩散过程,并有效地消除了扩散模型每个时间步的CLIP [242]梯度计算的要求。 PFB-Diff [223]通过多级特征融合将文本条件生成的内容无缝地融入目标图像,并在交叉注意力层引入了注意力掩蔽机制,将特定词语的影响限制在所需区域,提高了背景编辑的性能。

Mask Auto-Generation. These methods enable precise image editing by automatically generating masks. DiffEdit [224] simplifies semantic editing by automatically generating masks that isolate relevant regions for modification. This selective editing approach safeguards unedited regions and preserves their semantic integrity. RDM [225] introduces a region-aware diffusion model that seamlessly integrates masks to automatically pinpoint and edit regions of interest based on text-driven guidance. MFL [226] proposes a two-stage mask-free training paradigm tailored for text-guided image editing. In the first stage, a unified mask is obtained according to the source prompt, and then several candidate images are generated with the provided mask and the target prompt based on the diffusion model.
自动生成遮罩。这些方法通过自动生成遮罩来实现精确的图像编辑。DiffEdit [224]通过自动生成隔离相关区域以进行修改的遮罩,简化了语义编辑。这种选择性编辑方法保护未编辑的区域并保持其语义完整性。RDM [225]引入了一种区域感知扩散模型,通过无缝集成遮罩,根据文本驱动的指导自动定位和编辑感兴趣的区域。MFL [226]提出了一个针对文本引导图像编辑的两阶段无遮罩训练范式。在第一阶段,根据源提示获得统一的遮罩,然后根据扩散模型使用提供的遮罩和目标提示生成多个候选图像。

Mask-Guided Regional Focus. This line of research employs masks as navigational tools, steering the sampling process towards specific regions of interest. The methods [227, 228] excel in localizing edits, ensuring that modifications align precisely with designated regions specified in text prompts or editing requirements. Differential Diffusion [227] pioneers a region-aware editing paradigm where masks are employed to customize the intensity of change within each region of the source image. This allows a granular level of control, fostering exquisite details and artistic expression. Watch Your Steps [228] utilizes masks as relevance maps to pinpoint desired edit regions, ensuring that only the most relevant regions are edited while preserving the other regions of the original image.
遮罩引导的区域焦点。这一研究领域利用遮罩作为导航工具,将采样过程引导到感兴趣的特定区域。这些方法[227, 228]在定位编辑方面表现出色,确保修改与文本提示或编辑要求中指定的区域完全对齐。差分扩散[227]开创了一种区域感知的编辑范式,其中使用遮罩来自定义源图像中每个区域的变化强度。这允许精细级别的控制,培养出精美的细节和艺术表达。《小心脚步》[228]利用遮罩作为相关性地图,以确定所需的编辑区域,确保只编辑最相关的区域,同时保留原始图像的其他区域。

6.5 Multi-Noise Redirection
6.5 多噪音重定向

Multi-noise redirection is the process of predicting multiple noises in different directions in a single step and then redirecting them into a single noise. The advantage of this redirection is its capacity to enable the single noise to unify multiple distinct editing directions simultaneously, thereby more effectively meeting the user’s editing requirements.
多噪声重定向是一种在单个步骤中预测不同方向上的多个噪声,然后将它们重定向为单个噪声的过程。这种重定向的优势在于它能够使单个噪声同时统一多个不同的编辑方向,从而更有效地满足用户的编辑需求。

Semantic Noise Steering. These methods semantically guide the noise redirection in the sampling process and enhance fine-grained control over image content. Stable Artist [232] and SEGA [233] showcase the ability to steer the diffusion process using semantic guidance. They facilitate refined edits in image composition through control over various semantic directions within the noise redirection process. LEDITS [234] applies semantic noise steering by combining DDPM inversion [196] with SEGA, and facilitates versatile editing direction, encompassing both object editing and overall style change.
语义噪声引导。这些方法在采样过程中语义地引导噪声重定向,并增强对图像内容的细粒度控制。稳定的艺术家[232]和SEGA[233]展示了使用语义引导来引导扩散过程的能力。它们通过在噪声重定向过程中控制各种语义方向,促进了图像合成中的精细编辑。LEDITS[234]通过将DDPM反演[196]与SEGA结合起来应用语义噪声引导,并促进了多功能的编辑方向,包括对象编辑和整体风格变化。

Object-Aware Noise Redirection. OIR-Diffusion [235] introduces the object-aware inversion and reassembly method, which enables fine-grained editing of specific objects within an image by determining optimal inversion steps for each sample and seamlessly integrating the edited regions with the other regions. It significantly maintains the fidelity of the non-edited regions and achieves remarkable results in editing diverse object attributes, particularly in intricate multi-object scenarios.
对象感知噪声重定向。OIR-Diffusion [235]引入了对象感知的反转和重组方法,通过确定每个样本的最佳反转步骤,并将编辑区域与其他区域无缝集成,实现对图像中特定对象的精细编辑。它显著保持了未编辑区域的保真度,并在编辑多样化对象属性,特别是在复杂的多对象场景中取得了显著的效果。

7 Inpainting and Outpainting
7修复和扩展绘画

Image inpainting and outpainting, often regarded as sub-tasks of image editing, occupy unique positions with distinct objectives and challenges. We divide them into two major types (see Fig. 9) for better explanation, as detailed in Sections 7.1 and 7.2.
图像修复和扩展通常被视为图像编辑的子任务,它们具有独特的目标和挑战。为了更好地解释,我们将它们分为两种主要类型(见图9),详细说明在第7.1节和第7.2节中。

Refer to caption
Figure 9: Visual comparison between traditional context-driven inpainting (top) and multimodal conditional inpainting (bottom). The samples in the two rows are from Palette [58] and Imagen Editor [153], respectively.
图9:传统的上下文驱动修复(上)和多模态条件修复(下)之间的视觉比较。两行中的样本分别来自Palette [58]和Imagen Editor [153]。

7.1 Traditional Context-Driven Inpainting
7.1 传统的上下文驱动修复

7.1.1 Supervised Training 7.1.1 监督训练

The supervised training-based inpainting & outpainting methods usually train a diffusion model from scratch with paired corrupted and complete images. Palette [58] develops a unified framework for image-to-image translation based on conditional diffusion models and individually trains this framework on four image-to-image translation tasks, namely colorization, inpainting, outpainting, and JPEG restoration. It utilizes a direct concatenation of low-quality reference images with the denoised result at the t1𝑡1t-1italic_t - 1 step as the condition for noise prediction at the t𝑡titalic_t step. SUD22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT [264] builds upon and extends the SUD framework [265], integrating additional strategies like correlation minimization, noise injection, and particularly the use of denoising diffusion models to address the scarcity of paired training data in semi-supervised learning.
监督训练的修复和扩展方法通常使用有损和完整图像对训练扩散模型。Palette [58]开发了一个基于条件扩散模型的图像到图像转换统一框架,并在四个图像到图像转换任务上分别训练该框架,包括上色、修复、扩展和JPEG恢复。它利用低质量参考图像与第 t1𝑡1t-1italic_t - 1 步的去噪结果直接连接作为第 t𝑡titalic_t 步噪声预测的条件。SUD 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT [264]在SUD框架 [265]的基础上进行了扩展,集成了额外的策略,如相关性最小化、噪声注入,特别是使用去噪扩散模型来解决半监督学习中配对训练数据的稀缺性。

7.1.2 Zero-Shot Learning 零样本学习

Context Prior Integration. This approach aims to extract inherent structures and textures from the uncorrupt part of an image as complementary to generate the content in the masked corrupt regions at each step, thereby maintaining global content consistency. Representatively, Repaint [266] builds on a pretrained unconditional diffusion model and modifies the standard denoising strategy by sampling the masked regions from the diffusion model and sampling the unmasked regions from the given image in each reverse step. GradPaint [267] also utilizes a pretrained diffusion model to estimate content for masked image regions. It then uses a custom loss function to measure coherence with unmasked parts, updating the generated content based on the loss’s gradient to ensure context consistency.
上下文先验整合。这种方法旨在从图像的未损坏部分提取固有的结构和纹理,作为在每个步骤中生成掩蔽损坏区域内容的补充,从而保持全局内容的一致性。代表性地,Repaint [266] 基于预训练的无条件扩散模型,并通过从扩散模型中采样掩蔽区域和从给定图像中采样未掩蔽区域来修改标准去噪策略。GradPaint [267] 也利用预训练的扩散模型来估计掩蔽图像区域的内容。然后,它使用自定义损失函数来衡量与未掩蔽部分的一致性,并根据损失的梯度更新生成的内容以确保上下文的一致性。

Degradation Decomposition. Image inpainting can be regarded as a specialized application of general linear inverse problems, which is represented as:
降解分解。图像修复可以被视为一种特殊应用的一般线性逆问题,表示为:

𝐲=𝐇𝐱+𝐧,𝐲𝐇𝐱𝐧\displaystyle\mathbf{y}=\mathbf{H}\mathbf{x}+\mathbf{n},bold_y = bold_Hx + bold_n , (13)

where 𝐇𝐇\mathbf{H}bold_H is the degradation operator (e.g., a mask matrix comprising 0 and 1 that indicates the corrupted regions of the image in inpainting), and 𝐧𝒩(𝟎,σ2𝐈)similar-to𝐧𝒩0superscript𝜎2𝐈\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I})bold_n ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ). The objective is to recover the original image 𝐱𝐱\mathbf{x}bold_x from the corrupted image 𝐲𝐲\mathbf{y}bold_y. A few works implement it by performing decomposition on 𝐇𝐇\mathbf{H}bold_H. DDRM [66], utilizing a pretrained denoising diffusion model and Singular Value Decomposition (SVD) of 𝐇𝐇\mathbf{H}bold_H, transforms this recovery into an efficient iterative diffusion process within the spectral space of 𝐇𝐇\mathbf{H}bold_H. Differently, DDNM [68] addresses the image inverse problem by applying a range-null space decomposition, where the image is divided into two distinct components via the pseudo-inverse of the degradation matrix 𝐇superscript𝐇\mathbf{H}^{\dagger}bold_H start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT.
其中 𝐇𝐇\mathbf{H}bold_H 是退化运算符(例如,一个由0和1组成的掩膜矩阵,用于指示修复图像中的损坏区域),而 𝐧𝒩(𝟎,σ2𝐈)similar-to𝐧𝒩0superscript𝜎2𝐈\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I})bold_n ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) 。目标是从损坏的图像 𝐲𝐲\mathbf{y}bold_y 中恢复原始图像 𝐱𝐱\mathbf{x}bold_x 。一些工作通过对 𝐇𝐇\mathbf{H}bold_H 进行分解来实现。DDRM [66]利用预训练的去噪扩散模型和 𝐇𝐇\mathbf{H}bold_H 的奇异值分解(SVD),将这种恢复转化为 𝐇𝐇\mathbf{H}bold_H 的频谱空间内的高效迭代扩散过程。不同的是,DDNM [68]通过应用范围-零空间分解来解决图像逆问题,其中图像通过退化矩阵 𝐇superscript𝐇\mathbf{H}^{\dagger}bold_H start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 的伪逆被分为两个不同的组成部分。

Posterior Edstimation. To solve the general noisy linear inverse problem as formulated in Eq. 13, several studies [70, 268, 269, 270, 271, 272] focus on estimating the posterior distribution p(𝐱|𝐲)𝑝conditional𝐱𝐲p(\mathbf{x}|\mathbf{y})italic_p ( bold_x | bold_y ), leveraging the unconditional diffusion model based on Bayes’ theorem. It calculates the conditional posterior p(𝐱t|𝐲)𝑝conditionalsubscript𝐱𝑡𝐲p(\mathbf{x}_{t}|\mathbf{y})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) at each step of the reverse diffusion process. With Bayes’ theorem, it can be derived: p(𝐱t|𝐲)=p(𝐲|𝐱t)p(𝐱t)p(𝐲)𝑝conditionalsubscript𝐱𝑡𝐲𝑝conditional𝐲subscript𝐱𝑡𝑝subscript𝐱𝑡𝑝𝐲p(\mathbf{x}_{t}|\mathbf{y})=\frac{p(\mathbf{y}|\mathbf{x}_{t})p(\mathbf{x}_{t% })}{p(\mathbf{y})}italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) = divide start_ARG italic_p ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_y ) end_ARG. The corresponding gradient of the log-likelihood is then computed as:
后验估计。为了解决在公式13中表述的一般噪声线性逆问题,几项研究[70, 268, 269, 270, 271, 272]专注于利用基于贝叶斯定理的无条件扩散模型来估计后验分布 p(𝐱|𝐲)𝑝conditional𝐱𝐲p(\mathbf{x}|\mathbf{y})italic_p ( bold_x | bold_y ) 。它在逆扩散过程的每一步计算条件后验 p(𝐱t|𝐲)𝑝conditionalsubscript𝐱𝑡𝐲p(\mathbf{x}_{t}|\mathbf{y})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) 。根据贝叶斯定理,可以推导出: p(𝐱t|𝐲)=p(𝐲|𝐱t)p(𝐱t)p(𝐲)𝑝conditionalsubscript𝐱𝑡𝐲𝑝conditional𝐲subscript𝐱𝑡𝑝subscript𝐱𝑡𝑝𝐲p(\mathbf{x}_{t}|\mathbf{y})=\frac{p(\mathbf{y}|\mathbf{x}_{t})p(\mathbf{x}_{t% })}{p(\mathbf{y})}italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) = divide start_ARG italic_p ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_y ) end_ARG 。然后计算对应的对数似然梯度。

𝐱tlogpt(𝐱t|𝐲)=𝐱tlogpt(𝐲|𝐱t)+𝐱tlogpt(𝐱t),subscriptsubscript𝐱𝑡subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲subscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\displaystyle\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t}|\mathbf{y})=% \nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{y}|\mathbf{x}_{t})+\nabla_{\mathbf{x% }_{t}}\log p_{t}(\mathbf{x}_{t}),∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) = ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (14)

where the score function 𝐱tlogpt(𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be estimated using a pretrained score network 𝐬θ(𝐱t,t)subscript𝐬𝜃subscript𝐱𝑡𝑡\mathbf{s}_{\theta}(\mathbf{x}_{t},t)bold_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). The estimation of 𝐱tlogpt(𝐱t|𝐲)subscriptsubscript𝐱𝑡subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t}|\mathbf{y})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) then reduces to estimating the unknown term 𝐱tlogpt(𝐲|𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{y}|\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Two representative earlier works, MCG [70] and DPS [268], approximate the posterior pt(𝐲|𝐱t)subscript𝑝𝑡conditional𝐲subscript𝐱𝑡p_{t}(\mathbf{y}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with pt(𝐲|𝐱^0)subscript𝑝𝑡conditional𝐲subscript^𝐱0p_{t}(\mathbf{y}|\hat{\mathbf{x}}_{0})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the noise-free expectation of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, formalized as 𝐱^0=E[𝐱^0|𝐱t]subscript^𝐱0𝐸delimited-[]conditionalsubscript^𝐱0subscript𝐱𝑡\hat{\mathbf{x}}_{0}=E[\hat{\mathbf{x}}_{0}|\mathbf{x}_{t}]over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_E [ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] using Tweedie’s formula. Hence, Eq. 14 becomes:
其中,得分函数 𝐱tlogpt(𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 可以使用预训练的得分网络 𝐬θ(𝐱t,t)subscript𝐬𝜃subscript𝐱𝑡𝑡\mathbf{s}_{\theta}(\mathbf{x}_{t},t)bold_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) 进行估计。对 𝐱tlogpt(𝐱t|𝐲)subscriptsubscript𝐱𝑡subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t}|\mathbf{y})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) 的估计则简化为对未知项 𝐱tlogpt(𝐲|𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{y}|\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 的估计。两个代表性的早期工作,MCG [70]和DPS [268],使用 pt(𝐲|𝐱^0)subscript𝑝𝑡conditional𝐲subscript^𝐱0p_{t}(\mathbf{y}|\hat{\mathbf{x}}_{0})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) 近似后验概率 pt(𝐲|𝐱t)subscript𝑝𝑡conditional𝐲subscript𝐱𝑡p_{t}(\mathbf{y}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,其中 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的无噪声期望,用Tweedie公式形式化为 𝐱^0=E[𝐱^0|𝐱t]subscript^𝐱0𝐸delimited-[]conditionalsubscript^𝐱0subscript𝐱𝑡\hat{\mathbf{x}}_{0}=E[\hat{\mathbf{x}}_{0}|\mathbf{x}_{t}]over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_E [ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] 。因此,方程14变为:

𝐱tlogpt(𝐱t|𝐲)subscriptsubscript𝐱𝑡subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲\displaystyle\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t}|\mathbf{y})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) =𝐱tlogpt(𝐲|𝐱t)+𝐱tlogpt(𝐱t)absentsubscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\displaystyle=\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{y}|\mathbf{x}_{t})+% \nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})= ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
ρ𝐱t𝐲𝐇𝐱^022+𝐬θ(𝐱t,t),absent𝜌subscriptsubscript𝐱𝑡subscriptsuperscriptnorm𝐲𝐇subscript^𝐱022subscript𝐬𝜃subscript𝐱𝑡𝑡\displaystyle\approx-\rho\nabla_{\mathbf{x}_{t}}\|\mathbf{y}-\mathbf{H}\hat{% \mathbf{x}}_{0}\|^{2}_{2}+\mathbf{s}_{\theta}(\mathbf{x}_{t},t),≈ - italic_ρ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_y - bold_H over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (15)

where ρ=1σ2𝜌1superscript𝜎2\rho=\frac{1}{\sigma^{2}}italic_ρ = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the step size. MCG further adds a projection onto the measurement subspace post-update via Eq. 7.1.2, serving as a corrective measure for deviations from ideal data consistency. ΠΠ\Piroman_ΠGDM [269] expands the posterior estimation framework in Eq. 14 to address linear, non-linear, and differentiable inverse problems by introducing the Moore-Penrose pseudoinverse 𝐇superscript𝐇\mathbf{H}^{\dagger}bold_H start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT of the degradation function 𝐇𝐇\mathbf{H}bold_H. GDP [270] gives a different approximation formulated as:
其中 ρ=1σ2𝜌1superscript𝜎2\rho=\frac{1}{\sigma^{2}}italic_ρ = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG 是步长。MCG通过公式7.1.2进一步在更新后对测量子空间进行投影,作为纠正理想数据一致性偏差的措施。 ΠΠ\Piroman_Π GDM [269]通过引入Moore-Penrose伪逆 𝐇superscript𝐇\mathbf{H}^{\dagger}bold_H start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 来扩展公式14中的后验估计框架,以解决线性、非线性和可微的反问题。 𝐇𝐇\mathbf{H}bold_H 的伪逆。GDP [270]给出了一个不同的近似形式:

𝐱tlogpt(𝐲|𝐱t)s𝐱t(𝐇𝐱^0+𝐧,𝐲)λ𝐱t𝒬(𝐱^0),subscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡𝑠subscriptsubscript𝐱𝑡𝐇subscript^𝐱0𝐧𝐲𝜆subscriptsubscript𝐱𝑡𝒬subscript^𝐱0\displaystyle\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{y}|\mathbf{x}_{t})% \approx-s\nabla_{\mathbf{x}_{t}}\mathcal{L}(\mathbf{H}\hat{\mathbf{x}}_{0}+% \mathbf{n},\mathbf{y})-\lambda\nabla_{\mathbf{x}_{t}}\mathcal{Q}(\hat{\mathbf{% x}}_{0}),∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ - italic_s ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_H over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_n , bold_y ) - italic_λ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_Q ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (16)

where \mathcal{L}caligraphic_L, 𝒬𝒬\mathcal{Q}caligraphic_Q, s𝑠sitalic_s, and λ𝜆\lambdaitalic_λ are an image distance metric, a quality loss, a scale factor controlling the magnitude of guidance, and a scale factor for adjusting the quality of images, respectively. With a similar goal, CoPaint [271] uses the one-step estimation of the final generated image to predict 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT deterministically to make the computation of the inpainting errors tractable at each intermediate denoising step.
其中 \mathcal{L}caligraphic_L𝒬𝒬\mathcal{Q}caligraphic_Qs𝑠sitalic_sλ𝜆\lambdaitalic_λ 分别是图像距离度量、质量损失、控制引导强度的比例因子和调整图像质量的比例因子。与此类似,CoPaint [271]使用一步估计最终生成图像来确定地预测 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,以便在每个中间去噪步骤中使修复错误的计算变得可行。

7.2 Multimodal Conditional Inpainting
7.2 多模态条件修复

7.2.1 Random Mask Training
7.2.1随机掩码训练

Traditional image inpainting, which focuses on filling missing areas in images driven by surrounding context, often lacks precise control over the content. With the advance of text-to-image diffusion models, this limitation is being overcome by injecting additional user-provided multimodal conditions such as text descriptions, segmentation maps, and reference images as guidance. These models, e.g., Stable Diffusion and Imagen, can also be adapted to the inpainting task where the noisy background is replaced with a noisy version of the original image in the reverse diffusion process. This modification, however, sometimes results in unsatisfactory samples due to limited global context visibility during sampling. Addressing this, models like GLIDE [34] and Stable Inpainting (inpainting specialist v1.5 from Stable Diffusion [26]) further finetune the pretrained text-to-image models specialized for inpainting. They use a randomly generated mask along with the masked image and the full image caption, enabling the model to utilize information from the unmasked region during training. As pioneering works, they provide enlightening insights for subsequent research.
传统的图像修复主要依靠周围环境的上下文来填补图像中的缺失区域,但往往缺乏对内容的精确控制。随着文本到图像扩散模型的进展,通过注入额外的用户提供的多模态条件,如文本描述、分割地图和参考图像作为指导,这一限制正在被克服。这些模型,如稳定扩散和Imagen,也可以适应修复任务,其中将噪声背景替换为原始图像的噪声版本,进行反向扩散过程。然而,这种修改有时会导致不令人满意的样本,因为在采样过程中全局上下文的可见性有限。为了解决这个问题,像GLIDE [34]和稳定修复(来自稳定扩散的修复专家v1.5 [26])这样的模型进一步微调了专门用于修复的预训练文本到图像模型。它们使用随机生成的遮罩以及遮罩图像和完整图像的标题,使模型能够在训练过程中利用未遮罩区域的信息。作为开创性的工作,它们为后续研究提供了启示性的见解。

7.2.2 Precise Control Conditioning
7.2.2精确控制调节

An essential problem with random masking is that it may cover regions unrelated to the text prompt, leading the model to ignoring the prompt, especially when the masked regions are small or cover only a part of an object. To provide more precise control over the inpainted content, several approaches have been proposed. SmartBrush [150] introduces a precision factor to produce various mask types, from fine to coarse, by applying Gaussian blur to accurate instance masks. It allows users to choose between coarse masks, which include the desired object within the mask, and detailed masks that closely outline the object shape.
随机遮罩的一个重要问题是它可能覆盖与文本提示无关的区域,导致模型忽略提示,特别是当遮罩区域很小或仅覆盖对象的一部分时。为了对修复的内容提供更精确的控制,已经提出了几种方法。SmartBrush [150]引入了一个精度因子,通过对准确的实例遮罩应用高斯模糊来生成各种遮罩类型,从细到粗。它允许用户在遮罩中选择包含所需对象的粗糙遮罩,或者紧密勾勒对象形状的详细遮罩。

Imagen Editor [153] extends Imagen [27] through finetuning, employing a cascaded diffusion model for text-guided inpainting. It integrates image and mask context into each diffusion stage with three convolutional downsampling image encoders. The mask here is generated on-the-fly by an object detector, SSD Mobilenet v2 [273], instead of randomly masking.
图像编辑器[153]通过微调扩展了图像[27],采用级联扩散模型进行文本引导修复。它使用三个卷积下采样图像编码器将图像和掩码上下文集成到每个扩散阶段中。这里的掩码是由对象检测器SSD Mobilenet v2 [273]实时生成的,而不是随机掩码。

PbE [142] and PhD [145] focus on using reference images for users’ customized inpainting. Specifically, PbE discards the text while only extracting subject information from the reference image using CLIP [242]. The reference image is cropped from the inpainting image itself with augmentations to support self-supervised training. In contrast, PhD integrates both text and reference image in the training pipeline same as PbE’s.
PbE [142]和PhD [145]专注于使用参考图像进行用户定制修复。具体而言,PbE丢弃文本,仅使用CLIP [242]从参考图像中提取主题信息。参考图像是从修复图像本身裁剪出来的,并进行增强以支持自监督训练。相比之下,PhD与PbE相同,在训练流程中同时整合文本和参考图像。

Uni-paint [155] first finetunes Stable Diffusion on the unmasked regions of the input images with null text and then uses various modes of guidance including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, and a combination of them to perform multimodal inpainting during inference via different condition integration techniques. For facial inpainting, PVA [274] preserves identity by encoding a few reference images with the same identity as the masked image through Transformer blocks. The output image embedding, combined with the CLIP text embedding, then serves as extra conditions of the Stable Inpainting model through a parallel visual attention mechanism. SmartMask [154] utilizes semantic amodal segmentation data to create high-quality annotations for mask-free object insertion. It trains a diffusion model to predict instance maps from these annotations including both semantic maps and scene context. During inference, it uses panoptic segmentation to create a semantic map, which guides the prediction of object insertion masks.
Uni-paint [155]首先在输入图像的未遮罩区域上对稳定扩散进行微调,然后使用各种引导模式,包括无条件、文本驱动、笔画驱动、示例驱动修复以及它们的组合,通过不同的条件整合技术在推理过程中进行多模态修复。对于面部修复,PVA [274]通过Transformer块对几个与被遮罩图像具有相同身份的参考图像进行编码,以保留身份信息。输出图像嵌入与CLIP文本嵌入相结合,通过并行视觉注意机制作为稳定修复模型的额外条件。SmartMask [154]利用语义无模态分割数据为无遮罩对象插入创建高质量的注释。它训练一个扩散模型,从这些注释中预测实例映射,包括语义映射和场景上下文。在推理过程中,它使用全景分割创建语义映射,从而指导对象插入掩码的预测。

7.2.3 Pretrained Diffusion Prior Exploiting
7.2.3利用预训练扩散先验

Despite significant advancements in training-based methods, there still exist challenges, particularly in gathering large-scale realistic data and the extensive resources required for model training. Consequently, some research shifts focus to specific inpainting tasks, harnessing the generative capabilities of pretrained text-to-image diffusion models via integrating various techniques.
尽管在基于训练的方法方面取得了显著进展,但仍然存在一些挑战,特别是在收集大规模真实数据和进行模型训练所需的广泛资源方面。因此,一些研究将重点转向特定的修复任务,通过整合各种技术来利用预训练的文本到图像扩散模型的生成能力。

Blended Diffusion [229] employs CLIP to calculate the difference between the denoised image embedding from a pretrained unconditional diffusion model and the guided text embedding to refine the generation within masked regions. It then spatially blends the noisy masked regions with the corresponding noisy version of the unmasked part of the input image during each sampling step. Blended Latent Diffusion [222] takes this further by applying the technique in the LDM [26] for higher efficiency.
混合扩散[229]使用CLIP计算预训练无条件扩散模型的去噪图像嵌入与引导文本嵌入之间的差异,以在遮罩区域内改进生成。然后,在每个采样步骤中,它将嘈杂的遮罩区域与输入图像未遮罩部分的相应嘈杂版本进行空间混合。混合潜在扩散[222]通过在LDM[26]中应用该技术,进一步提高了效率。

Inpaint Anything [231] combines SAM [246] and existing inpainting model such as LaMa [275] and Stable Inpainting [222] as a whole framework that allows users to remove, replace, or fill any part of an image with simple clicks, offering a high level of user-friendliness and flexibility in image editing. MagicRemover [186] starts with a null-text inversion [175] to project an image into a noisy latent code, and then optimizes subsequent codes during the sampling process. It introduces an attention-based guidance strategy and a classifier optimization algorithm to improve the precision and stability of the inpainting process with fewer steps.
Inpaint Anything [231]将SAM [246]和现有的修复模型(如LaMa [275]和Stable Inpainting [222])结合为一个整体框架,允许用户通过简单的点击删除、替换或填充图像的任何部分,提供了高水平的用户友好性和图像编辑的灵活性。MagicRemover [186]从一个空文本反转[175]开始,将图像投影到一个噪声潜在代码中,然后在采样过程中优化后续代码。它引入了基于注意力的引导策略和分类器优化算法,以减少步骤并提高修复过程的精确性和稳定性。

HD-Painter [220] is designed for high-resolution image inpainting and operates in two stages: image completion and inpainting-specialized super-resolution. The first stage is carried out in the latent space of Stable Diffusion, where a prompt-aware introverted attention mechanism replaces the original self-attention, and an additional reweighting attention score guidance is employed. This modification ensures better alignment of the generated content with the input prompt. The second stage uses the Stable Diffusion Super-Resolution variant with the low-resolution inpainted image conditioned and mask blended.
HD-Painter [220]是专为高分辨率图像修复而设计的,分为两个阶段:图像补全和修复专用超分辨率。第一阶段在稳定扩散的潜在空间中进行,其中一个注意力机制取代了原始的自注意力,并采用了额外的重新加权注意力分数引导。这种修改确保了生成内容与输入提示的更好对齐。第二阶段使用了稳定扩散超分辨率变体,其中低分辨率修复图像被条件化和蒙版混合。

7.3 Outpainting 7.3 外部绘画

Image outpainting, similar to but not the same as inpainting, aims to generate new pixels that seamlessly extend an image’s boundaries. Existing T2I models like Stable Diffusion [26] and DALL-E [25] can be generalized to address this task due to their training on extensive datasets including images of various sizes and shapes. It is often considered a special form of inpainting with similar implementations. For instance, Palette [58] trains a diffusion model by directly combining cropped images with their original versions as input in a supervised manner. PowerPaint [152] divides inpainting and related tasks into four specific categories, viewing outpainting as a type of context-aware image inpainting. It enhances Stable Diffusion with learnable task-specific prompts and finetuning strategies to precisely direct the model towards distinct objectives.
图像外扩,类似但不同于修复,旨在生成无缝扩展图像边界的新像素。现有的T2I模型,如Stable Diffusion [26]和DALL-E [25],可以推广到解决这个任务,因为它们在包括各种大小和形状的图像在内的大规模数据集上进行了训练。它通常被认为是一种特殊形式的修复,具有类似的实现方式。例如,Palette [58]通过以监督方式直接将裁剪图像与其原始版本组合作为输入来训练扩散模型。PowerPaint [152]将修复和相关任务分为四个具体类别,将外扩视为一种上下文感知的图像修复类型。它通过可学习的任务特定提示和微调策略增强了Stable Diffusion,以精确引导模型朝着不同的目标发展。

8 Benchmark and Evaluation
8基准和评估

8.1 Benchmark Construction
8.1 基准构建

In the previous sections, we delve into the methodology aspects of diffusion model-based image editing approaches. Beyond this analysis, it is crucial to assess these methods, examining their capabilities across various editing tasks. However, existing benchmarks for image editing are limited and do not fully meet the needs identified in our survey. For instance, EditBench [153] primarily aims at text and mask-guided inpainting, and ignores broader tasks that involve global editing like style transfer. TedBench [187], while it expands the range of tasks, lacks detailed instructions, which is essential for evaluating methods that rely on textual instructions rather than descriptions. Moreover, the EditVal benchmark [276], while attempting to offer more comprehensive coverage of tasks and methods, is limited by the quality of its images from the MS-COCO dataset [277], which are often low-resolution and blur.
在前面的章节中,我们深入探讨了基于扩散模型的图像编辑方法的方法论方面。除了这种分析之外,评估这些方法的能力,检查它们在各种编辑任务中的表现是至关重要的。然而,现有的图像编辑基准有限,并且不能完全满足我们在调查中发现的需求。例如,EditBench [153] 主要针对文本和掩模引导修复,忽略了涉及全局编辑(如风格转移)的更广泛任务。TedBench [187] 扩展了任务范围,但缺乏详细的说明,这对于评估依赖于文本指令而不是描述的方法至关重要。此外,EditVal 基准 [276] 尽管试图提供更全面的任务和方法覆盖范围,但受到其来自 MS-COCO 数据集 [277] 的图像质量的限制,这些图像通常具有低分辨率和模糊。

To address these problems, we introduce EditEval, a benchmark designed to evaluate general diffusion model-based image editing methods. EditEval includes a carefully curated dataset of 50 high-quality images, each accompanied by text prompts. EditEval evaluates performance on 7 common editing tasks selected from Table I. Additionally, we propose LMM Score, a quantitative evaluation metric that utilizes the capabilities of large multimodal models (LMMs) to assess the editing performance on different tasks. In addition to the objective evaluation provided by LMM Score, we also carry out a user study to incorporate subjective evaluation. The details of EditEval’s construction and its application are described as follows.
为了解决这些问题,我们引入了EditEval,一个旨在评估基于通用扩散模型的图像编辑方法的基准测试。EditEval包括一个精心策划的数据集,其中包含50张高质量图像,每张图像都附有文本提示。EditEval评估从表中选择的7个常见编辑任务的性能。此外,我们提出了LMM Score,一种利用大型多模态模型(LMMs)的能力来评估不同任务的编辑性能的定量评估指标。除了LMM Score提供的客观评估外,我们还进行了用户研究,以融入主观评估。以下将详细描述EditEval的构建细节和应用。

8.1.1 Task Selection 8.1.1任务选择

When selecting evaluation tasks, we take into account the capabilities indicated in Table I. It is observed that most methods are capable of handling semantic and stylistic tasks while encountering challenges with structural editing. The potential reason is that many current T2I diffusion models, which most editing methods depend on, have difficulty in accurate spatial awareness [278]. For example, they often generate inconsistent images with the prompts containing terms like “to the left of”, “underneath”, or “behind”. Considering these factors and the practical applications, we choose seven common tasks for our benchmark: object addition, object replacement, object removal, background replacement, style change, texture change, and action change, which aim to provide a thorough evaluation of the editing method’s performance from simple object edits to complex scene changes.
在选择评估任务时,我们考虑了表中所示的能力。观察到大多数方法能够处理语义和风格任务,但在结构编辑方面面临挑战。潜在原因是许多当前的T2I扩散模型,大多数编辑方法依赖于此,很难准确地感知空间[278]。例如,它们经常生成与包含诸如“在左边”,“在下面”或“在后面”等术语的提示不一致的图像。考虑到这些因素和实际应用,我们选择了七个常见任务作为我们的基准:添加对象,替换对象,删除对象,替换背景,改变风格,改变纹理和改变动作,旨在从简单的对象编辑到复杂的场景变化提供对编辑方法性能的全面评估。

8.1.2 Dataset Construction
8.1.2 数据集构建

For image selection, we manually choose 50 images from Unsplash’s online repository of professional photographs, ensuring a wide diversity of subjects and scenes. These images are cropped to a square format, aligning with the input ratio for most editing models. We then categorize the chosen images into 7 groups, each associated with one of the specific editing tasks mentioned above.
对于图像选择,我们从Unsplash的在线专业摄影库中手动选择50张图片,确保主题和场景的广泛多样性。这些图片被裁剪成方形格式,与大多数编辑模型的输入比例相匹配。然后,我们将选择的图片分为7组,每组与上述特定的编辑任务之一相关联。

In generating prompts for each image, we employ an LMM to create a source prompt that describes the image’s content, a target prompt outlining the expected result of the editing, and a corresponding instruction to guide the editing process. This step is facilitated by providing GPT-4V, one of the most widely-used LMMs, with a detailed template that includes an example pair and several instructions. The example pair consists of a task indicator (e.g., “Object Removal”), a source image, a source description (e.g., “A cup of coffee on an unmade bed”), a target description (e.g., “An unmade bed”), and an instruction (e.g., “Remove the cup of coffee”). Along with this example, we feed GPT-4V clear instructions on our expectations for generating source descriptions, target descriptions, and instructions for new images and task indicators we upload. This method ensures that GPT-4V produces prompts and instructions that not only align with the specified editing task but also maintain diversity in scenes and subjects. After GPT-4V generates the initial prompts and instructions, we undertake a meticulous examination to ensure each prompt and the set of instructions is specific, clear, and directly applicable to the corresponding image and task.
在为每个图像生成提示时,我们使用一个LMM来创建一个源提示,描述图像的内容,一个目标提示,概述编辑的预期结果,并提供相应的指导编辑过程的指令。这一步骤通过向GPT-4V提供一个详细的模板来实现,该模板包括一个示例对和几个指令。示例对包括一个任务指示器(例如,“去除物体”),一个源图像,一个源描述(例如,“一个未整理的床上的一杯咖啡”),一个目标描述(例如,“一个未整理的床”)和一个指令(例如,“去除咖啡杯”)。除了这个示例,我们还向GPT-4V提供了关于我们对生成源描述、目标描述和新图像和任务指示器的指令的期望的明确说明。这种方法确保GPT-4V生成的提示和指令不仅与指定的编辑任务保持一致,而且在场景和主题上保持多样性。 GPT-4V生成初始提示和指令后,我们进行细致的检查,以确保每个提示和指令都具体、清晰,并且与相应的图像和任务直接相关。

The final dataset, including the selected images, their source and target prompts, and the editing instructions, is available in the accompanying GitHub repository.
最终数据集,包括所选图像、它们的来源和目标提示以及编辑指令,可在附带的GitHub存储库中找到。

8.1.3 Metric Design and Selection
8.1.3度量设计和选择

The field of image editing has traditionally relied on CLIPScore [279] as a major quantitative evaluation metric. Despite its effectiveness in assessing the alignment between images and corresponding textual prompts, CLIPScore may struggle in complicated scenes with many details and specific spatial relationships [276]. This limitation urges the need for a more versatile and encompassing metric that can be applied to broader image editing tasks. Hence, we propose LMM Score, a new metric by harnessing advanced visual-language understanding capabilities of large multimodal models (LMMs).
图像编辑领域传统上依赖于CLIPScore [279]作为主要的定量评估指标。尽管CLIPScore在评估图像与相应文本提示之间的对齐方面非常有效,但在具有许多细节和特定空间关系的复杂场景中可能会遇到困难[276]。这种限制迫使我们需要一种更通用和全面的度量标准,可以应用于更广泛的图像编辑任务。因此,我们提出了LMM Score,这是一种通过利用大型多模态模型(LMMs)的先进视觉语言理解能力的新度量标准。

To develop this metric, we first direct GPT-4 to conceive a quantitative metric and outline a framework that allows for objective evaluation for general image editing tasks through appropriate user prompts. Based on GPT-4’s recommendations, the evaluation framework encompasses the following elements, using a source image Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT along with its text description tsrcsubscript𝑡𝑠𝑟𝑐t_{src}italic_t start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT, a collection of edited images Itgti,i=1,2,,Nformulae-sequencesuperscriptsubscript𝐼𝑡𝑔𝑡𝑖𝑖12𝑁I_{tgt}^{i},i={1,2,...,N}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = 1 , 2 , … , italic_N generated by N𝑁Nitalic_N methods, an editing prompt teditsubscript𝑡𝑒𝑑𝑖𝑡t_{edit}italic_t start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT, an editing instruction tinstsubscript𝑡𝑖𝑛𝑠𝑡t_{inst}italic_t start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t end_POSTSUBSCRIPT, and a task indicator o𝑜oitalic_o. The criteria integrates four crucial factors:
为了开发这个度量标准,我们首先指导GPT-4构思一个定量的度量标准,并概述一个框架,通过适当的用户提示来对一般图像编辑任务进行客观评估。根据GPT-4的建议,评估框架包括以下元素,使用一个源图像 Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT 以及它的文本描述 tsrcsubscript𝑡𝑠𝑟𝑐t_{src}italic_t start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ,由 N𝑁Nitalic_N 方法生成的一组编辑后的图像 Itgti,i=1,2,,Nformulae-sequencesuperscriptsubscript𝐼𝑡𝑔𝑡𝑖𝑖12𝑁I_{tgt}^{i},i={1,2,...,N}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = 1 , 2 , … , italic_N ,一个编辑提示 teditsubscript𝑡𝑒𝑑𝑖𝑡t_{edit}italic_t start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT ,一个编辑指令 tinstsubscript𝑡𝑖𝑛𝑠𝑡t_{inst}italic_t start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t end_POSTSUBSCRIPT 和一个任务指示器 o𝑜oitalic_o 。这个标准综合了四个关键因素:

  • Editing Accuracy: Evaluate how closely the edited image Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT adheres to the specified editing prompt teditsubscript𝑡𝑒𝑑𝑖𝑡t_{edit}italic_t start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT and instruction tinstsubscript𝑡𝑖𝑛𝑠𝑡t_{inst}italic_t start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t end_POSTSUBSCRIPT, measuring the precision of the editing.


    • 编辑准确性:评估编辑后的图像与指定的编辑提示和指令的一致程度,衡量编辑的精确度。
  • Contextual Preservation: Assess how well the edited image Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT maintains the context of the source image Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT that requires no changes.


    • 上下文保留:评估编辑后的图像 Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT 在不需要更改的情况下,保持源图像 Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT 的上下文。
  • Visual Quality: Assess the overall quality of the edited image Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, including factors like resolution, absence of artifacts, color accuracy, sharpness, etc.


    视觉质量:评估编辑后图像的整体质量,包括分辨率、无伪影、色彩准确性、清晰度等因素。
  • Logical Realism: Evaluate how logically realistic the edited image Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is, in terms of adherence to natural laws like lighting consistency, texture continuity, etc.


    逻辑现实主义:评估编辑后的图像 Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT 在遵循自然法则(如光照一致性、纹理连续性等)方面的逻辑现实性。

As for evaluation, each edited image Itgtisuperscriptsubscript𝐼𝑡𝑔𝑡𝑖I_{tgt}^{i}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is evaluated on these factors, yielding four sub-scores (ranging from 1 to 10) denoted as Saccsubscript𝑆𝑎𝑐𝑐S_{acc}italic_S start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT, Spresubscript𝑆𝑝𝑟𝑒S_{pre}italic_S start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT, Squasubscript𝑆𝑞𝑢𝑎S_{qua}italic_S start_POSTSUBSCRIPT italic_q italic_u italic_a end_POSTSUBSCRIPT, and Srealsubscript𝑆𝑟𝑒𝑎𝑙S_{real}italic_S start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT. An overall score, SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT, is computed as a weighted average of these sub-scores to reflect the overall editing quality:
关于评估,每个编辑后的图像根据以下因素进行评估,得出四个子分数(从1到10),分别表示为 Saccsubscript𝑆𝑎𝑐𝑐S_{acc}italic_S start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPTSpresubscript𝑆𝑝𝑟𝑒S_{pre}italic_S start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPTSquasubscript𝑆𝑞𝑢𝑎S_{qua}italic_S start_POSTSUBSCRIPT italic_q italic_u italic_a end_POSTSUBSCRIPTSrealsubscript𝑆𝑟𝑒𝑎𝑙S_{real}italic_S start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT 。计算出一个总体分数 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT ,作为这些子分数的加权平均值,以反映整体编辑质量。

SLMM=0.4×Sacc+0.3×Spre+0.2×Squa+0.1×Sreal.subscript𝑆𝐿𝑀𝑀0.4×subscript𝑆𝑎𝑐𝑐0.3×subscript𝑆𝑝𝑟𝑒0.2×subscript𝑆𝑞𝑢𝑎0.1×subscript𝑆𝑟𝑒𝑎𝑙\displaystyle S_{LMM}=0.4\texttimes S_{acc}+0.3\texttimes S_{pre}+0.2% \texttimes S_{qua}+0.1\texttimes S_{real}.italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT = 0.4 × italic_S start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT + 0.3 × italic_S start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT + 0.2 × italic_S start_POSTSUBSCRIPT italic_q italic_u italic_a end_POSTSUBSCRIPT + 0.1 × italic_S start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT . (17)

This weighted formula, suggested by GPT-4, aims to balance the relative importance of each evaluation factor.
这个加权公式是由GPT-4提出的,旨在平衡每个评估因素的相对重要性。

Following the proposed framework, we employ an LMM, specifically GPT-4V in this work, to carry out this evaluation. GPT-4V follows user-predefined instructions to calculate the sub-scores across the four evaluation factors for each edited image, obtaining the value of SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT as in Eq. 17. To enable a user-friendly application of LMM Score, we provide a template that includes predefined instructions and necessary materials, which is accessible in our GitHub repository, facilitating its adoption in image editing evaluation.
根据提出的框架,我们在这项工作中采用了一个LMM,具体是GPT-4V,来进行评估。GPT-4V根据用户预定义的指令计算每个编辑后图像的四个评估因素的子分数,得到 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT 的值,如公式17所示。为了实现LMM Score的用户友好应用,我们提供了一个模板,其中包括预定义的指令和必要的材料,可以在我们的GitHub存储库中访问,方便其在图像编辑评估中的应用。

In addition to the objective evaluation provided by LMM Score, we conduct a user study to collect subjective feedback. This study engages 50 participants from diverse backgrounds, ensuring a wide range of perspectives. Each participant is presented with the source image, its description, the editing prompt and instruction, and a series of edited images. They are then asked to score each edited image according to the same four evaluation factors utilized by LMM Score. Each factor is scored from 1 to 10. The participants’ scores are then aggregated to calculate an overall User Score (SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT) for each image, with the same calculation in Eq. 17. The results from this user study complement the LMM Score evaluations, providing a comprehensive assessment of the performance of different editing methods.
除了LMM评分提供的客观评估外,我们进行了一项用户研究以收集主观反馈。该研究吸引了来自不同背景的50名参与者,确保了广泛的观点。每个参与者都会看到源图像、描述、编辑提示和指导以及一系列编辑后的图像。然后,他们被要求根据LMM评分所使用的相同四个评估因素为每个编辑后的图像打分。每个因素的分数从1到10。然后将参与者的分数汇总,根据公式17计算每个图像的总体用户评分( SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT )。这项用户研究的结果补充了LMM评分的评估,提供了对不同编辑方法性能的全面评估。

8.2 Evaluation 8.2 评估

8.2.1 Method Selection 8.2.1方法选择

For the evaluation of each editing task, we carefully select between 4 to 8 methods from Table I, encompassing a range of approaches including training-based, testing-time finetuning, and training & finetuning free methods. To ensure a fair and consistent comparison, our selection criteria are as follows: the method must require only text conditions, own the capability to handle specific tasks, and have open-source code available for implementation. We exclude domain-specific methods to avoid the constraints imposed by their limited applicability across diverse domains. These selected methods are specified in subsequent subsections.
对于每个编辑任务的评估,我们从表格中精选出4到8种方法,涵盖了一系列的方法,包括基于训练的方法、测试时微调的方法以及无需训练和微调的方法。为了确保公平和一致的比较,我们的选择标准如下:该方法必须仅需要文本条件,具备处理特定任务的能力,并且有可供实施的开源代码。我们排除了领域特定的方法,以避免受到其在不同领域中应用受限的约束。这些被选中的方法在后续的小节中进行了详细说明。

8.2.2 Comparative Analysis
8.2.2比较分析

Performance Comparison. To present a thorough evaluation of the selected methods across the 7 selected editing tasks, we compute the mean and standard deviation of both SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT and SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT on all evaluation samples, as shown in Fig. 10. The color bars illustrate the range of scores from minimum to maximum. From the results, we can have several insights. First, there is no single method that outperforms others across all tasks. For example, while LEDITS++ [198] demonstrates commendable performance in object removal (Fig. 10(c)), it does not perform well in style change (Fig. 10(e)). Second, most methods exhibit a wide range of fluctuating scores in each task, as evidenced by the broad span between the minimum and maximum values and the large standard deviations of both SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT and SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT. Such variability points to the unrobustness of these methods, suggesting that a method’s performance can be sample-dependent and may not consistently meet the editing requirements in various scenes. Third, some methods do exhibit stable and impressive performances for certain tasks. Null-Text Inversion (Null-Text) [175], for example, consistently shows high score distributions in object replacement, background change, and texture change, indicating its reliable
性能比较。为了对所选方法在7个选择的编辑任务中进行全面评估,我们计算了所有评估样本上 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPTSUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT 的均值和标准差,如图10所示。颜色条显示了从最小到最大的分数范围。从结果中,我们可以得出几个见解。首先,没有一种单一的方法在所有任务中表现优于其他方法。例如,虽然LEDITS++ [198]在物体去除(图10(c))方面表现出色,但在风格变化(图10(e))方面表现不佳。其次,大多数方法在每个任务中都表现出一定范围内波动的分数,这可以从最小值和最大值之间的广泛跨度以及 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPTSUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT 的大标准差看出。这种变异性表明这些方法的不稳定性,表明方法的性能可能依赖于样本,并且在不同场景中可能无法始终满足编辑要求。第三,一些方法确实在某些任务中表现出稳定而令人印象深刻的性能。Null-Text Inversion(Null-Text)[175],例如,在对象替换、背景变化和纹理变化方面始终显示出高分数分布,表明其可靠性
[Uncaptioned image]
[Uncaptioned image]
Figure 10: Quantitative performances of various methods across 7 selected editing tasks. μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ denote the mean value and standard deviation of the scores, respectively.
图10:在7个选定的编辑任务中,各种方法的定量表现。 μ𝜇\muitalic_μσ𝜎\sigmaitalic_σ 分别表示得分的平均值和标准差。

editing quality for some real-world applications. Last, the mean values of SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT and SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT are closely matched for each method within each task, suggesting that LMM Score well reflects users’ preferences. Although the methods’ performances fluctuate, the consistency between LMM Score and user perception across the tasks indicates that LMM Score is a reliable metric for image editing. Besides the quantitative results, we also present several visual examples for qualitative comparison in Fig. 13.
一些真实世界应用的编辑质量。最后,每种方法在每个任务中的 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPTSUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT 的平均值相互匹配,表明LMM分数很好地反映了用户的偏好。尽管方法的性能有所波动,但LMM分数与用户感知之间的一致性表明LMM分数是图像编辑的可靠度量。除了定量结果,我们还在图13中提供了几个视觉示例进行定性比较。

Refer to caption
Figure 11: Pearson correlation coefficients between LMM Score and the user study.
图11:LMM分数与用户研究之间的皮尔逊相关系数。

Correlation Between LMM Score and User Study. To examine the effectiveness of LMM Score, we investigate the correlation between it and the user study. Specifically, we first calculate the 4 sub-scores’ means of the collected user data on the 4 factors. Then, within each editing task, we compute the Pearson correlation coefficient on each factor between the LMM sub-scores and the user study sub-scores on each sample. Finally, these coefficients are averaged across all samples to obtain the overall correlation for each factor per task. The results, as illustrated in Fig. 11, reveal a significant alignment across the editing tasks, indicating a strong concordance between the objective LMM Score evaluation and the subjective human judgment.
LMM分数与用户研究之间的相关性。为了检验LMM分数的有效性,我们研究了它与用户研究之间的相关性。具体而言,我们首先计算了收集到的用户数据在4个因素上的4个子分数的平均值。然后,在每个编辑任务中,我们计算了LMM子分数与每个样本上用户研究子分数之间的Pearson相关系数。最后,这些系数在所有样本上进行平均,得到每个任务每个因素的整体相关性。结果如图11所示,显示出编辑任务之间的显著一致性,表明客观的LMM分数评估与主观的人类判断之间存在强烈的一致性。

Refer to caption
Figure 12: Comparison of Pearson correlation coefficients between LMM Score / CLIPScore and the user study.
图12:LMM分数/ CLIPScore与用户研究之间的皮尔逊相关系数比较。
Editing Type: Object Addition (“Add a glass of milk next to the cookies”)
编辑类型:物体添加(“在饼干旁边加一杯牛奶”)
Refer to caption Source 来源 Refer to caption InstructPix2Pix [156] Refer to caption Instruct- 指导 Diffusion [160] 扩散[160] Refer to caption DiffusionDisen- 扩散失望 tanglement [178] 纠缠 Refer to caption Imagic [187] Refer to caption DDPM Inversion [196] 倒装[196] Refer to caption LEDITS++ [198] Refer to caption ProxEdit [202] Refer to caption AIDI [205] AIDI [205]
Editing Type: Object Removal (“Remove the wooden house and dock”)
编辑类型:物体移除(“移除木屋和码头”)
Refer to caption Source 来源    Refer to caption Inst-Paint [163] 安装-油漆[163]    Refer to caption InstructDiffusion [160] 指导扩散[160]    Refer to caption LEDITS++ [198]    Refer to caption ProxEdit [202]    Refer to caption AIDI [205] AIDI [205]
Editing Type: Object Replacement ( “Replace the llama toy with a floor lamp”)
编辑类型:对象替换(“用一盏落地灯替换羊驼玩具”)
Refer to caption Source 来源 Refer to caption InstructPix2Pix [156] Refer to caption Instruct- 指导 Diffusion [160] 扩散[160] Refer to caption DiffusionDisen- 扩散失望 tanglement [178] 纠缠 Refer to caption Null-Text [175] 空文本 [175] Refer to caption Imagic [187] Refer to caption LEDITS++ [198] Refer to caption ProxEdit [202] Refer to caption AIDI [205] AIDI [205]
Editing Type: Background Change (“Change the city street to a dense jungle”)
编辑类型:背景更改(“将城市街道更改为茂密的丛林”)
Refer to caption Source 来源 Refer to caption InstructPix2Pix [156] Refer to caption DDS [183] DDS [183] Refer to caption Null-Text [175] 空文本 [175] Refer to caption SINE [190] 正弦 [ 190] Refer to caption EDICT [204] EDICT [204] Refer to caption LEDITS++ [198] Refer to caption ProxEdit [202] Refer to caption AIDI [205] AIDI [205]
Editing Type: Style Change (“Transform it to the Van Gogh style”)
编辑类型:风格变化(“将其转换为梵高风格”)
Refer to caption Source 来源 Refer to caption Instruct 指导 Pix2Pix [156] Refer to caption Instruct- 指导 Diffusion [160] 扩散[160] Refer to caption Null-Text [175] 空文本 [175] Refer to caption Imagic [187] Refer to caption DDPM Inversion [196] 倒装[196] Refer to caption LEDITS++ [198] Refer to caption ProxEdit [202] Refer to caption AIDI [205] AIDI [205]
Editing Type: Texture Change (“Turn the horse into a statue”)
编辑类型:纹理变化(“将马变成雕像”)
Refer to caption Source 来源 Refer to caption InstructPix2Pix [156] Refer to caption Instruct- 指导 Diffusion [160] 扩散[160] Refer to caption DiffusionDisen- 扩散失望 tanglement [178] 纠缠 Refer to caption Null-Text [175] 空文本 [175] Refer to caption Imagic [187] Refer to caption PnP [214] Refer to caption EBMs [218] Refer to caption CycleDiffusion [206] 循环扩散 [206]
Editing Type: Action Change (“Let the bear raise its hand”)
编辑类型:行动变更(“让熊举起它的手”)
Refer to caption Source 来源            Refer to caption Forgedit [189] 锻造 [189]            Refer to caption Imagic [187]            Refer to caption MasaCtrl [85]            Refer to caption TIC [209] TIC [209]
Figure 13: Visual comparison on 7 selected editing types.
图13:对7种选择的编辑类型进行视觉比较。

LMM Score vs. CLIPScore. To further evaluate the effectiveness of LMM Score, we compute the Pearson correlation coefficients between a commonly used metric CLIPScore [279] and the user scores for comparison. CLIPScore is calculated using the specified text prompt teditsubscript𝑡𝑒𝑑𝑖𝑡t_{edit}italic_t start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT for a source sample Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and the corresponding edited image Itgtisubscriptsuperscript𝐼𝑖𝑡𝑔𝑡I^{i}_{tgt}italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT generated by method i𝑖iitalic_i within a specific editing task [279]. Once all CLIPScore values are obtained for all samples across the 7 selected tasks, we calculate the Pearson correlation coefficients between these scores and the overall user scores on all the samples. Unlike LMM Score, CLIPScore does not consider specific factors and is thus directly compared with SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT. These coefficients are then averaged to yield an overall correlation coefficient for each task. Concurrently, we compute the Pearson correlation coefficients between the overall LMM scores SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT and the overall user scores on all the samples and average them per task as well. All the SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT scores are normalized to [0, 1] to match the range of CLIPScore. Fig. 12 visualizes these comparisons and demonstrates that for all these editing tasks, LMM Score maintains a closer correlation with user judgments than CLIPScore does, signifying a more accurate reflection of user evaluations and the effectiveness of LMM Score in the evaluation of various image editing tasks and methods.
LMM分数与CLIPScore。为了进一步评估LMM分数的有效性,我们计算了常用指标CLIPScore [279]与用户评分之间的皮尔逊相关系数进行比较。CLIPScore是使用指定的文本提示 teditsubscript𝑡𝑒𝑑𝑖𝑡t_{edit}italic_t start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT 计算的,用于源样本 Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT 和方法 i𝑖iitalic_i 生成的相应编辑图像 Itgtisubscriptsuperscript𝐼𝑖𝑡𝑔𝑡I^{i}_{tgt}italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT 在特定编辑任务[279]中。一旦获得了所有样本在7个选定任务中的所有CLIPScore值,我们计算这些分数与所有样本的总体用户评分之间的皮尔逊相关系数。与LMM分数不同,CLIPScore不考虑具体因素,因此直接与 SUsersubscript𝑆𝑈𝑠𝑒𝑟S_{User}italic_S start_POSTSUBSCRIPT italic_U italic_s italic_e italic_r end_POSTSUBSCRIPT 进行比较。然后对这些系数进行平均,得到每个任务的总体相关系数。同时,我们还计算了总体LMM分数 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT 与所有样本的总体用户评分之间的皮尔逊相关系数,并对每个任务进行平均。所有的 SLMMsubscript𝑆𝐿𝑀𝑀S_{LMM}italic_S start_POSTSUBSCRIPT italic_L italic_M italic_M end_POSTSUBSCRIPT 分数都被归一化到[0, 1]以匹配CLIPScore的范围。图。 12展示了这些比较,并证明对于所有这些编辑任务,LMM分数与用户判断之间的相关性比CLIPScore更接近,表明LMM分数更准确地反映了用户评价和LMM分数在评估各种图像编辑任务和方法的有效性。

9 Challenges and Future Directions
9个挑战和未来方向

Despite the success achieved in image editing with diffusion models, there are still limitations that need to be addressed in future work.
尽管扩散模型在图像编辑方面取得了成功,但仍然存在需要在未来工作中解决的限制。

Fewer-step Model Inference. Most diffusion-based models require a significant number of steps to obtain the final image during inference, which is both time-consuming and computationally costly, bringing challenges in model deployment and user experience. To improve the inference efficiency, few-step or one-step generation diffusion models have been studied [280, 281, 90]. Recent methods in this field reduce the number of steps by distilling knowledge from a pretrained strong diffusion model, so that the few-step model can mimic the behavior of the strong model. A more interesting yet challenging direction is to develop few-step models directly without relying on the pretrained models, such as consistency models [282, 283].
较少步骤的模型推理。大多数基于扩散的模型在推理过程中需要大量步骤才能获得最终图像,这既耗时又计算成本高,给模型部署和用户体验带来了挑战。为了提高推理效率,研究了少步或一步生成扩散模型。最近的方法通过从预训练的强扩散模型中提取知识来减少步骤数,使得少步模型能够模仿强模型的行为。一个更有趣但也更具挑战性的方向是直接开发少步模型,而不依赖于预训练模型,例如一致性模型。

Efficient Models. Training a diffusion model that can generate realistic results is computationally intensive and requires a large amount of high-quality data. This complexity makes developing diffusion models for image editing very challenging. To reduce the training cost, recent works design more efficient network architectures as the backbones of diffusion models [284, 285]. Besides, another important direction is to train only a portion of the parameters or freeze the original parameters and add a few new layers on top of the pretrained diffusion model [286, 113].
高效模型。训练一个能够生成逼真结果的扩散模型需要大量的高质量数据和计算资源。这种复杂性使得开发用于图像编辑的扩散模型变得非常具有挑战性。为了降低训练成本,最近的研究设计了更高效的网络架构作为扩散模型的主干[284, 285]。此外,另一个重要的方向是只训练部分参数或冻结原始参数,并在预训练的扩散模型之上添加一些新的层[286, 113]。

Complex Object Structure Editing. Existing works can synthesize realistic colors, styles or textures when editing images. However, they still produce noticeable artifacts when dealing with complex structures, such as fingers, logos, and scene text. Attempts have been made to address these issues. Previous methods usually use negative prompting such as “six fingers, bad leg, etc.” to make the model avoid generating such images, which are effective in certain cases but not robust enough [287]. Recent works start to use layouts, edges or dense labels as guidance for editing the global or local structures of images [101, 103].
复杂对象结构编辑。现有的作品在编辑图像时可以合成逼真的颜色、风格或纹理。然而,在处理复杂的结构,如手指、标志和场景文字时,它们仍然会产生明显的伪影。已经尝试解决这些问题。以往的方法通常使用负面提示,如“六个手指,腿不好等”,使模型避免生成这样的图像,在某些情况下是有效的,但不够稳健[287]。最近的作品开始使用布局、边缘或密集标签作为指导,用于编辑图像的全局或局部结构[101, 103]。

Complex Lighting and Shadow Editing. It is still challenging to edit the lighting or shadow of an object, which requires an accurate estimation of the lighting condition in the scene. Previous works such as Total Relighting [288] use a combination of networks to estimate the normal, albedo, and shading of the foreground object to obtain realistic relighting effects. Recently, diffusion-based models are proposed to edit the lighting of faces (DiFaReli [289]). However, the editing of portraits or general objects leveraging the strong lighting prior from the pretrained diffusion models remains an open area [290]. Similarly, ShadowDiffusion [62] explores diffusion-based shadow synthesis and can generate visually pleasing shadows of objects. However, accurately editing the shadow of an object under different background conditions using diffusion models is still an unsolved problem.
复杂的光照和阴影编辑。编辑物体的光照或阴影仍然具有挑战性,需要准确估计场景中的光照条件。以前的研究,如Total Relighting [288],使用网络的组合来估计前景物体的法线、反射率和阴影,以获得逼真的重新照明效果。最近,提出了基于扩散的模型来编辑面部的光照(DiFaReli [289])。然而,利用预训练的扩散模型的强光照先验来编辑肖像或一般物体的光照仍然是一个未解决的领域[290]。类似地,ShadowDiffusion [62]探索了基于扩散的阴影合成,并能够生成视觉上令人愉悦的物体阴影。然而,使用扩散模型在不同背景条件下准确编辑物体的阴影仍然是一个未解决的问题。

Unrobustness of Image Editing. Existing diffusion-based image editing models can synthesize realistic visual contents for a portion of given conditions. However, they still fail in many real-world scenarios [187]. The fundamental cause for this problem is that the model is not capable of accurately modeling all possible samples in the conditional distribution space. How to improve the models to generate artifact-free contents consistently remains a challenge. There are several ways to relieve this problem. First, scale up the data for model training to cover the challenging scenarios. This is an effective yet costly approach. In some situations, it is even very challenging to collect sufficient amount of data, such as medical images, visual inspection data, etc. Second, adapt the model to accept more conditions such as structural guidance [101], 3D-aware guidance [291], and textual guidance for more controllable and deterministic content creation. Third, adopt iterative refinement or multi-stage training to improve the initial results of the model progressively [57, 266, 292].
图像编辑的不稳定性。现有的基于扩散的图像编辑模型可以合成部分给定条件下的逼真视觉内容。然而,在许多实际场景中它们仍然失败[187]。这个问题的根本原因是模型无法准确地对条件分布空间中的所有可能样本进行建模。如何改进模型以始终生成无瑕疵的内容仍然是一个挑战。有几种方法可以缓解这个问题。首先,扩大模型训练的数据规模,以涵盖具有挑战性的场景。这是一种有效但昂贵的方法。在某些情况下,甚至很难收集足够数量的数据,例如医学图像、视觉检查数据等。其次,使模型适应更多的条件,如结构引导[101]、3D感知引导[291]和文本引导,以实现更可控和确定性的内容创建。第三,采用迭代改进或多阶段训练,逐步改善模型的初始结果[57, 266, 292]。

Faithful Evaluation Metrics. Accurate evaluation is crucial for image editing to ensure that the edited contents are well aligned with the given conditions. However, although some quantitative metrics such as FID [293], KID [294], LPIPS [295], CLIP Score [242], PSNR, and SSIM have been used for this task as a reference, most of existing works still heavily rely on user study to provide relatively accurate perceptual evaluation for visual results, which is neither efficient nor scalable. Faithful quantitative evaluation metrics are still an open problem. Recently, more accurate metrics for quantifying the perceptual similarity of objects have been proposed. DreamSim [296] measures the mid-level similarity of two images considering layout, pose, and semantic contents and outperforms LPIPS. Similarly, foreground feature averaging (FFA) [297] provides a simple yet effective method for measuring the similarity of objects despite its pose, viewpoint, lighting conditions, or background. In this paper, we also propose an effective image editing metric LMM Score with the help of an LMM.
忠实的评估指标。准确的评估对于图像编辑至关重要,以确保编辑内容与给定条件相吻合。然而,尽管一些定量指标如FID [293]、KID [294]、LPIPS [295]、CLIP Score [242]、PSNR和SSIM已被用作参考,但大多数现有工作仍然严重依赖用户研究来提供相对准确的感知评估结果,这既不高效也不可扩展。忠实的定量评估指标仍然是一个未解决的问题。最近,提出了更准确的用于量化物体感知相似性的指标。DreamSim [296]考虑布局、姿态和语义内容,衡量两个图像的中层相似性,并超过了LPIPS。类似地,前景特征平均(FFA) [297]提供了一种简单而有效的方法,用于衡量物体的相似性,无论其姿态、视角、光照条件或背景如何。在本文中,我们还提出了一种有效的图像编辑指标LMM Score,借助于LMM的帮助。

10 Conclusion 10 结论

We have extensively overviewed diffusion model-based image editing methods, examining the field from multiple perspectives. Our analysis begins by categorizing over 100 methods into three main groups according to their learning strategies: training-based, test-time fine-tuning, and training and fine-tuning free approaches. We then classify image editing tasks into three distinct categories: semantic, stylistic, and structural editing, encompassing 12 specific types in total. We explore these methods and their contributions towards enhancing editing performance. An evaluation of 7 tasks alongside recent state-of-the-art methods is conducted within our image editing benchmark EditEval. Additionally, a new metric LMM Score is introduced for comparative analysis of these methods. Concluding our review, we highlight the broad potential within the image editing domain and suggest directions for future research.
我们对基于扩散模型的图像编辑方法进行了广泛的概述,从多个角度考察了该领域。我们的分析从学习策略的角度将100多种方法分为三大类:基于训练的方法、测试时微调的方法和无需训练和微调的方法。然后,我们将图像编辑任务分为三个不同的类别:语义编辑、风格编辑和结构编辑,总共涵盖了12种具体类型。我们探讨了这些方法及其对提高编辑性能的贡献。我们在我们的图像编辑基准EditEval中对7个任务以及最新的先进方法进行了评估。此外,我们引入了一种新的度量指标LMM Score,用于比较分析这些方法。在总结我们的回顾时,我们强调了图像编辑领域的广阔潜力,并提出了未来研究的方向。

References 参考资料

  • [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
    好的,以下是翻译结果: I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“生成对抗网络”,发表于NeurIPS,第27卷,2014年。
  • [2] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, pp. 4401–4410, 2019.
    T. Karras,S. Laine和T. Aila,“用于生成对抗网络的基于风格的生成器架构”,CVPR,2019年,第4401-4410页。
  • [3] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
    M. Mirza和S. Osindero,“条件生成对抗网络”,arXiv预印本arXiv:1411.1784,2014年。
  • [4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, pp. 1125–1134, 2017.
    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros,“具有条件对抗网络的图像到图像的转换”,在CVPR中,第1125-1134页,2017年。
  • [5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, pp. 2223–2232, 2017.
    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “使用循环一致对抗网络进行非配对图像到图像的转换”,在ICCV中,页码2223-2232,2017年。
  • [6] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” in ICLR, 2019.
    A. Brock,J. Donahue和K. Simonyan,“大规模GAN训练用于高保真度自然图像合成”,在ICLR,2019年。
  • [7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
    D. P. Kingma和M. Welling,“自动编码变分贝叶斯”,arXiv预印本arXiv:1312.6114,2013年。
  • [8] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in ICML, pp. 1747–1756, 2016.
    A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu,“像素递归神经网络”,在ICML中,第1747-1756页,2016年。
  • [9] D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in ICML, pp. 1530–1538, 2015.
    D. Rezende和S. Mohamed,“具有归一化流的变分推断”,在ICML中,第1530-1538页,2015年。
  • [10] A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” in NeurIPS, vol. 30, 2017.
    A. Van Den Oord, O. Vinyals等人,“神经离散表示学习”,发表于NeurIPS,第30卷,2017年。
  • [11] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in NeurIPS, vol. 30, 2017.
    G. Papamakarios,T. Pavlakou和I. Murray,“用于密度估计的掩码自回归流”,在NeurIPS,卷30,2017年。
  • [12] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in CVPR, pp. 12873–12883, 2021.
    P. Esser, R. Rombach和B. Ommer,“驯服变压器进行高分辨率图像合成”,CVPR,2021年,pp. 12873-12883。
  • [13] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, pp. 8821–8831, 2021.
    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, "零样本文本到图像生成," 在ICML会议上, pp. 8821-8831, 2021年.
  • [14] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
    J. Zhao,M. Mathieu和Y. LeCun,“基于能量的生成对抗网络”,arXiv预印本arXiv:1609.03126,2016年。
  • [15] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “使用非平衡热力学的深度无监督学习”,ICML,2015年。
  • [16] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, pp. 6840–6851, 2020.
    J. Ho,A. Jain和P. Abbeel,“去噪扩散概率模型”,发表于NeurIPS,第33卷,第6840-6851页,2020年。
  • [17] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, pp. 8162–8171, 2021.
    A. Q. Nichol和P. Dhariwal,“改进的去噪扩散概率模型”,在ICML中,第8162-8171页,2021年。
  • [18] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
    J. Song, C. Meng和S. Ermon,“去噪扩散隐式模型”,ICLR,2021年。
  • [19] A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.,” JMLR, vol. 6, no. 4, 2005.
    Hyvärinen和Dayan,“通过得分匹配估计非标准化统计模型”,JMLR,第6卷,第4期,2005年。
  • [20] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
    P. Vincent,“得分匹配与去噪自编码器之间的联系”,《神经计算》杂志,第23卷,第7期,2011年,1661-1674页。
  • [21] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in NeurIPS, vol. 32, 2019.
    Y. Song和S. Ermon,“通过估计数据分布的梯度进行生成建模”,发表于NeurIPS,第32卷,2019年。
  • [22] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021.
    宋Y,Sohl-Dickstein J,Kingma DP,Kumar A,Ermon S和Poole B,“通过随机微分方程进行基于分数的生成建模”,ICLR,2021年。
  • [23] Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” in NeurIPS, vol. 33, pp. 12438–12448, 2020.
    宋Y和Ermon S,“训练基于得分的生成模型的改进技术”,发表于NeurIPS,第33卷,第12438-12448页,2020年。
  • [24] J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, “Divae: Photorealistic images synthesis with denoising diffusion decoder,” arXiv preprint arXiv:2206.00386, 2022.
    J. Shi, C. Wu, J. Liang, X. Liu和N. Duan,“Divae:具有去噪扩散解码器的逼真图像合成”,arXiv预印本arXiv:2206.00386,2022年。
  • [25] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,《基于文本条件的层次化图像生成与剪贴潜变量》,arXiv预印本arXiv:2204.06125,2022年。
  • [26] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
    R. Rombach,A. Blattmann,D. Lorenz,P. Esser和B. Ommer,“使用潜在扩散模型进行高分辨率图像合成”,CVPR,2022年。
  • [27] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, 2022.
    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes等人,“具有深度语言理解的逼真文本到图像扩散模型”,发表于NeurIPS,2022年。
  • [28] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
    J. Ho和T. Salimans,“无分类器的扩散引导”,arXiv预印本arXiv:2207.12598,2022年。
  • [29] G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann, “Conditional image generation with score-based diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
    G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann,《基于得分的扩散模型的条件图像生成》,arXiv预印本arXiv:2111.13606,2021年。
  • [30] F. Bao, C. Li, J. Zhu, and B. Zhang, “Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models,” in ICLR, 2022.
    F. Bao, C. Li, J. Zhu, and B. Zhang,“Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models,” in ICLR, 2022.
  • [31] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” JMLR, vol. 23, no. 1, pp. 2249–2281, 2022.
    J. Ho,C. Saharia,W. Chan,D. J. Fleet,M. Norouzi和T. Salimans,“用于高保真度图像生成的级联扩散模型”,JMLR,第23卷,第1期,第2249-2281页,2022年。
  • [32] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in NeurIPS, vol. 34, pp. 8780–8794, 2021.
    P. Dhariwal和A. Nichol,“扩散模型在图像合成方面胜过GANs”,发表于NeurIPS,第34卷,第8780-8794页,2021年。
  • [33] X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi, A. Rohrbach, and T. Darrell, “More control for free! image synthesis with semantic diffusion guidance,” in WACV, pp. 289–299, 2023.
    刘X,朴D.H,阿扎迪S,张G,乔皮基扬A,胡Y,史H,罗尔巴赫A和达雷尔T,“更多自由的控制!带有语义扩散引导的图像合成”,WACV会议论文集,289-299页,2023年。
  • [34] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in ICML, pp. 16784–16804, 2022.
    A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen,《Glide:基于文本引导扩散模型的逼真图像生成和编辑》,ICML会议论文集,2022年,页码16784-16804。
  • [35] R. Rombach, A. Blattmann, and B. Ommer, “Text-guided synthesis of artistic images with retrieval-augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
    R. Rombach, A. Blattmann和B. Ommer,“基于文本引导的检索增强扩散模型的艺术图像合成”,arXiv预印本arXiv:2207.13038,2022年。
  • [36] A. Bansal, E. Borgnia, H.-M. Chu, J. S. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein, “Cold diffusion: Inverting arbitrary image transforms without noise,” arXiv preprint arXiv:2208.09392, 2022.  
  • [37]  C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and T. Salimans, “On distillation of guided diffusion models,” in CVPR, pp. 14297–14306, 2023.  
  • [38]  H. Phung, Q. Dao, and A. Tran, “Wavelet diffusion models are fast and scalable image generators,” in CVPR, pp. 10199–10208, 2023.  
  • [39]  J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” in NeurIPS, 2022.  
  • [40]  U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., “Make-a-video: Text-to-video generation without text-video data,” in ICLR, 2023.  
  • [41]  J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.  
  • [42]  S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in ICCV, 2023.  
  • [43]  A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023.  
  • [44]  D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.  
  • [45]  S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang, et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” arXiv preprint arXiv:2303.12346, 2023.  
  • [46]  P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023.  
  • [47]  J. An, S. Zhang, H. Yang, S. Gupta, J.-B. Huang, J. Luo, and X. Yin, “Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation,” arXiv preprint arXiv:2304.08477, 2023.  
  • [48]  J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.  
  • [49]  X. Li, W. Chu, Y. Wu, W. Yuan, F. Liu, Q. Zhang, F. Li, H. Feng, E. Ding, and J. Wang, “Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation,” arXiv preprint arXiv:2309.00398, 2023.  
  • [50]  Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022.
    Y. He, T. Yang, Y. Zhang, Y. Shan和Q. Chen,“用于任意长度高保真视频生成的潜在视频扩散模型”,arXiv预印本arXiv:2211.13221,2022年。
  • [51] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023.
    Y. Wang,X. Chen,X. Ma,S. Zhou,Z. Huang,Y. Wang,C. Yang,Y. He,J. Yu,P. Yang等人,“Lavie:使用级联潜在扩散模型生成高质量视频”,arXiv预印本arXiv:2309.15103,2023年。
  • [52] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023.
    J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023.
  • [53] J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” arXiv preprint arXiv:2311.12631, 2023.
    J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen, "Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning," arXiv preprint arXiv:2311.12631, 2023.
  • [54] D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818, 2023.
    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou,《Show-1: 像素和潜在扩散模型的文本到视频生成》,arXiv预印本arXiv:2309.15818,2023年。
  • [55] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al., “稳定的视频扩散:将潜在视频扩散模型扩展到大型数据集,” arXiv预印本arXiv:2311.15127, 2023.
  • [56] R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
    R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, "Emu视频:通过显式图像调节分解文本到视频生成," arXiv预印本arXiv:2311.10709, 2023.
  • [57] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE TPAMI, vol. 45, no. 4, pp. 4713–4726, 2022.
    C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,《通过迭代改进的图像超分辨率》,IEEE TPAMI,第45卷,第4期,第4713-4726页,2022年。
  • [58] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH, pp. 1–10, 2022.
    C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi,《调色板:图像到图像扩散模型》,ACM SIGGRAPH,第1-10页,2022年。
  • [59] O. Özdenizci and R. Legenstein, “Restoring vision in adverse weather conditions with patch-based denoising diffusion models,” IEEE TPAMI, 2023.
    O. Özdenizci和R. Legenstein,“使用基于补丁的去噪扩散模型在恶劣天气条件下恢复视力”,IEEE TPAMI,2023年。
  • [60] S. Shang, Z. Shan, G. Liu, and J. Zhang, “Resdiff: Combining cnn and diffusion model for image super-resolution,” arXiv preprint arXiv:2303.08714, 2023.
    S. Shang, Z. Shan, G. Liu和J. Zhang,“Resdiff:结合CNN和扩散模型进行图像超分辨率”,arXiv预印本arXiv:2303.08714,2023年。
  • [61] S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang, “Implicit diffusion models for continuous super-resolution,” in CVPR, pp. 10021–10030, 2023.
    S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang,“连续超分辨率的隐式扩散模型”,CVPR,pp. 10021-10030,2023年。
  • [62] L. Guo, C. Wang, W. Yang, S. Huang, Y. Wang, H. Pfister, and B. Wen, “Shadowdiffusion: When degradation prior meets diffusion model for shadow removal,” in CVPR, pp. 14049–14058, 2023.
    L. Guo, C. Wang, W. Yang, S. Huang, Y. Wang, H. Pfister, and B. Wen, “Shadowdiffusion: When degradation prior meets diffusion model for shadow removal,” in CVPR, pp. 14049–14058, 2023.
  • [63] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Image restoration with mean-reverting stochastic differential equations,” in ICML, 2023.
    Z. Luo,F. K. Gustafsson,Z. Zhao,J. Sjölund和T. B. Schön,“使用均值回归随机微分方程进行图像恢复”,在ICML,2023年。
  • [64] B. Xia, Y. Zhang, S. Wang, Y. Wang, X. Wu, Y. Tian, W. Yang, and L. Van Gool, “Diffir: Efficient diffusion model for image restoration,” arXiv preprint arXiv:2303.09472, 2023.
    B. Xia,Y. Zhang,S. Wang,Y. Wang,X. Wu,Y. Tian,W. Yang和L. Van Gool,“Diffir:用于图像恢复的高效扩散模型”,arXiv预印本arXiv:2303.09472,2023年。
  • [65] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” in CVPR, pp. 14367–14376, 2021.
    J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr:去噪扩散概率模型的条件方法”,在CVPR中,第14367-14376页,2021年。
  • [66] B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion restoration models,” in NeurIPS, vol. 35, pp. 23593–23606, 2022.
    B. Kawar,M. Elad,S. Ermon和J. Song,“去噪扩散恢复模型”,发表于NeurIPS,第35卷,第23593-23606页,2022年。
  • [67] Y. Huang, J. Huang, J. Liu, M. Yan, Y. Dong, J. Lyu, C. Chen, and S. Chen, “Wavedm: Wavelet-based diffusion models for image restoration,” IEEE TMM, 2024.
    Y. Huang, J. Huang, J. Liu, M. Yan, Y. Dong, J. Lyu, C. Chen, and S. Chen, “Wavedm: 基于小波的图像恢复扩散模型”,IEEE TMM,2024年。
  • [68] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using denoising diffusion null-space model,” in ICLR, 2023.
    王Y,于J,张J,“使用去噪扩散零空间模型进行零样本图像恢复”,ICLR,2023年。
  • [69] Z. Yue and C. C. Loy, “Difface: Blind face restoration with diffused error contraction,” arXiv preprint arXiv:2212.06512, 2022.
    Z. Yue和C. C. Loy,“Difface:使用扩散误差收缩进行盲目人脸修复”,arXiv预印本arXiv:2212.06512,2022年。
  • [70] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,” in NeurIPS, vol. 35, pp. 25683–25696, 2022.
    H. Chung, B. Sim, D. Ryu和J. C. Ye,“利用流形约束改进逆问题的扩散模型”,发表于NeurIPS,第35卷,第25683-25696页,2022年。
  • [71] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” in ICLR, 2023.
    H. Chung, J. Kim, M. T. Mccann, M. L. Klasky和J. C. Ye,“扩散后验采样用于一般噪声逆问题”,在ICLR会议上,2023年。
  • [72] A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, and D. Merhof, “Diffusion models for medical image analysis: A comprehensive survey,” arXiv preprint arXiv:2211.07804, 2022.
    Kazerouni、E. K. Aghdam、M. Heidari、R. Azad、M. Fayyaz、I. Hacihaliloglu和D. Merhof,“医学图像分析的扩散模型:综述”,arXiv预印本arXiv:2211.07804,2022年。
  • [73] Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “A survey on video diffusion models,” arXiv preprint arXiv:2310.10647, 2023.
    Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “视频扩散模型调查”,arXiv预印本arXiv:2310.10647,2023年。
  • [74] X. Li, Y. Ren, X. Jin, C. Lan, X. Wang, W. Zeng, X. Wang, and Z. Chen, “Diffusion models for image restoration and enhancement–a comprehensive survey,” arXiv preprint arXiv:2308.09388, 2023.  
  • [75]  B. B. Moser, A. S. Shanbhag, F. Raue, S. Frolov, S. Palacio, and A. Dengel, “Diffusion models, image super-resolution and everything: A survey,” arXiv preprint arXiv:2401.00736, 2024.  
  • [76]  L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.  
  • [77]  H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.  
  • [78]  F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE TPAMI, 2023.  
  • [79]  A. Ulhaq, N. Akhtar, and G. Pogrebna, “Efficient diffusion models for vision: A survey,” arXiv preprint arXiv:2210.09292, 2022.  
  • [80]  C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion model in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.  
  • [81]  T. Zhang, Z. Wang, J. Huang, M. M. Tasnim, and W. Shi, “A survey of diffusion based image generation models: Issues and their solutions,” arXiv preprint arXiv:2308.13142, 2023.  
  • [82]  R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, et al., “State of the art on diffusion models for visual computing,” arXiv preprint arXiv:2310.07204, 2023.  
  • [83]  H. Koo and T. E. Kim, “A comprehensive survey on generative diffusion models for structured data,” ArXiv, abs/2306.04139 v2, 2023.  
  • [84]  C. Meng, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Image synthesis and editing with stochastic differential equations,” in ICLR, 2022.  
  • [85]  M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023.  
  • [86]  J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.  
  • [87]  C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P. Chen, and C.-Y. Lee, “Denoising likelihood score matching for conditional score-based data generation,” in ICLR, 2022.  
  • [88]  T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in NeurIPS, vol. 35, pp. 26565–26577, 2022.  
  • [89]  C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” in NeurIPS, vol. 35, pp. 5775–5787, 2022.  
  • [90]  T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in ICLR, 2022.  
  • [91]  S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in CVPR, pp. 10696–10706, 2022.  
  • [92]  D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.  
  • [93]  J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al., “Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” arXiv preprint arXiv:2310.00426, 2023.  
  • [94]  X. Dai, J. Hou, C.-Y. Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, et al., “Emu: Enhancing image generation models using photogenic needles in a haystack,” arXiv preprint arXiv:2309.15807, 2023.  
  • [95]  K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, et al., “Styledrop: Text-to-image generation in any style,” arXiv preprint arXiv:2306.00983, 2023.  
  • [96]  Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro, et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.  
  • [97]  S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in CVPR, pp. 7545–7556, 2023.  
  • [98] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in CVPR, pp. 22511–22521, 2023.
    李Y,刘H,吴Q,穆F,杨J,高J,李C和李YJ,“Gligen:开放式基于文本的图像生成”,CVPR,第22511-22521页,2023年。
  • [99] O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman, “Make-a-scene: Scene-based text-to-image generation with human priors,” in ECCV, pp. 89–106, 2022.
    O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman,《Make-a-scene: 基于场景的文本到图像生成与人类先验知识》,ECCV会议论文集,2022年,第89-106页。
  • [100] O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh, D. Lischinski, O. Fried, and X. Yin, “Spatext: Spatio-textual representation for controllable image generation,” in CVPR, pp. 18370–18380, 2023.
    O. Avrahami,T. Hayes,O. Gafni,S. Gupta,Y. Taigman,D. Parikh,D. Lischinski,O. Fried和X. Yin,“Spatext:用于可控图像生成的空间文本表示”,在CVPR中,第18370-18380页,2023年。
  • [101] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, pp. 3836–3847, 2023.
    L. Zhang, A. Rao和M. Agrawala,“在文本到图像扩散模型中添加条件控制”,在ICCV中,页码3836-3847,2023年。
  • [102] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv preprint arXiv:2305.16322, 2023.
    S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong,“Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv预印本arXiv:2305.16322,2023年。
  • [103] C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,” arXiv preprint arXiv:2305.11147, 2023.
    C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese等人,“Unicontrol:一种统一的扩散模型用于野外可控视觉生成”,arXiv预印本arXiv:2305.11147,2023年。
  • [104] L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” in ICML, 2023.
    L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou,“Composer: Creative and controllable image synthesis with composable conditions,” in ICML, 2023.
  • [105] C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
    C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie,《T2i-adapter: 学习适配器以挖掘文本到图像扩散模型的更多可控能力》,arXiv预印本arXiv:2302.08453,2023年。
  • [106] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in ICLR, 2023.
    R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-or, “一图胜千言:使用文本反转个性化文本到图像生成,” 在ICLR, 2023.
  • [107] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in CVPR, pp. 22500–22510, 2023.
    N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman,“Dreambooth:针对主题驱动生成的文本到图像扩散模型的微调”,在CVPR中,第22500-22510页,2023年。
  • [108] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in CVPR, pp. 1931–1941, 2023.
    N. Kumari, B. Zhang, R. Zhang, E. Shechtman和J.-Y. Zhu,“文本到图像扩散的多概念定制”,CVPR,pp. 1931-1941,2023年。
  • [109] A. Voronov, M. Khoroshikh, A. Babenko, and M. Ryabinin, “Is this loss informative? faster text-to-image customization by tracking objective dynamics,” in NeurIPS, 2023.
    A. Voronov, M. Khoroshikh, A. Babenko, and M. Ryabinin,“这个损失是否具有信息性?通过跟踪目标动态实现更快的文本到图像定制”,于2023年在NeurIPS上发表。
  • [110] Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848, 2023.
    Y. Wei,Y. Zhang,Z. Ji,J. Bai,L. Zhang和W. Zuo,“Elite:将视觉概念编码为文本嵌入以进行定制的文本到图像生成”,arXiv预印本arXiv:2302.13848,2023年。
  • [111] Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion models for customized generation,” in ICML, 2023.
    Z.刘,R.冯,K.朱,Y.张,K.郑,Y.刘,D.赵,J.周和Y.曹,“锥体:用于定制生成的扩散模型中的概念神经元”,收录于ICML,2023年。
  • [112] W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
    W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M.-W. Chang, and W. W. Cohen,“通过学徒学习实现主题驱动的文本到图像生成”,arXiv预印本arXiv:2304.00186,2023年。
  • [113] J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
    J. Shi, W. Xiong, Z. Lin, and H. J. Jung,“Instantbooth:个性化的文本到图像生成,无需测试时微调”,arXiv预印本arXiv:2304.03411,2023年。
  • [114] Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” in ACM SIGGRAPH, pp. 1–11, 2023.
    Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon,“文本到图像个性化的关键锁定一级编辑”,在ACM SIGGRAPH中,第1-11页,2023年。
  • [115] H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374, vol. 2, 2023.
    H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374, vol. 2, 2023.
  • [116] J. Gu, Y. Wang, N. Zhao, T.-J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, et al., “Photoswap: Personalized subject swapping in images,” arXiv preprint arXiv:2305.18286, 2023.
    J. Gu, Y. Wang, N. Zhao, T.-J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung等人,“Photoswap:图像中的个性化主题交换”,arXiv预印本arXiv:2305.18286,2023年。
  • [117] Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, and Y. Shan, “Customnet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models,” arXiv preprint arXiv:2310.19784, 2023.
    Z. Yuan,M. Cao,X. Wang,Z. Qi,C. Yuan和Y. Shan,“Customnet:在文本到图像扩散模型中使用可变视角进行零样本对象定制”,arXiv预印本arXiv:2310.19784,2023年。
  • [118] H. Li, Y. Yang, M. Chang, H. Feng, Z. hai Xu, Q. Li, and Y. ting Chen, “Srdiff: Single image super-resolution with diffusion probabilistic models,” Neurocomputing, vol. 479, pp. 47–59, 2022.
    H. Li, Y. Yang, M. Chang, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Srdiff: 使用扩散概率模型的单图像超分辨率,” 《神经计算》, vol. 479, pp. 47–59, 2022.
  • [119] M. Özbey, S. U. Dar, H. A. Bedel, O. Dalmaz, Ş. Özturk, A. Güngör, and T. Çukur, “Unsupervised medical image translation with adversarial diffusion models,” arXiv preprint arXiv:2207.08208, 2022.
    M. Özbey, S. U. Dar, H. A. Bedel, O. Dalmaz, Ş. Özturk, A. Güngör, and T. Çukur,“无监督医学图像翻译与对抗扩散模型”,arXiv预印本arXiv:2207.08208,2022年。
  • [120] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Refusion: Enabling large-size realistic image restoration with latent-space diffusion models,” in CVPR, pp. 1680–1691, 2023.
    Z. Luo,F. K. Gustafsson,Z. Zhao,J. Sjölund和T. B. Schön,“Refusion:利用潜空间扩散模型实现大尺寸逼真图像恢复”,在CVPR中,第1680-1691页,2023年。
  • [121] Z. Chen, Y. Zhang, D. Liu, B. Xia, J. Gu, L. Kong, and X. Yuan, “Hierarchical integration diffusion model for realistic image deblurring,” arXiv preprint arXiv:2305.12966, 2023.  
  • [122]  F. Guth, S. Coste, V. De Bortoli, and S. Mallat, “Wavelet score-based generative modeling,” in NeurIPS, vol. 35, 2022.  
  • [123]  J. Huang, Y. Liu, and S. Chen, “Bootstrap diffusion model curve estimation for high resolution low-light image enhancement,” arXiv preprint arXiv:2309.14709, 2023.  
  • [124]  J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” arXiv preprint arXiv:2305.07015, 2023.  
  • [125]  X. Lin, J. He, Z. Chen, Z. Lyu, B. Fei, B. Dai, W. Ouyang, Y. Qiao, and C. Dong, “Diffbir: Towards blind image restoration with generative diffusion prior,” arXiv preprint arXiv:2308.15070, 2023.  
  • [126]  H. Sun, W. Li, J. Liu, H. Chen, R. Pei, X. Zou, Y. Yan, and Y. Yang, “Coser: Bridging image and language for cognitive super-resolution,” arXiv preprint arXiv:2311.16512, 2023.
    H. Sun, W. Li, J. Liu, H. Chen, R. Pei, X. Zou, Y. Yan, and Y. Yang, "Coser: Bridging image and language for cognitive super-resolution," arXiv preprint arXiv:2311.16512, 2023.
  • [127] F. Yu, J. Gu, Z. Li, J. Hu, X. Kong, X. Wang, J. He, Y. Qiao, and C. Dong, “Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild,” arXiv preprint arXiv:2401.13627, 2024.
    F. Yu, J. Gu, Z. Li, J. Hu, X. Kong, X. Wang, J. He, Y. Qiao, and C. Dong, “扩展至卓越:在野外实践模型扩展以实现逼真图像修复”,arXiv预印本arXiv:2401.13627,2024年。
  • [128] J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided diffusion models for inverse problems,” in ICLR, 2022.
    J. Song, A. Vahdat, M. Mardani, and J. Kautz,“伪逆导向的扩散模型用于逆问题”,在ICLR会议上,2022年。
  • [129] J. Schwab, S. Antholzer, and M. Haltmeier, “Deep null space learning for inverse problems: convergence analysis and rates,” Inverse Problems, vol. 35, no. 2, p. 025008, 2019.
    J. Schwab, S. Antholzer, and M. Haltmeier, “深度零空间学习用于逆问题:收敛分析和速率”,《逆问题》杂志,第35卷,第2期,第025008页,2019年。
  • [130] Y. Wang, Y. Hu, J. Yu, and J. Zhang, “Gan prior based null-space learning for consistent super-resolution,” in AAAI, vol. 37, pp. 2724–2732, 2023.
    王Y,胡Y,于J和张J,“基于GAN先验的零空间学习用于一致超分辨率”,在AAAI会议上,第37卷,第2724-2732页,2023年。
  • [131] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in CVPR, pp. 2426–2435, 2022.
    G. Kim,T. Kwon和J. C. Ye,“Diffusionclip:用于稳健图像处理的文本引导扩散模型”,在CVPR中,第2426-2435页,2022年。
  • [132] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in ICLR, 2023.
    M. Kwon,J. Jeong和Y. Uh,“扩散模型已经具有语义潜空间”,在ICLR,2023年。
  • [133] N. Starodubcev, D. Baranchuk, V. Khrulkov, and A. Babenko, “Towards real-time text-driven image manipulation with unconditional diffusion models,” arXiv preprint arXiv:2304.04344, 2023.
    N. Starodubcev,D. Baranchuk,V. Khrulkov和A. Babenko,“基于无条件扩散模型的实时文本驱动图像处理”,arXiv预印本arXiv:2304.04344,2023年。
  • [134] N. Huang, Y. Zhang, F. Tang, C. Ma, H. Huang, W. Dong, and C. Xu, “Diffstyler: Controllable dual diffusion for text-driven image stylization,” IEEE TNNLS, 2024.
    N. Huang, Y. Zhang, F. Tang, C. Ma, H. Huang, W. Dong, and C. Xu, "Diffstyler: 可控的双重扩散用于基于文本的图像风格化," IEEE TNNLS, 2024.
  • [135] Z. Wang, L. Zhao, and W. Xing, “Stylediffusion: Controllable disentangled style transfer via diffusion models,” in ICCV, pp. 7677–7689, 2023.
    Z. Wang,L. Zhao和W. Xing,“Stylediffusion:通过扩散模型实现可控的解耦风格转移”,在ICCV中,第7677-7689页,2023年。
  • [136] H. Sasaki, C. G. Willcocks, and T. P. Breckon, “Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,” arXiv preprint arXiv:2104.05358, 2021.
    H. Sasaki,C. G. Willcocks和T. P. Breckon,“Unit-ddpm:使用去噪扩散概率模型进行非配对图像转换”,arXiv预印本arXiv:2104.05358,2021年。
  • [137] S. Xu, Z. Ma, Y. Huang, H. Lee, and J. Chai, “Cyclenet: Rethinking cycle consistency in text-guided diffusion for image manipulation,” in NeurIPS, 2023.
    S. Xu, Z. Ma, Y. Huang, H. Lee和J. Chai,“Cyclenet:重新思考文本引导扩散中的循环一致性用于图像操作”,发表于NeurIPS,2023年。
  • [138] K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable representation,” in CVPR, pp. 10619–10629, 2022.
    K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn,“扩散自编码器:朝着有意义且可解码的表示的方向”,在CVPR中,第10619-10629页,2022年。
  • [139] Z. Lu, C. Wu, X. Chen, Y. Wang, L. Bai, Y. Qiao, and X. Liu, “Hierarchical diffusion autoencoders and disentangled image manipulation,” in WACV, pp. 5374–5383, 2024.
    2024年WACV会议上的论文《Hierarchical diffusion autoencoders and disentangled image manipulation》由Z. Lu、C. Wu、X. Chen、Y. Wang、L. Bai、Y. Qiao和X. Liu撰写。
  • [140] M. Zhao, F. Bao, C. Li, and J. Zhu, “Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations,” in NeurIPS, vol. 35, pp. 3609–3623, 2022.
    M. Zhao,F. Bao,C. Li和J. Zhu,“Egsde:通过能量引导的随机微分方程进行非配对图像到图像的转换”,收录于NeurIPS,第35卷,第3609-3623页,2022年。
  • [141] N. Matsunaga, M. Ishii, A. Hayakawa, K. Suzuki, and T. Narihira, “Fine-grained image editing by pixel-wise guidance using diffusion models,” in CVPR workshop, 2023.
    2023年CVPR研讨会上,N. Matsunaga、M. Ishii、A. Hayakawa、K. Suzuki和T. Narihira提出了使用扩散模型进行像素级引导的细粒度图像编辑方法。
  • [142] B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen, “Paint by example: Exemplar-based image editing with diffusion models,” in CVPR, pp. 18381–18391, 2023.
    B. Yang,S. Gu,B. Zhang,T. Zhang,X. Chen,X. Sun,D. Chen和F. Wen,“以示例为基础的扩散模型图像编辑:以示例为基础的图像编辑”,在CVPR中,第18381-18391页,2023年。
  • [143] K. Kim, S. Park, J. Lee, and J. Choo, “Reference-based image composition with sketch via structure-aware diffusion model,” in CVPR workshop, 2023.
    K. Kim,S. Park,J. Lee和J. Choo,“基于参考图像和草图的结构感知扩散模型图像合成”,CVPR研讨会,2023年。
  • [144] Y. Song, Z. Zhang, Z. Lin, S. Cohen, B. Price, J. Zhang, S. Y. Kim, and D. Aliaga, “Objectstitch: Object compositing with diffusion model,” in CVPR, pp. 18310–18319, 2023.
    Y. Song, Z. Zhang, Z. Lin, S. Cohen, B. Price, J. Zhang, S. Y. Kim和D. Aliaga,“Objectstitch: 使用扩散模型进行对象合成”,在CVPR中,第18310-18319页,2023年。
  • [145] X. Zhang, J. Guo, P. Yoo, Y. Matsuo, and Y. Iwasawa, “Paste, inpaint and harmonize via denoising: Subject-driven image editing with pre-trained diffusion model,” arXiv preprint arXiv:2306.07596, 2023.
    X. Zhang, J. Guo, P. Yoo, Y. Matsuo和Y. Iwasawa,“通过去噪实现粘贴、修复和协调:基于预训练扩散模型的主体驱动图像编辑”,arXiv预印本arXiv:2306.07596,2023年。
  • [146] S. Xie, Y. Zhao, Z. Xiao, K. C. Chan, Y. Li, Y. Xu, K. Zhang, and T. Hou, “Dreaminpainter: Text-guided subject-driven image inpainting with diffusion models,” arXiv preprint arXiv:2312.03771, 2023.
    S. Xie, Y. Zhao, Z. Xiao, K. C. Chan, Y. Li, Y. Xu, K. Zhang, and T. Hou,《Dreaminpainter: Text-guided subject-driven image inpainting with diffusion models》,arXiv预印本arXiv:2312.03771,2023年。
  • [147] X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” in CVPR, 2024.
    X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao,“Anydoor:零射击物体级图像定制”,在CVPR,2024年。
  • [148] X. Chen and S. Lathuilière, “Face aging via diffusion-based editing,” in BMCV, 2023.
    X. Chen和S. Lathuilière,“基于扩散编辑的人脸老化”,发表于BMCV,2023年。
  • [149] V. Goel, E. Peruzzo, Y. Jiang, D. Xu, N. Sebe, T. Darrell, Z. Wang, and H. Shi, “Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models,” arXiv preprint arXiv:2303.17546, 2023.
    V. Goel, E. Peruzzo, Y. Jiang, D. Xu, N. Sebe, T. Darrell, Z. Wang, and H. Shi,《配对扩散:基于结构和外观配对扩散模型的对象级图像编辑》,arXiv预印本arXiv:2303.17546,2023年。
  • [150] S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in CVPR, pp. 22428–22437, 2023.
    S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang,“Smartbrush: 基于扩散模型的文本和形状引导的物体修复”,CVPR会议论文集,第22428-22437页,2023年。
  • [151] Z. Zhang, J. Zheng, Z. Fang, and B. A. Plummer, “Text-to-image editing by image information removal,” in WACV, pp. 5232–5241, 2024.
    Z. Zhang, J. Zheng, Z. Fang和B. A. Plummer,“通过图像信息删除进行文本到图像编辑”,收录于WACV,页码5232-5241,2024年。
  • [152] J. Zhuang, Y. Zeng, W. Liu, C. Yuan, and K. Chen, “A task is worth one word: Learning with task prompts for high-quality versatile image inpainting,” arXiv preprint arXiv:2312.03594, 2023.  
  • [153]  S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y. Onoe, S. Laszlo, D. J. Fleet, R. Soricut, et al., “Imagen editor and editbench: Advancing and evaluating text-guided image inpainting,” in CVPR, pp. 18359–18369, 2023.  
  • [154]  J. Singh, J. Zhang, Q. Liu, C. Smith, Z. Lin, and L. Zheng, “Smartmask: Context aware high-fidelity mask generation for fine-grained object insertion and layout control,” in CVPR, 2024.  
  • [155]  S. Yang, X. Chen, and J. Liao, “Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model,” in ACM MM, pp. 3190–3199, 2023.  
  • [156]  T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in CVPR, pp. 18392–18402, 2023.  
  • [157]  S. Li, C. Chen, and H. Lu, “Moecontroller: Instruction-based arbitrary image manipulation with mixture-of-expert controllers,” arXiv preprint arXiv:2309.04372, 2023.  
  • [158]  Q. Guo and T. Lin, “Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation,” in CVPR, 2024.  
  • [159]  T. Chakrabarty, K. Singh, A. Saakyan, and S. Muresan, “Learning to follow object-centric image editing instructions faithfully,” in EMNLP, pp. 9630–9646, 2023.  
  • [160]  Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Hu, D. Chen, and G. Baining, “Instructdiffusion: A generalist modeling interface for vision tasks,” in CVPR, 2024.  
  • [161]  S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman, “Emu edit: Precise image editing via recognition and generation tasks,” arXiv preprint arXiv:2311.10089, 2023.  
  • [162]  J. Wei, S. Wu, X. Jiang, and Y. Wang, “Dialogpaint: A dialog-based image editing model,” arXiv preprint arXiv:2303.10073, 2023.  
  • [163]  A. B. Yildirim, V. Baday, E. Erdem, A. Erdem, and A. Dundar, “Inst-inpaint: Instructing to remove objects with diffusion models,” arXiv preprint arXiv:2304.03246, 2023.  
  • [164]  S. Zhang, X. Yang, Y. Feng, C. Qin, C.-C. Chen, N. Yu, Z. Chen, H. Wang, S. Savarese, S. Ermon, et al., “Hive: Harnessing human feedback for instructional visual editing,” in CVPR, 2024.  
  • [165]  S. Yasheng, Y. Yang, H. Peng, Y. Shen, Y. Yang, H. Hu, L. Qiu, and H. Koike, “Imagebrush: Learning visual in-context instructions for exemplar-based image manipulation,” in NeurIPS, 2023.  
  • [166]  S. Li, H. Singh, and A. Grover, “Instructany2pix: Flexible visual editing via multimodal instruction following,” arXiv preprint arXiv:2312.06738, 2023.  
  • [167]  T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan, “Guiding instruction-based image editing via multimodal large language models,” in ICLR, 2024.  
  • [168]  Y. Huang, L. Xie, X. Wang, Z. Yuan, X. Cun, Y. Ge, J. Zhou, C. Dong, R. Huang, R. Zhang, and Y. Shan, “Smartedit: Exploring complex instruction-based image editing with multimodal large language models,” in CVPR, 2024.  
  • [169]  R. Bodur, E. Gundogdu, B. Bhattarai, T.-K. Kim, M. Donoser, and L. Bazzani, “iedit: Localised text-guided image editing with weak supervision,” arXiv preprint arXiv:2305.05947, 2023.  
  • [170]  Y. Lin, Y.-W. Chen, Y.-H. Tsai, L. Jiang, and M.-H. Yang, “Text-driven image editing via learnable regions,” in CVPR, 2024.  
  • [171]  D. Yue, Q. Guo, M. Ning, J. Cui, Y. Zhu, and L. Yuan, “Chatface: Chat-guided real face editing via diffusion latent space manipulation,” arXiv preprint arXiv:2305.14742, 2023.  
  • [172]  D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, “Unitune: Text-driven image editing by fine tuning an image generation model on a single image,” ACM TOG, 2023.  
  • [173]  J. Choi, Y. Choi, Y. Kim, J. Kim, and S. Yoon, “Custom-edit: Text-guided image editing with customized diffusion models,” in CVPR workshop, 2023.  
  • [174]  J. Huang, Y. Liu, J. Qin, and S. Chen, “Kv inversion: Kv embeddings learning for text-conditioned real image action editing,” arXiv preprint arXiv:2309.16608, 2023.  
  • [175]  R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in CVPR, pp. 6038–6047, 2023.  
  • [176]  K. Wang, F. Yang, S. Yang, M. A. Butt, and J. van de Weijer, “Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing,” in NeurIPS, 2023.  
  • [177]  Q. Wu, Y. Liu, H. Zhao, A. Kale, T. Bui, T. Yu, Z. Lin, Y. Zhang, and S. Chang, “Uncovering the disentanglement capability in text-to-image diffusion models,” in CVPR, pp. 1900–1910, June 2023.  
  • [178] W. Dong, S. Xue, X. Duan, and S. Han, “Prompt tuning inversion for text-driven image editing using diffusion models,” in ICCV, 2023.
    W. Dong,S. Xue,X. Duan和S. Han,“使用扩散模型进行基于文本的图像编辑的快速调整反演”,在ICCV,2023年。
  • [179] S. Li, J. van de Weijer, T. Hu, F. S. Khan, Q. Hou, Y. Wang, and J. Yang, “Stylediffusion: Prompt-embedding inversion for text-based editing,” arXiv preprint arXiv:2303.15649, 2023.
    S. Li, J. van de Weijer, T. Hu, F. S. Khan, Q. Hou, Y. Wang, and J. Yang,《Stylediffusion: Prompt-embedding inversion for text-based editing》,arXiv预印本arXiv:2303.15649,2023年。
  • [180] Y. Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, “Inversion-based style transfer with diffusion models,” in CVPR, pp. 10146–10156, June 2023.
    张Y,黄N,唐F,黄H,马C,董W和徐C,“基于扩散模型的倒置风格转换”,CVPR会议论文集,第10146-10156页,2023年6月。
  • [181] C. Mou, X. Wang, J. Song, Y. Shan, and J. Zhang, “Dragondiffusion: Enabling drag-style manipulation on diffusion models,” in ICLR, 2024.
    C. Mou, X. Wang, J. Song, Y. Shan, and J. Zhang,“Dragondiffusion: 在扩散模型上实现拖拽式操作”,ICLR,2024年。
  • [182] Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: Harnessing diffusion models for interactive point-based image editing,” in CVPR, 2024.
    Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: 利用扩散模型进行交互式基于点的图像编辑”,在CVPR会议上,2024年。
  • [183] A. Hertz, K. Aberman, and D. Cohen-Or, “Delta denoising score,” in ICCV, pp. 2328–2337, 2023.
    Hertz、Aberman和Cohen-Or,“Delta去噪分数”,在ICCV中,第2328-2337页,2023年。
  • [184] G. Kwon and J. C. Ye, “Diffusion-based image translation using disentangled style and content representation,” in ICLR, 2023.
    G. Kwon和J. C. Ye,“使用解耦的风格和内容表示的基于扩散的图像翻译”,在ICLR,2023年。
  • [185] H. Nam, G. Kwon, G. Y. Park, and J. C. Ye, “Contrastive denoising score for text-guided latent diffusion image editing,” in CVPR, 2024.
    H. Nam,G. Kwon,G. Y. Park和J. C. Ye,“用于文本引导的潜在扩散图像编辑的对比去噪分数”,CVPR,2024年。
  • [186] S. Yang, L. Zhang, L. Ma, Y. Liu, J. Fu, and Y. He, “Magicremover: Tuning-free text-guided image inpainting with diffusion models,” arXiv preprint arXiv:2310.02848, 2023.
    S. Yang, L. Zhang, L. Ma, Y. Liu, J. Fu, and Y. He,《Magicremover: 基于扩散模型的无需调参的文本引导图像修复》,arXiv预印本arXiv:2310.02848,2023年。
  • [187] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in CVPR, pp. 6007–6017, June 2023.
    B. Kawar,S. Zada,O. Lang,O. Tov,H. Chang,T. Dekel,I. Mosseri和M. Irani,“Imagic:基于文本的扩散模型实时图像编辑”,CVPR,第6007-6017页,2023年6月。
  • [188] P. Li, Q. Huang, Y. Ding, and Z. Li, “Layerdiffusion: Layered controlled image editing with diffusion models,” in SIGGRAPH Asia 2023 Technical Communications, pp. 1–4, 2023.
    P. Li, Q. Huang, Y. Ding, and Z. Li, “Layerdiffusion: Layered controlled image editing with diffusion models,” in SIGGRAPH Asia 2023 Technical Communications, pp. 1–4, 2023. P. Li, Q. Huang, Y. Ding, 和 Z. Li, “Layerdiffusion: 基于扩散模型的分层控制图像编辑,” 在 SIGGRAPH Asia 2023 技术交流会上, 第1-4页, 2023年.
  • [189] S. Zhang, S. Xiao, and W. Huang, “Forgedit: Text guided image editing via learning and forgetting,” arXiv preprint arXiv:2309.10556, 2023.
    S. Zhang, S. Xiao和W. Huang,“Forgedit:通过学习和遗忘进行文本引导的图像编辑”,arXiv预印本arXiv:2309.10556,2023年。
  • [190] Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” in CVPR, pp. 6027–6037, 2023.
    Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: 使用文本到图像扩散模型进行单图像编辑”,于CVPR会议上,页码6027-6037,2023年。
  • [191] H. Ravi, S. Kelkar, M. Harikumar, and A. Kale, “Preditor: Text guided image editing with diffusion prior,” arXiv preprint arXiv:2302.07979, 2023.
    H. Ravi, S. Kelkar, M. Harikumar和A. Kale,“Preditor:具有扩散先验的文本引导图像编辑”,arXiv预印本arXiv:2302.07979,2023年。
  • [192] Y. Lin, S. Zhang, X. Yang, X. Wang, and Y. Shi, “Regeneration learning of diffusion models with rich prompts for zero-shot image translation,” arXiv preprint arXiv:2305.04651, 2023.
    Y. Lin, S. Zhang, X. Yang, X. Wang, and Y. Shi,“使用丰富提示的扩散模型再生学习进行零样本图像翻译”,arXiv预印本arXiv:2305.04651,2023年。
  • [193] S. Kim, W. Jang, H. Kim, J. Kim, Y. Choi, S. Kim, and G. Lee, “User-friendly image editing with minimal text input: Leveraging captioning and injection techniques,” arXiv preprint arXiv:2306.02717, 2023.
    S. Kim, W. Jang, H. Kim, J. Kim, Y. Choi, S. Kim和G. Lee,“最小文本输入的用户友好图像编辑:利用字幕和注入技术”,arXiv预印本arXiv:2306.02717,2023年。
  • [194] Q. Wang, B. Zhang, M. Birsak, and P. Wonka, “Instructedit: Improving automatic masks for diffusion-based image editing with user instructions,” arXiv preprint arXiv:2305.18047, 2023.
    Q. Wang,B. Zhang,M. Birsak和P. Wonka,“Instructedit:通过用户指令改进基于扩散的图像编辑的自动蒙版”,arXiv预印本arXiv:2305.18047,2023年。
  • [195] A. Elarabawy, H. Kamath, and S. Denton, “Direct inversion: Optimization-free text-driven real image editing with diffusion models,” arXiv preprint arXiv:2211.07825, 2022.
    Elarabawy、H. Kamath和S. Denton,“直接反演:基于扩散模型的无优化文本驱动实际图像编辑”,arXiv预印本arXiv:2211.07825,2022年。
  • [196] I. Huberman-Spiegelglas, V. Kulikov, and T. Michaeli, “An edit friendly ddpm noise space: Inversion and manipulations,” in CVPR, 2024.
    Huberman-Spiegelglas、V. Kulikov和T. Michaeli,“一个适用于编辑的ddpm噪声空间:反演和操作”,CVPR,2024年。
  • [197] S. Nie, H. A. Guo, C. Lu, Y. Zhou, C. Zheng, and C. Li, “The blessing of randomness: Sde beats ode in general diffusion-based image editing,” in ICLR, 2024.
    S. Nie, H. A. Guo, C. Lu, Y. Zhou, C. Zheng, and C. Li, “随机性的祝福:SDE在一般扩散图像编辑中胜过ODE”,ICLR,2024年。
  • [198] M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos, “Ledits++: Limitless image editing using text-to-image models,” in CVPR, 2024.
    M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos,“Ledits++:使用文本到图像模型进行无限图像编辑”,CVPR,2024年。
  • [199] S. Chen and J. Huang, “Fec: Three finetuning-free methods to enhance consistency for real image editing,” arXiv preprint arXiv:2309.14934, 2023.
    S. Chen和J. Huang,“FEC:三种无需微调的方法来增强真实图像编辑的一致性”,arXiv预印本arXiv:2309.14934,2023年。
  • [200] K. Joseph, P. Udhayanan, T. Shukla, A. Agarwal, S. Karanam, K. Goswami, and B. V. Srinivasan, “Iterative multi-granular image editing using diffusion models,” in WACV, pp. 8107–8116, 2024.
    K. Joseph, P. Udhayanan, T. Shukla, A. Agarwal, S. Karanam, K. Goswami, and B. V. Srinivasan,“使用扩散模型的迭代多粒度图像编辑”,在WACV中,第8107-8116页,2024年。
  • [201] D. Miyake, A. Iohara, Y. Saito, and T. Tanaka, “Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,” arXiv preprint arXiv:2305.16807, 2023.  
  • [202]  L. Han, S. Wen, Q. Chen, Z. Zhang, K. Song, M. Ren, R. Gao, A. Stathopoulos, X. He, Y. Chen, et al., “Proxedit: Improving tuning-free real image editing with proximal guidance,” in WACV, pp. 4291–4301, 2024.  
  • [203]  J. Zhao, H. Zheng, C. Wang, L. Lan, W. Huang, and W. Yang, “Null-text guidance in diffusion models is secretly a cartoon-style creator,” in ACM MM, pp. 5143–5152, 2023.  
  • [204]  B. Wallace, A. Gokul, and N. Naik, “Edict: Exact diffusion inversion via coupled transformations,” in CVPR, pp. 22532–22541, 2023.  
  • [205]  Z. Pan, R. Gherardi, X. Xie, and S. Huang, “Effective real image editing with accelerated iterative diffusion inversion,” in ICCV, pp. 15912–15921, 2023.  
  • [206]  C. H. Wu and F. De la Torre, “A latent space of stochastic diffusion models for zero-shot image editing and guidance,” in ICCV, pp. 7378–7387, 2023.  
  • [207]  J. Jeong, M. Kwon, and Y. Uh, “Training-free content injection using h-space in diffusion models,” in WACV, pp. 5151–5161, 2024.  
  • [208]  B. Meiri, D. Samuel, N. Darshan, G. Chechik, S. Avidan, and R. Ben-Ari, “Fixed-point inversion for text-to-image diffusion models,” arXiv preprint arXiv:2312.12540, 2023.  
  • [209]  X. Duan, S. Cui, G. Kang, B. Zhang, Z. Fei, M. Fan, and J. Huang, “Tuning-free inversion-enhanced control for consistent image editing,” arXiv preprint arXiv:2312.14611, 2023.  
  • [210]  P. Gholami and R. Xiao, “Diffusion brush: A latent diffusion model-based editing tool for ai-generated images,” arXiv preprint arXiv:2306.00219, 2023.  
  • [211]  D. Epstein, A. Jabri, B. Poole, A. A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,” in NeurIPS, 2023.  
  • [212]  A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-or, “Prompt-to-prompt image editing with cross-attention control,” in ICLR, 2023.  
  • [213] G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu, “Zero-shot image-to-image translation,” in ACM SIGGRAPH, pp. 1–11, 2023.
    G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu,“零样本图像到图像的转换”,在ACM SIGGRAPH会议上,第1-11页,2023年。
  • [214] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in CVPR, pp. 1921–1930, 2023.
    N. Tumanyan,M. Geyer,S. Bagon和T. Dekel,“用于文本驱动的图像到图像转换的即插即用扩散特征”,CVPR,第1921-1930页,2023年。
  • [215] S. Lu, Y. Liu, and A. W.-K. Kong, “Tf-icon: Diffusion-based training-free cross-domain image composition,” in ICCV, pp. 2294–2305, 2023.
    S. Lu,Y. Liu和A. W.-K. Kong,“Tf-icon:基于扩散的无需训练的跨领域图像合成”,在ICCV中,第2294-2305页,2023年。
  • [216] O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or, “Localizing object-level shape variations with text-to-image diffusion models,” in ICCV, pp. 23051–23061, 2023.
    O. Patashnik,D. Garibi,I. Azuri,H. Averbuch-Elor和D. Cohen-Or,“使用文本到图像扩散模型定位物体级形状变化”,在ICCV中,第23051-23061页,2023年。
  • [217] H. Lee, M. Kang, and B. Han, “Conditional score guidance for text-driven image-to-image translation,” in NeurIPS, 2023.
    H. Lee,M. Kang和B. Han,“基于条件分数引导的文本驱动图像到图像的转换”,发表于NeurIPS,2023年。
  • [218] G. Y. Park, J. Kim, B. Kim, S. W. Lee, and J. C. Ye, “Energy-based cross attention for bayesian context update in text-to-image diffusion models,” in NeurIPS, 2023.
    G. Y. Park, J. Kim, B. Kim, S. W. Lee和J. C. Ye,“基于能量的交叉注意力用于文本到图像扩散模型中的贝叶斯上下文更新”,发表于NeurIPS,2023年。
  • [219] D. H. Park, G. Luo, C. Toste, S. Azadi, X. Liu, M. Karalashvili, A. Rohrbach, and T. Darrell, “Shape-guided diffusion with inside-outside attention,” in WACV, pp. 4198–4207, 2024.
    D. H. Park, G. Luo, C. Toste, S. Azadi, X. Liu, M. Karalashvili, A. Rohrbach, and T. Darrell, “基于形状引导的内外关注扩散”,收录于WACV,第4198-4207页,2024年。
  • [220] H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023.
    H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi,《Hd-painter: 高分辨率和忠实于文本的图像修复技术》,arXiv预印本arXiv:2312.14091,2023年。
  • [221] Z. Yu, H. Li, F. Fu, X. Miao, and B. Cui, “Fisedit: Accelerating text-to-image editing via cache-enabled sparse diffusion inference,” in AAAI, 2024.
    Z. Yu, H. Li, F. Fu, X. Miao和B. Cui,“Fisedit:通过启用缓存的稀疏扩散推理加速文本到图像编辑”,AAAI,2024年。
  • [222] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM TOG, vol. 42, no. 4, pp. 1–11, 2023.
    O. Avrahami,O. Fried和D. Lischinski,“混合潜在扩散”,ACM TOG,第42卷,第4期,第1-11页,2023年。
  • [223] W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” arXiv preprint arXiv:2306.16894, 2023.
    W. Huang, S. Tu和L. Xu,“Pfb-diff:用于文本驱动图像编辑的渐进特征混合扩散”,arXiv预印本arXiv:2306.16894,2023年。
  • [224] G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in ICLR, 2023.
    G. Couairon, J. Verbeek, H. Schwenk, and M. Cord,“Diffedit: 基于扩散的带有掩码引导的语义图像编辑”,ICLR,2023年。
  • [225] N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu, “Region-aware diffusion for zero-shot text-driven image editing,” arXiv preprint arXiv:2302.11797, 2023.
    N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu,“区域感知扩散用于零样本文本驱动的图像编辑”,arXiv预印本arXiv:2302.11797,2023年。
  • [226] Z. Liu, F. Zhang, J. He, J. Wang, Z. Wang, and L. Cheng, “Text-guided mask-free local image retouching,” in ICME, pp. 2783–2788, 2023.
    Z.刘,F.张,J.何,J.王,Z.王和L.程,“基于文本引导的无遮罩局部图像修饰”,在ICME中,第2783-2788页,2023年。
  • [227] E. Levin and O. Fried, “Differential diffusion: Giving each pixel its strength,” arXiv preprint arXiv:2306.00950, 2023.
    E. Levin和O. Fried,“差分扩散:给每个像素赋予其强度”,arXiv预印本arXiv:2306.00950,2023年。
  • [228] A. Mirzaei, T. Aumentado-Armstrong, M. A. Brubaker, J. Kelly, A. Levinshtein, K. G. Derpanis, and I. Gilitschenski, “Watch your steps: Local image and scene editing by text instructions,” arXiv preprint arXiv:2308.08947, 2023.
    A. Mirzaei,T. Aumentado-Armstrong,M. A. Brubaker,J. Kelly,A. Levinshtein,K. G. Derpanis和I. Gilitschenski,“小心脚步:通过文本指令进行本地图像和场景编辑”,arXiv预印本arXiv:2308.08947,2023年。
  • [229] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in CVPR, pp. 18208–18218, 2022.
    O. Avrahami,D. Lischinski和O. Fried,“用于基于文本编辑自然图像的混合扩散”,CVPR,第18208-18218页,2022年。
  • [230] S. Li, B. Zeng, Y. Feng, S. Gao, X. Liu, J. Liu, L. Lin, X. Tang, Y. Hu, J. Liu, and B. Zhang, “Zone: Zero-shot instruction-guided local editing,” in CVPR, 2024.
    S. Li, B. Zeng, Y. Feng, S. Gao, X. Liu, J. Liu, L. Lin, X. Tang, Y. Hu, J. Liu, and B. Zhang, “Zone: Zero-shot instruction-guided local editing,” in CVPR, 2024.
  • [231] T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen, “Inpaint anything: Segment anything meets image inpainting,” arXiv preprint arXiv:2304.06790, 2023.
    T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen,《任意修复:分割任意物体遇见图像修复》,arXiv预印本arXiv:2304.06790,2023年。
  • [232] M. Brack, P. Schramowski, F. Friedrich, D. Hintersdorf, and K. Kersting, “The stable artist: Steering semantics in diffusion latent space,” arXiv preprint arXiv:2212.06013, 2022.
    M. Brack, P. Schramowski, F. Friedrich, D. Hintersdorf和K. Kersting,“稳定的艺术家:在扩散潜空间中引导语义”,arXiv预印本arXiv:2212.06013,2022年。
  • [233] M. Brack, F. Friedrich, D. Hintersdorf, L. Struppek, P. Schramowski, and K. Kersting, “Sega: Instructing text-to-image models using semantic guidance,” in NeurIPS, 2023.
    M. Brack,F. Friedrich,D. Hintersdorf,L. Struppek,P. Schramowski和K. Kersting,“Sega:使用语义指导指导文本到图像模型”,在NeurIPS,2023年。
  • [234] L. Tsaban and A. Passos, “Ledits: Real image editing with ddpm inversion and semantic guidance,” arXiv preprint arXiv:2307.00522, 2023.
    L. Tsaban和A. Passos,“Ledits:使用ddpm反演和语义引导进行真实图像编辑”,arXiv预印本arXiv:2307.00522,2023年。
  • [235] Z. Yang, D. Gui, W. Wang, H. Chen, B. Zhuang, and C. Shen, “Object-aware inversion and reassembly for image editing,” in ICLR, 2024.  
  • [236]  T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in ICLR, 2018.  
  • [237]  Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” in CVPR, pp. 8188–8197, 2020.  
  • [238]  F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
    F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao,《Lsun:利用深度学习和人类参与构建大规模图像数据集》,arXiv预印本arXiv:1506.03365,2015年。
  • [239] W. volunteer team, “Wikiart dataset.” https://www.wikiart.org/
    W.志愿者团队,“Wikiart数据集。” https://www.wikiart.org/
    .
  • [240] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,” ACM TOG, vol. 41, no. 4, pp. 1–13, 2022.
    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,” ACM TOG, vol. 41, no. 4, pp. 1–13, 2022.
  • [241] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in ICCV, pp. 2085–2094, 2021.
    O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: 基于文本的StyleGAN图像操作,” in ICCV, pp. 2085–2094, 2021.
  • [242] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, pp. 8748–8763, 2021.
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “从自然语言监督中学习可迁移的视觉模型”,于ICML会议上,2021年,第8748-8763页。
  • [243] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in CVPR, pp. 9243–9252, 2020.
    Y. Shen, J. Gu, X. Tang, and B. Zhou,“解释GAN的潜在空间用于语义人脸编辑”,在CVPR中,第9243-9252页,2020年。
  • [244] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan++: How to edit the embedded images?,” in CVPR, pp. 8296–8305, 2020.
    R. Abdal, Y. Qin和P. Wonka,“Image2stylegan++:如何编辑嵌入图像?”,在CVPR中,第8296-8305页,2020年。
  • [245] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” in NeurIPS, vol. 33, pp. 1877–1901, 2020.
    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell等人,“语言模型是少样本学习器”,发表于NeurIPS,第33卷,1877-1901页,2020年。
  • [246] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in ICCV, pp. 4015–4026, 2023.
    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, "分割任何东西," 在ICCV中, pp. 4015-4026, 2023.
  • [247] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, vol. 35, pp. 24824–24837, 2022.
    J. Wei,X. Wang,D. Schuurmans,M. Bosma,F. Xia,E. Chi,Q. V. Le,D. Zhou等,“思维链提示引发大型语言模型的推理”,收录于NeurIPS,第35卷,第24824-24837页,2022年。
  • [248] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in ICML, 2023.
    J. Li, D. Li, S. Savarese, and S. Hoi,“Blip-2:使用冻结图像编码器和大型语言模型进行语言-图像预训练的自举方法”,发表于ICML,2023年。
  • [249] K. Zhang, L. Mo, W. Chen, H. Sun, and Y. Su, “Magicbrush: A manually annotated dataset for instruction-guided image editing,” arXiv preprint arXiv:2306.10012, 2023.
    K. Zhang, L. Mo, W. Chen, H. Sun, and Y. Su,《Magicbrush:一份手动注释的指导图像编辑数据集》,arXiv预印本arXiv:2306.10012,2023年。
  • [250] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale等人,“Llama 2: 开放基础和精调聊天模型”,arXiv预印本arXiv:2307.09288,2023年。
  • [251] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
    Y. Wang,Y. Kordi,S. Mishra,A. Liu,N. A. Smith,D. Khashabi和H. Hajishirzi,“自我指导:将语言模型与自动生成的指令对齐”,arXiv预印本arXiv:2212.10560,2022年。
  • [252] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau, et al., “Recipes for building an open-domain chatbot,” in ACL, pp. 300–325, 2021.
    S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau等人,“构建开放领域聊天机器人的方法”,ACL会议论文集,2021年,第300-325页。
  • [253] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in CVPR, pp. 6700–6709, 2019.
    D. A. Hudson和C. D. Manning,“Gqa:一个用于真实世界视觉推理和组合问题回答的新数据集”,CVPR,2019年,第6700-6709页。
  • [254] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” https://github.com/facebookresearch/detectron2, 2019.
    Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,“Detectron2。” https://github.com/facebookresearch/detectron2, 2019.
  • [255] X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
    在ECCV 2022中,X. Zhou、R. Girdhar、A. Joulin、P. Krähenbühl和I. Misra提出了使用图像级监督来检测两万个类别。
  • [256] Y. Zeng, Z. Lin, H. Lu, and V. M. Patel, “Cr-fill: Generative image inpainting with auxiliary contextual reconstruction,” in ICCV, pp. 14164–14173, 2021.
    Y. Zeng, Z. Lin, H. Lu, and V. M. Patel,“Cr-fill:辅助上下文重建的生成式图像修复”,在ICCV中,第14164-14173页,2021年。
  • [257] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, pp. 15180–15190, 2023.
    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: 一个绑定所有的嵌入空间”,在CVPR中,页码15180-15190,2023年。
  • [258] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
    W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez等人,“Vicuna:一款开源聊天机器人,以90%*的ChatGPT质量令gpt-4印象深刻”,见https://vicuna.lmsys.org(访问日期:2023年4月14日),2023年。
  • [259] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, vol. 36, 2023.
    H. Liu, C. Li, Q. Wu, and Y. J. Lee,“视觉指导调整”,于NeurIPS会议,卷36,2023年。
  • [260] T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in CVPR, pp. 7086–7096, 2022.
    T. Lüddecke和A. Ecker,“使用文本和图像提示进行图像分割”,CVPR会议论文集,第7086-7096页,2022年。
  • [261]  H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in ICLR, 2023.  
  • [262]  X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, and C. Theobalt, “Drag your gan: Interactive point-based manipulation on the generative image manifold,” in ACM SIGGRAPH, pp. 1–11, 2023.  
  • [263]  J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, pp. 12888–12900, 2022.  
  • [264]  M. A. Chan, S. I. Young, and C. A. Metzler, “Sud22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Supervision by denoising diffusion models for image reconstruction,” arXiv preprint arXiv:2303.09642, 2023.  
  • [265]  S. I. Young, A. V. Dalca, E. Ferrante, P. Golland, C. A. Metzler, B. Fischl, and J. E. Iglesias, “Supervision by denoising,” IEEE TPAMI, 2023.  
  • [266]  A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” in CVPR, pp. 11461–11471, 2022.  
  • [267]  A. Grechka, G. Couairon, and M. Cord, “Gradpaint: Gradient-guided inpainting with diffusion models,” arXiv preprint arXiv:2309.09614, 2023.  
  • [268]  H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” in ICLR, 2023.  
  • [269]  J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided diffusion models for inverse problems,” in ICLR, 2023.  
  • [270]  B. Fei, Z. Lyu, L. Pan, J. Zhang, W. Yang, T. Luo, B. Zhang, and B. Dai, “Generative diffusion prior for unified image restoration and enhancement,” in CVPR, pp. 9935–9946, 2023.  
  • [271]  G. Zhang, J. Ji, Y. Zhang, M. Yu, T. Jaakkola, and S. Chang, “Towards coherent image inpainting using denoising diffusion implicit models,” in ICML, pp. 41164–41193, 2023.  
  • [272]  Z. Fabian, B. Tinaz, and M. Soltanolkotabi, “Diracdiffusion: Denoising and incremental reconstruction with assured data-consistency,” arXiv preprint arXiv:2303.14353, 2023.  
  • [273]  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, pp. 4510–4520, 2018.  
  • [274]  J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J.-C. Bazin, and F. De la Torre, “Personalized face inpainting with diffusion models by parallel visual attention,” in WACV, pp. 5432–5442, 2024.  
  • [275]  R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in WACV, pp. 2149–2159, 2022.  
  • [276]  S. Basu, M. Saberi, S. Bhardwaj, A. M. Chegini, D. Massiceti, M. Sanjabi, S. X. Hu, and S. Feizi, “Editval: Benchmarking diffusion based text-guided image editing methods,” arXiv preprint arXiv:2310.02426, 2023.  
  • [277]  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, pp. 740–755, 2014.  
  • [278]  J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, 2023.  
  • [279]  J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” in EMNLP, pp. 7514–7528, 2021.
    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras和Y. Choi,“Clipscore:一种用于图像字幕的无参考评估指标”,在EMNLP中,页码7514-7528,2021年。
  • [280] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023.
    A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach,“对抗性扩散蒸馏”,arXiv预印本arXiv:2311.17042,2023年。
  • [281] Z. Geng, A. Pokle, and J. Z. Kolter, “One-step diffusion distillation via deep equilibrium models,” vol. 36, 2023.
    Z. Geng, A. Pokle和J. Z. Kolter,“通过深度平衡模型进行一步扩散蒸馏”,卷36,2023年。
  • [282] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” in ICML, 2023.
    宋Y,Dhariwal P,陈M和Sutskever I,“一致性模型”,在ICML,2023年。
  • [283] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023.
    S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao,《潜在一致性模型:使用少步推理合成高分辨率图像》,arXiv预印本arXiv:2310.04378,2023年。
  • [284] Y. Zhao, Y. Xu, Z. Xiao, and T. Hou, “Mobilediffusion: Subsecond text-to-image generation on mobile devices,” arXiv preprint arXiv:2311.16567, 2023.
    Y. Zhao, Y. Xu, Z. Xiao和T. Hou,“Mobilediffusion:移动设备上的亚秒级文本到图像生成”,arXiv预印本arXiv:2311.16567,2023年。
  • [285] Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov, and J. Ren, “Snapfusion: Text-to-image diffusion model on mobile devices within two seconds,” in NeurIPS, vol. 36, 2023.
    李Y,王H,金Q,胡J,Chemerys P,付Y,王Y,Tulyakov S和任J,“Snapfusion:移动设备上的文本到图像扩散模型,两秒内完成”,在NeurIPS会议上,卷36,2023年。
  • [286] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen,《Lora: Low-rank adaptation of large language models》,arXiv预印本arXiv:2106.09685,2021年。
  • [287] P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” in CVPR, pp. 22522–22531, 2023.
    P. Schramowski, M. Brack, B. Deiseroth和K. Kersting,“安全的潜在扩散:减轻扩散模型中的不适当退化”,在CVPR中,第22522-22531页,2023年。
  • [288] R. Pandey, S. O. Escolano, C. Legendre, C. Haene, S. Bouaziz, C. Rhemann, P. Debevec, and S. Fanello, “Total relighting: learning to relight portraits for background replacement,” ACM TOG, vol. 40, no. 4, pp. 1–21, 2021.
    R. Pandey, S. O. Escolano, C. Legendre, C. Haene, S. Bouaziz, C. Rhemann, P. Debevec, and S. Fanello,“全景照明:学习为背景替换重新照明肖像”,ACM TOG,第40卷,第4期,第1-21页,2021年。
  • [289] P. Ponglertnapakorn, N. Tritrong, and S. Suwajanakorn, “Difareli: Diffusion face relighting,” arXiv preprint arXiv:2304.09479, 2023.
    P. Ponglertnapakorn,N. Tritrong和S. Suwajanakorn,“Difareli:扩散人脸照明”,arXiv预印本arXiv:2304.09479,2023年。
  • [290] M. Ren, W. Xiong, J. S. Yoon, Z. Shu, J. Zhang, H. Jung, G. Gerig, and H. Zhang, “Relightful harmonization: Lighting-aware portrait background replacement,” arXiv preprint arXiv:2312.06886, 2023.
    M. Ren, W. Xiong, J. S. Yoon, Z. Shu, J. Zhang, H. Jung, G. Gerig, and H. Zhang, “Relightful harmonization: Lighting-aware portrait background replacement,” arXiv preprint arXiv:2312.06886, 2023.
  • [291] J. Xiang, J. Yang, B. Huang, and X. Tong, “3d-aware image generation using 2d diffusion models,” arXiv preprint arXiv:2303.17905, 2023.
    J. Xiang, J. Yang, B. Huang和X. Tong,“使用2D扩散模型的3D感知图像生成”,arXiv预印本arXiv:2303.17905,2023年。
  • [292] J. Ackermann and M. Li, “High-resolution image editing via multi-stage blended diffusion,” arXiv preprint arXiv:2210.12965, 2022.
    J. Ackermann和M. Li,“通过多阶段混合扩散进行高分辨率图像编辑”,arXiv预印本arXiv:2210.12965,2022年。
  • [293] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, vol. 30, 2017.
    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “通过两个时间尺度更新规则训练的生成对抗网络收敛到局部纳什均衡点”,发表于NeurIPS,第30卷,2017年。
  • [294] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” arXiv preprint arXiv:1801.01401, 2018.
    M. Bińkowski,D. J. Sutherland,M. Arbel和A. Gretton,“揭秘MMD GANs”,arXiv预印本arXiv:1801.01401,2018年。
  • [295] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, pp. 586–595, 2018.
    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“深度特征作为感知度量的不合理有效性”,CVPR,2018年,第586-595页。
  • [296] S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola, “Dreamsim: Learning new dimensions of human visual similarity using synthetic data,” arXiv preprint arXiv:2306.09344, 2023.
    S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola,《Dreamsim:使用合成数据学习人类视觉相似性的新维度》,arXiv预印本arXiv:2306.09344,2023年。
  • [297] K. Kotar, S. Tian, H.-X. Yu, D. Yamins, and J. Wu, “Are these the same apple? comparing images based on object intrinsics,” in NeurIPS, vol. 36, 2023.
    K. Kotar, S. Tian, H.-X. Yu, D. Yamins和J. Wu,“这是同一个苹果吗?基于物体内在特性的图像比较”,收录于NeurIPS,第36卷,2023年。