这是用户在 2024-3-21 24:04 为 https://arxiv.org/html/2402.17525v2?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
HTML转换有时会显示错误,原因是内容没有从源文件正确转换。本文使用了以下尚未被HTML转换工具支持的包。对于这些问题的反馈并不必要,因为它们已知并正在解决中。

  • failed: forest 失败:森林
  • failed: hyphenat 失败:连字符化
  • failed: printlen 失败:打印长度

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
作者:通过遵循这些最佳实践,从您的LaTeX提交中获得最佳的HTML结果。

License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org永久非独占许可证
arXiv:2402.17525v2 [cs.CV] 16 Mar 2024
arXiv:2402.17525v2 [cs.CV] 2024年3月16日
\uselengthunit 使用长度单位

pt

Diffusion Model-Based Image Editing: A Survey
扩散模型基于图像编辑:一项调查

Yi Huang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jiancheng Huang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Yifan Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Mingfu Yan*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jiaxi Lv*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Jianzhuang Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, ,
黄毅,黄建成,刘一凡,严明福,吕佳熙,刘建壮,

Wei Xiong, He Zhang, Shifeng Chen, and Liangliang Cao
魏雄,何张,施锋陈和梁亮曹
Y. Huang, J. Huang, M. Yan, and J. Lv are with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, and also with University of Chinese Academy of Sciences, Beijing, China. (E-mail: yi.huang@siat.ac.cn) Y. Liu is with Southern University of Science and Technology, Shenzhen, China, and also with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. J. Liu and S. Chen are with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. (E-mail: jz.liu@siat.ac.cn, shifeng.chen@siat.ac.cn) W. Xiong and H. Zhang are with Adobe Inc, San Jose, USA.L. Cao is with Apple Inc, Cupertino, USA.*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT denotes equal contributions. S. Chen is the corresponding author.
黄勇,黄杰,闫敏和吕杰就职于中国科学院深圳先进技术研究院和中国科学院大学。 (电子邮件:yi.huang@siat.ac.cn) 刘洋就职于南方科技大学和中国科学院深圳先进技术研究院。刘军和陈师峰就职于中国科学院深圳先进技术研究院。 (电子邮件:jz.liu@siat.ac.cn, shifeng.chen@siat.ac.cn) 熊伟和张华就职于Adobe公司,美国圣何塞。曹亮就职于苹果公司,美国库比蒂诺。 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 表示相等的贡献。陈师峰为通讯作者。
Abstract 摘要

Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research. The accompanying repository is released at https://github.com/SiatMMLab/
去噪扩散模型已经成为各种图像生成和编辑任务的强大工具,以无条件或输入条件的方式促进了视觉内容的合成。它们的核心思想是学习逆转逐渐向图像添加噪声的过程,从而能够从复杂分布中生成高质量的样本。在本调查中,我们对使用扩散模型进行图像编辑的现有方法进行了详尽的概述,涵盖了领域中的理论和实践方面。我们从多个角度对这些作品进行了深入分析和分类,包括学习策略、用户输入条件以及可以完成的特定编辑任务的数组。此外,我们特别关注图像修复和扩展,并探索了早期的传统上下文驱动和当前的多模态条件方法,对它们的方法进行了全面分析。为了进一步评估文本引导的图像编辑算法的性能,我们提出了一个系统的基准测试,EditEval,其中包含了一种创新的度量标准,LMM分数。 最后,我们解决当前的限制,并展望未来研究的一些潜在方向。附带的存储库已发布在https://github。com/SiatMMLab/ 翻译为简体中文为:com/SiatMMLab/

Awesome-Diffusion-Model-Based-Image-Editing-Methods
基于神奇扩散模型的图像编辑方法
.

Index Terms:
Diffusion Model, Image Editing, AIGC
索引词:扩散模型,图像编辑,AIGC
Refer to caption
Refer to caption
Refer to caption
Figure 1: Statistical overview of research publications in diffusion model-based image editing. Top: learning strategies. Middle: input conditions. Bottom: editing tasks.
图1:扩散模型图像编辑研究出版物的统计概述。顶部:学习策略。中部:输入条件。底部:编辑任务。

1 Introduction 1 简介

In the realm of AI-generated content (AIGC), where artificial intelligence is leveraged to create and modify digital content, image editing is recognized as a significant area of innovation and practical applications. Different from image generation that creates new images from minimal inputs, image editing involves altering the appearance, structure, or content of an image, encompassing a range of changes from subtle adjustments to major transformations. This research is fundamental in various areas, including digital media, advertising, and scientific research, where altering visual content is essential. The evolution of image editing has reflected the advancements in digital technology, progressing from manual, labor-intensive processes to advanced digital techniques driven by learning-based algorithms. A pivotal advancement in this evolution is the introduction of Generative Adversarial Networks (GANs) [1, 2, 3, 4, 5, 6], which significantly enhance the possibilities for creative image manipulation. More recently, diffusion models have emerged in AIGC [7, 8, 9, 10, 11, 12, 13, 14, 1, 15], leading to notable breakthroughs in visual generation tasks. Diffusion models, inspired by principles from non-equilibrium thermodynamics [15], work by gradually adding noise to data and then learning to reverse this process from random noise until generating desired data matching the source data distribution. They can be roughly classified into denoising diffusion based [15, 16, 17, 18] and score-matching based [19, 20, 21, 22, 23]. Their adaptability and effectiveness have led to widespread applications across various tasks such as image generation [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], video generation [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56], image restoration [57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], and image editing.
在人工智能生成内容(AIGC)领域中,人工智能被用于创建和修改数字内容,图像编辑被认为是一个重要的创新和实际应用领域。与从最小输入创建新图像的图像生成不同,图像编辑涉及改变图像的外观、结构或内容,包括从微调到重大转变的一系列变化。这项研究在数字媒体、广告和科学研究等各个领域都是基础,其中改变视觉内容是必不可少的。图像编辑的发展反映了数字技术的进步,从手动、劳动密集型的过程发展到由学习算法驱动的先进数字技术。这一进化中的关键进展是生成对抗网络(GANs)的引入[1, 2, 3, 4, 5, 6],它显著增强了创造性图像处理的可能性。最近,在AIGC中出现了扩散模型[7, 8, 9, 10, 11, 12, 13, 14, 1, 15],在视觉生成任务中取得了显著突破。 扩散模型受非平衡热力学原理的启发[15],通过逐渐向数据添加噪声,然后学习逆转这个过程,从随机噪声生成与源数据分布相匹配的期望数据。它们可以粗略地分为基于去噪扩散[15, 16, 17, 18]和基于分数匹配[19, 20, 21, 22, 23]的模型。它们的适应性和有效性已经在各种任务中得到广泛应用,如图像生成[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38],视频生成[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56],图像恢复[57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]和图像编辑。

The application of diffusion models in image editing has experienced a surge in interest, evidenced by the significant number of research publications in this field over recent years. This growing attention highlights the potential and versatility of diffusion models in improving image editing performance compared to previous works. Given this significant advancement, it is essential to systematically review and summarize these contributions. However, existing survey literature on diffusion models concentrates on other specific visual tasks [72, 73, 74, 75], such as video applications [73] or image restoration and enhancement [74, 75]. Some surveys that mention image editing tend to provide only a cursory overview [76, 77, 78, 79, 80, 81, 82, 83], missing a detailed and focused exploration of the methods. To bridge this gap, we conduct a survey to offer an in-depth and comprehensive analysis that is distinctly focused on image editing. We delve deeply into the methodologies, input conditions, and a wide range of editing tasks achieved by diffusion models in this field. The survey critically reviews over 100 research papers, organizing them into three primary classes in terms of learning strategies: training-based approaches, testing-time finetuning approaches, and training and finetuning free approaches. Each class is further divided based on its core techniques, with detailed discussions presented in Sections 45, and 6, respectively. We also explore 10 distinct types of input conditions used in these methods, including text, mask, reference (Ref.) image, class, layout, pose, sketch, segmentation (Seg.) map, audio, and dragging points to show the adaptability of diffusion models in diverse image editing scenarios. Furthermore, our survey presents a new classification of image editing tasks into three broad categories, semantic editing, stylistic editing, and structural editing, covering 12 specific types. Fig. 1 visually represents the statistical distributions of research across learning strategies, input conditions, and editing task categories. In addition, we pay special attention to inpainting and outpainting, which together stand out as a unique type of editing. We explore both earlier traditional and current multimodal conditional methods, offering a comprehensive analysis of their methodologies in Section 7. We also introduce EditEval, a benchmark designed to evaluate text-guided image editing algorithms, as detailed in Section 8. In particular, an effective evaluation metric, LMM Score, is proposed by leveraging the advanced visual-language understanding capabilities of large multimodal models (LMMs). Finally, we present some current challenges and potential future trends as an outlook in Section 9.
在图像编辑中应用扩散模型的兴趣急剧增加,近年来在这一领域的研究出版物数量显著增加,这表明扩散模型在改善图像编辑性能方面具有潜力和多样性,相比之前的工作有所提升。鉴于这一重大进展,有必要对这些贡献进行系统的回顾和总结。然而,现有关于扩散模型的调查文献集中在其他特定的视觉任务上[72, 73, 74, 75],如视频应用[73]或图像恢复和增强[74, 75]。一些提到图像编辑的调查往往只提供粗略的概述[76, 77, 78, 79, 80, 81, 82, 83],缺乏对方法的详细和专注的探索。为了填补这一空白,我们进行了一项调查,提供了一种深入和全面的分析,专注于图像编辑。我们深入研究了在这一领域中扩散模型的方法、输入条件和广泛的编辑任务。 该调查对100多篇研究论文进行了批判性评价,并将它们按照学习策略分为三个主要类别:基于训练的方法、测试时微调的方法和无需训练和微调的方法。每个类别根据其核心技术进一步细分,并分别在第4、5和6节中进行了详细讨论。我们还探讨了这些方法中使用的10种不同类型的输入条件,包括文本、遮罩、参考图像、类别、布局、姿势、草图、分割图、音频和拖动点,以展示扩散模型在不同图像编辑场景中的适应性。此外,我们的调查将图像编辑任务分为三个广泛的类别:语义编辑、风格编辑和结构编辑,涵盖了12种具体类型。图1直观地表示了研究在学习策略、输入条件和编辑任务类别上的统计分布。此外,我们特别关注修复和扩展,它们共同构成了一种独特的编辑类型。 我们在第7节中探讨了早期传统和当前多模态条件方法,并对它们的方法进行了全面分析。我们还在第8节中介绍了一个名为EditEval的基准,用于评估文本引导的图像编辑算法。特别地,通过利用大型多模态模型(LMMs)的先进视觉语言理解能力,提出了一种有效的评估指标LMM Score。最后,在第9节中,我们展示了一些当前的挑战和潜在的未来趋势作为展望。

In summary, this survey aims to systematically categorize and critically assess the extensive body of research in diffusion model-based image editing. Our goal is to provide a comprehensive resource that not only synthesizes current findings but also guides future research directions in this rapidly advancing field.
总之,本调查旨在系统地分类和批判性评估扩散模型为基础的图像编辑的广泛研究成果。我们的目标是提供一个全面的资源,不仅综合当前的研究结果,还指导这个快速发展领域的未来研究方向。

2 Background 2 背景

2.1 Diffusion Models 2.1 扩散模型

Diffusion Models have exerted a profound influence on the field of generative AI, giving rise to a plethora of approaches falling under their umbrella. Essentially, these models are grounded in a pivotal principle known as diffusion, which gradually adds noise to data samples of some distribution, transforming them into a predefined typically simple distribution, such as Gaussian, and this process is then reversed iteratively to generate data matching the original distribution. What distinguishes diffusion models from earlier generative models is their dynamic execution across iterative time steps, covering both forward and backward movements in time.
扩散模型对生成式人工智能领域产生了深远影响,涌现出许多属于该模型范畴的方法。基本上,这些模型以扩散为基础原理,逐渐向某个分布的数据样本添加噪声,将其转化为预定义的通常简单分布,如高斯分布,然后通过迭代逆转这个过程,生成与原始分布相匹配的数据。扩散模型与早期的生成模型的区别在于它们在迭代的时间步骤中具有动态执行,涵盖了时间上的正向和反向移动。

For each time step t𝑡titalic_t, the noisy latent 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT delineates the current state. The time step t𝑡titalic_t increments progressively during forward diffusion process, and it reduces towards 0 during reverse diffusion process. Notably, the literature lacks a distinct differentiation between 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in forward diffusion and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in reverse diffusion. In the context of forward diffusion, let 𝐳tq(𝐳t𝐳t1)similar-tosubscript𝐳𝑡𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\mathbf{z}_{t}\sim q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right)bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), and in reverse diffusion, let 𝐳t1p(𝐳t1𝐳t)similar-tosubscript𝐳𝑡1𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\sim p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Herein, we denote T𝑇Titalic_T with 0<tT0𝑡𝑇0<t\leq T0 < italic_t ≤ italic_T as the maximal time step for finite cases. The initial data distribution at t=0𝑡0t=0italic_t = 0 is represented by 𝐳0q(𝐳0)similar-tosubscript𝐳0𝑞subscript𝐳0\mathbf{z}_{0}\sim q\left(\mathbf{z}_{0}\right)bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), slowly contaminated by additive noise to 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Diffusion models gradually eliminate noise by a parameterized model pθ(𝐳t1𝐳t)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the reverse time direction. This model approximates the ideal, albeit unattainable, denoised distribution p(𝐳t1𝐳t)𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denoising Diffusion Probabilistic Models (DDPMs), as introduced in Ho et al. [16], effectively utilize Markov chains to facilitate both the forward and backward processes over a finite series of time steps.
对于每个时间步骤 t𝑡titalic_t ,噪声潜变量 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 描述当前状态。时间步骤 t𝑡titalic_t 在正向扩散过程中逐渐增加,并在反向扩散过程中减小至0。值得注意的是,文献中缺乏对正向扩散中 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 和反向扩散中 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的明确区分。在正向扩散的情况下,设 𝐳tq(𝐳t𝐳t1)similar-tosubscript𝐳𝑡𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\mathbf{z}_{t}\sim q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right)bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ,在反向扩散的情况下,设 𝐳t1p(𝐳t1𝐳t)similar-tosubscript𝐳𝑡1𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\sim p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 。在此,我们将有限情况下的最大时间步骤 T𝑇Titalic_T 表示为 0<tT0𝑡𝑇0<t\leq T0 < italic_t ≤ italic_T 。在 t=0𝑡0t=0italic_t = 0 处的初始数据分布由 𝐳0q(𝐳0)similar-tosubscript𝐳0𝑞subscript𝐳0\mathbf{z}_{0}\sim q\left(\mathbf{z}_{0}\right)bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) 表示,通过加性噪声逐渐污染为 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 。扩散模型通过反向时间方向上的参数化模型 pθ(𝐳t1𝐳t)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 逐渐消除噪声。该模型近似于理想的去噪分布 p(𝐳t1𝐳t)𝑝conditionalsubscript𝐳𝑡1subscript𝐳𝑡p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,尽管无法实现。去噪扩散概率模型(DDPMs),如Ho等人[16]所介绍的,有效利用马尔可夫链在有限的一系列时间步骤中促进正向和反向过程。

Forward Diffusion Process. This process serves as transforming the data distribution into a predefined distribution, such as the Gaussian distribution. The transformation is represented as:
前向扩散过程。该过程用于将数据分布转化为预定义的分布,例如高斯分布。转化表示为:

q(𝐳t𝐳t1)=𝒩(𝐳t1βt𝐳t1,βt𝐈),𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1𝒩conditionalsubscript𝐳𝑡1subscript𝛽𝑡subscript𝐳𝑡1subscript𝛽𝑡𝐈q(\mathbf{z}_{t}\mid\mathbf{z}_{t-1})=\mathcal{N}(\mathbf{z}_{t}\mid\sqrt{1-% \beta_{t}}\,\mathbf{z}_{t-1},\beta_{t}\mathbf{I}),italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (1)

where the set of hyper-parameters 0<β1:T<10subscript𝛽:1𝑇10<\beta_{1:T}<10 < italic_β start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT < 1 is indicative of the noise variance introduced at each sequential step. This diffusion process can be briefly expressed via a single-step equation:
其中超参数 0<β1:T<10subscript𝛽:1𝑇10<\beta_{1:T}<10 < italic_β start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT < 1 的集合表示每个顺序步骤引入的噪声方差。这个扩散过程可以通过一个单步方程简要表示:

q(𝐳t𝐳0)=𝒩(𝐳tα¯t𝐳0,(1α¯t)𝐈),𝑞conditionalsubscript𝐳𝑡subscript𝐳0𝒩conditionalsubscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳01subscript¯𝛼𝑡𝐈q(\mathbf{z}_{t}\mid\mathbf{z}_{0})=\mathcal{N}(\mathbf{z}_{t}\mid\sqrt{\bar{% \alpha}_{t}}\,\mathbf{z}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (2)

with αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as elaborated in Sohl-Dickstein et al. [15]. As a result, bypassing the need to consider intermediate time steps, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be directly sampled by:
如Sohl-Dickstein等人所述,通过 αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTα¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,可以绕过考虑中间时间步骤的需要,从而可以直接通过以下方式对 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 进行采样:

𝐳t=α¯t𝐳0+1α¯tϵ,ϵ𝒩(𝟎,𝐈).formulae-sequencesubscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳01subscript¯𝛼𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝐈\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\cdot\mathbf{z}_{0}+\sqrt{1-\bar{\alpha}% _{t}}\cdot\epsilon,\quad\epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}% \right).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_ϵ , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) . (3)

Reverse Diffusion Process. The primary objective here is to learn the reverse of the forward diffusion process, with the aim of generating a distribution that closely aligns with the original unaltered data samples 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In the context of image editing, 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the edited image. Practically, this is achieved using a UNet architecture to learn a parameterized version of p𝑝pitalic_p. Given that the forward diffusion process is approximated by q(𝐳T)𝒩(𝟎,𝐈)𝑞subscript𝐳𝑇𝒩0𝐈q(\mathbf{z}_{T})\approx\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)italic_q ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≈ caligraphic_N ( bold_0 , bold_I ), the formulation of the learnable transition is expressed as:
逆扩散过程。这里的主要目标是学习正向扩散过程的逆过程,以生成与原始未改变的数据样本密切对齐的分布。在图像编辑的背景下, 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 代表编辑后的图像。实际上,这是通过使用UNet架构来学习 p𝑝pitalic_p 的参数化版本来实现的。鉴于正向扩散过程由 q(𝐳T)𝒩(𝟎,𝐈)𝑞subscript𝐳𝑇𝒩0𝐈q(\mathbf{z}_{T})\approx\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)italic_q ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≈ caligraphic_N ( bold_0 , bold_I ) 近似,可学习的转换的表达式为:

pθ(𝐳t1𝐳t)=𝒩(𝐳t1μθ(𝐳t,α¯t),Σθ(𝐳t,α¯t)),subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝒩conditionalsubscript𝐳𝑡1subscript𝜇𝜃subscript𝐳𝑡subscript¯𝛼𝑡subscriptΣ𝜃subscript𝐳𝑡subscript¯𝛼𝑡p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)=\mathcal{N}\left(% \mathbf{z}_{t-1}\mid\mu_{\theta}(\mathbf{z}_{t},\bar{\alpha}_{t}),\Sigma_{% \theta}(\mathbf{z}_{t},\bar{\alpha}_{t})\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (4)

where the functions μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are learnable parameters. In addition, for the conditional formulation pθ(𝐳t1𝐳t,𝐜)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝐜p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{c}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ), which is conditioned on an external variable 𝐜𝐜\mathbf{c}bold_c (in image editing, 𝐜𝐜\mathbf{c}bold_c can be the source image), the model becomes μθ(𝐳t,𝐜,α¯t)subscript𝜇𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\mu_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Σθ(𝐳t,𝐜,α¯t)subscriptΣ𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\Sigma_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
其中函数 μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPTΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 是可学习的参数。此外,对于条件形式 pθ(𝐳t1𝐳t,𝐜)subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝐜p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{c}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ,它是基于外部变量 𝐜𝐜\mathbf{c}bold_c 的条件(在图像编辑中, 𝐜𝐜\mathbf{c}bold_c 可以是源图像),模型变为 μθ(𝐳t,𝐜,α¯t)subscript𝜇𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\mu_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )Σθ(𝐳t,𝐜,α¯t)subscriptΣ𝜃subscript𝐳𝑡𝐜subscript¯𝛼𝑡\Sigma_{\theta}(\mathbf{z}_{t},\mathbf{c},\bar{\alpha}_{t})roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

Optimization. The optimization strategy for guiding the reverse diffusion in learning the forward process involves minimizing the Kullback-Leibler (KL) divergence between the joint distributions of the forward and reverse sequences. These are mathematically defined as:
优化。引导学习正向过程中的逆向扩散的优化策略涉及最小化正向和逆向序列的联合分布之间的Kullback-Leibler(KL)散度。这些在数学上定义为:

pθ(𝐳0,,𝐳T)subscript𝑝𝜃subscript𝐳0subscript𝐳𝑇\displaystyle p_{\theta}\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =p(𝐳T)t=1Tpθ(𝐳t1𝐳t),absent𝑝subscript𝐳𝑇subscriptsuperscriptproduct𝑇𝑡1subscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡\displaystyle=p\left(\mathbf{z}_{T}\right)\prod^{T}_{t=1}p_{\theta}\left(% \mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right),= italic_p ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (5)
q(𝐳0,,𝐳T)𝑞subscript𝐳0subscript𝐳𝑇\displaystyle q\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =q(𝐳0)t=1Tq(𝐳t𝐳t1),absent𝑞subscript𝐳0subscriptsuperscriptproduct𝑇𝑡1𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1\displaystyle=q\left(\mathbf{z}_{0}\right)\prod^{T}_{t=1}q\left(\mathbf{z}_{t}% \mid\mathbf{z}_{t-1}\right),= italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (6)

leading to the minimization of:
导致最小化:

KL(q(𝐳0,,𝐳T)pθ(𝐳0,,𝐳T))KLconditional𝑞subscript𝐳0subscript𝐳𝑇subscript𝑝𝜃subscript𝐳0subscript𝐳𝑇\displaystyle\text{KL}(q\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right)\|p_{% \theta}\left(\mathbf{z}_{0},...,\mathbf{z}_{T}\right))KL ( italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) (7)
𝔼q(𝐳0)[logpθ(𝐳0)]+c,absentsubscript𝔼𝑞subscript𝐳0delimited-[]subscript𝑝𝜃subscript𝐳0𝑐\displaystyle\geq\mathbb{E}_{q(\mathbf{z}_{0})}\left[-\log p_{\theta}\left(% \mathbf{z}_{0}\right)\right]+c,≥ blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] + italic_c ,

which is detailed in Ho et al. [16] and the constant c𝑐citalic_c is irrelevant for optimizing θ𝜃\thetaitalic_θ. The KL divergence of Eq. 7 represents the variational bound of the log-likelihood of the data (logpθ(𝐳0)subscript𝑝𝜃subscript𝐳0\log p_{\theta}(\mathbf{z}_{0})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )). This KL divergence serves as the loss and is minimized in DDPMs. Practically, Ho et al. [16] adopt a reweighed version of this loss as a simpler denoising loss:
这在Ho等人的论文[16]中有详细说明,常数 c𝑐citalic_c 对于优化 θ𝜃\thetaitalic_θ 是无关紧要的。方程7的KL散度表示数据( logpθ(𝐳0)subscript𝑝𝜃subscript𝐳0\log p_{\theta}(\mathbf{z}_{0})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )的对数似然的变分上界。这个KL散度作为损失函数,在DDPMs中被最小化。实际上,Ho等人[16]采用了这个损失的加权版本作为更简单的降噪损失。

𝔼t𝒰(1,T),𝐳0q(𝐳0),ϵ𝒩(𝟎,𝐈)[λ(t)ϵϵθ(𝐳t,t)2],subscript𝔼formulae-sequencesimilar-to𝑡𝒰1𝑇formulae-sequencesimilar-tosubscript𝐳0𝑞subscript𝐳0similar-toitalic-ϵ𝒩0𝐈delimited-[]𝜆𝑡superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡2\mathbb{E}_{t\sim\mathcal{U}(1,T),\mathbf{z}_{0}\sim q(\mathbf{z}_{0}),% \epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)}\left[\lambda(t)\|% \epsilon-\epsilon_{\theta}(\mathbf{z}_{t},t)\|^{2}\right],blackboard_E start_POSTSUBSCRIPT italic_t ∼ caligraphic_U ( 1 , italic_T ) , bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT [ italic_λ ( italic_t ) ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

where λ(t)>0𝜆𝑡0\lambda(t)>0italic_λ ( italic_t ) > 0 denotes a weighting function, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained using Eq. 3, and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represents a network designed to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ based on 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t𝑡titalic_t.
其中 λ(t)>0𝜆𝑡0\lambda(t)>0italic_λ ( italic_t ) > 0 表示加权函数, 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 是通过公式3获得的, ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 表示基于 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTt𝑡titalic_t 设计的用于预测噪声 ϵitalic-ϵ\epsilonitalic_ϵ 的网络。

DDIM Sampling and Inversion. When working with a real image 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, prevalent editing methods [84, 85] initially invert this 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into a corresponding 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT utilizing a specific inversion scheme. Subsequently, the sampling begins from this 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, employing some editing strategy to produce an edited outcome 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In an ideal scenario, direct sampling from 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, without any edits, should yield a 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that closely resembles 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A significant deviation of 𝐳~0subscript~𝐳0\tilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, termed reconstruction failure, indicates the inability of the edited image to maintain the integrity of unaltered regions in 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, it is crucial to use an inversion method that ensures 𝐳~0𝐳0subscript~𝐳0subscript𝐳0\tilde{\mathbf{z}}_{0}\approx\mathbf{z}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
DDIM采样和反演。在处理真实图像时,常见的编辑方法[84, 85]首先将其反演为相应的图像,采用特定的反演方案。随后,从这个图像开始采样,采用一些编辑策略来生成编辑后的结果。在理想情况下,直接从未经编辑的图像进行采样应该得到一个与原图相似的图像。如果编辑后的图像与原图存在显著偏差,即重建失败,说明编辑后的图像无法保持未修改区域的完整性。因此,使用一种能确保重建成功的反演方法至关重要。

The DDIM sampling equation [18] is:
DDIM采样方程[18]是:

𝐳t1=α¯t1𝐳t1α¯tϵθ(𝐳t,t)α¯t+1α¯t1ϵθ(𝐳t,t),subscript𝐳𝑡1subscript¯𝛼𝑡1subscript𝐳𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\small\mathbf{z}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\frac{\mathbf{z}_{t}-\sqrt{1-% \bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}% +\sqrt{1-\bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t},t),bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (9)

which is alternatively expressed as:
这可以另外表达为:

𝐳t=α¯t𝐳t11α¯t1ϵθ(𝐳t,t)α¯t1+1α¯tϵθ(𝐳t,t).subscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳𝑡11subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\small\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\frac{\mathbf{z}_{t-1}-\sqrt{1-% \bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t},t)}{\sqrt{\bar{\alpha}_{t-% 1}}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t},t).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) . (10)

Although Eq. 10 appears to provide an ideal inversion from 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the problem arises from the unknown nature of 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is also used as an input for the network ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). To address this, DDIM Inversion[18] operates under the assumption that 𝐳t1𝐳tsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\approx\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and replaces 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the right-hand side of Eq. 10 with 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, leading to the following approximation:
尽管方程10似乎提供了从 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的理想反演,但问题在于 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的未知性质,它也被用作网络 ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) 的输入。为了解决这个问题,DDIM反演[18]在假设 𝐳t1𝐳tsubscript𝐳𝑡1subscript𝐳𝑡\mathbf{z}_{t-1}\approx\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≈ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 的情况下进行操作,并将方程10右侧的 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 替换为 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ,得到以下近似:

𝐳t=α¯t𝐳t11α¯t1ϵθ(𝐳t1,t)α¯t1+1α¯tϵθ(𝐳t1,t).subscript𝐳𝑡subscript¯𝛼𝑡subscript𝐳𝑡11subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝐳𝑡1𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐳𝑡1𝑡\displaystyle\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\frac{\mathbf{z}_{t-1}-% \sqrt{1-\bar{\alpha}_{t-1}}\epsilon_{\theta}(\mathbf{z}_{t-1},t)}{\sqrt{\bar{% \alpha}_{t-1}}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{z}_{t-1},t).bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) . (11)

Text Condition and Classifier-Free Guidance. Text-conditional diffusion models are designed to synthesize outcomes from random noise 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT guided by a text prompt P𝑃Pitalic_P. During inference in the sampling process, the noise estimation network ϵθ(𝐳t,t,C)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) is utilized to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ, where C=ψ(P)𝐶𝜓𝑃C=\psi(P)italic_C = italic_ψ ( italic_P ) represents the text embedding. This process methodically removes noise from 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT across T𝑇Titalic_T steps until the final result 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is obtained.
文本条件和无分类器指导。文本条件扩散模型旨在通过文本提示引导随机噪声 𝐳Tsubscript𝐳𝑇\mathbf{z}_{T}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 合成结果。在采样过程中的推理中,利用噪声估计网络 ϵθ(𝐳t,t,C)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) 预测噪声 ϵitalic-ϵ\epsilonitalic_ϵ ,其中 C=ψ(P)𝐶𝜓𝑃C=\psi(P)italic_C = italic_ψ ( italic_P ) 表示文本嵌入。这个过程有条不紊地在 T𝑇Titalic_T 步骤中去除 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 中的噪声,直到获得最终结果 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

In the realm of text-conditional image generation, it is vital to ensure substantial textual influence and control over the generated output. To this end, Ho et al. [86] introduce the concept of classifier-free guidance, a technique that amalgamates conditional and unconditional predictions. More specifically, let =ψ(``")𝜓``"\varnothing=\psi(``")∅ = italic_ψ ( ` ` " )111The placeholder \varnothing is typically used for negative prompts to prevent certain attributes from manifesting in the generated image. denote the null text embedding. When combined with a guidance scale w𝑤witalic_w, the classifier-free guidance prediction is formalized as:
在文本条件图像生成领域,确保生成的输出受到文本的重要影响和控制是至关重要的。为此,何等人引入了无分类器指导的概念,这是一种将条件和非条件预测相结合的技术。具体而言,假设 =ψ(``")𝜓``"\varnothing=\psi(``")∅ = italic_ψ ( ` ` " ) 表示空文本嵌入。当与指导比例 w𝑤witalic_w 结合时,无分类器指导预测被形式化为:

ϵθ(𝐳t,t,C,)=wϵθ(𝐳t,t,C)+(1w)ϵθ(𝐳t,t,).subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶𝑤subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶1𝑤subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)=w\epsilon_{\theta}(\mathbf{z% }_{t},t,C)+(1-w)\ \epsilon_{\theta}(\mathbf{z}_{t},t,\varnothing).italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) = italic_w italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) + ( 1 - italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) . (12)

In this formulation, ϵθ(𝐳t,t,C,)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) replaces ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) in the sampling Eq. 9. The value of w𝑤witalic_w, typically ranging from [1,7.5]17.5[1,7.5][ 1 , 7.5 ] as suggested in [26, 27], dictates the extent of textual control. A higher w𝑤witalic_w value correlates with a stronger text-driven influence in the generation process.
在这个公式中, ϵθ(𝐳t,t,C,)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡𝐶\epsilon_{\theta}(\mathbf{z}_{t},t,C,\varnothing)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , ∅ ) 替换了采样方程式9中的 ϵθ(𝐳t,t)subscriptitalic-ϵ𝜃subscript𝐳𝑡𝑡\epsilon_{\theta}(\mathbf{z}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )w𝑤witalic_w 的值通常在 [1,7.5]17.5[1,7.5][ 1 , 7.5 ] 范围内,如[26, 27]中所建议的,决定了文本控制的程度。较高的 w𝑤witalic_w 值与生成过程中更强的文本驱动影响相关。

2.2 Related Tasks 2.2 相关任务

2.2.1 Conditional Image Generation
2.2.1 条件图像生成

While our primary focus is on diffusion models in image editing, it is important to acknowledge related areas like conditional image generation. Unlike image editing which involves altering certain parts of an existing image, conditional image generation is about creating new images from scratch, guided by specified conditions. Early efforts [32, 31, 87, 88, 89, 90] typically involve class-conditioned image generation, which usually incorporates the class-induced gradients via an additional pretrained classifier during sampling. However, Ho et al. [86] introduce the classifier-free guidance, which does not rely on an external classifier and allows for more versatile conditions, e.g., text, as guidance. Text-to-Image (T2I) Generation. GLIDE [34] is the first work that uses text to guide image generation directly from the high-dimensional pixel level, replacing the label in class-conditioned diffusion models. Similarly, Imagen [27] uses a cascaded framework to generate high-resolution images more efficiently in pixel space. A different line of research first projects the image into a lower-dimensional space and then applies diffusion models in this latent space. Representative works include Stable Diffusion (SD) [26], VQ-diffusion [91], and DALL-E 2 [25]. Following these pioneering studies, a large number of works [37, 92, 93, 94, 95, 96, 97] are proposed, advancing this field over the past two years. Additional Conditions. Beyond text, more specific conditions are also used to achieve higher fidelity and more precise control in image synthesis. GLIGEN [98] inserts a gated self-attention layer between the original self-attention and cross-attention layers in each block of a pretrained T2I diffusion model for generating images conditioned on grounding boxes. Make-A-Scene [99] and SpaText [100] use segmentation masks to guide the generation process. In addition to segmentation maps, ControlNet [101] can also incorporate other types of input, such as depth maps, normal maps, canny edges, pose, and sketches as conditions. Other methods like UniControlNet [102], UniControl [103], Composer [104], and T2I-Adapter [105] integrate diverse conditional inputs and include additional layers, enhancing the generative process controlled by these conditions. Customized Image Generation. Closely related to image editing within conditional image generation is the task of creating personalized images. This task focuses on generating images that maintain a certain identity, typically guided by a few reference images of the same subject. Two early approaches addressing this customized generation through few-shot images are Textual Inversion [106] and DreamBooth [107]. Specifically, Textual Inversion learns a unique identifier word to represent a new subject and incorporates this word into the text encoder’s dictionary. DreamBooth, on the other hand, binds a new, rare word with a specific subject by finetuning the entire Imagen [27] model with a few reference images. To combine multiple new concepts effectively, CustomDiffusion [108] only optimizes the cross-attention parameters in Stable Diffusion [26], representing new concepts and conducting joint training for the multiple-concept combination. Following these foundational works, subsequent methods [109, 110, 111, 112, 113, 114, 115, 116, 117] provide more refined control over the generated images, enhancing the precision and accuracy of the output.
尽管我们的主要关注点是图像编辑中的扩散模型,但重要的是要承认相关领域,如条件图像生成。与图像编辑不同,图像生成是指根据指定条件从头开始创建新的图像。早期的努力[32, 31, 87, 88, 89, 90]通常涉及类别条件的图像生成,在采样过程中通常通过额外的预训练分类器引入类别诱导的梯度。然而,Ho等人[86]引入了无分类器指导的方法,它不依赖于外部分类器,并允许更多样化的条件,例如文本作为指导。文本到图像(T2I)生成。GLIDE [34]是第一个直接使用文本来指导从高维像素级别生成图像的工作,取代了类别条件的扩散模型中的标签。类似地,Imagen [27]使用级联框架在像素空间更高效地生成高分辨率图像。另一条研究线路首先将图像投影到较低维空间,然后在这个潜在空间中应用扩散模型。 代表作品包括稳定扩散(SD)[26],VQ-扩散[91]和DALL-E 2[25]。在这些开创性研究之后,过去两年中提出了大量作品[37, 92, 93, 94, 95, 96, 97],推动了这个领域的发展。附加条件。除了文本之外,还使用更具体的条件来实现更高的保真度和更精确的图像合成控制。GLIGEN[98]在预训练的T2I扩散模型的每个块中,在原始自注意力和交叉注意力层之间插入了一个门控自注意力层,用于生成以定位框为条件的图像。Make-A-Scene[99]和SpaText[100]使用分割掩模来指导生成过程。除了分割图,ControlNet[101]还可以将其他类型的输入,如深度图、法线图、边缘图、姿势和草图作为条件。像UniControlNet[102]、UniControl[103]、Composer[104]和T2I-Adapter[105]这样的其他方法集成了多样的条件输入,并包含了额外的层,增强了由这些条件控制的生成过程。定制图像生成。 与条件图像生成中的图像编辑密切相关的是创建个性化图像的任务。这个任务的重点是生成保持特定身份的图像,通常由几个相同主题的参考图像指导。早期解决这个定制生成问题的两种方法是文本反转[106]和DreamBooth[107]。具体而言,文本反转学习一个唯一的标识符词来表示一个新的主题,并将这个词纳入文本编码器的词典中。另一方面,DreamBooth通过使用几个参考图像对整个Imagen[27]模型进行微调,将一个新的、罕见的词与特定的主题绑定在一起。为了有效地组合多个新概念,CustomDiffusion[108]只优化了Stable Diffusion[26]中的交叉注意力参数,表示新概念,并对多概念组合进行联合训练。在这些基础工作之后,后续方法[109, 110, 111, 112, 113, 114, 115, 116, 117]提供了对生成图像的更精细控制,增强了输出的精度和准确性。

2.2.2 Image Restoration and Enhancement
图像恢复和增强

Image restoration (IR) stands as a pivotal task in low-level vision, aiming to enhance the quality of images contaminated by diverse degradations. Recent progress in diffusion models prompts researchers to investigate their potential for image restoration. Pioneering efforts integrate diffusion models into this task, surpassing previous GAN-based methods.
图像恢复(IR)是低级视觉中的一个关键任务,旨在提高受各种降质影响的图像的质量。最近扩散模型的进展促使研究人员探索其在图像恢复中的潜力。开创性的努力将扩散模型整合到这个任务中,超越了先前基于GAN的方法。

Input Image as a Condition. Generative models have significantly contributed to diverse image restoration tasks, such as super-resolution (SR) and deblurring [118, 29, 12, 13, 119]. Super-Resolution via Repeated Refinement (SR3) [57] utilizes DDPM for conditional image generation through a stochastic, iterative denoising process. Cascaded Diffusion Model [31] adopts multiple diffusion models in sequence, each generating images of increasing resolution. SRDiff [118] follows a close realization of SR3. The main distinction between SRDiff and SR3 is that SR3 predicts the target image directly, whereas SRDiff predicts the difference between the input and output images.
将图像作为条件输入。生成模型在各种图像恢复任务中发挥了重要作用,例如超分辨率(SR)和去模糊[118, 29, 12, 13, 119]。通过重复细化(SR3)[57]利用DDPM进行条件图像生成,通过随机的迭代去噪过程。级联扩散模型[31]按顺序采用多个扩散模型,每个模型生成分辨率逐渐增加的图像。SRDiff [118]是SR3的一个近似实现。SRDiff和SR3之间的主要区别在于SR3直接预测目标图像,而SRDiff预测输入图像和输出图像之间的差异。

Restoration in Non-Spatial Spaces. Some diffusion model-based IR methods focus in other spaces. For example, Refusion [120, 63] employs a mean-reverting Image Restoration (IR)-SDE to transform target images into their degraded counterparts. They leverage an autoencoder to compress the input image into its latent representation, with skip connections for accessing multi-scale details. Chen et al. [121] employ a similar approach by proposing a two-stage strategy called Hierarchical Integration Diffusion Model. The conversion from the spatial to the wavelet domain is lossless and offers significant advantages. For instance, WaveDM [67] modifies the low-frequency band, whereas WSGM [122] or ResDiff [60] conditions the high-frequency bands relative to the low-resolution image. BDCE [123] designs a bootstrap diffusion model in the depth curve space for high-resolution low-light image enhancement.
非空间空间中的恢复。一些基于扩散模型的信息检索方法专注于其他空间。例如,Refusion [120, 63] 使用均值回归图像恢复(IR)-SDE将目标图像转换为其退化的对应物。他们利用自动编码器将输入图像压缩为其潜在表示,使用跳跃连接来访问多尺度细节。陈等人[121]采用类似的方法,提出了一种称为分层集成扩散模型的两阶段策略。从空间到小波域的转换是无损的,并且具有显著的优势。例如,WaveDM [67] 修改了低频带,而WSGM [122]或ResDiff [60]将高频带相对于低分辨率图像进行了调整。BDCE [123]在深度曲线空间中设计了一种引导扩散模型,用于增强高分辨率低光图像。

T2I Prior Usage. The incorporation of T2I information proves advantageous as it allows the usage of pretrained T2I models. These models can be finetuned by adding specific layers or encoders tailored to the IR task. Wang et al. put this concept into practice with StableSR [124]. Central to StableSR is a time-aware encoder trained in tandem with a frozen Stable Diffusion model [26]. This setup seamlessly integrates trainable spatial feature transform layers, enabling conditioning based on the input image. DiffBIR [125] uses pretrained T2I diffusion models for blind image restoration, with a two-stage pipeline and a controllable module. CoSeR [126] introduces Cognitive Super-Resolution, merging image appearance and language understanding. SUPIR [127] leverages generative priors, model scaling, and a dataset of 20 million images for advanced restoration guided by textual prompts, featuring negative-quality prompts and a restoration-guided sampling method.
T2I先前使用。将T2I信息整合进来是有优势的,因为它允许使用预训练的T2I模型。这些模型可以通过添加特定的层或编码器来进行微调,以适应IR任务。王等人在StableSR [124]中将这个概念付诸实践。StableSR的核心是一个与冻结的稳定扩散模型[26]同时训练的时间感知编码器。这个设置无缝地集成了可训练的空间特征转换层,使得可以根据输入图像进行条件处理。DiffBIR [125]使用预训练的T2I扩散模型进行盲目图像恢复,采用两阶段流程和可控模块。CoSeR [126]引入了认知超分辨率,将图像外观和语言理解合并在一起。SUPIR [127]利用生成先验、模型缩放和2000万张图像的数据集,通过文本提示进行高级恢复,具有负质量提示和恢复引导的采样方法。

Projection-Based Methods. These methods aim to extract inherent structures or textures from input images to complement the generated images at each step and to ensure data consistency. ILVR [65], which projects the low-frequency information from the input image to the output image, ensures data consistency and establishes an improved condition. To address this and enhance data consistency, some recent works [71, 70, 128] take a different approach by aiming to estimate the posterior distribution using the Bayes theorem.
基于投影的方法。这些方法旨在从输入图像中提取固有的结构或纹理,以补充每个步骤生成的图像,并确保数据的一致性。ILVR [65]将输入图像的低频信息投影到输出图像中,确保数据的一致性并建立改进的条件。为了解决这个问题并增强数据的一致性,一些最近的工作[71, 70, 128]采用了不同的方法,通过使用贝叶斯定理来估计后验分布。

Decomposition-Based Methods. These methods view IR tasks as a linear reverse problem. Denoising Diffusion Restoration Models (DDRM) [66] utilizes a pretrained denoising diffusion generative model to address linear inverse problems, showcasing versatility across super-resolution, deblurring, inpainting, and colorization under varying levels of measurement noise. Denoising Diffusion Null-space Model (DDNM) [68] represents another decomposition-based zero-shot approach applicable to a broad range of linear IR problems beyond image SR, such as colorization, inpainting, and deblurring. It leverages the range-null space decomposition methodology [129, 130] to tackle diverse IR challenges effectively.
分解方法。这些方法将信息检索任务视为线性反问题。去噪扩散恢复模型(DDRM)[66]利用预训练的去噪扩散生成模型来解决线性反问题,在不同水平的测量噪声下展示了超分辨率、去模糊、修复和上色的多功能性。去噪扩散零空间模型(DDNM)[68]是另一种基于分解的零样本方法,适用于图像超分辨率之外的广泛线性信息检索问题,如上色、修复和去模糊。它利用了范围-零空间分解方法[129, 130]来有效应对各种信息检索挑战。

TABLE I: Comprehensive categorization of diffusion model-based image editing methods from multiple perspectives. The methods are color-rendered according to training, testing-time finetuning, and training & finetuning free. Input conditions include text, class, reference (Ref.) image, segmentation (Seg.) map, pose, mask, layout, sketch, dragging points, and audio. Task capabilities—semantic, stylistic, and structural—are marked with checkmarks (✓) according to the experimental results provided in the source papers.
表格I:基于扩散模型的图像编辑方法的综合分类,从多个角度进行分类。根据训练、测试时间微调和训练与微调自由的情况进行彩色渲染。输入条件包括文本、类别、参考图像、分割图、姿势、遮罩、布局、草图、拖动点和音频。根据源论文提供的实验结果,任务能力(语义、风格和结构)用勾号(✓)标记。
Method 方法

Condition(s) 条件(们)

Semantic Editing 语义编辑 Stylistic Editing 风格编辑 Structural Editing 结构编辑

Obj. 目标 Add. 添加。

Obj. 目标 Remo. 雷莫。

Obj. 目标 Repl. 替代

Bg. Chg. 变更

Emo. 情绪。 Expr. 表达 Mod. 模型

Color 颜色 Chg. 变更

Style 风格 Chg. 变更

Text. 文本。 Chg. 变更

Obj. 目标 Move. 移动。

Obj. 目标 Size. 大小。 Chg. 变更

Obj. 目标 Act. 行动。 Chg. 变更

Persp. /View. Chg.

DiffusionCLIP  [131] 扩散剪辑 [ 131]

Text, Class 文本,类别

Asyrp  [132]

Text, Class 文本,类别

EffDiff  [133]

Class 班级

DiffStyler  [134]

Text, Class 文本,类别

StyleDiffusion  [135] 风格扩散 [ 135]

Ref. Image, Class 参考图像,类别

UNIT-DDPM  [136] 单位-DDPM [ 136]

Class 班级

CycleNet  [137]

Class 班级

Diffusion Autoencoders  [138]
扩散自编码器[138]

Class 班级

HDAE  [139]

Class 班级

EGSDE  [140]

Class 班级

Pixel-Guided Diffusion  [141]
像素引导扩散[141]

Seg. Map, Class 分段地图,分类

PbE  [142]

Ref. Image 参考图像

RIC  [143] RIC [143]

Ref. Image, Sketch 参考图像,草图

ObjectStitch  [144] 对象拼接 [144]

Ref. Image 参考图像

PhD  [145] 博士 [ 145]

Ref. Image, Layout 参考图像,布局

DreamInpainter  [146] 梦境修复师 [ 146]

Ref. Image, Mask, Text 参考图像,遮罩,文本

Anydoor  [147] 任意门 [147]

Ref. Image, Mask 参考图像,掩码

FADING  [148] 褪色 [148]

Text 文本

PAIR Diffusion  [149] PAIR扩散[149]

Ref. Image, Text 参考图像,文本

SmartBrush  [150] 智能刷牙器 [150]

Text, Mask 文本,掩码

IIR-Net  [151]

Text 文本

PowerPaint  [152] 力量绘画[152]

Text, Mask 文本,掩码

Imagen Editor  [153] 图像编辑器 [ 153]

Text, Mask 文本,掩码

SmartMask  [154] 智能面罩 [154]

Text 文本

Uni-paint  [155] Uni-paint [ 155] 统一油漆

Text, Mask, Ref. Image 文本,遮罩,参考图像

InstructPix2Pix  [156]

Text 文本

MoEController  [157] 教育部控制器 [ 157]

Text 文本

FoI  [158] 自由信息法 [ 158]

Text 文本

LOFIE  [159]

Text 文本

InstructDiffusion  [160] 指导扩散[160]

Text 文本

Emu Edit  [161] 鸸鹋编辑[161]

Text 文本

DialogPaint  [162] 对话绘制 [162]

Text 文本

Inst-Inpaint  [163]

Text 文本

HIVE  [164] 蜂巢 [164]

Text 文本

ImageBrush  [165] 图像刷[165]

Ref. Image, Seg. Map, Pose
参考图像,分割地图,姿态

InstructAny2Pix  [166]

Ref. Image, Audio, Text 参考图像、音频、文本

MGIE  [167] MGIE [ 167] 的翻译为简体中文是:MGIE [ 167]

Text 文本

SmartEdit  [168] 智能编辑 [168]

Text 文本

iEdit  [169] iEdit [169]

Text 文本

TDIELR  [170] TDIELR [ 170] 翻译为简体中文:

Text 文本

ChatFace  [171] 聊天脸 [171]

Text 文本

UniTune  [172]

Text 文本

Custom-Edit  [173] 定制编辑 [173]

Text, Ref. Image 文本,参考图像

KV-Inversion  [174] KV-Inversion [174]

Text 文本

Null-Text Inversion  [175]
空文本反转 [175]

Text 文本

DPL  [176]

Text 文本

DiffusionDisentanglement  [177]
扩散解缠 [177]

Text 文本

Prompt Tuning Inversion  [178]
提示调谐反转[178]

Text 文本

(Continued on the next page)
(续下一页)

TABLE 1 (continued) 表1(续)

Method 方法 Condition(s) 条件(们) Semantic Editing 语义编辑 Stylistic Editing 风格编辑 Structural Editing 结构编辑 Obj. 目标 Add. 添加。 Obj. 目标 Remo. 雷莫。 Obj. 目标 Repl. 替代 Bg. Chg. 变更 Emo. 情绪。 Expr. 表达 Mod. 模型 Color 颜色 Chg. 变更 Style 风格 Chg. 变更 Text. 文本。 Chg. 变更 Obj. 目标 Move. 移动。 Obj. 目标 Size. 大小。 Chg. 变更 Obj. 目标 Act. 行动。 Chg. 变更 Persp. 透视。 /View. /视图。 Chg. 变更 StyleDiffusion  [179] 风格扩散 [179] Text 文本 InST  [180] Text 文本 DragonDiffusion  [181] 龙差异[181] Dragging Points 拖动点 DragDiffusion  [182] 拖曳扩散[182] Dragging Points 拖动点 DDS  [183] DDS [183] Text 文本 DiffuseIT  [184] 扩散IT [ 184] Text 文本 CDS  [185] CDS [185] Text 文本 MagicRemover  [186] 魔法移除器 [186] Text 文本 Imagic  [187] Text 文本 LayerDiffusion  [188] 层扩散 [188] Mask, Text 口罩,文本 Forgedit  [189] 锻造 [189] Text 文本 SINE  [190] 正弦 [ 190] Text 文本 PRedItOR  [191] Text 文本 ReDiffuser  [192] ReDiffuser [ 192] 重新扩散器 Text 文本 Captioning and Injection  [193]
字幕和注入 [193]
Text 文本
InstructEdit  [194] 指导编辑[194] Text 文本 Direct Inversion  [195] 直接反转[195] Text 文本 DDPM Inversion  [196] DDPM反演[196] Text 文本 SDE-Drag  [197] Layout 布局 LEDITS++  [198] Text 文本 FEC  [199] Text 文本 EMILIE  [200] 艾米莉 [200] Text 文本 Negative Inversion  [201]
负倒装[201]
Text 文本
ProxEdit  [202] Text 文本 Null-Text Guidance  [203]
空文本指导[203]
Ref. Image, Text 参考图像,文本
EDICT  [204] EDICT [204] Text 文本 AIDI  [205] AIDI [205] Text 文本 CycleDiffusion  [206] 循环扩散 [206] Text 文本 InjectFusion  [207] 注射融合 [207] Ref. Image 参考图像 Fixed-point inversion  [208]
固定点反转[208]
Text 文本
TIC  [209] TIC [209] Text, Mask 文本,掩码 Diffusion Brush  [210] 扩散刷 [210] Text, Mask 文本,掩码 Self-guidance  [211] 自我引导[211] Layout 布局 P2P  [212] Text 文本 Pix2Pix-Zero  [213] Text 文本