Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
概念滑块:用于扩散模型精确控制的 LoRA 适配器
Abstract
摘要
We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at sliders.baulab.info
我们提出了一种创建可解释概念滑块的方法,这种滑块可以精确控制扩散模型生成的图像中的属性。我们的方法可以识别与一个概念相对应的低等级参数方向,同时最大限度地减少对其他属性的干扰。滑块是通过一小组提示或样本图像创建的;因此,滑块方向既可以为文本概念创建,也可以为视觉概念创建。概念滑块是即插即用的:它们可以有效地组成并连续调节,从而实现对图像生成的精确控制。在定量实验中,与以前的编辑技术相比,我们的滑块具有更强的针对性编辑功能,干扰更小。我们展示了天气、年龄、风格和表情滑块以及滑块合成。我们展示了滑块如何从 StyleGAN 中转移潜变量,以便直观地编辑难以用文字描述的视觉概念。我们还发现,我们的方法可以帮助解决稳定扩散 XL 中持续存在的质量问题,包括修复物体变形和修复扭曲的手。我们的代码、数据和训练有素的滑块可在sliders.baulab.info上获取。
1 Introduction
1 引言
Artistic users of text-to-image diffusion models [4, 37, 36, 9, 19] often need finer control over the visual attributes and concepts expressed in a generated image than currently possible. Using only text prompts, it can be challenging to precisely modulate continuous attributes such as a person’s age or the intensity of the weather, and this limitation hinders creators’ ability to adjust images to match their vision [43]. In this paper, we address these needs by introducing interpretable Concept Sliders that allow nuanced editing of concepts within diffusion models. Our method empowers creators with high-fidelity control over the generative process as well as image editing. Our code and trained sliders will be open sourced.
文本到图像扩散模型[4,37,36,9,19]的艺术用户往往需要对生成图像中表达的视觉属性和概念进行比目前更精细的控制。仅使用文本提示,要精确调节连续属性(如人的年龄或天气的强度)可能具有挑战性,而这种限制阻碍了创作者调整图像以符合其视觉的能力[43]。在本文中,我们引入了可解释的概念滑块,允许对扩散模型中的概念进行细致入微的编辑,从而满足了这些需求。我们的方法能让创作者对生成过程和图像编辑进行高保真控制。我们的代码和训练有素的滑块将开源。
3shu_teiei@outlook.jp
3shu_teiei@outlook.jp
Concept Sliders solve several problems that are not well-addressed by previous methods. Direct prompt modification can control many image attributes, but changing the prompt often drastically alters overall image structure due to the sensitivity of outputs to the prompt-seed combination [38, 44, 22]. Post-hoc techniques such PromptToPrompt [13] and Pix2Video [3] enable editing visual concepts in an image by inverting the diffusion process and modifying cross-attentions. However, those methods require separate inference passes for each new concept and can support only a limited set of simultaneous edits. They require engineering a prompt suitable for an individual image rather than learning a simple generalizable control, and if not carefully prompted, they can introduce entanglement between concepts, such as altering race when modifying age (see Appendix). In contrast, Concept Sliders provide lightweight plug-and-play adaptors applied to pre-trained models that enable precise, continuous control over desired concepts in a single inference pass, with efficient composition (Figure 6) and minimal entanglement (Figure 11).
概念滑块解决了以往方法未能很好解决的几个问题。直接修改提示语可以控制许多图像属性,但由于输出对提示语-种子组合的敏感性,改变提示语往往会大幅改变整体图像结构[38,44,22]。事后技术,如 PromptToPrompt[13]和 Pix2Video[3 ],可以通过反转扩散过程和修改交叉注意力来编辑图像中的视觉概念。但是,这些方法需要对每个新概念进行单独的推理,而且只能支持有限的同时编辑。这些方法需要设计一个适合单个图像的提示,而不是学习一个简单的通用控制,而且如果不仔细提示,可能会引入概念之间的纠缠,例如在修改年龄时改变种族(见附录)。相比之下,概念滑动器提供了轻量级的即插即用适配器,适用于预先训练好的模型,能够在单次推理中对所需概念进行精确、连续的控制,并具有高效的组合(图6)和最小的纠缠(图11)。
Each Concept Slider is a low-rank modification of the diffusion model. We find that the low-rank constraint is a vital aspect of precision control over concepts: while finetuning without low-rank regularization reduces precision and generative image quality, low-rank training identifies the minimal concept subspace and results in controlled, high-quality, disentangled editing (Figure 11). Post-hoc image editing methods that act on single images rather than model parameters cannot benefit from this low-rank framework.
每个概念滑块都是对扩散模型的低阶修正。我们发现,低秩约束是对概念进行精确控制的一个重要方面:虽然不进行低秩正则化的微调会降低精确度和生成图像的质量,但低秩训练可以识别最小概念子空间,从而实现可控的、高质量的分离编辑(图11)。对单幅图像而非模型参数采取行动的事后图像编辑方法无法从这种低秩框架中受益。
Concept Sliders also allow editing of visual concepts that cannot be captured by textual descriptions; this distinguishes it from prior concept editing methods that rely on text [7, 8]. While image-based model customization methods [25, 38, 6] can add new tokens for new image-based concepts, those are difficult to use for image editing. In contrast, Concept Sliders allow an artist to provide a handful of paired images to define a desired concept, and then a Concept Slider will then generalize the visual concept and apply it to other images, even in cases where it would be infeasible to describe the transformation in words.
概念滑块还允许编辑文字描述无法捕捉的视觉概念;这使它有别于之前依赖文字的概念编辑方法[7,8]。虽然基于图像的模型定制方法[25,38,6]可以为基于图像的新概念添加新标记,但这些方法很难用于图像编辑。相比之下,概念滑块允许艺术家提供一些配对图像来定义所需的概念,然后概念滑块就会概括视觉概念并将其应用到其他图像中,即使在无法用文字描述转换的情况下也是如此。
Other generative image models, such as GANs, have previously exhibited latent spaces that provide highly disentangled control over generated outputs. In particular, it has been observed that StyleGAN [20] stylespace neurons offer detailed control over many meaningful aspects of images that would be difficult to describe in words [45]. To further demonstrate the capabilities of our approach, we show that it is possible to create Concept Sliders that transfer latent directions from StyleGAN’s style space trained on FFHQ face images [20] into diffusion models. Notably, despite originating from a face dataset, our method successfully adapts these latents to enable nuanced style control over diverse image generation. This showcases how diffusion models can capture the complex visual concepts represented in GAN latents, even those that may not correspond to any textual description.
其他生成式图像模型,如 GANs,以前也展示过潜在空间,可对生成的输出结果提供高度分散的控制。特别是,人们已经观察到,StyleGAN[20]风格空间神经元可以对图像的许多有意义的方面进行详细控制,而这些方面是很难用语言来描述的[45]。为了进一步证明我们方法的能力,我们展示了创建概念滑块的可能性,这种概念滑块可以将在 FFHQ 人脸图像[20]上训练的 StyleGAN 风格空间的潜在方向转移到扩散模型中。值得注意的是,尽管我们的方法源自人脸数据集,但却成功地调整了这些潜在方向,从而实现了对不同图像生成的细微风格控制。这展示了扩散模型如何捕捉 GAN 潜在变量所代表的复杂视觉概念,甚至是那些可能与任何文本描述都不对应的概念。
We demonstrate that the expressiveness of Concept Sliders is powerful enough to address two particularly practical applications—enhancing realism and fixing hand distortions. While generative models have made significant progress in realistic image synthesis, the latest generation of diffusion models such as Stable Diffusion XL [36] are still prone to synthesizing distorted hands with anatomically implausible extra or missing fingers [31], as well as warped faces, floating objects, and distorted perspectives. Through a perceptual user study, we validate that a Concept Slider for “realistic image” as well as another for “fixed hands” both create a statistically significant improvement in perceived realism without altering image content.
我们证明了概念滑块的表现力足以解决两个特别实际的应用问题--增强逼真度和修复手部变形。虽然生成模型在逼真图像合成方面取得了重大进展,但最新一代的扩散模型(如稳定扩散 XL 模型[36])仍然容易合成出扭曲的手掌、解剖学上难以置信的多余或缺失的手指[31],以及扭曲的人脸、漂浮的物体和扭曲的视角。通过对用户的感知研究,我们验证了 "逼真图像 "概念滑块和 "固定手部 "概念滑块都能在不改变图像内容的情况下显著提高感知逼真度。
Concept Sliders are modular and composable. We find that over 50 unique sliders can be composed without degrading output quality. This versatility gives artists a new universe of nuanced image control that allows them to blend countless textual, visual, and GAN-defined Concept Sliders. Because our method bypasses standard prompt token limits, it empowers more complex editing than achievable through text alone.
概念滑块是模块化和可组合的。我们发现,在不降低输出质量的情况下,可以组成 50 多个独特的滑块。这种多功能性为艺术家提供了一个细微图像控制的新领域,使他们能够将无数文字、视觉和 GAN 定义的概念滑块融合在一起。由于我们的方法绕过了标准提示符的限制,因此可以实现比单纯文本更复杂的编辑。
2 Related Works
2 相关作品
Image Editing
图像编辑
Recent methods propose different approaches for single image editing in text-to-image diffusion models. They mainly focus on manipulation of cross-attentions of a source image and a target prompt [13, 22, 35], or use a conditional input to guide the image structure [30].
Unlike those methods that are applied to a single image, our model creates a semantic change defined by a small set of text pairs or image pairs, applied to the entire model. Analyzing diffusion models through Riemannian geometry, Park et al. [33] discovered local latent bases that enable semantic editing by traversing the latent space. Their analysis also revealed the evolving geometric structure over timesteps across prompts, requiring per-image latent basis optimization. In contrast, we identify generalizable parameter directions, without needing custom optimization for each image. Instruct-pix2pix [1] finetunes a diffusion model to condition image generation on both an input image and text prompt. This enables a wide range of text-guided editing, but lacks fine-grained control over edit strength or visual concepts not easily described textually.
最近的方法为文本到图像扩散模型中的单幅图像编辑提出了不同的方法。它们主要集中在对源图像和目标提示的交叉注意力进行操作[13,22,35],或使用条件输入来引导图像结构[30]。与这些应用于单一图像的方法不同,我们的模型创建了一个由一小组文本对或图像对定义的语义变化,并应用于整个模型。Park 等人[33]通过黎曼几何分析扩散模型,发现了局部潜在基础,通过遍历潜在空间实现语义编辑。他们的分析还揭示了跨提示时间步的几何结构演变,这需要对每个图像的潜在基础进行优化。与此相反,我们确定了可通用的参数方向,无需针对每幅图像进行定制优化。Instruct-pix2pix[1]对扩散模型进行了微调,使图像生成以输入图像和文本提示为条件。这使得文本引导的编辑范围更广,但缺乏对编辑强度或不易用文字描述的视觉概念的精细控制。
Guidance Based Methods
基于指导的方法
Ho et al. [14] introduce classifier free guidance that showed improvement in image quality and text-image alignment when the data distribution is driven towards the prompt and away from unconditional output. Liu et al. [28] present an inference-time guidance formulation to enhance concept composition and negation in diffusion models. By adding guidance terms during inference, their method improves on the limited inherent compositionality of diffusion models. SLD [40] proposes using guidance to moderate unsafe concepts in diffusion models. They propose a safe prompt which is used to guide the output away from unsafe content during inference.
Ho 等人[14]介绍了无分类器引导,当数据分布趋向于提示和远离无条件输出时,图像质量和文本-图像对齐得到了改善。Liu 等人[28]提出了一种推理时指导公式,以增强扩散模型中的概念构成和否定。通过在推理过程中添加指导术语,他们的方法改善了扩散模型有限的固有组成性。SLD[40]建议使用指导来缓和扩散模型中的不安全概念。他们提出了一种安全提示,用于在推理过程中引导输出远离不安全的内容。
Model Editing
模型编辑
Our method can be seen as a model editing approach, where by applying a low-rank adaptor, we single out a semantic attribute and allow for continuous control with respect to the attribute. To personalize the models for adding new concepts, customization methods based on finetuning exist [38, 25, 6]. Custom Diffusion [25] proposes a way to incorporate new visual concepts into pretrained diffusion models by finetuning only the cross-attention layers. On the other hand, Textual Inversion [6] introduces new textual concepts by optimizing an embedding vector to activate desired model capabilities.
Previous works [7, 24, 23, 12, 46] proposed gradient based fine-tuning-based methods for the permanent erasure of a concept in a model. Ryu et al. [39] proposed adapting LoRA [16] for diffusion model customization. Recent works [47] developed low rank implementations of erasing concepts [7] allowing the ability to adjust the strength of erasure in an image. [17] implemented image based control of concepts by merging two overfitted LoRAs to capture an edit direction. Similarly, [8, 32] proposed closed-form formulation solutions for debiasing, redacting or moderating concepts within the model’s cross-attention weights. Our method does not modify the underlying text-to-image diffusion model and can be applied as a plug-and-play module easily stacked across different attributes.
我们的方法可以看作是一种模型编辑方法,通过应用低等级适配器,我们将语义属性单列出来,并允许对该属性进行持续控制。为个性化添加新概念的模型,存在基于微调的定制方法[38,25,6]。自定义扩散法[25]提出了一种只对交叉注意层进行微调,从而将新的视觉概念纳入预训练扩散模型的方法。另一方面,Textual Inversion[6]则通过优化嵌入向量来引入新的文本概念,从而激活所需的模型能力。之前的研究[7,24,23,12,46]提出了基于梯度微调的方法,用于永久删除模型中的概念。Ryu 等人[39]提出将 LoRA[16 ]用于扩散模型定制。 最近的研究[47]开发了擦除概念的低等级实现[7],允许调整图像中擦除的强度。[17]通过合并两个过度拟合的 LoRA 来捕捉编辑方向,实现了基于图像的概念控制。同样,文献[8,32]提出了在模型的交叉关注权重内对概念进行去胶、删节或缓和的闭式表述解决方案。我们的方法不会修改底层的文本到图像扩散模型,可以作为即插即用模块应用,很容易在不同属性之间进行堆叠。
Semantic Direction in Generative models
生成模型中的语义方向
In Generative Adversarial Networks (GANs), manipulation of semantic attributes has been widely studied. Latent space trajectories have been found in a self-supervised manner [18]. PCA has been used to identify semantic directions in the latent or feature spaces [11]. Latent subspaces corresponding to detailed face attributes have been analyzed [42]. For diffusion models, semantic latent spaces have been suggested to exist in the middle layers of the U-Net architecture [26, 34]. It has been shown that principal directions in diffusion model latent spaces (h-spaces) capture global semantics [10]. Our method directly trains low-rank subspaces corresponding to semantic attributes. By optimizing for specific global directions using text or image pairs as supervision, we obtain precise and localized editing directions. Recent works have [49] introduced the low-rank representation adapter, which employs a contrastive loss to fine-tune LoRA to achieve fine-grained control of concepts in language models.
在生成对抗网络(GANs)中,对语义属性的处理已被广泛研究。潜空间轨迹是以自我监督的方式发现的[18]。PCA 被用于识别潜空间或特征空间中的语义方向[11]。与详细人脸属性相对应的潜在子空间已被分析[42]。对于扩散模型,有人认为语义潜空间存在于 U-Net 架构的中间层[26,34]。研究表明,扩散模型潜空间(h-spaces)中的主方向可以捕捉全局语义[10]。我们的方法直接训练与语义属性相对应的低阶子空间。通过使用文本或图像对作为监督对特定的全局方向进行优化,我们获得了精确的本地化编辑方向。最近的研究[49]引入了低秩表示适配器(low-rank representation adapter),该适配器采用对比损失(contrastiveloss)对 LoRA 进行微调,以实现对语言模型中概念的细粒度控制。
3 Background
3 背景
3.1 Diffusion Models
3.1 扩散模型
Diffusion models are a subclass of generative models that operationalize the concept of reversing a diffusion process to synthesize data. Initially, the forward diffusion process gradually adds noise to the data, transitioning it from an organized state to a complete Gaussian noise . At any timestep , the noised image is modelled as:
扩散模型是生成模型的一个子类,它利用反向扩散过程的概念来合成数据。最初,正向扩散过程会逐渐给数据添加噪声,使其从有组织的状态 过渡到完全的高斯噪声 。在任意时间步 ,噪声图像的模型为
(1) |
Where is a randomly sampled gaussian noise with zero mean and unit variance. Diffusion models aim to reverse this diffusion process by sampling a random Gaussian noise and gradually denoising the image to generate an image . In practice [15, 29], the objective of diffusion model is simplified to predicting the true noise from Eq. 1 when is fed as input with additional inputs like the timestep and conditioning .
其中 是随机取样的高斯噪声,均值为零,方差为单位。扩散模型旨在通过对随机高斯噪声 进行采样,逆转这一扩散过程,并逐步对图像进行去噪处理,生成图像 。在实践中[15,29],当 作为输入时,扩散模型的目标被简化为根据公式1预测真实噪声 ,并输入额外的输入,如时间步 和调节 。
(2) |
Where is the noise predicted by the diffusion model conditioned on at timestep . In this work, we work with Stable Diffusion [37] and Stable Diffusion XL [36], which are latent diffusion models that improve efficiency by operating in a lower dimensional latent space of a pre-trained variational autoencoder. They convert the images to a latent space and run the diffusion training as discussed above. Finally, they decode the latent through the VAE decoder to get the final image
其中 是以 为条件的扩散模型在时间步 时预测的噪声。在这项工作中,我们使用了稳定扩散模型[37]和稳定扩散 XL[36 ],它们是潜在扩散模型,通过在预先训练的变异自动编码器的低维潜在空间 中运行来提高效率。它们将图像转换为潜在空间,并如上所述进行扩散训练。最后,他们通过 VAE 解码器对潜在 进行解码,得到最终图像
3.2 Low-Rank Adaptors
3.2 低电平适配器
The Low-Rank Adaptation (LoRA) [16] method enables efficient adaptation of large pre-trained language models to downstream tasks by decomposing the weight update during fine-tuning. Given a pre-trained model layer with weights , where is the input dimension and the output dimension, LoRA decomposes as
Low-Rank Adaptation(LoRA)[16]方法通过在微调过程中分解权重更新 ,使大型预训练语言模型能高效地适应下游任务。给定一个带有权重 的预训练模型层,其中 是输入维度, 是输出维度,LoRA 将 分解为
(3) |
where and with being a small rank that constrains the update to a low dimensional subspace. By freezing and only optimizing the smaller matrices and , LoRA achieves massive reductions in trainable parameters.
During inference, can be merged into with no overhead by a LoRA scaling factor :
其中, 和 , 为小秩,限制更新到低维子空间。通过冻结 并只优化较小的矩阵 和 ,LoRA 可大量减少可训练参数。
在推理过程中, 可以通过一个 LoRA 缩放因子 并入 ,而不会产生任何开销:
(4) |
4 Method
4 方法
Concept Sliders are a method for fine-tuning LoRA adaptors on a diffusion model to enable concept-targeted image control as shown in Figure 2. Our method learns low-rank parameter directions that increase or decrease the expression of specific attributes when conditioned on a target concept. Given a target concept and model , our goal is to obtain that modifies the likelihood of attributes and in image when conditioned on - increase likelihood of attribute and decrease likelihood of attribute .
如图2 所示,"概念滑块 "是一种在扩散模型上微调 LoRA 适配器的方法,可实现以概念为目标的图像控制。我们的方法可以学习低等级参数方向,在目标概念的条件下增加或减少特定属性的表达。给定一个目标概念 和模型 ,我们的目标是获得 ,当以 为条件时,它可以修改图像 中属性 和 的可能性--增加属性 的可能性,减少属性 的可能性。
(5) |
Where represents the distribution generated by the original model when conditioned on . Expanding , the gradient of the log probability would be proportional to:
其中 代表原始模型在 条件下产生的分布。将 展开,对数概率 的梯度将与之成正比:
(6) |
Based on Tweedie’s formula [5] and the reparametrization trick of [15], we can introduce a time-varying noising process and express each score (gradient of log probability) as a denoising prediction . Thus Eq. 6 becomes:
根据特威迪公式[5]和[15]的重参数化技巧,我们可以引入一个时变噪声过程,并将每个分数(对数概率梯度)表示为去噪预测值 。因此,公式6变为
(7) |
The proposed score function in Eq. 7 shifts the distribution of the target concept to exhibit more attributes of and fewer attributes of . In practice, we notice that a single prompt pair can sometimes identify a direction that is entangled with other undesired attributes. We therefore incorporate a set of preservation concepts (for example, race names while editing age) to constrain the optimization. Instead of simply increasing , we aim to increase, for every , , and reduce . This leads to the disentanglement objective:
公式7中提议的评分函数会改变目标概念 的分布,使其表现出更多的 属性和更少的 属性。在实践中,我们注意到,单个提示对有时会识别出一个与其他不受欢迎的属性纠缠在一起的方向。因此,我们加入了一组保护概念 (例如,在编辑年龄的同时编辑种族名称)来限制优化。我们的目标不是简单地增加 ,而是为每个 增加 ,并减少 。这就引出了拆分目标:
(8) |
The disentanglement objective in Equation 8 finetunes the Concept Slider modules while keeping pre-trained weights fixed. Crucially, the LoRA formulation in Equation 4 introduces a scaling factor that can be modified at inference time. This scaling parameter allows adjusting the strength of the edit, as shown in Figure 1. Increasing makes the edit stronger without retraining the model. Previous model editing method [7], suggests a stronger edit by retraining with increased guidance in Eq. 8. However, simply scaling at inference time produces the same effect of strengthening the edit, without costly retraining.
等式8中的解缠目标是对概念滑块模块进行微调,同时保持预先训练的权重不变。最重要的是,等式4中的 LoRA 公式引入了一个可在推理时修改的缩放因子 。如图1 所示,该缩放参数 可以调整编辑强度。增加 可以在不重新训练模型的情况下增强编辑强度。之前的模型编辑方法[7]建议,通过增加公式8 中的指导 进行再训练,以获得更强的编辑效果。然而,只需在推理时缩放 ,就能产生同样的加强编辑效果,而无需昂贵的重新训练。
4.1 Learning Visual Concepts from Image Pairs
We propose sliders to control nuanced visual concepts that are harder to specify using text prompts. We leverage small paired before/after image datasets to train sliders for these concepts. The sliders learn to capture the visual concept through the contrast between image pairs (, ).
Our training process optimizes the LORA applied in both the negative and positive directions. We shall write for the application of positive LoRA and for the negative case. Then we minimize the following loss:
(9) |
This has the effect of causing the LORA to align to a direction that causes the visual effect of A in the negative direction and B in the positive direction. Defining directions visually in this way not only allows an artist to define a Concept Slider through custom artwork; it is also the same method we use to transfer latents from other generative models such as StyleGAN.
5 Experiments
We evaluate our approach primarily on Stable Diffusion XL [36], a high-resolution 1024-pixel model, and we conduct additional experiments on SD v1.4 [37]. All models are trained for 500 epochs. We demonstrate generalization by testing sliders on diverse prompts - for example, we evaluate our "person" slider on prompts like "doctor", "man", "woman", and "barista". For inference, we follow the SDEdit technique of Meng et al. [30]: to maintain structure and semantics, we use the original pre-trained model for the first steps, setting the LoRA adaptor multipliers to 0 and retaining the pre-trained model priors. We then turn on the LoRA adaptor for the remaining steps.
5.1 Textual Concept Sliders
We validate the efficacy of our slider method on a diverse set of 30 text-based concepts, with full examples in the Appendix. Table 1 compares our method against two baselines: an approach we propose inspired by SDEdit [30] and Liu et al.[28] that uses a pretrained model with the standard prompt for timesteps, then starts composing by adding prompts to steer the image, and prompt2prompt[13], which leverages cross-attention for image editing after generating reference images. While the former baseline is novel, all three enable finer control but differ in how edits are applied. Our method directly generates 2500 edited images per concept, like "image of a person", by setting the scale parameter at inference. In contrast, the baselines require additional inference passes for each new concept (e.g "old person"), adding computational overhead. Our method consistently achieves higher CLIP scores and lower LPIPS versus the original, indicating greater coherence while enabling precise control. The baselines are also more prone to entanglement between concepts. We provide further analysis and details about the baselines in the Appendix.
Figure 3 shows typical qualitative examples, which maintains good image structure while enabling fine grained editing of the specified concept.
Prompt2Prompt | Our Method | Composition | ||||
---|---|---|---|---|---|---|
CLIP | LPIPS | CLIP | LPIPS | CLIP | LPIPS | |
Age | 1.10 | 0.15 | 3.93 | 0.06 | 3.14 | 0.13 |
Hair | 3.45 | 0.15 | 5.59 | 0.10 | 5.14 | 0.15 |
Sky | 0.43 | 0.15 | 1.56 | 0.13 | 1.55 | 0.14 |
Rusty | 7.67 | 0.25 | 7.60 | 0.09 | 6.67 | 0.18 |
5.2 Visual Concept Sliders
Some visual concepts like precise eyebrow shapes or eye sizes are challenging to control through text prompts alone. To enable sliders for these granular attributes, we leverage paired image datasets combined with optional text guidance. As shown in Figure 4, we create sliders for "eyebrow shape" and "eye size" using image pairs capturing the desired transformations. We can further refine the eyebrow slider by providing the text "eyebrows" so the direction focuses on that facial region. Using image pairs with different scales, like the eye sizes from Ostris [2], we can create sliders with stepwise control over the target attribute.
We quantitatively evaluate the eye size slider by detecting faces using FaceNet [41], cropping the area, and employing a face parser [48]to measure eye region across the slider range. Traversing the slider smoothly increases the average eye area 2.75x, enabling precise control as shown in Table 2. Compared to customization techniques like textual inversion [6] that learns a new token and custom diffusion [25] that fine-tunes cross attentions, our slider provides more targeted editing without unwanted changes. When model editing methods [25, 6] are used to incorporate new visual concepts, they memorize the training subjects rather than generalizing the contrast between pairs. We provide more details in the Appendix.
Training | Custom | Textual | Our | |
---|---|---|---|---|
Data | Diffusion | Inversion | Method | |
1.84 | 0.97 | 0.81 | 1.75 | |
LPIPS | 0.03 | 0.23 | 0.21 | 0.06 |
5.3 Sliders transferred from StyleGAN
Figure 5 demonstrates sliders transferred from the StyleGAN-v3 [21] style space that is trained on FFHQ [20] dataset. We use the method of [45] to explore the StyleGAN-v3 style space and identify neurons that control hard-to-describe facial features. By scaling these neurons, we collect images to train image-based sliders. We find that Stable Diffusion’s latent space can effectively learn these StyleGAN style neurons, enabling structured facial editing. This enables users to control nuanced concepts that are indescribable by words and styleGAN makes it easy to get generate the paired dataset.
5.4 Composing Sliders
A key advantage of our low-rank slider directions is composability - users can combine multiple sliders for nuanced control rather than being limited to one concept at a time. For example, in Figure 6 we show blending "cooked" and "fine dining" food sliders to traverse this 2D concept space. Since our sliders are lightweight LoRA adaptors, they are easy to share and overlay on diffusion models. By downloading interesting slider sets, users can adjust multiple knobs simultaneously to steer complex generations. In Figure 7 we qualitatively show the effects of composing multiple sliders progressively up to 50 sliders at a time. We use far greater than 77 tokens (the current context limit of SDXL [36]) to create these 50 sliders. This showcases the power of our method that allows control beyond what is possible through prompt-based methods alone. We further validate multi-slider composition in the appendix.
6 Concept Sliders to Improve Image Quality
One of the most interesting aspects of a large-scale generative model such as Stable Diffusion XL is that, although their image output can often suffer from distortions such as warped or blurry objects, the parameters of the model contains a latent capability to generate higher-quality output with fewer distortions than produced by default. Concept Sliders can unlock these abilities by identifying low-rank parameter directions that repair common distortions.
Fixing Hands
Generating realistic-looking hands is a persistent challenge for diffusion models: for example, hands are typically generated with missing, extra, or misplaced fingers. Yet the tendency to distort hands can be directly controlled by a Concept Slider: Figure 9 shows the effect of a "fix hands" Concept Slider that lets users smoothly adjust images to have more realistic, properly proportioned hands. This parameter direction is found using a complex prompt pair boosting “realistic hands, five fingers, 8k hyper-realistic hands” and suppressing “poorly drawn hands, distorted hands, misplaced fingers”. This slider allows hand quality to be improved with a simple tweak rather manual prompt engineering.
To measure the “fix hands" slider, we conduct a user study on Amazon Mechanical Turk. We present 300 random images with hands to raters—half generated by Stable Diffusion XL and half by XL with our slider applied (same seeds and prompts). Raters are asked to assess if the hands appear distorted or not. Across 150 SDXL images, raters find 62% have distorted hands, confirming it as a prevalent problem. In contrast, only 22% of the 150 slider images are rated as having distorted hands.
Repair Slider
In addition to controlling specific concepts like hands, we also demonstrate the use of Concept Sliders to guide generations towards overall greater realism. We identify single low-rank parameter direction that shifts images away from common quality issues like distorted subjects, unnatural object placement, and inconsistent shapes. As shown in Figures 8 and 10, traversing this “repair" slider noticeably fixes many errors and imperfections.
Through a perceptual study, we evaluate the realism of 250 pairs of slider-adjusted and original SD images. A majority of participants rate the slider images as more realistic in 80.39% of pairs, indicating our method enhances realism. However, FID scores do not align with this human assessment, echoing prior work on perceptual judgment gaps [27]. Instead, distorting images along the opposite slider direction improves FID, though users still prefer the realism-enhancing direction. We provide more details about the user studies in the appendix.
7 Ablations
7 次 消融
We analyze the two key components of our method to verify that they are both necessary: (1) the disentanglement formulation and (2) low-rank adaptation. Table 3 shows quantitative measures on 2500 images, and Figure 11 shows qualitative differences. In both quantitative and quantitative measures, we find that the disentanglement objective from Eq.8 success in isolating the edit from unwanted attributes (Fig.11.c); for example without this objective we see undesired changes in gender when asking for age as seen in Table 3, Interference metric which measures the percentage of samples with changed race/gender when making the edit. The low-rank constraint is also helpful: it has the effect of precisely capturing the edit direction with better generalization (Fig.11.d); for example, note how the background and the clothing are better preserved in Fig.11.b. Since LORA is parameter-efficient, it also has the advantage that it enables lightweight modularity. We also note that the SDEdit-inspired inference technique allows us to use a wider range of alpha values, increasing the editing capacity, without losing image structure. We find that SDEdit’s inference technique expands the usable range of alpha before coherence declines relative to the original image. We provide more details in the Appendix.
我们分析了我们方法的两个关键组成部分,以验证它们都是必要的:(1) 解缠公式和 (2) 低阶适应。表3显示了 2500 幅图像的定量测量结果,图11显示了定性差异。在定量和定量测量中,我们发现公式8中的解缠目标成功地将编辑与不需要的属性隔离开来(图11.c);例如,如果没有这个目标,我们会在询问年龄时看到不需要的性别变化,如表3 所示,干扰度量用于测量编辑时种族/性别发生变化的样本百分比。低秩约束也很有帮助:它可以精确捕捉编辑方向,并具有更好的泛化效果(图11.d);例如,请注意图11.b 中背景和服装是如何得到更好的保留的。我们还注意到,受 SDEdit 启发的推理技术允许我们在不丢失图像结构的情况下使用更大范围的 alpha 值,从而提高了编辑能力。我们发现,相对于原始图像,SDEdit 的推理技术在一致性下降之前扩大了 alpha 值的可用范围。我们将在附录中提供更多细节。
w/o 无 |
w/o 无 |
||
Ours 我们的 |
Disentanglement 脱钩 |
Low Rank 低等级 |
|
3.93 | 3.39 | 3.18 | |
LPIPS | 0.06 | 0.17 | 0.23 |
Interference 干扰 |
0.10 | 0.36 | 0.19 |
表 3: 与原始图像相比,衡量性别/种族变化的样本百分比的干扰度量显著降低,由此可见,解缠公式能够对年龄方向进行精确控制。通过使用 LoRA 适配程序,滑动条在结构和编辑方向上都实现了更精细的编辑,LPIPS 和干扰度的改善就是证明。概念强度得以保持,不同消融的分数相似。
8 Limitations
8 限制
While the disentanglement formulation reduces unwanted interference between edits, we still observe some residual effects as shown in Table 3 for our sliders. This highlights the need for more careful selection of the latent directions to preserve, preferably an automated method, in order to further reduce edit interference. Further study is required to determine the optimal set of directions that minimizes interference while retaining edit fidelity. We also observe that while the inference SDEdit technique helps preserve image structure, it can reduce edit intensity compared to the inference-time method, as shown in Table 1. The SDEdit approach appears to trade off edit strength for improved structural coherence. Further work is needed to determine if the edit strength can be improved while maintaining high fidelity to the original image.
虽然解缠公式减少了编辑之间不必要的干扰,但我们仍然观察到一些残余效应,如表3中我们的滑块所示。这说明需要更仔细地选择要保留的潜在方向,最好是采用自动方法,以进一步减少编辑干扰。我们需要进一步研究,以确定既能最大限度减少干扰,又能保持编辑保真度的最佳方向集。我们还观察到,虽然推理 SDEdit 技术有助于保留图像结构,但与推理时方法相比,它可以降低编辑强度,如表1 所示。SDEdit 方法似乎是以编辑强度换取结构一致性的改善。要确定是否能在保持与原始图像高度保真度的同时提高编辑强度,还需要进一步的工作。
9 Conclusion
9 结论
Concept Sliders are a simple and scalable new paradigm for interpretable control of diffusion models. By learning precise semantic directions in latent space, sliders enable intuitive and generalized control over image concepts. The approach provides a new level of flexiblilty beyond text-driven, image-specific diffusion model editing methods, because Concept Sliders allow continuous, single-pass adjustments without extra inference. Their modular design further enables overlaying many sliders simultaneously, unlocking complex multi-concept image manipulation.
概念滑块是对扩散模型进行可解释控制的一种简单、可扩展的新模式。通过学习潜在空间中的精确语义方向,滑块可以对图像概念进行直观和通用的控制。由于概念滑块可以进行连续的单次调整,无需额外的推理,因此这种方法提供了一种全新的灵活性,超越了文本驱动的、针对特定图像的扩散模型编辑方法。其模块化设计还能同时叠加多个滑块,从而实现复杂的多概念图像处理。
We have demonstrated the versatility of Concept Sliders by measuring their performance on Stable Diffusion XL and Stable Diffusion 1.4. We have found that sliders can be created from textual descriptions alone to control abstract concepts with minimal interference with unrelated concepts, outperforming previous methods. We have demonstrated and measured the efficacy of sliders for nuanced visual concepts that are difficult to describe by text, derived from small artist-created image datasets. We have shown that Concept Sliders can be used to transfer StyleGAN latents into diffusion models. Finally, we have conducted a human study that verifies the high quality of Concept Sliders that enhance and correct hand distortions. Our code and data will be made publicly available.
我们在 Stable Diffusion XL 和 Stable Diffusion 1.4 上测量了概念滑块的性能,从而证明了概念滑块的多功能性。我们发现,仅凭文字描述就能创建滑块,从而控制抽象概念,并将无关概念的干扰降到最低,这一点优于以往的方法。对于难以用文字描述的细微视觉概念,我们通过小型艺术家创作的图像数据集,展示并测量了滑块的功效。我们还证明了概念滑块可用于将 StyleGAN 潜在变量转移到扩散模型中。最后,我们还进行了一项人体研究,验证了概念滑块在增强和纠正手部变形方面的高品质。我们的代码和数据将公开发布。
Acknowledgments
致谢
We thank Jaret Burkett (aka Ostris) for the continued discussion on the image slider method and for sharing their eye size dataset. RG and DB are supported by Open Philanthropy.
我们感谢 Jaret Burkett(又名 Ostris)对图像滑块方法的持续讨论,并分享了他们的眼睛大小数据集。RG 和 DB 得到了 Open Philanthropy 的支持。
Code
代码
Our methods are available as open-source code. Source code, trained sliders, and data sets for reproducing our results can be found at sliders.baulab.info and at https://github.com/rohitgandikota/sliders.
我们的方法以开放源代码的形式提供。可在sliders.baulab.info和https://github.com/rohitgandikota/sliders 上找到源代码、训练有素的滑块以及用于重现我们的结果的数据集。
References
- Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Burkett [2023] Jarret Burkett. Ostris/ai-toolkit: Various ai scripts. mostly stable diffusion stuff., 2023.
- Ceylan et al. [2023] Duygu Ceylan, Chun-Hao Huang, and Niloy J. Mitra. Pix2video: Video editing using image diffusion. In International Conference on Computer Vision (ICCV), 2023.
- Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Efron [2011] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
- Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Gandikota et al. [2023] Rohit Gandikota, Joanna Materzyńska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
- Gandikota et al. [2024] Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. Unified concept editing in diffusion models. IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
- Google [2022] Google. Imagen, unprecedented photorealism x deep level of language understanding, 2022.
- Haas et al. [2023] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. arXiv preprint arXiv:2303.11073, 2023.
- Härkönen et al. [2020] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33:9841–9850, 2020.
- Heng and Soh [2023] Alvin Heng and Harold Soh. Selective amnesia: A continual learning approach to forgetting in deep generative models. arXiv preprint arXiv:2305.10120, 2023.
- Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Inui [2023] Norm Inui. Sd/sdxl tricks beneath the papers and codes, 2023.
- Jahanian et al. [2019] Ali Jahanian, Lucy Chai, and Phillip Isola. On the" steerability" of generative adversarial networks. arXiv preprint arXiv:1907.07171, 2019.
- James Betker [2023] et al James Betker. Improving image generation with better captions. OpenAI Reports, 2023.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Karras et al. [2021] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
- Kawar et al. [2022] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Kim et al. [2023] Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, and Juho Lee. Towards safe self-distillation of internet-scale text-to-image diffusion models. arXiv preprint arXiv:2307.05977, 2023.
- Kumari et al. [2023a] Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023a.
- Kumari et al. [2023b] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Kwon et al. [2022] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
- Kynkäänniemi et al. [2022] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr’echet inception distance. arXiv preprint arXiv:2203.06026, 2022.
- Liu et al. [2022] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022.
- Luo [2022] Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
- Meng et al. [2021] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Mothrider [2022] Mothrider. “can an ai draw hands?”, 2022.
- Orgad et al. [2023] Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. Editing implicit assumptions in text-to-image diffusion models. In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
- Park et al. [2023a] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. arXiv preprint arXiv:2307.12868, 2023a.
- Park et al. [2023b] Yong-Hyun Park, Mingi Kwon, Junghyo Jo, and Youngjung Uh. Unsupervised discovery of semantic latent directions in diffusion models. arXiv preprint arXiv:2302.12469, 2023b.
- Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Ryu [2023] Simo Ryu. Cloneofsimo/lora: Using low-rank adaptation to quickly fine-tune diffusion models.s, 2023.
- Schramowski et al. [2022] Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. arXiv preprint arXiv:2211.05105, 2022.
- Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
- Shen et al. [2020] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Staffell [2023] Staffell. The sheer number of options and sliders using stable diffusion is overwhelming., 2023.
- Wu et al. [2023] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
- Wu et al. [2021] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12863–12872, 2021.
- Zhang et al. [2023] Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591, 2023.
- Zhou [2023] Tingrui Zhou. Github - p1atdev/leco: Low-rank adaptation for erasing concepts from diffusion models., 2023.
- Zllrunning [2019] Zllrunning. Using modified bisenet for face parsing in pytorch, 2019.
- Zou et al. [2023] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
\标题
Supplementary Material
补充材料
10 Disentanglement Formulation
10 脱钩配方
We visualize the rationale behind our disentangled formulation for sliders. When training sliders on single pair of prompts, sometimes the directions are entangled with unintended directions. For example, as we show show in Figure 11, controlling age can interfere with gender or race. We therefore propose using multiple paired prompts for finding a disentangled direction. As shown in Figure 12, we explicitly define the preservation directions (dotted blue lines) to find a new edit direction (solid blue line) invariant to the preserve features.
我们直观地展示了我们为滑块设计的分解公式背后的原理。在根据单对提示训练滑块时,有时方向会与意想不到的方向纠缠在一起。例如,如图 11所示 ,控制年龄会干扰性别或种族。因此,我们建议使用多组配对提示来寻找分离的方向。如图 12 所示 ,我们明确定义了保存方向(蓝色虚线),以找到与保存特征不变的新编辑方向(蓝色实线)。
11 SDEdit Analysis
We ablate SDEdit’s contribution by fixing slider scale while varying SDEdit timesteps over 2,500 images. Figure 13 shows inverse trends between LPIPS and CLIP distances as SDEdit time increases. Using more SDEdit maintains structure, evidenced by lower LPIPS score, while maintaining lower CLIP score. This enables larger slider scales before risking structural changes. We notice that on average, timestep 750 - 850 has the best of both worlds with spatial structure preservation and increased efficacy.
12 Textual Concepts Sliders
We quantify slider efficacy and control via CLIP score change and LPIPS distance over 15 sliders at 12 scales in Figure 14. CLIP score change validates concept modification strength. Tighter LPIPS distributions demonstrate precise spatial manipulation without distortion across scales. We show additional qualitative examples for textual concept sliders in Figures 27-32.
12.1 Baseline Details
We compare our method against Prompt-to-prompt and a novel inference-time prompt composition method. For Prompt-to-prompt we use the official implementation code††https://github.com/google/prompt-to-prompt/. We use the Refinement strategy they propose, where new token is added to the existing prompt for image editing. For example, for the images in Figure 15, we add the token “old” for the original prompt “picture of person” to make it “picture of old person”. For the compostion method, we use the principles from Liu et al ††https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/. Specifically, we compose the score functions coming from both “picture of person” and “old person” through additive guidance. We also utilize the SDEdit technique for this method to allow finer image editing.
12.2 Entanglement
12.2 纠缠
The baselines are sometimes prone to interference with concepts when editing a particular concept. Table 4 shows quantitative analysis on interference while Figure 15 shows some qualititative examples. We find that Prompt-to-prompt and inference composition can sometimes change the race/gender when editing age. Our sliders with disentaglement object 8, show minimal interference as seen by Interference metric, which shows the percentage samples with race or gender changed out of 2500 images we tested. We also found through LPIPS metric that our method shows finer editing capabilities. We find similar conclusions through quanlitative samples from Figure 15, that P2P and composition can alter gender, race or both when controlling age.
在编辑特定概念时,基线有时容易干扰概念。表 4 显示了对干扰的定量分析,图 15 则 显示了一些定性例子。我们发现,在编辑年龄时,"提示到提示 "和推理组合有时会改变种族/性别。 干扰 度量指标显示,在我们测试的 2500 张图片中,种族或性别发生改变的样本百分比为 0.5 %。通过 LPIPS 指标,我们还发现我们的方法具有更精细的编辑能力。通过图 15 中的量化样本,我们发现了类似的结论 ,即在控制年龄的情况下,P2P 和合成可以改变性别、种族或两者。
P2P | Composition 组成 |
Ours 我们的 |
|
---|---|---|---|
1.10 | 3.14 | 3.93 | |
LPIPS | 0.15 | 0.13 | 0.06 |
Interference 干扰 |
0.33 | 0.38 | 0.10 |
表 4: 与原始图像相比,衡量性别/种族变化的样本百分比的干扰度量显著降低,由此可见,解缠公式能够对年龄方向进行精确控制。通过使用 LoRA 适配程序,滑动条在结构和编辑方向上都实现了更精细的编辑,LPIPS 和干扰度的改善就是证明。概念强度得以保持,不同消融的分数相似。
13 Visual Concept
13 视觉概念
13.1 Baseline Details
13.1 基线详情
We compare our method to two image customization baselines: custom diffusion ††https://github.com/adobe-research/custom-diffusion
我们将我们的方法与两种图像定制基线进行了比较:定制扩散 † and textual inversion ††https://github.com/rinongal/textual_inversion
和文本反转 †. For fair comparison, we use the official implementations of both, modifying textual inversion to support SDXL. These baselines learn concepts from concept-labeled image sets. However, this approach risks entangling concepts with irrelevant attributes (e.g. hair, skin tone) that correlate spuriously in the dataset, limiting diversity.
.为了进行公平比较,我们使用了两者的官方实现,并修改了文本反转以支持 SDXL。这些基线从概念标签图像集中学习概念。然而,这种方法有可能会将概念与数据集中的不相关属性(如头发、肤色)产生虚假关联,从而限制了多样性。
13.2 Precise Concept Capturing
13.2 精确捕捉概念
Figure 16 shows non-cherry-picked customization samples from all methods trained on the large-eyes Ostris dataset ††https://github.com/ostris/ai-toolkit
图 16 显示了在大眼睛 Ostris 数据集 † 上训练的所有方法的非樱桃采摘定制样本。. While exhibiting some diversity, samples frequently include irrelevant attributes correlated with large eyes in the dataset, e.g. blonde hair in custom diffusion, blue eyes in textual inversion. In contrast, our paired image training isolates concepts by exposing only local attribute changes, avoiding spurious correlation learning.
.虽然表现出一定的多样性,但样本中经常包含与数据集中的大眼睛相关的无关属性 ,例如 自定义扩散中的金发,文本反转中的蓝眼睛。相比之下,我们的配对图像训练只暴露局部属性变化,避免了虚假的相关性学习,从而隔离了概念。
14 Composing Sliders
We show a 2 dimensional slider by composing “cooked” and “fine dining” food sliders in Figure 17. Next, we show progessive composition of sliders one by one in Figures 18,19. From top left image (original SDXL), we progressively generate images by composing a slider at each step. We show how our sliders provide a semantic control over images.
15 Editing Real Images
15 编辑真实图像
Concept sliders can also be used to edit real images. Manually engineering a prompt to generate an image similar to the real image is very difficult. We use null inversion ††https://null-text-inversion.github.io
概念滑块也可用于编辑真实图像。手动设计提示符以生成与真实图像相似的图像非常困难。我们使用空反转 † which finetunes the unconditional text embedding in the classifier free guidance during inference. This allows us to find the right setup to turn the real image as a diffusion model generated image. Figure 20 shows Concept Sliders used on real images to precisely control attributes in them.
在推理过程中,对分类器自由引导中的无条件文本嵌入进行微调。这样,我们就能找到正确的设置,将真实图像转化为扩散模型生成的图像。图 20 显示了用于真实图像的概念滑块,以精确控制图像中的属性。
16 Sliders to Improve Image Quality
16 个 滑块可提高图像质量
We provide more qualitative examples for "fix hands" slider in Figure 21. We also show additional examples for the "repair" slider in Figure 22-24
我们在图 21中为 "修复双手 "滑块提供了更多定性示例 。我们还在图 22-24 中展示了 "修复 "滑块的其他示例
16.1 Details about User Studies
16.1 用户研究详情
We conduct two human evaluations analyzing our “repair” and “fix hands” sliders. For “fix hands”, we generate 150 images each from SDXL and our slider using matched seeds and prompts. We randomly show each image to an odd number users and have them select issues with the hands: 1) misplaced/distorted fingers, 2) incorrect number of fingers, 3) none. as shown in Figure 25
62% of the 150 SDXL images have hand issues as rated by a majority of users. In contrast, only 22% of our method’s images have hand issues, validating effectiveness of our fine-grained control.
我们对 "修复 "和 "修复手 "滑块进行了两次人工评估分析。对于 "修复双手",我们使用匹配的种子和提示从 SDXL 和我们的滑块中各生成 150 张图片。我们随机向奇数用户展示每张图片,让他们选择手的问题:如图 25所示, 在 150 张 SDXL 图像中,有 62% 的图像被大多数用户评为存在手部问题。相比之下,我们的方法仅有 22% 的图像存在手部问题,这验证了我们精细控制的有效性。
We conduct an A/B test to evaluate the efficacy of our proposed “repair”it slider. The test set consists of 300 image pairs (Fig. 26), where each pair contains an original image alongside the output of our method when applied to that image with the same random seed. The left/right placement of these two images is randomized. Through an online user study, we task raters to select the image in each pair that exhibits fewer flaws or distortions, and to describe the reasoning behind their choice as a sanity check. For example, one rater selected the original image in Fig. 22.a, commenting that “The left side image is not realistic because the chair is distorted.” . Similarly a user commented “Giraffes heads are separate unlikely in other image” for Fig. 23.c. Across all 300 pairs, our “repair” slider output is preferred as having fewer artifacts by 80.39% of raters. This demonstrates that the slider effectively reduces defects relative to the original. We manually filter out responses with generic comments (e.g., “more realistic”), as the sanity check prompts raters for specific reasons. After this filtering, 250 pairs remain for analysis.
我们进行了 A/B 测试,以评估我们提出的 "修复 "滑块的功效。测试集由 300 对图像组成(图 26),每对图像包含一张原始图像,以及我们的方法在该图像上使用相同随机种子时的输出结果。这两张图片的左右位置是随机的。通过在线用户研究,我们要求评分者在每对图像中选择瑕疵或失真较少的图像,并描述其选择背后的原因,以进行合理性检查。例如,一位评分者选择了图 22.a 中的原始图像 ,并评论说:"左边的图像不真实,因为椅子扭曲了。.在所有 300 对图像中,80.39% 的评分者认为我们的 "修复 "滑块输出的伪影更少。这表明,相对于原图,滑块能有效减少缺陷。我们手动过滤掉了带有一般性评论(如 "更真实")的回复,因为合理性检查会提示评分者具体原因。经过这样的过滤后,还剩下 250 对进行分析。