这是用户在 2024-8-30 11:37 为 https://ar5iv.labs.arxiv.org/html/2311.12092?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
概念滑块:用于扩散模型精确控制的 LoRA 适配器

Rohit Gandikota1       Joanna Materzyńska2       Tingrui Zhou3      Antonio Torralba2       David Bau1
RohitGandikota1JoannaMaterzyńska2TingruiZhou3AntonioTorralba2DavidBau1

1Northeastern University  2Massachusetts Institute of Technology  3 Independent Researcher
1 东北大学2 麻省理工学院3独立研究员
Abstract
摘要

We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at sliders.baulab.info
我们提出了一种创建可解释概念滑块的方法,这种滑块可以精确控制扩散模型生成的图像中的属性。我们的方法可以识别与一个概念相对应的低等级参数方向,同时最大限度地减少对其他属性的干扰。滑块是通过一小组提示或样本图像创建的;因此,滑块方向既可以为文本概念创建,也可以为视觉概念创建。概念滑块是即插即用的:它们可以有效地组成并连续调节,从而实现对图像生成的精确控制。在定量实验中,与以前的编辑技术相比,我们的滑块具有更强的针对性编辑功能,干扰更小。我们展示了天气、年龄、风格和表情滑块以及滑块合成。我们展示了滑块如何从 StyleGAN 中转移潜变量,以便直观地编辑难以用文字描述的视觉概念。我们还发现,我们的方法可以帮助解决稳定扩散 XL 中持续存在的质量问题,包括修复物体变形和修复扭曲的手。我们的代码、数据和训练有素的滑块可在sliders.baulab.info上获取。

[Uncaptioned image]
Figure 1: Given a small set of text prompts or paired image data, our method identifies low-rank directions in diffusion parameter space for targeted concept control with minimal interference to other attributes. These directions can be derived from pairs of opposing textual concepts or artist-created images, and they are composable for complex multi-attribute control. We demonstrate the effectivness of our method by fixing distorted hands in Stable Diffusion outputs and transferring disentangled StyleGAN latents into diffusion models.
图 1给定一小组文字提示或配对图像数据,我们的方法就能在扩散参数空间中识别低秩方向,从而在对其他属性干扰最小的情况下进行有针对性的概念控制。这些方向可以从成对的文本概念或艺术家创作的图像中获得,而且它们可以组合用于复杂的多属性控制。我们通过在稳定扩散输出中固定扭曲的手,并将分散的 StyleGAN 潜在变量转移到扩散模型中,证明了我们方法的有效性。

1 Introduction
1 引言

Artistic users of text-to-image diffusion models [4, 37, 36, 9, 19] often need finer control over the visual attributes and concepts expressed in a generated image than currently possible. Using only text prompts, it can be challenging to precisely modulate continuous attributes such as a person’s age or the intensity of the weather, and this limitation hinders creators’ ability to adjust images to match their vision [43]. In this paper, we address these needs by introducing interpretable Concept Sliders that allow nuanced editing of concepts within diffusion models. Our method empowers creators with high-fidelity control over the generative process as well as image editing. Our code and trained sliders will be open sourced.
文本到图像扩散模型[4,37,36,9,19]的艺术用户往往需要对生成图像中表达的视觉属性和概念进行比目前更精细的控制。仅使用文本提示,要精确调节连续属性如人的年龄或天气的强度)可能具有挑战性,而这种限制阻碍了创作者调整图像以符合其视觉的能力[43]。在本文中,我们引入了可解释的概念滑块,允许对扩散模型中的概念进行细致入微的编辑,从而满足了这些需求。我们的方法能让创作者对生成过程和图像编辑进行高保真控制。我们的代码和训练有素的滑块将开源。

footnotetext: 1[gandikota.ro,davidbau]@northeastern.edu 2[jomat, torralba]@mit.edu
3shu_teiei@outlook.jp
3shu_teiei@outlook.jp

Concept Sliders solve several problems that are not well-addressed by previous methods. Direct prompt modification can control many image attributes, but changing the prompt often drastically alters overall image structure due to the sensitivity of outputs to the prompt-seed combination [38, 44, 22]. Post-hoc techniques such PromptToPrompt [13] and Pix2Video [3] enable editing visual concepts in an image by inverting the diffusion process and modifying cross-attentions. However, those methods require separate inference passes for each new concept and can support only a limited set of simultaneous edits. They require engineering a prompt suitable for an individual image rather than learning a simple generalizable control, and if not carefully prompted, they can introduce entanglement between concepts, such as altering race when modifying age (see Appendix). In contrast, Concept Sliders provide lightweight plug-and-play adaptors applied to pre-trained models that enable precise, continuous control over desired concepts in a single inference pass, with efficient composition (Figure 6) and minimal entanglement (Figure 11).
概念滑块解决了以往方法未能很好解决的几个问题。直接修改提示语可以控制许多图像属性,但由于输出对提示语-种子组合的敏感性,改变提示语往往会大幅改变整体图像结构[38,44,22]。事后技术,如 PromptToPrompt[13]和 Pix2Video[3 ],可以通过反转扩散过程和修改交叉注意力来编辑图像中的视觉概念。但是,这些方法需要对每个新概念进行单独的推理,而且只能支持有限的同时编辑。这些方法需要设计一个适合单个图像的提示,而不是学习一个简单的通用控制,而且如果不仔细提示,可能会引入概念之间的纠缠,例如在修改年龄时改变种族(见附录)。相比之下,概念滑动器提供了轻量级的即插即用适配器,适用于预先训练好的模型,能够在单次推理中对所需概念进行精确、连续的控制,并具有高效的组合(图6)和最小的纠缠(图11)。

Each Concept Slider is a low-rank modification of the diffusion model. We find that the low-rank constraint is a vital aspect of precision control over concepts: while finetuning without low-rank regularization reduces precision and generative image quality, low-rank training identifies the minimal concept subspace and results in controlled, high-quality, disentangled editing (Figure 11). Post-hoc image editing methods that act on single images rather than model parameters cannot benefit from this low-rank framework.
每个概念滑块都是对扩散模型的低阶修正。我们发现,低秩约束是对概念进行精确控制的一个重要方面:虽然不进行低秩正则化的微调会降低精确度和生成图像的质量,但低秩训练可以识别最小概念子空间,从而实现可控的、高质量的分离编辑(图11)。对单幅图像而非模型参数采取行动的事后图像编辑方法无法从这种低秩框架中受益。

Concept Sliders also allow editing of visual concepts that cannot be captured by textual descriptions; this distinguishes it from prior concept editing methods that rely on text [7, 8]. While image-based model customization methods [25, 38, 6] can add new tokens for new image-based concepts, those are difficult to use for image editing. In contrast, Concept Sliders allow an artist to provide a handful of paired images to define a desired concept, and then a Concept Slider will then generalize the visual concept and apply it to other images, even in cases where it would be infeasible to describe the transformation in words.
概念滑块还允许编辑文字描述无法捕捉的视觉概念;这使它有别于之前依赖文字的概念编辑方法[7,8]。虽然基于图像的模型定制方法[25,38,6]可以为基于图像的新概念添加新标记,但这些方法很难用于图像编辑。相比之下,概念滑块允许艺术家提供一些配对图像来定义所需的概念,然后概念滑块就会概括视觉概念并将其应用到其他图像中,即使在无法用文字描述转换的情况下也是如此。

Other generative image models, such as GANs, have previously exhibited latent spaces that provide highly disentangled control over generated outputs. In particular, it has been observed that StyleGAN [20] stylespace neurons offer detailed control over many meaningful aspects of images that would be difficult to describe in words [45]. To further demonstrate the capabilities of our approach, we show that it is possible to create Concept Sliders that transfer latent directions from StyleGAN’s style space trained on FFHQ face images [20] into diffusion models. Notably, despite originating from a face dataset, our method successfully adapts these latents to enable nuanced style control over diverse image generation. This showcases how diffusion models can capture the complex visual concepts represented in GAN latents, even those that may not correspond to any textual description.
其他生成式图像模型,如 GANs,以前也展示过潜在空间,可对生成的输出结果提供高度分散的控制。特别是,人们已经观察到,StyleGAN[20]风格空间神经元可以对图像的许多有意义的方面进行详细控制,而这些方面是很难用语言来描述的[45]。为了进一步证明我们方法的能力,我们展示了创建概念滑块的可能性,这种概念滑块可以将在 FFHQ 人脸图像[20]上训练的 StyleGAN 风格空间的潜在方向转移到扩散模型中。值得注意的是,尽管我们的方法源自人脸数据集,但却成功地调整了这些潜在方向,从而实现了对不同图像生成的细微风格控制。这展示了扩散模型如何捕捉 GAN 潜在变量所代表的复杂视觉概念,甚至是那些可能与任何文本描述都不对应的概念。

We demonstrate that the expressiveness of Concept Sliders is powerful enough to address two particularly practical applications—enhancing realism and fixing hand distortions. While generative models have made significant progress in realistic image synthesis, the latest generation of diffusion models such as Stable Diffusion XL [36] are still prone to synthesizing distorted hands with anatomically implausible extra or missing fingers [31], as well as warped faces, floating objects, and distorted perspectives. Through a perceptual user study, we validate that a Concept Slider for “realistic image” as well as another for “fixed hands” both create a statistically significant improvement in perceived realism without altering image content.
我们证明了概念滑块的表现力足以解决两个特别实际的应用问题--增强逼真度和修复手部变形。虽然生成模型在逼真图像合成方面取得了重大进展,但最新一代的扩散模型如稳定扩散 XL 模型[36])仍然容易合成出扭曲的手掌、解剖学上难以置信的多余或缺失的手指[31],以及扭曲的人脸、漂浮的物体和扭曲的视角。通过对用户的感知研究,我们验证了 "逼真图像 "概念滑块和 "固定手部 "概念滑块都能在不改变图像内容的情况下显著提高感知逼真度。

Concept Sliders are modular and composable. We find that over 50 unique sliders can be composed without degrading output quality. This versatility gives artists a new universe of nuanced image control that allows them to blend countless textual, visual, and GAN-defined Concept Sliders. Because our method bypasses standard prompt token limits, it empowers more complex editing than achievable through text alone.
概念滑块是模块化和可组合的。我们发现,在不降低输出质量的情况下,可以组成 50 多个独特的滑块。这种多功能性为艺术家提供了一个细微图像控制的新领域,使他们能够将无数文字、视觉和 GAN 定义的概念滑块融合在一起。由于我们的方法绕过了标准提示符的限制,因此可以实现比单纯文本更复杂的编辑。

2 Related Works
2 相关作品

Image Editing
图像编辑

Recent methods propose different approaches for single image editing in text-to-image diffusion models. They mainly focus on manipulation of cross-attentions of a source image and a target prompt [13, 22, 35], or use a conditional input to guide the image structure [30]. Unlike those methods that are applied to a single image, our model creates a semantic change defined by a small set of text pairs or image pairs, applied to the entire model. Analyzing diffusion models through Riemannian geometry, Park et al. [33] discovered local latent bases that enable semantic editing by traversing the latent space. Their analysis also revealed the evolving geometric structure over timesteps across prompts, requiring per-image latent basis optimization. In contrast, we identify generalizable parameter directions, without needing custom optimization for each image. Instruct-pix2pix [1] finetunes a diffusion model to condition image generation on both an input image and text prompt. This enables a wide range of text-guided editing, but lacks fine-grained control over edit strength or visual concepts not easily described textually.
最近的方法为文本到图像扩散模型中的单幅图像编辑提出了不同的方法。它们主要集中在对源图像和目标提示的交叉注意力进行操作[13,22,35],或使用条件输入来引导图像结构[30]。与这些应用于单一图像的方法不同,我们的模型创建了一个由一小组文本对或图像对定义的语义变化,并应用于整个模型。Park 等人[33]通过黎曼几何分析扩散模型,发现了局部潜在基础,通过遍历潜在空间实现语义编辑。他们的分析还揭示了跨提示时间步的几何结构演变,这需要对每个图像的潜在基础进行优化。与此相反,我们确定了可通用的参数方向,无需针对每幅图像进行定制优化。Instruct-pix2pix[1]对扩散模型进行了微调,使图像生成以输入图像和文本提示为条件。这使得文本引导的编辑范围更广,但缺乏对编辑强度或不易用文字描述的视觉概念的精细控制。

Guidance Based Methods
基于指导的方法

Ho et al. [14] introduce classifier free guidance that showed improvement in image quality and text-image alignment when the data distribution is driven towards the prompt and away from unconditional output. Liu et al. [28] present an inference-time guidance formulation to enhance concept composition and negation in diffusion models. By adding guidance terms during inference, their method improves on the limited inherent compositionality of diffusion models. SLD [40] proposes using guidance to moderate unsafe concepts in diffusion models. They propose a safe prompt which is used to guide the output away from unsafe content during inference.
Ho 等人[14]介绍了无分类器引导,当数据分布趋向于提示和远离无条件输出时,图像质量和文本-图像对齐得到了改善。Liu 等人[28]提出了一种推理时指导公式,以增强扩散模型中的概念构成和否定。通过在推理过程中添加指导术语,他们的方法改善了扩散模型有限的固有组成性。SLD[40]建议使用指导来缓和扩散模型中的不安全概念。他们提出了一种安全提示,用于在推理过程中引导输出远离不安全的内容。

Model Editing
模型编辑

Our method can be seen as a model editing approach, where by applying a low-rank adaptor, we single out a semantic attribute and allow for continuous control with respect to the attribute. To personalize the models for adding new concepts, customization methods based on finetuning exist [38, 25, 6]. Custom Diffusion [25] proposes a way to incorporate new visual concepts into pretrained diffusion models by finetuning only the cross-attention layers. On the other hand, Textual Inversion [6] introduces new textual concepts by optimizing an embedding vector to activate desired model capabilities. Previous works [7, 24, 23, 12, 46] proposed gradient based fine-tuning-based methods for the permanent erasure of a concept in a model. Ryu et al. [39] proposed adapting LoRA [16] for diffusion model customization. Recent works  [47] developed low rank implementations of erasing concepts [7] allowing the ability to adjust the strength of erasure in an image. [17] implemented image based control of concepts by merging two overfitted LoRAs to capture an edit direction. Similarly,  [8, 32] proposed closed-form formulation solutions for debiasing, redacting or moderating concepts within the model’s cross-attention weights. Our method does not modify the underlying text-to-image diffusion model and can be applied as a plug-and-play module easily stacked across different attributes.
我们的方法可以看作是一种模型编辑方法,通过应用低等级适配器,我们将语义属性单列出来,并允许对该属性进行持续控制。为个性化添加新概念的模型,存在基于微调的定制方法[38,25,6]。自定义扩散法[25]提出了一种只对交叉注意层进行微调,从而将新的视觉概念纳入预训练扩散模型的方法。另一方面,Textual Inversion[6]则通过优化嵌入向量来引入新的文本概念,从而激活所需的模型能力。之前的研究[7,24,23,12,46]提出了基于梯度微调的方法,用于永久删除模型中的概念。Ryu 等人[39]提出将 LoRA[16 ]用于扩散模型定制。 最近的研究[47]开发了擦除概念的低等级实现[7],允许调整图像中擦除的强度。[17]通过合并两个过度拟合的 LoRA 来捕捉编辑方向,实现了基于图像的概念控制。同样,文献[8,32]提出了在模型的交叉关注权重内对概念进行去胶、删节或缓和的闭式表述解决方案。我们的方法不会修改底层的文本到图像扩散模型,可以作为即插即用模块应用,很容易在不同属性之间进行堆叠。

Semantic Direction in Generative models
生成模型中的语义方向

In Generative Adversarial Networks (GANs), manipulation of semantic attributes has been widely studied. Latent space trajectories have been found in a self-supervised manner [18]. PCA has been used to identify semantic directions in the latent or feature spaces [11]. Latent subspaces corresponding to detailed face attributes have been analyzed [42]. For diffusion models, semantic latent spaces have been suggested to exist in the middle layers of the U-Net architecture [26, 34]. It has been shown that principal directions in diffusion model latent spaces (h-spaces) capture global semantics [10]. Our method directly trains low-rank subspaces corresponding to semantic attributes. By optimizing for specific global directions using text or image pairs as supervision, we obtain precise and localized editing directions. Recent works have [49] introduced the low-rank representation adapter, which employs a contrastive loss to fine-tune LoRA to achieve fine-grained control of concepts in language models.
在生成对抗网络(GANs)中,对语义属性的处理已被广泛研究。潜空间轨迹是以自我监督的方式发现的[18]。PCA 被用于识别潜空间或特征空间中的语义方向[11]。与详细人脸属性相对应的潜在子空间已被分析[42]。对于扩散模型,有人认为语义潜空间存在于 U-Net 架构的中间层[26,34]。研究表明,扩散模型潜空间(h-spaces)中的主方向可以捕捉全局语义[10]。我们的方法直接训练与语义属性相对应的低阶子空间。通过使用文本或图像对作为监督对特定的全局方向进行优化,我们获得了精确的本地化编辑方向。最近的研究[49]引入了低秩表示适配器low-rank representation adapter),该适配器采用对比损失(contrastiveloss)对 LoRA 进行微调,以实现对语言模型中概念的细粒度控制。

3 Background
3 背景

3.1 Diffusion Models
3.1 扩散模型

Diffusion models are a subclass of generative models that operationalize the concept of reversing a diffusion process to synthesize data. Initially, the forward diffusion process gradually adds noise to the data, transitioning it from an organized state x0subscript𝑥0x_{0} to a complete Gaussian noise xTsubscript𝑥𝑇x_{T}. At any timestep t𝑡t, the noised image is modelled as:
扩散模型是生成模型的一个子类,它利用反向扩散过程的概念来合成数据。最初,正向扩散过程会逐渐给数据添加噪声,使其从有组织的状态 x0x_{0} 过渡到完全的高斯噪声 xTx_{T} 。在任意时间步 tt ,噪声图像的模型为

xt1βtx0+βtϵsubscript𝑥𝑡1subscript𝛽𝑡subscript𝑥0subscript𝛽𝑡italic-ϵx_{t}\leftarrow\sqrt{1-\beta_{t}}x_{0}+\sqrt{\beta_{t}}\epsilon (1)

Where ϵitalic-ϵ\epsilon is a randomly sampled gaussian noise with zero mean and unit variance. Diffusion models aim to reverse this diffusion process by sampling a random Gaussian noise XTsubscript𝑋𝑇X_{T} and gradually denoising the image to generate an image x0subscript𝑥0x_{0}. In practice [15, 29], the objective of diffusion model is simplified to predicting the true noise ϵitalic-ϵ\epsilon from Eq. 1 when xtsubscript𝑥𝑡x_{t} is fed as input with additional inputs like the timestep t𝑡t and conditioning c𝑐c.
其中 ϵ\epsilon 是随机取样的高斯噪声,均值为零,方差为单位。扩散模型旨在通过对随机高斯噪声 XTX_{T} 进行采样,逆转这一扩散过程,并逐步对图像进行去噪处理,生成图像 x0x_{0} 。在实践中[15,29],当 xtx_{t} 作为输入时,扩散模型的目标被简化为根据公式1预测真实噪声 ϵ\epsilon ,并输入额外的输入,如时间步 tt 和调节 cc

θϵϵθ(xt,c,t)2subscript𝜃superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡2\nabla_{\theta}||\epsilon-\epsilon_{\theta}(x_{t},c,t)||^{2} (2)

Where ϵθ(xt,c,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡\epsilon_{\theta}(x_{t},c,t) is the noise predicted by the diffusion model conditioned on c𝑐c at timestep t𝑡t. In this work, we work with Stable Diffusion [37] and Stable Diffusion XL [36], which are latent diffusion models that improve efficiency by operating in a lower dimensional latent space z𝑧z of a pre-trained variational autoencoder. They convert the images to a latent space and run the diffusion training as discussed above. Finally, they decode the latent z0subscript𝑧0z_{0} through the VAE decoder to get the final image x0subscript𝑥0x_{0}
其中 ϵθ(xt,c,t)\epsilon_{\theta}(x_{t},c,t) 是以 cc 为条件的扩散模型在时间步 tt 时预测的噪声。在这项工作中,我们使用了稳定扩散模型[37]和稳定扩散 XL[36 ],它们是潜在扩散模型,通过在预先训练的变异自动编码器的低维潜在空间 zz 中运行来提高效率。它们将图像转换为潜在空间,并如上所述进行扩散训练。最后,他们通过 VAE 解码器对潜在 z0z_{0} 进行解码,得到最终图像 x0x_{0}

3.2 Low-Rank Adaptors
3.2 低电平适配器

The Low-Rank Adaptation (LoRA) [16] method enables efficient adaptation of large pre-trained language models to downstream tasks by decomposing the weight update ΔWΔ𝑊\Delta W during fine-tuning. Given a pre-trained model layer with weights W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}, where d𝑑d is the input dimension and k𝑘k the output dimension, LoRA decomposes ΔWΔ𝑊\Delta W as
Low-Rank Adaptation(LoRA)[16]方法通过在微调过程中分解权重更新 ΔW\Delta W ,使大型预训练语言模型能高效地适应下游任务。给定一个带有权重 W0d×kW_{0}\in\mathbb{R}^{d\times k} 的预训练模型层,其中 dd 是输入维度, kk 是输出维度,LoRA 将 ΔW\Delta W 分解为

ΔW=BAΔ𝑊𝐵𝐴\Delta W=BA (3)

where Bd×r𝐵superscript𝑑𝑟B\in\mathbb{R}^{d\times r} and Ar×k𝐴superscript𝑟𝑘A\in\mathbb{R}^{r\times k} with rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k) being a small rank that constrains the update to a low dimensional subspace. By freezing W0subscript𝑊0W_{0} and only optimizing the smaller matrices A𝐴A and B𝐵B, LoRA achieves massive reductions in trainable parameters. During inference, ΔWΔ𝑊\Delta W can be merged into W0subscript𝑊0W_{0} with no overhead by a LoRA scaling factor α𝛼\alpha:
其中, Bd×rB\in\mathbb{R}^{d\times r}Ar×kA\in\mathbb{R}^{r\times k}rmin(d,k)r\ll\min(d,k) 为小秩,限制更新到低维子空间。通过冻结 W0W_{0} 并只优化较小的矩阵 AABB ,LoRA 可大量减少可训练参数。 在推理过程中, ΔW\Delta W 可以通过一个 LoRA 缩放因子 α\alpha 并入 W0W_{0} ,而不会产生任何开销:

W=W0+αΔW𝑊subscript𝑊0𝛼Δ𝑊W=W_{0}+\alpha\Delta W (4)
Refer to caption
Figure 2: Concept Sliders are created by fine-tuning LoRA adaptors using a guided score that enhances attribute c+subscript𝑐c_{+} while suppressing attribute csubscript𝑐c_{-} from the target concept ctsubscript𝑐𝑡c_{t}. The slider model generates samples xtsubscript𝑥𝑡x_{t} by partially denoising Gaussian noise over time steps 1 to t𝑡t, conditioned on the target concept ctsubscript𝑐𝑡c_{t}.
图 2概念滑块是通过微调 LoRA 适配器创建的,LoRA 适配器使用一个引导分数来增强属性c+c_{+},同时抑制目标概念ctc_{t} 的属性c-c_{-}。滑块模型以目标概念ctc_{t} 为条件,通过对时间步长 1 至tt 的高斯噪声进行部分去噪,生成样本xtx_{t}

4 Method
4 方法

Concept Sliders are a method for fine-tuning LoRA adaptors on a diffusion model to enable concept-targeted image control as shown in Figure 2. Our method learns low-rank parameter directions that increase or decrease the expression of specific attributes when conditioned on a target concept. Given a target concept ctsubscript𝑐𝑡c_{t} and model θ𝜃\theta, our goal is to obtain θsuperscript𝜃\theta^{*} that modifies the likelihood of attributes c+subscript𝑐c_{+} and csubscript𝑐c_{-} in image X𝑋X when conditioned on ctsubscript𝑐𝑡c_{t} - increase likelihood of attribute c+subscript𝑐c_{+} and decrease likelihood of attribute csubscript𝑐c_{-}.
如图2 所示,"概念滑块 "是一种在扩散模型上微调 LoRA 适配器的方法,可实现以概念为目标的图像控制。我们的方法可以学习低等级参数方向,在目标概念的条件下增加或减少特定属性的表达。给定一个目标概念 ctc_{t} 和模型 θ\theta ,我们的目标是获得 θ\theta^{*} ,当以 ctc_{t} 为条件时,它可以修改图像 XX 中属性 c+c_{+}cc_{-} 的可能性--增加属性 c+c_{+} 的可能性,减少属性 cc_{-} 的可能性。

Pθ(X|ct)Pθ(X|ct)(Pθ(c+|X)Pθ(c|X))ηsubscript𝑃superscript𝜃conditional𝑋subscript𝑐𝑡subscript𝑃𝜃conditional𝑋subscript𝑐𝑡superscriptsubscript𝑃𝜃conditionalsubscript𝑐𝑋subscript𝑃𝜃conditionalsubscript𝑐𝑋𝜂\displaystyle P_{\theta^{*}}(X|c_{t})\leftarrow P_{\theta}(X|c_{t})\left(\frac{P_{\theta}(c_{+}|X)}{P_{\theta}(c_{-}|X)}\right)^{\eta} (5)

Where Pθ(X|ct)subscript𝑃𝜃conditional𝑋subscript𝑐𝑡P_{\theta}(X|c_{t}) represents the distribution generated by the original model when conditioned on ctsubscript𝑐𝑡c_{t}. Expanding P(c+|X)=P(X|c+)P(c+)P(X)𝑃conditionalsubscript𝑐𝑋𝑃conditional𝑋subscript𝑐𝑃subscript𝑐𝑃𝑋P(c_{+}|X)=\frac{P(X|c_{+})P(c_{+})}{P(X)}, the gradient of the log probability logPθ(X|ct)subscript𝑃superscript𝜃conditional𝑋subscript𝑐𝑡\nabla\log P_{\theta^{*}}(X|c_{t}) would be proportional to:
其中 Pθ(X|ct)P_{\theta}(X|c_{t}) 代表原始模型在 ctc_{t} 条件下产生的分布。将 P(c+|X)=P(X|c+)P(c+)P(X)P(c_{+}|X)=\frac{P(X|c_{+})P(c_{+})}{P(X)} 展开,对数概率 logPθ(X|ct)\nabla\log P_{\theta^{*}}(X|c_{t}) 的梯度将与之成正比:

logPθ(X|ct)+η(logPθ(X|c+)logPθ(X|c))subscript𝑃𝜃conditional𝑋subscript𝑐𝑡𝜂subscript𝑃𝜃conditional𝑋subscript𝑐subscript𝑃𝜃conditional𝑋subscript𝑐\displaystyle\nabla\log P_{\theta}(X|c_{t})+\eta\left(\nabla\log P_{\theta}(X|c_{+})-\nabla\log P_{\theta}(X|c_{-})\right) (6)

Based on Tweedie’s formula [5] and the reparametrization trick of [15], we can introduce a time-varying noising process and express each score (gradient of log probability) as a denoising prediction ϵ(X,ct,t)italic-ϵ𝑋subscript𝑐𝑡𝑡\epsilon(X,c_{t},t). Thus Eq. 6 becomes:
根据特威迪公式[5][15]的重参数化技巧,我们可以引入一个时变噪声过程,并将每个分数(对数概率梯度)表示为去噪预测值 ϵ(X,ct,t)\epsilon(X,c_{t},t) 。因此,公式6变为

ϵθ(X,ct,t)ϵθ(X,ct,t)+η(ϵθ(X,c+,t)ϵθ(X,c,t))subscriptitalic-ϵsuperscript𝜃𝑋subscript𝑐𝑡𝑡subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑡𝑡𝜂subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑡subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑡\displaystyle\begin{split}\epsilon_{\theta^{*}}(X,c_{t},t)\leftarrow\;&\epsilon_{\theta}(X,c_{t},t)\;+\\ &\eta\left(\epsilon_{\theta}(X,c_{+},t)-\epsilon_{\theta}(X,c_{-},t)\right)\end{split} (7)

The proposed score function in Eq. 7 shifts the distribution of the target concept ctsubscript𝑐𝑡c_{t} to exhibit more attributes of c+subscript𝑐c_{+} and fewer attributes of csubscript𝑐c_{-}. In practice, we notice that a single prompt pair can sometimes identify a direction that is entangled with other undesired attributes. We therefore incorporate a set of preservation concepts p𝒫𝑝𝒫p\in\mathcal{P} (for example, race names while editing age) to constrain the optimization. Instead of simply increasing Pθ(c+|X)subscript𝑃𝜃conditionalsubscript𝑐𝑋P_{\theta}(c_{+}|X), we aim to increase, for every p𝑝p, Pθ((c+,p)|X)subscript𝑃𝜃conditionalsubscript𝑐𝑝𝑋P_{\theta}((c_{+},p)|X), and reduce Pθ((c,p)|X)subscript𝑃𝜃conditionalsubscript𝑐𝑝𝑋P_{\theta}((c_{-},p)|X). This leads to the disentanglement objective:
公式7中提议的评分函数会改变目标概念 ctc_{t} 的分布,使其表现出更多的 c+c_{+} 属性和更少的 cc_{-} 属性。在实践中,我们注意到,单个提示对有时会识别出一个与其他不受欢迎的属性纠缠在一起的方向。因此,我们加入了一组保护概念 p𝒫p\in\mathcal{P} (例如,在编辑年龄的同时编辑种族名称)来限制优化。我们的目标不是简单地增加 Pθ(c+|X)P_{\theta}(c_{+}|X) ,而是为每个 pp 增加 Pθ((c+,p)|X)P_{\theta}((c_{+},p)|X) ,并减少 Pθ((c,p)|X)P_{\theta}((c_{-},p)|X) 。这就引出了拆分目标:

ϵθ(X,ct,t)ϵθ(X,ct,t)+ηp𝒫(ϵθ(X,(c+,p),t)ϵθ(X,(c,p),t))subscriptitalic-ϵsuperscript𝜃𝑋subscript𝑐𝑡𝑡subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑡𝑡𝜂subscript𝑝𝒫subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑝𝑡subscriptitalic-ϵ𝜃𝑋subscript𝑐𝑝𝑡\displaystyle\begin{split}\epsilon_{\theta^{*}}(X,c_{t},t)&\leftarrow\;\epsilon_{\theta}(X,c_{t},t)\;+\\ &\eta\sum_{p\in\mathcal{P}}\left(\epsilon_{\theta}(X,(c_{+},p),t)-\epsilon_{\theta}(X,(c_{-},p),t)\right)\end{split} (8)

The disentanglement objective in Equation 8 finetunes the Concept Slider modules while keeping pre-trained weights fixed. Crucially, the LoRA formulation in Equation 4 introduces a scaling factor α𝛼\alpha that can be modified at inference time. This scaling parameter α𝛼\alpha allows adjusting the strength of the edit, as shown in Figure 1. Increasing α𝛼\alpha makes the edit stronger without retraining the model. Previous model editing method [7], suggests a stronger edit by retraining with increased guidance η𝜂\eta in Eq. 8. However, simply scaling α𝛼\alpha at inference time produces the same effect of strengthening the edit, without costly retraining.
等式8中的解缠目标是对概念滑块模块进行微调,同时保持预先训练的权重不变。最重要的是,等式4中的 LoRA 公式引入了一个可在推理时修改的缩放因子 α\alpha 。如图1 所示,该缩放参数 α\alpha 可以调整编辑强度。增加 α\alpha 可以在不重新训练模型的情况下增强编辑强度。之前的模型编辑方法[7]建议,通过增加公式8 中的指导 η\eta 进行再训练,以获得更强的编辑效果。然而,只需在推理时缩放 α\alpha ,就能产生同样的加强编辑效果,而无需昂贵的重新训练。

4.1 Learning Visual Concepts from Image Pairs

We propose sliders to control nuanced visual concepts that are harder to specify using text prompts. We leverage small paired before/after image datasets to train sliders for these concepts. The sliders learn to capture the visual concept through the contrast between image pairs (xAsuperscript𝑥𝐴x^{A}, xBsuperscript𝑥𝐵x^{B}).

Our training process optimizes the LORA applied in both the negative and positive directions. We shall write ϵθ+subscriptitalic-ϵsubscript𝜃\epsilon_{\theta_{+}} for the application of positive LoRA and ϵθsubscriptitalic-ϵsubscript𝜃\epsilon_{\theta_{-}} for the negative case. Then we minimize the following loss:

ϵθ(xtA,‘ ’,t)ϵ2+ϵθ+(xtB,‘ ’,t)ϵ2superscriptnormsubscriptitalic-ϵsubscript𝜃subscriptsuperscript𝑥𝐴𝑡‘ ’𝑡italic-ϵ2superscriptnormsubscriptitalic-ϵsubscript𝜃subscriptsuperscript𝑥𝐵𝑡‘ ’𝑡italic-ϵ2||\epsilon_{\theta_{-}}(x^{A}_{t},\text{` '},t)-\epsilon||^{2}+||\epsilon_{\theta_{+}}(x^{B}_{t},\text{` '},t)-\epsilon||^{2} (9)

This has the effect of causing the LORA to align to a direction that causes the visual effect of A in the negative direction and B in the positive direction. Defining directions visually in this way not only allows an artist to define a Concept Slider through custom artwork; it is also the same method we use to transfer latents from other generative models such as StyleGAN.

Refer to caption
Figure 3: Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure. Traversing the sliders towards the negative direction produces an opposing effect on the attributes.
图 3我们基于文本的滑块可以在图像生成过程中对所需属性进行精确编辑,同时保持整体结构不变。向负方向移动滑块会对属性产生相反的效果。

5 Experiments

We evaluate our approach primarily on Stable Diffusion XL [36], a high-resolution 1024-pixel model, and we conduct additional experiments on SD v1.4 [37]. All models are trained for 500 epochs. We demonstrate generalization by testing sliders on diverse prompts - for example, we evaluate our "person" slider on prompts like "doctor", "man", "woman", and "barista". For inference, we follow the SDEdit technique of Meng et al. [30]: to maintain structure and semantics, we use the original pre-trained model for the first t𝑡t steps, setting the LoRA adaptor multipliers to 0 and retaining the pre-trained model priors. We then turn on the LoRA adaptor for the remaining steps.

5.1 Textual Concept Sliders

We validate the efficacy of our slider method on a diverse set of 30 text-based concepts, with full examples in the Appendix. Table 1 compares our method against two baselines: an approach we propose inspired by SDEdit [30] and Liu et al.[28] that uses a pretrained model with the standard prompt for t𝑡t timesteps, then starts composing by adding prompts to steer the image, and prompt2prompt[13], which leverages cross-attention for image editing after generating reference images. While the former baseline is novel, all three enable finer control but differ in how edits are applied. Our method directly generates 2500 edited images per concept, like "image of a person", by setting the scale parameter at inference. In contrast, the baselines require additional inference passes for each new concept (e.g "old person"), adding computational overhead. Our method consistently achieves higher CLIP scores and lower LPIPS versus the original, indicating greater coherence while enabling precise control. The baselines are also more prone to entanglement between concepts. We provide further analysis and details about the baselines in the Appendix.

Figure 3 shows typical qualitative examples, which maintains good image structure while enabling fine grained editing of the specified concept.

Prompt2Prompt Our Method Composition
ΔΔ\Delta CLIP LPIPS ΔΔ\Delta CLIP LPIPS ΔΔ\Delta CLIP LPIPS
Age 1.10 0.15 3.93 0.06 3.14 0.13
Hair 3.45 0.15 5.59 0.10 5.14 0.15
Sky 0.43 0.15 1.56 0.13 1.55 0.14
Rusty 7.67 0.25 7.60 0.09 6.67 0.18
Table 1: Compared to Prompt2Prompt [13], our method achieves comparable efficacy in terms of ΔΔ\Delta CLIP score while inducing finer edits as measured by LPIPS distance to the original image. The ΔΔ\Delta CLIP metric measures the change in CLIP score between the original and edited images when evaluated on the text prompt describing the desired edit. Results are shown for a single positive scale of the trained slider.

5.2 Visual Concept Sliders

Some visual concepts like precise eyebrow shapes or eye sizes are challenging to control through text prompts alone. To enable sliders for these granular attributes, we leverage paired image datasets combined with optional text guidance. As shown in Figure 4, we create sliders for "eyebrow shape" and "eye size" using image pairs capturing the desired transformations. We can further refine the eyebrow slider by providing the text "eyebrows" so the direction focuses on that facial region. Using image pairs with different scales, like the eye sizes from Ostris [2], we can create sliders with stepwise control over the target attribute.

Refer to caption
Figure 4: Controlling fine-grained attributes like eyebrow shape and eye size using image pair-driven concept sliders with optional text guidance. The eye size slider scales from small to large eyes using the Ostris dataset [2].

We quantitatively evaluate the eye size slider by detecting faces using FaceNet [41], cropping the area, and employing a face parser [48]to measure eye region across the slider range. Traversing the slider smoothly increases the average eye area 2.75x, enabling precise control as shown in Table 2. Compared to customization techniques like textual inversion [6] that learns a new token and custom diffusion [25] that fine-tunes cross attentions, our slider provides more targeted editing without unwanted changes. When model editing methods [25, 6] are used to incorporate new visual concepts, they memorize the training subjects rather than generalizing the contrast between pairs. We provide more details in the Appendix.

Training Custom Textual Our
Data Diffusion Inversion Method
𝚫𝐞𝐲𝐞subscript𝚫𝐞𝐲𝐞\mathbf{\Delta_{eye}} 1.84 0.97 0.81 1.75
LPIPS 0.03 0.23 0.21 0.06
Table 2: Our results demonstrate the effectiveness of our sliders for intuitive image editing based on visual concepts. The metric ΔeyesubscriptΔ𝑒𝑦𝑒\Delta_{eye} represents the ratio of change in eye size compared to the original image. Our method achieves targeted editing of eye size while maintaining similarity to the original image distribution, as measured by the LPIPS.

5.3 Sliders transferred from StyleGAN

Figure 5 demonstrates sliders transferred from the StyleGAN-v3 [21] style space that is trained on FFHQ [20] dataset. We use the method of [45] to explore the StyleGAN-v3 style space and identify neurons that control hard-to-describe facial features. By scaling these neurons, we collect images to train image-based sliders. We find that Stable Diffusion’s latent space can effectively learn these StyleGAN style neurons, enabling structured facial editing. This enables users to control nuanced concepts that are indescribable by words and styleGAN makes it easy to get generate the paired dataset.

Refer to caption
Figure 5: We demonstrate transferring StyleGAN style space latents to the diffusion latent space. We identify three neurons that edit facial structure: neuron 77 controls cheekbone structure, neuron 646 selectively adjusts the left side face width, and neuron 847 edits inter-ocular distance. We transfer these StyleGAN latents to the diffusion model to enable structured facial editing.

5.4 Composing Sliders

A key advantage of our low-rank slider directions is composability - users can combine multiple sliders for nuanced control rather than being limited to one concept at a time. For example, in Figure 6 we show blending "cooked" and "fine dining" food sliders to traverse this 2D concept space. Since our sliders are lightweight LoRA adaptors, they are easy to share and overlay on diffusion models. By downloading interesting slider sets, users can adjust multiple knobs simultaneously to steer complex generations. In Figure 7 we qualitatively show the effects of composing multiple sliders progressively up to 50 sliders at a time. We use far greater than 77 tokens (the current context limit of SDXL [36]) to create these 50 sliders. This showcases the power of our method that allows control beyond what is possible through prompt-based methods alone. We further validate multi-slider composition in the appendix.

Refer to caption
Figure 6: Composing two text-based sliders results in a complex control over food images. We show the effect of applying both the "cooked" slider and "fine-dining" slider to a generated image. These sliders can be used in both positive and negative directions.
Refer to caption
Figure 7: We show composition capabilities of concept sliders. We progressively compose multiple sliders in each row from left to right, enabling nuanced traversal of high-dimensional concept spaces. We demonstrate composing sliders trained from text prompts, image datasets, and transferred from GANs.
Refer to caption
Figure 8: The repair slider enables the model to generate images that are more realistic and undistorted. The parameters under the control of this slider help the model correct some of the flaws in their generated outputs like distorted humans and pets in (a, b), unnatural objects in (b, c, d), and blurry natural images in (b,c)

6 Concept Sliders to Improve Image Quality

One of the most interesting aspects of a large-scale generative model such as Stable Diffusion XL is that, although their image output can often suffer from distortions such as warped or blurry objects, the parameters of the model contains a latent capability to generate higher-quality output with fewer distortions than produced by default. Concept Sliders can unlock these abilities by identifying low-rank parameter directions that repair common distortions.

Fixing Hands

Generating realistic-looking hands is a persistent challenge for diffusion models: for example, hands are typically generated with missing, extra, or misplaced fingers. Yet the tendency to distort hands can be directly controlled by a Concept Slider: Figure 9 shows the effect of a "fix hands" Concept Slider that lets users smoothly adjust images to have more realistic, properly proportioned hands. This parameter direction is found using a complex prompt pair boosting “realistic hands, five fingers, 8k hyper-realistic hands” and suppressing “poorly drawn hands, distorted hands, misplaced fingers”. This slider allows hand quality to be improved with a simple tweak rather manual prompt engineering.

Refer to caption
Figure 9: We demonstrate a slider for fixing hands in stable diffusion. We find a direction to steer hands to be more realistic and away from "poorly drawn hands".

To measure the “fix hands" slider, we conduct a user study on Amazon Mechanical Turk. We present 300 random images with hands to raters—half generated by Stable Diffusion XL and half by XL with our slider applied (same seeds and prompts). Raters are asked to assess if the hands appear distorted or not. Across 150 SDXL images, raters find 62% have distorted hands, confirming it as a prevalent problem. In contrast, only 22% of the 150 slider images are rated as having distorted hands.

Repair Slider

In addition to controlling specific concepts like hands, we also demonstrate the use of Concept Sliders to guide generations towards overall greater realism. We identify single low-rank parameter direction that shifts images away from common quality issues like distorted subjects, unnatural object placement, and inconsistent shapes. As shown in Figures 8 and 10, traversing this “repair" slider noticeably fixes many errors and imperfections.

Refer to caption
Figure 10: We demonstrate the effect of our “repair” slider on fine details: it improves the rendering of densely arranged objects, it straightens architectural lines, and it avoids blurring and distortions at the edges of complex shapes.

Through a perceptual study, we evaluate the realism of 250 pairs of slider-adjusted and original SD images. A majority of participants rate the slider images as more realistic in 80.39% of pairs, indicating our method enhances realism. However, FID scores do not align with this human assessment, echoing prior work on perceptual judgment gaps [27]. Instead, distorting images along the opposite slider direction improves FID, though users still prefer the realism-enhancing direction. We provide more details about the user studies in the appendix.

7 Ablations
7 次 消融

We analyze the two key components of our method to verify that they are both necessary: (1) the disentanglement formulation and (2) low-rank adaptation. Table 3 shows quantitative measures on 2500 images, and Figure 11 shows qualitative differences. In both quantitative and quantitative measures, we find that the disentanglement objective from Eq.8 success in isolating the edit from unwanted attributes (Fig.11.c); for example without this objective we see undesired changes in gender when asking for age as seen in Table 3, Interference metric which measures the percentage of samples with changed race/gender when making the edit. The low-rank constraint is also helpful: it has the effect of precisely capturing the edit direction with better generalization (Fig.11.d); for example, note how the background and the clothing are better preserved in Fig.11.b. Since LORA is parameter-efficient, it also has the advantage that it enables lightweight modularity. We also note that the SDEdit-inspired inference technique allows us to use a wider range of alpha values, increasing the editing capacity, without losing image structure. We find that SDEdit’s inference technique expands the usable range of alpha before coherence declines relative to the original image. We provide more details in the Appendix.
我们分析了我们方法的两个关键组成部分,以验证它们都是必要的:(1) 解缠公式和 (2) 低阶适应。表3显示了 2500 幅图像的定量测量结果,图11显示了定性差异。在定量和定量测量中,我们发现公式8中的解缠目标成功地将编辑与不需要的属性隔离开来(图11.c);例如,如果没有这个目标,我们会在询问年龄时看到不需要的性别变化,如表3 所示,干扰度量用于测量编辑时种族/性别发生变化的样本百分比。低秩约束也很有帮助:它可以精确捕捉编辑方向,并具有更好的泛化效果(图11.d);例如,请注意图11.b 中背景和服装是如何得到更好的保留的。我们还注意到,受 SDEdit 启发的推理技术允许我们在不丢失图像结构的情况下使用更大范围的 alpha 值,从而提高了编辑能力。我们发现,相对于原始图像,SDEdit 的推理技术在一致性下降之前扩大了 alpha 值的可用范围。我们将在附录中提供更多细节。

w/o
w/o
Ours
我们的
Disentanglement
脱钩
Low Rank
低等级
𝚫𝐂𝐋𝐈𝐏subscript𝚫𝐂𝐋𝐈𝐏\mathbf{\Delta_{CLIP}} 3.93 3.39 3.18
LPIPS 0.06 0.17 0.23
Interference
干扰
0.10 0.36 0.19
Table 3: The disentanglement formulation enables precise control over the age direction, as shown by the significant reduction in the Interference metric which measures the percentage of samples with gender/race change, compared to the original images. By using LoRA adaptors, sliders achieve finer editing in terms of both structure and edit direction, as evidenced by improvements in LPIPS and Interference. Concept strength is maintained, with similar ΔCLIPsubscriptΔ𝐶𝐿𝐼𝑃\Delta_{CLIP} scores across ablations.
表 3与原始图像相比,衡量性别/种族变化的样本百分比的干扰度量显著降低,由此可见,解缠公式能够对年龄方向进行精确控制。通过使用 LoRA 适配程序,滑动条在结构和编辑方向上都实现了更精细的编辑,LPIPS 和干扰度的改善就是证明。概念强度得以保持,不同消融的ΔCLIP\Delta_{CLIP}分数相似。
Refer to caption
Figure 11: The disentanglement objective (Eq. 8) helps avoid undesired attribute changes like change in race or gender when editing age. The low-rank constraint enables a precise edit.
图 11在编辑年龄时,解缠目标(公式8)有助于避免不希望发生的属性变化,如种族或 性别的变化。低等级约束可实现精确编辑。

8 Limitations
8 限制

While the disentanglement formulation reduces unwanted interference between edits, we still observe some residual effects as shown in Table 3 for our sliders. This highlights the need for more careful selection of the latent directions to preserve, preferably an automated method, in order to further reduce edit interference. Further study is required to determine the optimal set of directions that minimizes interference while retaining edit fidelity. We also observe that while the inference SDEdit technique helps preserve image structure, it can reduce edit intensity compared to the inference-time method, as shown in Table 1. The SDEdit approach appears to trade off edit strength for improved structural coherence. Further work is needed to determine if the edit strength can be improved while maintaining high fidelity to the original image.
虽然解缠公式减少了编辑之间不必要的干扰,但我们仍然观察到一些残余效应,如表3中我们的滑块所示。这说明需要更仔细地选择要保留的潜在方向,最好是采用自动方法,以进一步减少编辑干扰。我们需要进一步研究,以确定既能最大限度减少干扰,又能保持编辑保真度的最佳方向集。我们还观察到,虽然推理 SDEdit 技术有助于保留图像结构,但与推理时方法相比,它可以降低编辑强度,如表1 所示。SDEdit 方法似乎是以编辑强度换取结构一致性的改善。要确定是否能在保持与原始图像高度保真度的同时提高编辑强度,还需要进一步的工作。

9 Conclusion
9 结论

Concept Sliders are a simple and scalable new paradigm for interpretable control of diffusion models. By learning precise semantic directions in latent space, sliders enable intuitive and generalized control over image concepts. The approach provides a new level of flexiblilty beyond text-driven, image-specific diffusion model editing methods, because Concept Sliders allow continuous, single-pass adjustments without extra inference. Their modular design further enables overlaying many sliders simultaneously, unlocking complex multi-concept image manipulation.
概念滑块是对扩散模型进行可解释控制的一种简单、可扩展的新模式。通过学习潜在空间中的精确语义方向,滑块可以对图像概念进行直观和通用的控制。由于概念滑块可以进行连续的单次调整,无需额外的推理,因此这种方法提供了一种全新的灵活性,超越了文本驱动的、针对特定图像的扩散模型编辑方法。其模块化设计还能同时叠加多个滑块,从而实现复杂的多概念图像处理。

We have demonstrated the versatility of Concept Sliders by measuring their performance on Stable Diffusion XL and Stable Diffusion 1.4. We have found that sliders can be created from textual descriptions alone to control abstract concepts with minimal interference with unrelated concepts, outperforming previous methods. We have demonstrated and measured the efficacy of sliders for nuanced visual concepts that are difficult to describe by text, derived from small artist-created image datasets. We have shown that Concept Sliders can be used to transfer StyleGAN latents into diffusion models. Finally, we have conducted a human study that verifies the high quality of Concept Sliders that enhance and correct hand distortions. Our code and data will be made publicly available.
我们在 Stable Diffusion XL 和 Stable Diffusion 1.4 上测量了概念滑块的性能,从而证明了概念滑块的多功能性。我们发现,仅凭文字描述就能创建滑块,从而控制抽象概念,并将无关概念的干扰降到最低,这一点优于以往的方法。对于难以用文字描述的细微视觉概念,我们通过小型艺术家创作的图像数据集,展示并测量了滑块的功效。我们还证明了概念滑块可用于将 StyleGAN 潜在变量转移到扩散模型中。最后,我们还进行了一项人体研究,验证了概念滑块在增强和纠正手部变形方面的高品质。我们的代码和数据将公开发布。

Acknowledgments
致谢

We thank Jaret Burkett (aka Ostris) for the continued discussion on the image slider method and for sharing their eye size dataset. RG and DB are supported by Open Philanthropy.
我们感谢 Jaret Burkett(又名 Ostris)对图像滑块方法的持续讨论,并分享了他们的眼睛大小数据集。RG 和 DB 得到了 Open Philanthropy 的支持。

Code
代码

Our methods are available as open-source code. Source code, trained sliders, and data sets for reproducing our results can be found at sliders.baulab.info and at https://github.com/rohitgandikota/sliders.
我们的方法以开放源代码的形式提供。可在sliders.baulab.infohttps://github.com/rohitgandikota/sliders 上找到源代码、训练有素的滑块以及用于重现我们的结果的数据集

References

  • Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  • Burkett [2023] Jarret Burkett. Ostris/ai-toolkit: Various ai scripts. mostly stable diffusion stuff., 2023.
  • Ceylan et al. [2023] Duygu Ceylan, Chun-Hao Huang, and Niloy J. Mitra. Pix2video: Video editing using image diffusion. In International Conference on Computer Vision (ICCV), 2023.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Efron [2011] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  • Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • Gandikota et al. [2023] Rohit Gandikota, Joanna Materzyńska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
  • Gandikota et al. [2024] Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. Unified concept editing in diffusion models. IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
  • Google [2022] Google. Imagen, unprecedented photorealism x deep level of language understanding, 2022.
  • Haas et al. [2023] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. arXiv preprint arXiv:2303.11073, 2023.
  • Härkönen et al. [2020] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33:9841–9850, 2020.
  • Heng and Soh [2023] Alvin Heng and Harold Soh. Selective amnesia: A continual learning approach to forgetting in deep generative models. arXiv preprint arXiv:2305.10120, 2023.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Inui [2023] Norm Inui. Sd/sdxl tricks beneath the papers and codes, 2023.
  • Jahanian et al. [2019] Ali Jahanian, Lucy Chai, and Phillip Isola. On the" steerability" of generative adversarial networks. arXiv preprint arXiv:1907.07171, 2019.
  • James Betker [2023] et al James Betker. Improving image generation with better captions. OpenAI Reports, 2023.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Karras et al. [2021] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  • Kawar et al. [2022] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  • Kim et al. [2023] Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, and Juho Lee. Towards safe self-distillation of internet-scale text-to-image diffusion models. arXiv preprint arXiv:2307.05977, 2023.
  • Kumari et al. [2023a] Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023a.
  • Kumari et al. [2023b] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  • Kwon et al. [2022] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
  • Kynkäänniemi et al. [2022] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr\\\backslash’echet inception distance. arXiv preprint arXiv:2203.06026, 2022.
  • Liu et al. [2022] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022.
  • Luo [2022] Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
  • Meng et al. [2021] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  • Mothrider [2022] Mothrider. “can an ai draw hands?”, 2022.
  • Orgad et al. [2023] Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. Editing implicit assumptions in text-to-image diffusion models. In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
  • Park et al. [2023a] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. arXiv preprint arXiv:2307.12868, 2023a.
  • Park et al. [2023b] Yong-Hyun Park, Mingi Kwon, Junghyo Jo, and Youngjung Uh. Unsupervised discovery of semantic latent directions in diffusion models. arXiv preprint arXiv:2302.12469, 2023b.
  • Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  • Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  • Ryu [2023] Simo Ryu. Cloneofsimo/lora: Using low-rank adaptation to quickly fine-tune diffusion models.s, 2023.
  • Schramowski et al. [2022] Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. arXiv preprint arXiv:2211.05105, 2022.
  • Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • Shen et al. [2020] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Staffell [2023] Staffell. The sheer number of options and sliders using stable diffusion is overwhelming., 2023.
  • Wu et al. [2023] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
  • Wu et al. [2021] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12863–12872, 2021.
  • Zhang et al. [2023] Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591, 2023.
  • Zhou [2023] Tingrui Zhou. Github - p1atdev/leco: Low-rank adaptation for erasing concepts from diffusion models., 2023.
  • Zllrunning [2019] Zllrunning. Using modified bisenet for face parsing in pytorch, 2019.
  • Zou et al. [2023] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
\thetitle
\标题

Supplementary Material
补充材料

10 Disentanglement Formulation
10 脱钩配方

We visualize the rationale behind our disentangled formulation for sliders. When training sliders on single pair of prompts, sometimes the directions are entangled with unintended directions. For example, as we show show in Figure 11, controlling age can interfere with gender or race. We therefore propose using multiple paired prompts for finding a disentangled direction. As shown in Figure 12, we explicitly define the preservation directions (dotted blue lines) to find a new edit direction (solid blue line) invariant to the preserve features.
我们直观地展示了我们为滑块设计的分解公式背后的原理。在根据单对提示训练滑块时,有时方向会与意想不到的方向纠缠在一起。例如,如图 11所示 ,控制年龄会干扰性别或种族。因此,我们建议使用多组配对提示来寻找分离的方向。如图 12 所示 ,我们明确定义了保存方向(蓝色虚线),以找到与保存特征不变的新编辑方向(蓝色实线)。

Refer to caption
Figure 12: In this schematic we illustrate how multiple preservation concepts are used to disentangle a direction. For the sake of clarity in figure, we show examples for just two races. In practice, we preserve a diversity of several protected attribute directions.
图 12在这张示意图中,我们说明了如何使用多个保存概念来区分一个方向。为清晰起见,图中只展示了两个种族的例子。在实践中,我们会保留多种受保护的属性方向。

11 SDEdit Analysis

We ablate SDEdit’s contribution by fixing slider scale while varying SDEdit timesteps over 2,500 images. Figure 13 shows inverse trends between LPIPS and CLIP distances as SDEdit time increases. Using more SDEdit maintains structure, evidenced by lower LPIPS score, while maintaining lower CLIP score. This enables larger slider scales before risking structural changes. We notice that on average, timestep 750 - 850 has the best of both worlds with spatial structure preservation and increased efficacy.

Refer to caption
Figure 13: The plot examines CLIP score change and LPIPS distance when applying the same slider scale but with increasing SDEdit times. Higher timesteps enhance concept attributes considerably per CLIP while increased LPIPS demonstrates change in spatial stability. On the x-axis, 0 corresponds to no slider application while 1000 represents switching from start.

12 Textual Concepts Sliders

We quantify slider efficacy and control via CLIP score change and LPIPS distance over 15 sliders at 12 scales in Figure 14. CLIP score change validates concept modification strength. Tighter LPIPS distributions demonstrate precise spatial manipulation without distortion across scales. We show additional qualitative examples for textual concept sliders in Figures 27-32.

Refer to caption
Figure 14: Analyzing attribute isolation efficacy vs stylistic variation for 15 slider types across 12 scales. We divide our figure into two columns. The left column contains concepts that have words for antonyms (e.g. expensive - cheap) showing symmetric CLIP score deltas up/down. The right column shows harder to negate sliders (e.g. no glasses) causing clipped negative range. We also note that certain sliders have higher lpips, such as “cluttered” room slider, which intuitively makes sense.

12.1 Baseline Details

We compare our method against Prompt-to-prompt and a novel inference-time prompt composition method. For Prompt-to-prompt we use the official implementation codehttps://github.com/google/prompt-to-prompt/. We use the Refinement strategy they propose, where new token is added to the existing prompt for image editing. For example, for the images in Figure 15, we add the token “old” for the original prompt “picture of person” to make it “picture of old person”. For the compostion method, we use the principles from Liu et al https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/. Specifically, we compose the score functions coming from both “picture of person” and “old person” through additive guidance. We also utilize the SDEdit technique for this method to allow finer image editing.

12.2 Entanglement
12.2 纠缠

The baselines are sometimes prone to interference with concepts when editing a particular concept. Table 4 shows quantitative analysis on interference while Figure 15 shows some qualititative examples. We find that Prompt-to-prompt and inference composition can sometimes change the race/gender when editing age. Our sliders with disentaglement object 8, show minimal interference as seen by Interference metric, which shows the percentage samples with race or gender changed out of 2500 images we tested. We also found through LPIPS metric that our method shows finer editing capabilities. We find similar conclusions through quanlitative samples from Figure 15, that P2P and composition can alter gender, race or both when controlling age.
在编辑特定概念时,基线有时容易干扰概念。表 4 显示了对干扰的定量分析,图 15 显示了一些定性例子。我们发现,在编辑年龄时,"提示到提示 "和推理组合有时会改变种族/性别。 干扰 度量指标显示,在我们测试的 2500 张图片中,种族或性别发生改变的样本百分比为 0.5 %。通过 LPIPS 指标,我们还发现我们的方法具有更精细的编辑能力。通过图 15 中的量化样本,我们发现了类似的结论 ,即在控制年龄的情况下,P2P 和合成可以改变性别、种族或两者。

P2P Composition
组成
Ours
我们的
𝚫𝐂𝐋𝐈𝐏subscript𝚫𝐂𝐋𝐈𝐏\mathbf{\Delta_{CLIP}} 1.10 3.14 3.93
LPIPS 0.15 0.13 0.06
Interference
干扰
0.33 0.38 0.10
Table 4: The disentanglement formulation enables precise control over the age direction, as shown by the significant reduction in the Interference metric which measures the percentage of samples with gender/race change, compared to the original images. By using LoRA adaptors, sliders achieve finer editing in terms of both structure and edit direction, as evidenced by improvements in LPIPS and Interference. Concept strength is maintained, with similar ΔCLIPsubscriptΔ𝐶𝐿𝐼𝑃\Delta_{CLIP} scores across ablations.
表 4与原始图像相比,衡量性别/种族变化的样本百分比的干扰度量显著降低,由此可见,解缠公式能够对年龄方向进行精确控制。通过使用 LoRA 适配程序,滑动条在结构和编辑方向上都实现了更精细的编辑,LPIPS 和干扰度的改善就是证明。概念强度得以保持,不同消融的ΔCLIP\Delta_{CLIP}分数相似。
Refer to caption
Figure 15: Concept Sliders demonstrate minimal entanglement when controlling a concept. Prompt-to-prompt and inference-time textual composition sometimes tend to alter race/gender when editing age.

13 Visual Concept
13 视觉概念

13.1 Baseline Details
13.1 基线详情

We compare our method to two image customization baselines: custom diffusion https://github.com/adobe-research/custom-diffusion
我们将我们的方法与两种图像定制基线进行了比较:定制扩散 †
and textual inversion https://github.com/rinongal/textual_inversion
和文本反转 †
. For fair comparison, we use the official implementations of both, modifying textual inversion to support SDXL. These baselines learn concepts from concept-labeled image sets. However, this approach risks entangling concepts with irrelevant attributes (e.g. hair, skin tone) that correlate spuriously in the dataset, limiting diversity.
.为了进行公平比较,我们使用了两者的官方实现,并修改了文本反转以支持 SDXL。这些基线从概念标签图像集中学习概念。然而,这种方法有可能会将概念与数据集中的不相关属性(如头发、肤色)产生虚假关联,从而限制了多样性。

13.2 Precise Concept Capturing
13.2 精确捕捉概念

Figure 16 shows non-cherry-picked customization samples from all methods trained on the large-eyes Ostris dataset https://github.com/ostris/ai-toolkit
图 16 显示了在大眼睛 Ostris 数据集 † 上训练的所有方法的非樱桃采摘定制样本。
. While exhibiting some diversity, samples frequently include irrelevant attributes correlated with large eyes in the dataset, e.g. blonde hair in custom diffusion, blue eyes in textual inversion. In contrast, our paired image training isolates concepts by exposing only local attribute changes, avoiding spurious correlation learning.
.虽然表现出一定的多样性,但样本中经常包含与数据集中的大眼睛相关的无关属性 例如 自定义扩散中的金发,文本反转中的蓝眼睛。相比之下,我们的配对图像训练只暴露局部属性变化,避免了虚假的相关性学习,从而隔离了概念。

Refer to caption
Figure 16: Concept Sliders demonstrate more diverse outputs while also being effective at learning the new concepts. Customization methods can sometimes tend to learn unintended concepts like hair and eye colors.
图 16概念滑块在有效学习新概念的同时,还能展示更多样化的输出。自定义方法有时会倾向于学习非预期的概念,如头发和眼睛的颜色。

14 Composing Sliders

We show a 2 dimensional slider by composing “cooked” and “fine dining” food sliders in Figure 17. Next, we show progessive composition of sliders one by one in Figures 18,19. From top left image (original SDXL), we progressively generate images by composing a slider at each step. We show how our sliders provide a semantic control over images.

Refer to caption
Figure 17: Composing two text-based sliders results in a complex control over thanksgiving food options. We show the effect of applying both the "cooked" slider and "fine-dining" slider to a generated image of thanksgiving dinner. These sliders can be used in both positive and negative directions.
图 17将两个基于文本的滑块组合在一起,就能对感恩节食物选项进行复杂的控制。我们展示了将 "烹饪 "滑块和 "精致美食 "滑块应用到生成的感恩节晚餐图片上的效果。这些滑块既可以正向使用,也可以反向使用。
Refer to caption
Figure 18: Concept Sliders can be composed for a more nuanced and complex control over attributes in an image. From stable diffusion XL image on the top left, we progressively compose a slider on top of the previously added stack of sliders. By the end, bottom right, we show the image by composing all 10 sliders.
图 18概念滑块的组合可以对图像中的属性进行更细致、更复杂的控制。从左上角的稳定扩散 XL 图像开始,我们在之前添加的滑块堆叠上逐步组合一个滑块。最后,在右下方,我们展示了由所有 10 个滑块组成的图像。
Refer to caption
Figure 19: Concept Sliders can be composed for a more nuanced and complex control over attributes in an image. From stable diffusion XL image on the top left, we progressively compose a slider on top of the previously added stack of sliders. By the end, bottom right, we show the image by composing all 10 sliders.
图 19概念滑块的组合可以对图像中的属性进行更细致、更复杂的控制。从左上角的稳定扩散 XL 图像开始,我们在之前添加的滑块堆叠上逐步组合一个滑块。最后,在右下方,我们展示了由所有 10 个滑块组成的图像。

15 Editing Real Images
15 编辑真实图像

Concept sliders can also be used to edit real images. Manually engineering a prompt to generate an image similar to the real image is very difficult. We use null inversion https://null-text-inversion.github.io
概念滑块也可用于编辑真实图像。手动设计提示符以生成与真实图像相似的图像非常困难。我们使用空反转 †
which finetunes the unconditional text embedding in the classifier free guidance during inference. This allows us to find the right setup to turn the real image as a diffusion model generated image. Figure 20 shows Concept Sliders used on real images to precisely control attributes in them.
在推理过程中,对分类器自由引导中的无条件文本嵌入进行微调。这样,我们就能找到正确的设置,将真实图像转化为扩散模型生成的图像。图 20 显示了用于真实图像的概念滑块,以精确控制图像中的属性。

Refer to caption
Figure 20: Concept Sliders can be used to edit real images. We use null inversion method to convert real image as a diffusion model generated image. We then run our Concept Sliders on that generation to enable precise control of concepts.
图 20概念滑块可用于编辑真实图像。我们使用空反转方法将真实图像转换为扩散模型生成的图像。然后,我们在生成的图像上运行概念滑块,以实现对概念的精确控制。

16 Sliders to Improve Image Quality
16 个 滑块可提高图像质量

We provide more qualitative examples for "fix hands" slider in Figure 21. We also show additional examples for the "repair" slider in Figure 22-24
我们在图 21中为 "修复双手 "滑块提供了更多定性示例 。我们还在图 22-24 中展示了 "修复 "滑块的其他示例

Refer to caption
Figure 21: Concept Sliders can be used to fix common distortions in diffusion model generated images. We demonstrate "Fix Hands" slider that can fix distorted hands.
图 21概念滑块可用于修复扩散模型生成图像中常见的失真。我们演示的 "修复手部 "滑块可以修复扭曲的手部。
Refer to caption
Figure 22: Concept Sliders can be used to fix common distortions in diffusion model generated images. The repair slider enables the model to generate images that are more realistic and undistorted.
图 22概念滑块可用于修复扩散模型生成图像中常见的失真。修复滑块可使模型生成的图像更加逼真、不失真。
Refer to caption
Figure 23: Concept Sliders can be used to fix common distortions in diffusion model generated images. The repair slider enables the model to generate images that are more realistic and undistorted.
图 23概念滑块可用于修复扩散模型生成图像中常见的失真。修复滑块可使模型生成的图像更加逼真、不失真。
Refer to caption
Figure 24: Concept Sliders can be used to fix common distortions in diffusion model generated images. The repair slider enables the model to generate images that are more realistic and undistorted.
图 24概念滑块可用于修复扩散模型生成图像中常见的失真。修复滑块可使模型生成的图像更加逼真、不失真。

16.1 Details about User Studies
16.1 用户研究详情

We conduct two human evaluations analyzing our “repair” and “fix hands” sliders. For “fix hands”, we generate 150 images each from SDXL and our slider using matched seeds and prompts. We randomly show each image to an odd number users and have them select issues with the hands: 1) misplaced/distorted fingers, 2) incorrect number of fingers, 3) none. as shown in Figure 25 62% of the 150 SDXL images have hand issues as rated by a majority of users. In contrast, only 22% of our method’s images have hand issues, validating effectiveness of our fine-grained control.
我们对 "修复 "和 "修复手 "滑块进行了两次人工评估分析。对于 "修复双手",我们使用匹配的种子和提示从 SDXL 和我们的滑块中各生成 150 张图片。我们随机向奇数用户展示每张图片,让他们选择手的问题:如图 25所示, 在 150 张 SDXL 图像中,有 62% 的图像被大多数用户评为存在手部问题。相比之下,我们的方法仅有 22% 的图像存在手部问题,这验证了我们精细控制的有效性。

Refer to caption
Figure 25: User study interface on Amazon Mechanical Turk. Users are shown images randomly sampled from either SDXL or our “fix hands” slider method, and asked to identify hand issues or mark the image as free of errors. Aggregate ratings validate localization capability of our finger control sliders. For the example shown above, users chose the option “Fingers in wrong place”
图 25Amazon Mechanical Turk 上的用户研究界面。用户会看到从 SDXL 或我们的 "修复手部 "滑块方法中随机抽取的图像,并被要求找出手部问题或标记图像无误。综合评分验证了我们手指控制滑块的定位能力。在上面的例子中,用户选择了 "手指位置错误 "选项

We conduct an A/B test to evaluate the efficacy of our proposed “repair”it slider. The test set consists of 300 image pairs (Fig. 26), where each pair contains an original image alongside the output of our method when applied to that image with the same random seed. The left/right placement of these two images is randomized. Through an online user study, we task raters to select the image in each pair that exhibits fewer flaws or distortions, and to describe the reasoning behind their choice as a sanity check. For example, one rater selected the original image in Fig. 22.a, commenting that “The left side image is not realistic because the chair is distorted.” . Similarly a user commented “Giraffes heads are separate unlikely in other image” for Fig. 23.c. Across all 300 pairs, our “repair” slider output is preferred as having fewer artifacts by 80.39% of raters. This demonstrates that the slider effectively reduces defects relative to the original. We manually filter out responses with generic comments (e.g., “more realistic”), as the sanity check prompts raters for specific reasons. After this filtering, 250 pairs remain for analysis.
我们进行了 A/B 测试,以评估我们提出的 "修复 "滑块的功效。测试集由 300 对图像组成(图 26),每对图像包含一张原始图像,以及我们的方法在该图像上使用相同随机种子时的输出结果。这两张图片的左右位置是随机的。通过在线用户研究,我们要求评分者在每对图像中选择瑕疵或失真较少的图像,并描述其选择背后的原因,以进行合理性检查。例如,一位评分者选择了图 22.a 中的原始图像 ,并评论说:"左边的图像不真实,因为椅子扭曲了。.在所有 300 对图像中,80.39% 的评分者认为我们的 "修复 "滑块输出的伪影更少。这表明,相对于原图,滑块能有效减少缺陷。我们手动过滤掉了带有一般性评论(如 "更真实")的回复,因为合理性检查会提示评分者具体原因。经过这样的过滤后,还剩下 250 对进行分析。

Refer to caption
Figure 26: Interface for our "realistic" slider user study. Users are shown an original SDXL image and the corresponding output from our slider, with left/right placement randomized. Users select the image they find more photorealistic and describe their rationale as a sanity check. For example, one user selected the slider image as more realistic in the shown example, commenting “The black-haired boy’s face, right arm and left foot are distorted in right image.” Another user also chose the slider output, noting “The right side image has a floating head”. Asking raters to give reasons aims to reduce random selections.
图 26 我们的 "真实 "滑块用户研究界面。我们向用户展示了 SDXL 原始图像和滑块的相应输出,左右位置随机调整。用户选择他们认为更逼真的图像,并描述他们的理由,作为理智检查。例如,在所示示例中,一位用户认为滑块输出的图像更逼真,因此选择了滑块输出的图像,并评论说 "右图中黑发男孩的脸、右臂和左脚都变形了"。另一位用户也选择了滑块输出,并指出 "右侧图像的头部是浮动的"。要求评分者给出理由的目的是减少随机选择。
Refer to caption
Figure 27: We demonstrate the effects of modifying an image with different sliders like “curly hair”, “surprised”, “chubby”. Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure.
Refer to caption
Figure 28: We demonstrate style sliders for "pixar", "realistic details", "clay", and "sculpture". Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure.
Refer to caption
Figure 29: We demonstrate weather sliders for "delightful", "dark", "tropical", and "winter". For delightful, we notice that the model sometimes make the weather bright or adds festive decorations. For tropical, it adds tropical plants and trees. Finally, for winter, it adds snow.
图 29我们演示了 "愉快"、"阴暗"、"热带 "和 "冬季 "的天气滑块。对于 "愉快",我们注意到模型有时会让天气变得明亮,或添加节日装饰。在 "热带 "模式下,模型会添加热带植物和树木。最后,对于 "冬季",模型会添加雪景。
Refer to caption
Figure 30: We demonstrate sliders to add attributes to people like "glasses", "muscles", "beard", and "long hair". Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure.
图 30我们展示了为人物添加 "眼镜"、"肌肉"、"胡须 "和 "长发 "等属性的滑块。我们基于文本的滑块可以在图像生成过程中对所需属性进行精确编辑,同时保持整体结构不变。
Refer to caption
Figure 31: We demonstrate sliders to control attributes of vehicles like “rusty”, “futuristic”, “damaged”. Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure.
图 31我们展示了用于控制 "生锈"、"未来感 "和 "损坏 "等车辆属性的滑块。我们基于文本的滑块可在图像生成过程中对所需属性进行精确编辑,同时保持整体结构不变。
Refer to caption
Figure 32: Our sliders can also be used to control styles of furniture like “royal”, “Modern”. Our text-based sliders allow precise editing of desired attributes during image generation while maintaining the overall structure.
图 32我们的滑块还可用于控制家具的风格,如 "皇家"、"现代 "等。我们基于文本的滑块可以在图像生成过程中对所需属性进行精确编辑,同时保持整体结构不变。