这是用户在 2025-3-5 10:36 为 https://arxiv.org/html/2403.14468?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: capt-of
  • failed: animate

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0  授权协议: CC BY 4.0
arXiv:2403.14468v4 [cs.CV] 03 Nov 2024
arXiv:2403.14468v4 [cs.CV] 2024 年 11 月 3 日
\useunder  \use下

\ul  \蜂房

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
AnyV2V:适用于任何视频到视频编辑任务的免调优框架

♠†Max Ku, ♠†Cong Wei, ♠†Weiming Ren, Harry Yang, ♠†Wenhu Chen
University of Waterloo, Vector Institute, Harmony.AI
{m3ku, c58wei, w2ren, wenhuchen}@uwaterloo.ca
Abstract  抽象

In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks. The code is available at https://github.com/TIGER-AI-Lab/AnyV2V.
在使用生成模型创建数字内容的动态领域中,最先进的视频编辑模型仍然无法提供用户所需的质量和控制水平。以前的视频编辑工作要么以零镜头的方式从基于图像的生成模型扩展而来,要么需要大量的微调,这可能会阻碍流畅视频编辑的产生。此外,这些方法经常依赖文本输入作为编辑指导,从而导致歧义并限制它们可以执行的编辑类型。认识到这些挑战,我们引入了 AnyV2V,这是一种新颖的免调谐范式,旨在将视频编辑简化为两个主要步骤:(1) 采用现成的图像编辑模型来修改第一帧,(2) 利用现有的图像到视频生成模型通过时间特征注入生成编辑后的视频。AnyV2V 可以利用任何现有的图像编辑工具来支持广泛的视频编辑任务,包括基于提示的编辑、基于参考的样式传输、主题驱动的编辑和身份作,这些都是以前的方法无法实现的。AnyV2V 也可以支持任何视频长度。我们的评估表明,AnyV2V 实现了与其他基线方法相当的 CLIP 分数。此外,AnyV2V 在人工评估中的表现明显优于这些基线,在与源视频的视觉一致性方面取得了显着改进,同时在所有编辑任务中产生了高质量的编辑。该代码可在 https://github.com/TIGER-AI-Lab/AnyV2V 获取。

1 Introduction  1 介绍

The development of deep generative models (Ho et al., 2020) has led to significant advancements in content creation and manipulation, especially in digital images (Rombach et al., 2022; Nichol et al., 2022; Brooks et al., 2023; Ku et al., 2024; Chen et al., 2023c; Li et al., 2023). However, video generation and editing have not reached the same level of advancement as images (Wang et al., 2023; Chen et al., 2023a; 2024; Ho et al., 2022). In the context of video editing, training a large-scale video editing model presents considerable challenges due to the scarcity of paired data and the substantial computational resources required.
深度生成模型的发展(Ho et al., 2020导致了内容创建和作的重大进步,尤其是在数字图像方面(Rombach et al., 2022;Nichol et al., 2022;Brooks 等 人,2023 年;Ku et al., 2024;Chen等 人,2023c;Li et al., 2023的。然而,视频生成和编辑并没有达到与图像相同的进步水平(Wang et al., 2023;Chen 等人 ,2023a;2024 年;Ho et al., 2022的在视频编辑的上下文中,由于配对数据的稀缺性和所需的大量计算资源,训练大规模视频编辑模型带来了相当大的挑战。

To overcome these challenges, researchers proposed various approaches (Geyer et al., 2023; Cong et al., 2023; Wu et al., 2023a; b; Liu et al., 2023a; Liang et al., 2023; Gu et al., 2023b; Cong et al., 2023; Zhang et al., 2023d; Qi et al., 2023; Ceylan et al., 2023; Yang et al., 2023; Jeong & Ye, 2023; Guo et al., 2023), which can be categorized into two types: (1) zero-shot adaptation from pre-trained text-to-image (T2I) models or (2) fine-tuned motion module from a pre-trained T2I or text-to-video (T2V) models. The zero-shot methods (Geyer et al., 2023; Cong et al., 2023; Jeong & Ye, 2023; Zhang et al., 2023d) usually suffer from flickering issues due to a lack of temporal understanding. On the other hand, fine-tuning methods (Wu et al., 2023a; Chen et al., 2023b; Wu et al., 2023b; Gu et al., 2023b; Guo et al., 2023) require more time and computational overhead to edit videos. Moreover, all the methods can only adhere to certain types of edits. For example, a user might want to perform edits on visuals that are out-of-distribution from the learned text encoder (e.g. change a person to a character from their artwork). Thus, a highly customizable solution is more desired in video editing applications. It would be ideal if there were methods that allowed for a seamless combination of human artistic input and the assistance provided by AI, allowing for a synergy between human effort and AI, maintaining creators’ creativity while producing high-quality outputs.
为了克服这些挑战,研究人员提出了各种方法(Geyer等 人,2023 年;Cong et al., 2023;Wu et al., 2023a;;Liu et al., 2023a;Liang et al., 2023;Gu et al., 2023b;Cong et al., 2023;Zhang et al., 2023d;Qi et al., 2023;Ceylan 等 人,2023 年;Yang et al., 2023;Jeong & Ye, 2023;Guo et al., 2023),可分为两种类型:(1) 来自预先训练的文本到图像 (T2I) 模型的零镜头适应或 (2) 来自预先训练的 T2I 或文本到视频 (T2V) 模型的微调运动模块。零镜头方法(Geyer等 人,2023 年;Cong et al., 2023;Jeong & Ye, 2023;Zhang et al., 2023d 通常由于缺乏对时间的理解而遭受闪烁问题。另一方面,微调方法(Wu et al., 2023a;Chen et al., 2023b;Wu et al., 2023b;Gu et al., 2023b;Guo et al., 2023需要更多的时间和计算开销来编辑视频。此外,所有方法都只能遵守某些类型的编辑。例如,用户可能希望对未从学习的文本编码器分发的视觉对象执行编辑(例如,将人物更改为其图稿中的角色)。因此,视频编辑应用程序更需要高度可定制的解决方案。 如果有一些方法可以将人类的艺术输入与 AI 提供的帮助无缝结合,实现人类努力和 AI 之间的协同作用,在生产高质量输出的同时保持创作者的创造力,那将是理想的。

Refer to caption
Figure 1: AnyV2V is an evolving framework to handle all types of video-to-video editing tasks without any parameter tuning. AnyV2V disentangles video editing into two simpler problems: (1) Single image editing and (2) Image-to-video generation with video reference.
图 1: AnyV2V 是一个不断发展的框架,无需任何参数调整即可处理所有类型的视频到视频编辑任务。AnyV2V 将视频编辑分为两个更简单的问题:(1) 单个图像编辑和 (2) 使用视频参考生成图像到视频。

With the above motivations, we aim to develop a video editing framework that requires no fine-tuning and caters to user demand. In this work, we present AnyV2V, designed to enhance the controllability of zero-shot video editing by decomposing the task into two pivotal stages:
基于上述动机,我们的目标是开发一个无需微调并满足用户需求的视频编辑框架。在这项工作中,我们提出了 AnyV2V,旨在通过将任务分解为两个关键阶段来增强零镜头视频编辑的可控性:

  1. 1.

    Apply image edit on the first frame with any off-the-shelf image editing model or human effort.


    1. 使用任何现成的图像编辑模型或人工工作在第一帧上应用图像编辑。
  2. 2.

    Leverage the innate knowledge of the image-to-video (I2V) model to generate the edited video with the edited first frame, source video latent, and the intermediate temporal features.


    2. 利用图像到视频 (I2V) 模型的固有知识,生成具有编辑后的第一帧、源视频潜伏和中间时间特征的编辑视频。

Our objective is to propagate the edited first frame across the entire video while ensuring alignment with the source video. To achieve this, we employ I2V models (Zhang et al., 2023c; Chen et al., 2023d; Ren et al., 2024) for DDIM inversion to enable first-frame conditioning. With the inverted latents as initial noise and the modified first frame as the conditional signal, the I2V model can generate videos that are not only faithful to the edited first frame but also follow the appearance and motion of the source video. To further enforce the consistency of the appearance and motion with the source video, we perform feature injection in the I2V model. The two-stage editing process effectively offloads the editing operation to existing image editing tools. This design (detail in Section 4) helps AnyV2V excel in:
我们的目标是将编辑后的第一帧传播到整个视频中,同时确保与源视频保持一致。为了实现这一目标,我们采用了 I2V 模型(Zhang et al., 2023c;Chen等 人,2023d;任 et al., 2024 进行 DDIM 反转以实现第一帧条件。以倒置的潜伏点作为初始噪声,以修改后的第一帧作为条件信号,I2V 模型可以生成不仅忠实于编辑后的第一帧的视频,还可以跟随源视频的外观和运动。为了进一步加强外观和动作与源视频的一致性,我们在 I2V 模型中执行特征注入。两阶段编辑过程有效地将编辑作卸载到现有的图像编辑工具中。此设计(详见第 4 节)可帮助 AnyV2V 在以下方面表现出色:

  • Compatibility: It provides a highly customized interface for a user to perform video edits on any modality. It can seamlessly integrate any image editing methods to perform diverse editing.


    • 兼容性:它为用户提供了一个高度定制的界面,可以在任何模式下执行视频编辑。它可以无缝集成任何图像编辑方法以执行多样化的编辑。
  • Simplicity: It does not require any fine-tuning nor additional video features like previous works (Gu et al., 2023b; Wu et al., 2023b; Ouyang et al., 2023) to achieve high appearance and temporal consistency for video editing tasks.


    • 简单性:它不需要像以前的作品那样进行任何微调或额外的视频功能(Gu et al., 2023b;Wu et al., 2023b;Ouyang et al., 2023)实现视频编辑任务的高外观和时间一致性。

From our findings, without any fine-tuning, AnyV2V can perform video editing tasks beyond the scope of current publicly available methods, such as reference-based style transfer, subject-driven editing, and identity manipulation. AnyV2V also can perform prompt-based editing and achieve superior results on common video editing evaluation metrics compared to the baseline models (Geyer et al., 2023; Cong et al., 2023; Wu et al., 2023b). We show both quantitatively and qualitatively that our method outperforms existing SOTA baselines in Section 5.3 and Appendix B. AnyV2V is favoured in 69.7% of samples for prompt alignment and 46.2% overall preference in human evaluation, while the best baseline only achieves 31.7% and 20.7%, respectively. Our method also reaches the highest CLIP-Text score of 0.2932 in text alignment and a competitive CLIP-Image score of 0.9652 in temporal consistency.
从我们的研究结果来看,无需任何微调,AnyV2V 就可以执行超出当前公开可用方法范围的视频编辑任务,例如基于参考的样式迁移、主题驱动的编辑和身份作。与基线模型相比,AnyV2V 还可以执行基于提示的编辑,并在常见的视频编辑评估指标上取得卓越的结果(Geyer等人 ,2023 年;Cong et al., 2023;Wu et al., 2023b的。我们在定量和定性方面都表明,我们的方法优于第 5.3 节和附录 B 中现有的 SOTA 基线。AnyV2V 在 69.7% 的样本中受到快速对齐的青睐,在人类评估中受到 46.2% 的总体偏好,而最佳基线分别仅达到 31.7% 和 20.7%。我们的方法在文本对齐方面也达到了最高的 CLIP-Text 分数 0.2932,在时间一致性方面也达到了具有竞争力的 CLIP-Image 分数 0.9652。

All these achievements are thanks to the AnyV2V’s design to harness the power of off-the-shelf image editing models from advanced image editing research. Through a comprehensive study and evaluation of the effectiveness of our design, our key observation is that the inverted noise latent and feature injection serve as critical components to guide the video motion, and the I2V model itself has good capabilities in generating motions. We also found that by inverting long videos exceeding the I2V models’ training frames, the inverted latents enable the I2V model to produce longer videos, making long video editing possible. To summarize, The main contributions of our work are three-fold:
所有这些成就都归功于 AnyV2V 的设计,它利用了来自高级图像编辑研究的现成图像编辑模型的力量。通过对我们设计有效性的全面研究和评估,我们的主要观察是反向噪声潜伏和特征注入是引导视频运动的关键组件,而 I2V 模型本身具有良好的产生运动的能力。我们还发现,通过反转超过 I2V 模型训练帧的长视频,倒置的潜在数据使 I2V 模型能够生成更长的视频,从而使长视频编辑成为可能。总而言之,我们工作的主要贡献有三方面:

  • We proposed AnyV2V as a first fundamentally different solution for video editing, treating video editing as a simpler image editing problem.


    • 我们提出了 AnyV2V 作为第一个完全不同的视频编辑解决方案,将视频编辑视为一个更简单的图像编辑问题。
  • We showed that AnyV2V can support long video editing by inverting videos that extend beyond the training frame lengths of I2V models.


    • 我们表明 AnyV2V 可以通过反转超出 I2V 模型训练帧长度的视频来支持长视频编辑。
  • Our extensive experiments showcased the superior performance of AnyV2V when compared to the existing SOTA methods, highlighting the potential of leveraging I2V models for video editing.


    • 与现有的 SOTA 方法相比,我们广泛的实验展示了 AnyV2V 的卓越性能,突出了利用 I2V 模型进行视频编辑的潜力。

2 Related Works  阿拉伯数字 相关作品

Video generation has attracted considerable attention within the field (Chen et al., 2023a; 2024; OpenAI, ; Wang et al., 2023; Hong et al., 2022; Zhang et al., 2023a; Henschel et al., 2024; Wang et al., 2024c; Xing et al., 2023; Chen et al., 2023d; Bar-Tal et al., 2024; Ren et al., 2024; Zhang et al., 2023c). However, video manipulation also represents a significant and popular area of interest. Initial attempts, such as Tune-A-Video (Wu et al., 2023b) and VideoP2P (Liu et al., 2023b) involved fine-tuning a text-to-image model to achieve video editing by learning the continuous motion. The concurrent works at that time such as Pix2Video (Ceylan et al., 2023) and Fate-Zero (Qi et al., 2023) go for zero-shot approach, which leverages the inverted latent from a text-to-image model to retain both structural and motion information. The progressively propagates to other frames edits. Subsequent developments have enhanced the results but generally follow the two paradigms  (Geyer et al., 2023; Wu et al., 2023a; Cong et al., 2023; Yang et al., 2023; Ceylan et al., 2023; Ouyang et al., 2023; Guo et al., 2023; Gu et al., 2023b; Esser et al., 2023; Chen et al., 2023b; Jeong & Ye, 2023; Zhang et al., 2023d; Cheng et al., 2023). Control-A-Video (Chen et al., 2023b) and ControlVideo (Zhang et al., 2023d) leveraged ControlNet (Zhang et al., 2023b) for extra spatial guidance. TokenFlow (Geyer et al., 2023) leveraged the nearest neighbor field and inverted latent to achieve temporally consistent edit. Fairy (Wu et al., 2023a) followed both paradigms which they fine-tuned a text-to-image model and also cached the attention maps to propagate the frame edits. VideoSwap (Gu et al., 2023b) requires additional parameter tuning and video feature extraction (e.g. tracking, point correspondence, etc) to ensure appearance and temporal consistency. CoDeF (Ouyang et al., 2023) allows the first image edit to propagate the other frames with one-shot tuning. UniEdit (Bai et al., 2024) leverages the inverted latent and feature maps injection to achieve a wide range of video editing with a pre-trained text-to-video model (Wang et al., 2023).
视频生成在该领域引起了相当大的关注(Chen et al., 2023a;2024 年;OpenAI, ;Wang et al., 2023;Hong et al., 2022;Zhang et al., 2023a;Henschel 等 人,2024 年;Wang et al., 2024c;Xing et al., 2023;Chen等 人,2023d;Bar-Tal 等 人,2024 年;任 et al., 2024;Zhang et al., 2023c)。然而,视频处理也是一个重要且受欢迎的感兴趣领域。最初的尝试,如 Tune-A-Video (Wu et al., 2023b 和 VideoP2P (Liu et al., 2023b 涉及微调文本到图像模型,通过学习连续运动来实现视频编辑。当时同时进行的作品,如 Pix2Video(Ceylan et al., 2023和 Fate-Zero(Qi et al., 2023采用了零镜头方法,它利用文本到图像模型的倒置潜在来保留结构和运动信息。将逐渐传播到其他帧编辑。随后的发展增强了结果,但通常遵循两种范式(Geyer等 人,2023 年;Wu et al., 2023a;Cong et al., 2023;Yang et al., 2023;Ceylan 等 人,2023 年;Ouyang et al., 2023;Guo et al., 2023;Gu et al., 2023b;Esser et al., 2023;Chen et al., 2023b;Jeong & Ye, 2023;Zhang et al., 2023d;Cheng et al., 2023的 Control-A-Video (Chen et al., 2023b 和 ControlVideo (Zhang et al., 2023d 利用 ControlNet (Zhang et al., 2023b 进行额外的空间指导。TokenFlow (Geyer et al., 2023 利用最近邻域和倒置 latent 来实现时间一致的编辑。Fairy (Wu et al., 2023a 遵循了这两种范式,他们微调了文本到图像模型,并缓存了注意力图以传播帧编辑。VideoSwap (Gu et al., 2023b 需要额外的参数调整和视频特征提取(例如跟踪、点对应等)以确保外观和时间的一致性。CoDeF(Ouyang等 人,2023 年)允许第一次图像编辑通过一次性调整传播其他帧。UniEdit (Bai et al., 2024 利用反向潜在和特征图注入,通过预先训练的文本到视频模型(Wang et al., 2023实现广泛的视频编辑。

Table 1: Comparison with different video editing methods and the type of editing tasks.
表 1: 与不同的视频编辑方法和编辑任务类型的比较。
Method  方法
Prompt-based  基于提示
Editing  编辑
Reference-based  基于引用
Style Transfer  样式迁移
Subject-Driven  主体驱动
Editing  编辑
Identity  身份
Manipulation  操纵
Tuning-Free?  免调音? Backbone  骨干
Tune-A-Video (Wu et al., 2023b)
Tune-A-Video Wu et al.2023b
Stable Diffusion  稳定扩散
Pix2Video (Ceylan et al., 2023)
Pix2Video Ceylan 等 人2023
SD-Depth  SD 深度
Gen-1 (Esser et al., 2023)
Gen-1 Esser等 人2023
Stable Diffusion  稳定扩散
TokenFlow (Geyer et al., 2023)
TokenFlow Geyer et al.2023
Stable Diffusion  稳定扩散
FLATTEN (Cong et al., 2023)
FLATTEN Cong等 人2023 年)
Stable Diffusion  稳定扩散
Fairy (Wu et al., 2023a)
 仙女Wu et al.2023a
Stable Diffusion  稳定扩散
ControlVideo (Zhang et al., 2023d)
ControlVideo Zhang et al.2023d
ControlNet  控制网
CoDeF (Ouyang et al., 2023)
CoDeF Ouyang等 人2023
ControlNet  控制网
VideoSwap (Gu et al., 2023b)
VideoSwap Gu et al.2023b
AnimateDiff  AnimateDiff (动画差异)
UniEdit (Bai et al., 2024)
UniEdit Bai等 人2024
Any T2V Models  任何 T2V 型号
AnyV2V (Ours)  AnyV2V(我们的) Any I2V Models  任何 I2V 型号

However, none of the methods can offer precise control to users, as the edits may not align with the user’s exact intentions or desired level of detail, often due to the ambiguity of natural language and the constraints of the model’s capabilities. For example, VideoP2P (Liu et al., 2023a) is restricted to only word-swapping prompts due to the reliance on cross-attention. There is a clear need for a more precise and comprehensive solution for video editing tasks. Our work AnyV2V is the first work to empower a diverse array of video editing tasks. We compare AnyV2V with the existing methods in Table 1. As can be seen, our method excels in its applicability and compatibility.
但是,没有一种方法可以为用户提供精确的控制,因为编辑可能与用户的确切意图或所需的详细程度不一致,这通常是由于自然语言的歧义和模型功能的限制。例如,VideoP2P (Liu et al., 2023a 由于依赖交叉注意力,仅限于换词提示。显然需要一种更精确、更全面的视频编辑任务解决方案。我们的工作 AnyV2V 是第一个为各种视频编辑任务提供支持的工作。我们将 AnyV2V 与表 1 中的现有方法进行了比较。可以看出,我们的方法在适用性和兼容性方面表现出色。

3 Preliminary  3 初步

3.1 Image-to-Video (I2V) Generation Models
3.1 图像到视频 (I2V) 生成模型

In this work, we focus on leveraging latent diffusion-based (Rombach et al., 2022) I2V generation models for video editing. Given an input first frame I1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, a text prompt 𝐬\mathbf{s}bold_s and a noisy video latent 𝐳t\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step ttitalic_t, I2V generation models recover a less noisy latent 𝐳t1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT using a denoising model ϵθ(𝐳t,I1,𝐬,t)\epsilon_{\theta}(\mathbf{z}_{t},I_{1},\mathbf{s},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_s , italic_t ) conditioned on both I1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐬\mathbf{s}bold_s. The denoising model ϵθ\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT contains a set of spatial and temporal self-attention layers, where the self-attention operation can be formulated as:
在这项工作中,我们专注于利用基于潜在扩散(Rombach et al., 2022的 I2V 生成模型进行视频编辑。给定输入第一帧 I1subscript1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 、文本提示 𝐬\mathbf{s}bold_s 和时间步 𝐳tsubscript\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ttitalic_t 的噪声视频潜伏,I2V 生成模型 𝐳t1subscript1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT 使用以 I1subscript1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐬\mathbf{s}bold_s 为条件的去噪模型 ϵθ(𝐳t,I1,𝐬,t)subscriptsubscriptsubscript1\epsilon_{\theta}(\mathbf{z}_{t},I_{1},\mathbf{s},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_s , italic_t ) 恢复噪声较小的潜伏。去噪模型 ϵθsubscript\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 包含一组空间和时间的自注意力层,其中自注意力作可以表述为:

Q=WQz,K=WKz,V=WVz,\displaystyle Q=W^{Q}z,K=W^{K}z,V=W^{V}z,italic_Q = italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT italic_z , italic_K = italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_z , italic_V = italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_z , (1)
Attention(Q,K,V)=Softmax(QKd)V,\displaystyle\mathrm{Attention}(Q,K,V)=\mathrm{Softmax}(\frac{QK^{\top}}{\sqrt% {d}})V,roman_Attention ( italic_Q , italic_K , italic_V ) = roman_Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V , (2)

where zzitalic_z is the input hidden state to the self-attention layer and WQW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, WKW^{K}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and WVW^{V}italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are learnable projection matrices that map zzitalic_z onto query, key and value vectors, respectively. For spatial self-attention, zzitalic_z represents a sequence of spatial tokens from each frame. For temporal self-attention, zzitalic_z is composed of tokens located at the same spatial position across all frames.
其中 zzitalic_z 是自我注意层的输入隐藏状态 和 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPTWKsuperscriptW^{K}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPTWVsuperscriptW^{V}italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT 可学习的投影矩阵,分别 zzitalic_z 映射到 query、key 和 value 向量。对于空间自我注意, zzitalic_z 表示来自每个帧的空间标记序列。对于时间自我注意, zzitalic_z 它由位于所有帧中相同空间位置的标记组成。

3.2 DDIM Inversion
3.2 DDIM 反演

The denoising process for I2V generation models from 𝐳t\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐳t1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT can be achieved using the DDIM (Song et al., 2020) sampling algorithm. The reverse process of DDIM sampling, known as DDIM inversion (Mokady et al., 2023; Dhariwal & Nichol, 2021), allows obtaining 𝐳t+1\mathbf{z}_{t+1}bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT from 𝐳t\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that 𝐳t+1=αt+1αt𝐳t+(1αt+111αt1)ϵθ(𝐳t,x0,𝐬,t)\mathbf{z}_{t+1}=\sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}\mathbf{z}_{t}+(\sqrt{% \frac{1}{\alpha_{t+1}}-1}-\sqrt{\frac{1}{\alpha_{t}}-1})\cdot\epsilon_{\theta}% (\mathbf{z}_{t},x_{0},\mathbf{s},t)bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG - 1 end_ARG - square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ) ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s , italic_t ), where αt\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is derived from the variance schedule of the diffusion process.
I2V 生成模型的 𝐳tsubscript\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 去噪过程 𝐳t1subscript1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT 可以使用 DDIM (Song et al., 2020 采样算法来实现。DDIM 采样的相反过程,称为 DDIM 反转(Mokady等 人,2023 年;Dhariwal & Nichol, 2021允许从 𝐳tsubscript\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 中得到 𝐳t+1subscript1\mathbf{z}_{t+1}bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT𝐳t+1=αt+1αt𝐳t+(1αt+111αt1)ϵθ(𝐳t,x0,𝐬,t)subscript1subscript1subscriptsubscript1subscript111subscript1subscriptsubscriptsubscript0\mathbf{z}_{t+1}=\sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}\mathbf{z}_{t}+(\sqrt{% \frac{1}{\alpha_{t+1}}-1}-\sqrt{\frac{1}{\alpha_{t}}-1})\cdot\epsilon_{\theta}% (\mathbf{z}_{t},x_{0},\mathbf{s},t)bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG - 1 end_ARG - square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ) ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s , italic_t ) 其中 αtsubscript\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 是从扩散过程的方差时间表中得出的。

3.3 Plug-and-Play (PnP) Diffusion Features
3.3 即插即用 (PnP) 扩散功能

Tumanyan et al. (2023a) proposed PnP diffusion features for image editing, based on the observation that intermediate convolution features ffitalic_f and self-attention scores A=Softmax(QKd)A=\mathrm{Softmax}(\frac{QK^{\top}}{\sqrt{d}})italic_A = roman_Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) in a text-to-image (T2I) denoising U-Net capture the semantic regions (e.g. legs or torso of a human body) during the image generation process.
Tumanyan 等 人 (2023a 提出了用于图像编辑的 PnP 扩散特征,其依据是观察到文本到图像 (T2I) 去噪 U-Net 中的中间卷积特征 ffitalic_f 和自注意力分数 A=Softmax(QKd)superscripttopA=\mathrm{Softmax}(\frac{QK^{\top}}{\sqrt{d}})italic_A = roman_Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) 在图像生成过程中捕获了语义区域(例如人体的腿部或躯干)。

Given an input source image ISI^{S}italic_I start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and a target prompt PPitalic_P, PnP first performs DDIM inversion to obtain the image’s corresponding noise {𝐳tS}t=1T\{\mathbf{z}^{S}_{t}\}_{t=1}^{T}{ bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT at each time step ttitalic_t. It then collects the convolution features {ftl}\{f^{l}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and attention scores {Atl}\{A^{l}_{t}\}{ italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } from some predefined layers llitalic_l at each time step ttitalic_t of the backward diffusion process 𝐳t1S=ϵθ(𝐳tS,,t)\mathbf{z}_{t-1}^{S}=\epsilon_{\theta}(\mathbf{z}_{t}^{S},\varnothing,t)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , ∅ , italic_t ), where \varnothing denotes the null text prompt during denoising.
给定一个输入源图像 ISsuperscriptI^{S}italic_I start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 和一个目标提示 PPitalic_P 符 ,PnP 首先执行 DDIM 反转,以获得图像在每个时间步 ttitalic_t 的相应噪声 {𝐳tS}t=1Tsuperscriptsubscriptsubscriptsuperscript1\{\mathbf{z}^{S}_{t}\}_{t=1}^{T}{ bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 。然后,它在向后扩散过程 𝐳t1S=ϵθ(𝐳tS,,t)superscriptsubscript1subscriptsuperscriptsubscript\mathbf{z}_{t-1}^{S}=\epsilon_{\theta}(\mathbf{z}_{t}^{S},\varnothing,t)bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , ∅ , italic_t ) 的每个时间步 ttitalic_t 从一些预定义的层 llitalic_l 中收集卷积特征 {ftl}subscriptsuperscript\{f^{l}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } 和注意力分数 {Atl}subscriptsuperscript\{A^{l}_{t}\}{ italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ,其中 \varnothing 表示去噪期间的空文本提示。

To generate the edited image II^{*}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, PnP starts from the initial noise of the source image (i.e. 𝐳T=𝐳TS\mathbf{z}_{T}^{*}=\mathbf{z}_{T}^{S}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT) and performs feature injection during denoising: 𝐳t1=ϵθ(𝐳t,P,t,{ftl,Atl})\mathbf{z}_{t-1}^{*}=\epsilon_{\theta}(\mathbf{z}^{*}_{t},P,t,\{f^{l}_{t},A^{l% }_{t}\})bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P , italic_t , { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ), where ϵθ(,,,{ftl,Atl})\epsilon_{\theta}(\cdot,\cdot,\cdot,\{f^{l}_{t},A^{l}_{t}\})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ , { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) represents the operation of replacing the intermediate feature and attention scores {ftl,Atl}\{f^{l*}_{t},A^{l*}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } with {ftl,Atl}\{f^{l}_{t},A^{l}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. This feature injection mechanism ensures II^{*}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to preserve the layout and structure from ISI^{S}italic_I start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT while reflecting the description in PPitalic_P. To control the feature injection strength, PnP also employs two thresholds τf\tau_{f}italic_τ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and τA\tau_{A}italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT such that the feature and attention scores are only injected in the first τf\tau_{f}italic_τ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and τA\tau_{A}italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT denoising steps. Our method extends this feature injection mechanism to I2V generation models, where we inject features in convolution, spatial, and temporal attention layers. We show the detailed design of AnyV2V in Section 4.
为了生成编辑后的图像 IsuperscriptI^{*}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,PnP 从源图像的初始噪声(即 𝐳T=𝐳TSsuperscriptsubscriptsuperscriptsubscript\mathbf{z}_{T}^{*}=\mathbf{z}_{T}^{S}bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )开始,并在降噪过程中执行特征注入: 𝐳t1=ϵθ(𝐳t,P,t,{ftl,Atl})superscriptsubscript1subscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript\mathbf{z}_{t-1}^{*}=\epsilon_{\theta}(\mathbf{z}^{*}_{t},P,t,\{f^{l}_{t},A^{l% }_{t}\})bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P , italic_t , { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) ,其中 ϵθ(,,,{ftl,Atl})subscriptsubscriptsuperscriptsubscriptsuperscript\epsilon_{\theta}(\cdot,\cdot,\cdot,\{f^{l}_{t},A^{l}_{t}\})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ , { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) 表示将中间特征和注意力分数 {ftl,Atl}subscriptsuperscriptsubscriptsuperscript\{f^{l*}_{t},A^{l*}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } 替换为 {ftl,Atl}subscriptsuperscriptsubscriptsuperscript\{f^{l}_{t},A^{l}_{t}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } 的作。此功能注入机制可确保 IsuperscriptI^{*}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 在反映 中的描述 ISsuperscriptI^{S}italic_I start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 时保留布局和结构 PPitalic_P 。为了控制特征注入强度,PnP 还采用了两个阈值 τfsubscript\tau_{f}italic_τ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPTτAsubscript\tau_{A}italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT 这样特征和注意力分数仅在第一步 τfsubscript\tau_{f}italic_τ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPTτAsubscript\tau_{A}italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT 降噪步骤中注入。我们的方法将这种特征注入机制扩展到 I2V 生成模型,我们在卷积、空间和时间注意力层中注入特征。我们在第 4 节中展示了 AnyV2V 的详细设计。

4 AnyV2V  4 AnyV2V 的

Refer to caption
Figure 2: AnyV2V takes a source video VSV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT as input. In the first stage, we apply a block-box image editing method on the first frame I1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT according to the editing task. In the second stage, the source video is inverted to initial noise zTSz_{T}^{S}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, which is then denoised using DDIM sampling. During the sampling process, we extract spatial convolution, spatial attention, and temporal attention features from the I2V models’ decoder layers. To generate the edited video, we perform a DDIM sampling by fixing zTz_{T}^{*}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as zTTz_{T}^{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and use the edited first frame as the conditional signal. During sampling, we inject the features and attention into corresponding layers of the model.
图 2: AnyV2V 将源视频 VSsuperscriptV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 作为输入。在第一阶段,我们根据编辑任务在第一帧 I1subscript1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 上应用块框图像编辑方法。在第二阶段,源视频被反转为初始噪声 zTSsuperscriptsubscriptz_{T}^{S}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ,然后使用 DDIM 采样对其进行去噪。在采样过程中,我们从 I2V 模型的解码器层中提取空间卷积、空间注意力和时间注意力特征。为了生成编辑后的视频,我们通过固定 zTsuperscriptsubscriptz_{T}^{*}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT AS zTTsuperscriptsubscriptz_{T}^{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 来执行 DDIM 采样,并使用编辑后的第一帧作为条件信号。在采样过程中,我们将特征和注意力注入模型的相应层。

Our method presents a two-stage approach to video editing. Given a source video VS={I1,I2,I3,,In}V^{S}=\{I_{1},I_{2},I_{3},...,I_{n}\}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where IiI_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the frame at time iiitalic_i and nnitalic_n denotes the video length, we first extract the initial frame I1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and pass it into an image editing model ϕimg\phi_{\text{img}}italic_ϕ start_POSTSUBSCRIPT img end_POSTSUBSCRIPT to obtain an edited first frame I1=ϕimg(I1,C)I^{*}_{1}=\phi_{\text{img}}(I_{1},C)italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C ). CCitalic_C is the auxiliary conditions for image editing models such as text prompt, mask, style, etc. In the second stage, we feed the edited first frame I1I^{*}_{1}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a target prompt 𝐬\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT into an I2V generation model ϵθ\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and employ the inverted latent from the source video VSV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT to guide the generation process such that the edited video VV^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT follows the motion of the source video VSV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, the semantic information represented in the edited first frame I1I^{*}_{1}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the target prompt 𝐬\mathbf{s^{*}}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. An overall illustration of our video editing pipeline is shown in Figure 2. In this section, we explain each core component of our method.
我们的方法提出了一种两阶段的视频编辑方法。给定一个源视频 VS={I1,I2,I3,,In}superscriptsubscript1subscript2subscript3subscriptV^{S}=\{I_{1},I_{2},I_{3},...,I_{n}\}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ,其中 IisubscriptI_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是当时 iiitalic_i 的帧, nnitalic_n 表示视频长度,我们首先提取初始帧 I1subscript1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 并将其传递给图像编辑模型 ϕimgsubscript\phi_{\text{img}}italic_ϕ start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ,以获得编辑后的第一帧 I1=ϕimg(I1,C)subscriptsuperscript1subscriptsubscript1I^{*}_{1}=\phi_{\text{img}}(I_{1},C)italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C )CCitalic_C 是文本提示、蒙版、样式等图片编辑模型的辅助条件。在第二阶段,我们将编辑后的第一帧 I1subscriptsuperscript1I^{*}_{1}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 和目标提示 𝐬superscript\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 输入到 I2V 生成模型中 ϵθsubscript\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ,并利用源视频 VSsuperscriptV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 的倒置潜在值来指导生成过程,使编辑后的视频 VsuperscriptV^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 跟随源视频 VSsuperscriptV^{S}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 的运动、编辑后的第一帧 I1subscriptsuperscript1I^{*}_{1}italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 中表示的语义信息和目标提示 𝐬superscript\mathbf{s^{*}}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .我们的视频编辑管道的整体图示如图 2 所示。在本节中,我们将解释我们方法的每个核心组件。

4.1 Flexible First Frame Editing
4.1 灵活的首帧编辑

In visual manipulation, controllability is a key element in performing precise editing. AnyV2V enables more controllable video editing by utilizing image editing models to modify the video’s first frame. This strategic approach enables highly accurate modifications in the video and is compatible with a broad spectrum of image editing models, including other deep learning models that can perform image style transfer (Gatys et al., 2015; Ghiasi et al., 2017; Lötzsch et al., 2022; Wang et al., 2024a), mask-based image editing (Nichol et al., 2022; Avrahami et al., 2022), image inpainting (Suvorov et al., 2021; Ku et al., 2022), identity-preserving image editing (Wang et al., 2024b), and subject-driven image editing (Chen et al., 2023c; Li et al., 2023; Gu et al., 2023a).
在视觉作中,可控性是执行精确编辑的关键要素。AnyV2V 通过利用图像编辑模型修改视频的第一帧,实现更可控的视频编辑。这种战略方法可以对视频进行高度准确的修改,并与广泛的图像编辑模型兼容,包括其他可以执行图像风格迁移的深度学习模型(Gatys等 人,2015 年;Ghiasi et al., 2017;Lötzsch 等 人,2022 年;Wang et al., 2024a中,基于蒙版的图像编辑(Nichol et al., 2022;Avrahami等 人,2022 年),图像修复(Suvorov等 人,2021 年;Ku et al., 2022、身份保留图像编辑(Wang et al., 2024b和主体驱动的图像编辑(Chen et al., 2023c;Li et al., 2023;Gu et al., 2023a)。

4.2 Structural Guidance using DDIM Inverison
4.2 使用 DDIM Inverison 的结构引导

To ensure the generated videos from the I2V generation model follow the general structure as presented in the source video, we employ DDIM inversion to obtain the latent noise of the source video at each time step ttitalic_t. Specifically, we perform the inversion without text prompt condition but with the first frame condition. Formally, given a source video VS={I1,I2,I3,,In}V^{S}=\{I_{1},I_{2},I_{3},...,I_{n}\}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, we obtain the inverted latent noise for time step ttitalic_t as:
为了确保从 I2V 生成模型生成的视频遵循源视频中呈现的一般结构,我们采用 DDIM 反演来获取源视频在每个时间步 ttitalic_t 的潜噪声。具体来说,我们在没有文本提示条件的情况下执行反转,但使用第一个帧条件。形式上,给定一个源视频 VS={I1,I2,I3,,In}superscriptsubscript1subscript2subscript3subscriptV^{S}=\{I_{1},I_{2},I_{3},...,I_{n}\}italic_V start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ,我们得到时间步 ttitalic_t 长的倒置潜在噪声:

𝐳tS=DDIM_Inv(ϵθ(𝐳t+1,I1,,t)),\mathbf{z}^{S}_{t}=\mathrm{DDIM\_Inv}(\epsilon_{\theta}(\mathbf{z}_{t+1},I_{1}% ,\varnothing,t)),bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_DDIM _ roman_Inv ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∅ , italic_t ) ) , (3)

where DDIM_Inv()\text{DDIM\_Inv}(\cdot)DDIM_Inv ( ⋅ ) denotes the DDIM inversion operation as described in Appendix 3. In ideal cases, the latent noise 𝐳TS\mathbf{z}^{S}_{T}bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT at the final time step TTitalic_T (initial noise of the source video) should be used as the initial noise for sampling the edited videos. In practice, we find that due to the limited capability of certain I2V models, the edited videos denoised from the last time step are sometimes distorted. Following Li et al. (2023), we observe that starting the sampling from a previous time step T<TT^{\prime}<Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_T can be used as a simple workaround to fix this issue.
其中 DDIM_Inv()\text{DDIM\_Inv}(\cdot)DDIM_Inv ( ⋅ ) ,表示附录 3 中描述的 DDIM 反演作。在理想情况下,应使用最后时间步 TTitalic_T 的潜伏噪声 𝐳TSsubscriptsuperscript\mathbf{z}^{S}_{T}bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (源视频的初始噪声)作为对编辑后的视频进行采样的初始噪声。在实践中,我们发现,由于某些 I2V 模型的能力有限,从最后一个时间步去噪的编辑视频有时会失真。根据 Li et al. (2023 的说法,我们观察到从前一个时间步 T<TsuperscriptT^{\prime}<Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_T 开始采样可以作为解决此问题的简单解决方法。

4.3 Appearance Guidance via Spatial Feature Injection
4.3 通过空间特征注入进行外观引导

Our empirical observation (Section 5.4) suggests that I2V generation models already have some editing capabilities by only using the edited first frame and DDIM inverted noise as the model input. However, we find that this simple approach is often unable to correctly preserve the background in the edited first frame and the motion in the source video, as the conditional signal from the source video encoded in the inverted noise is limited.
我们的实证观察(第 5.4 节)表明,仅使用编辑后的第一帧和 DDIM 反转噪声作为模型输入,I2V 生成模型已经具有一些编辑功能。然而,我们发现这种简单的方法通常无法正确保留编辑后的第一帧中的背景和源视频中的运动,因为以倒置噪声编码的源视频的条件信号是有限的。

To enforce consistency with the source video, we perform feature injection in both convolution layers and spatial attention layers in the denoising U-Net. During the video sampling process, we simultaneously denoise the source video using the previously collected DDIM inverted latents 𝐳tS\mathbf{z}^{S}_{t}bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each time step ttitalic_t such that 𝐳t1S=ϵθ(𝐳tS,I1,,t)\mathbf{z}^{S}_{t-1}=\epsilon_{\theta}(\mathbf{z}^{S}_{t},I_{1},\varnothing,t)bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∅ , italic_t ). We preserve two types of features during source video denoising: convolution features fl1f^{l_{1}}italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT before skip connection from the l1thl_{1}^{\text{th}}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT residual block in the U-Net decoder, and the spatial self-attention scores {Asl2}\{A_{s}^{l_{2}}\}{ italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } from l2={llow,llow+1,,lhigh}l_{2}=\{l_{low},l_{low+1},...,l_{high}\}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w + 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT } layers. We collect the queries {Qsl2}\{Q_{s}^{l_{2}}\}{ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } and keys {Ksl2}\{K_{s}^{l_{2}}\}{ italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } instead of directly collecting Asl2A_{s}^{l_{2}}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as the attention score matrices are parameterized by the query and key vectors. We then replace the corresponding features during denoising the edited video in both the normal denoising branch and the negative prompt branch for classifier-free guidance (Ho & Salimans, 2022). We use two thresholds τconv\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT and τsa\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT to control the convolution and spatial attention injection to only happen in the first τconv\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT and τsa\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT steps during video sampling.
为了加强与源视频的一致性,我们在去噪 U-Net 的卷积层和空间注意力层中都执行特征注入。在视频采样过程中,我们在每个时间步 ttitalic_t 长使用先前收集的 DDIM 反转潜伏 𝐳tSsubscriptsuperscript\mathbf{z}^{S}_{t}bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 物同时对源视频进行去噪,以便 𝐳t1S=ϵθ(𝐳tS,I1,,t)subscriptsuperscript1subscriptsubscriptsuperscriptsubscript1\mathbf{z}^{S}_{t-1}=\epsilon_{\theta}(\mathbf{z}^{S}_{t},I_{1},\varnothing,t)bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∅ , italic_t ) .在源视频去噪过程中,我们保留了两种类型的特征:来自 U-Net 解码器中 l1thsuperscriptsubscript1l_{1}^{\text{th}}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT 残差块的跳过连接之前的卷积特征 fl1superscriptsubscript1f^{l_{1}}italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,以及来自层的空间 l2={llow,llow+1,,lhigh}subscript2subscriptsubscript1subscriptl_{2}=\{l_{low},l_{low+1},...,l_{high}\}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w + 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT } 自我注意分数 {Asl2}superscriptsubscriptsubscript2\{A_{s}^{l_{2}}\}{ italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } 。我们收集查询 {Qsl2}superscriptsubscriptsubscript2\{Q_{s}^{l_{2}}\}{ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } 和键 {Ksl2}superscriptsubscriptsubscript2\{K_{s}^{l_{2}}\}{ italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ,而不是直接收集 Asl2superscriptsubscriptsubscript2A_{s}^{l_{2}}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,因为注意力分数矩阵是由查询和键向量参数化的。然后,我们在正常降噪分支和否定提示分支中替换编辑后的视频时的相应特征,以实现无分类器的指导(Ho & Salimans)。我们使用两个阈值, τconvsubscript\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPTτsasubscript\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT 控制卷积和空间注意力注入仅发生在视频采样期间的 first τconvsubscript\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPTτsasubscript\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT steps。

4.4 Motion Guidance through Temporal Feature Injection
4.4 通过时间特征注入进行运动引导

The spatial feature injection mechanism described in Section 4.3 significantly enhances the background and overall structure consistency of the edited video. While it also helps maintain the source video motion to some degree, we observe that the edited videos will still have a high chance of containing incorrect motion compared to the source video. On the other hand, we notice that I2V generation models, or video diffusion models in general, are often initialized from pre-trained T2I models and continue to be trained on video data. During the training process, parameters in the spatial layers are often frozen or set to a lower learning rate such that the pre-trained weights from the T2I model are less affected, and the parameters in the temporal layers are more extensively updated during training. Therefore, it is likely that a large portion of the motion information is encoded in the temporal layers of the I2V generation models. Concurrent work (Bai et al., 2024) also observes that features in the temporal layers show similar characteristics with optical flow (Horn & Schunck, 1981), a pattern that is often used to describe the motion of the video.
4.3 节中描述的空间特征注入机制显着增强了编辑视频的背景和整体结构一致性。虽然它还在一定程度上有助于保持源视频的运动,但我们观察到,与源视频相比,编辑后的视频仍然很有可能包含不正确的运动。另一方面,我们注意到 I2V 生成模型或一般的视频扩散模型通常是从预先训练的 T2I 模型初始化的,并继续在视频数据上进行训练。在训练过程中,空间层中的参数通常会被冻结或设置为较低的学习率,以便 T2I 模型中的预训练权重受到的影响较小,并且时态层中的参数在训练期间得到更广泛的更新。因此,很大一部分运动信息很可能是在 I2V 生成模型的时间层中编码的。同时进行的工作(Bai 等人,2024还观察到,时间层中的特征显示出与光流相似的特征(Horn & Schunck,1981),这种模式通常用于描述视频的运动。

To better reconstruct the source video motion in the edited video, we propose to also inject the temporal attention features in the video generation process. Similar to spatial attention injection, we collect the source video temporal self-attention queries Qtl3Q^{l_{3}}_{t}italic_Q start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and keys Ktl3K^{l_{3}}_{t}italic_K start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from some U-Net decoder layers represented by l3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and inject them into the edited video denoising branches. We also only apply temporal attention injection in the first τta\tau_{ta}italic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT steps during sampling.
为了更好地重建编辑后的视频中的源视频运动,我们建议在视频生成过程中也注入时间注意力特征。与空间注意力注入类似,我们从以 U-Net 为代表的一些 U-Net 解码层中收集源视频时态自注意力查询 Qtl3subscriptsuperscriptsubscript3Q^{l_{3}}_{t}italic_Q start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 和键 Ktl3subscriptsuperscriptsubscript3K^{l_{3}}_{t}italic_K start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTl3subscript3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 并将它们注入到编辑后的视频去噪分支中。我们也只在采样的第一步 τtasubscript\tau_{ta}italic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT 应用时间注意力注入。

4.5 Putting it Together
4,5 把它放在一起

Overall, combining the spatial and temporal feature injection mechanisms, we replace the editing branch features {fl1,Qsl2,Ksl2,Qtl3,Ktl3}\{f^{*l_{1}},Q_{s}^{*l_{2}},K_{s}^{*l_{2}},Q_{t}^{*l_{3}},K_{t}^{*l_{3}}\}{ italic_f start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } with the features from the source denoising branch:
总的来说,结合空间和时间特征注入机制,我们将编辑分支特征 {fl1,Qsl2,Ksl2,Qtl3,Ktl3}superscriptabsentsubscript1superscriptsubscriptabsentsubscript2superscriptsubscriptabsentsubscript2superscriptsubscriptabsentsubscript3superscriptsubscriptabsentsubscript3\{f^{*l_{1}},Q_{s}^{*l_{2}},K_{s}^{*l_{2}},Q_{t}^{*l_{3}},K_{t}^{*l_{3}}\}{ italic_f start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } 替换为来自源降噪分支的特征:

𝐳t1=ϵθ(𝐳t,I,𝐬,t;{fl1,Qsl2,Ksl2,Qtl3,Ktl3}),\mathbf{z}^{*}_{t-1}=\epsilon_{\theta}(\mathbf{z}^{*}_{t},I^{*},\mathbf{s}^{*}% ,t\ ;\{f^{l_{1}},Q_{s}^{l_{2}},K_{s}^{l_{2}},Q_{t}^{l_{3}},K_{t}^{l_{3}}\}),bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t ; { italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ) , (4)

where ϵθ(;{fl1,Qsl2,Ksl2,Qtl3,Ktl3})\epsilon_{\theta}(\cdot\ ;\{f^{l_{1}},Q_{s}^{l_{2}},K_{s}^{l_{2}},Q_{t}^{l_{3}% },K_{t}^{l_{3}}\})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ; { italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ) denotes the feature replacement operation across different layers l1,l2,l3l_{1},l_{2},l_{3}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Our proposed spatial and temporal feature injection scheme enables tuning-free adaptation of I2V generation models for video editing. Our experimental results demonstrate that each component in our design is crucial to the accurate editing of source videos. We showcase more qualitative results for the effectiveness of our model components in Section 5.
其中 ϵθ(;{fl1,Qsl2,Ksl2,Qtl3,Ktl3})subscriptsuperscriptsubscript1superscriptsubscriptsubscript2superscriptsubscriptsubscript2superscriptsubscriptsubscript3superscriptsubscriptsubscript3\epsilon_{\theta}(\cdot\ ;\{f^{l_{1}},Q_{s}^{l_{2}},K_{s}^{l_{2}},Q_{t}^{l_{3}% },K_{t}^{l_{3}}\})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ; { italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ) 表示跨不同层 l1,l2,l3subscript1subscript2subscript3l_{1},l_{2},l_{3}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 的特征替换作。我们提出的空间和时间特征注入方案支持对 I2V 生成模型进行无调谐适应,用于视频编辑。我们的实验结果表明,我们设计中的每个组件对于源视频的准确编辑都至关重要。我们在第 5 节中展示了模型组件有效性的更多定性结果。

5 Experiments  5 实验

Refer to caption
Figure 3: AnyV2V is robust in a wide range of prompt-based editing tasks while preserving the background. The results align the most with the text prompt and maintain high motion consistency.
图 3: AnyV2V 在各种基于提示的编辑任务中都很健壮,同时保留了背景。结果与文本提示最一致,并保持高度的运动一致性。
Refer to caption
Figure 4: With different image editing models, AnyV2V can achieve a wide range of editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation.
图 4: 通过不同的图像编辑模型,AnyV2V 可以完成广泛的编辑任务,包括基于参考的样式迁移、主题驱动的编辑和身份作。

5.1 Implementation Details
5,1 实现细节

We employ AnyV2V on three off-the-shelf I2V generation models: I2VGen-XL111We use the version provided in https://huggingface.co/ali-vilab/i2vgen-xl. (Zhang et al., 2023c), ConsistI2V (Ren et al., 2024) and SEINE (Chen et al., 2023d). For all I2V models, we use τconv=0.2T\tau_{conv}=0.2Titalic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT = 0.2 italic_T, τsa=0.2T\tau_{sa}=0.2Titalic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = 0.2 italic_T and τta=0.5T\tau_{ta}=0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT = 0.5 italic_T, where TTitalic_T is the total number of sampling steps. We use the DDIM (Song et al., 2020) sampler and set TTitalic_T to the default values of the selected I2V models. Following PnP (Tumanyan et al., 2023b), we set l1=4l_{1}=4italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 for convolution feature injection and l2=l3={4,5,6,,11}l_{2}=l_{3}=\{4,5,6,...,11\}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { 4 , 5 , 6 , … , 11 } for spatial and temporal attention injections. During sampling, we apply text classifier-free guidance (CFG) (Ho & Salimans, 2022) for all models with the same negative prompt “Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms” across all edits. To obtain the initial edited frames in our implementation, we use a set of image editing model candidates including prompt-based image editing model InstructPix2Pix (Brooks et al., 2023), style transfer model Neural Style Transfer (NST) (Gatys et al., 2015), subject-driven image editing model AnyDoor (Chen et al., 2023c), and identity-driven image editing model InstantID (Wang et al., 2024b). We experiment with only the successfully edited frames, which is crucial for our method. We conducted all the experiments on a single Nvidia A6000 GPU. To edit a 16-frame video, it requires around 15G GPU memory and around 100 seconds for the whole inference process. We refer readers to Appendix A for more discussions on our implementation details and hyperparameter settings.
我们在三种现成的 I2V 生成模型上使用 AnyV2V:I2VGen-XL111 我们使用 https://huggingface.co/ali-vilab/i2vgen-xl.(Zhang et al., 2023c)、ConsistI2V(任 et al., 2024和 SEINE(Chen et al., 2023d中提供的版本。对于所有 I2V 模型,我们使用 τconv=0.2Tsubscript0.2\tau_{conv}=0.2Titalic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT = 0.2 italic_Tτsa=0.2Tsubscript0.2\tau_{sa}=0.2Titalic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = 0.2 italic_Tτta=0.5Tsubscript0.5\tau_{ta}=0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT = 0.5 italic_T ,其中 TTitalic_T 是采样步骤的总数。我们使用 DDIM (Song et al., 2020 采样器并设置为 TTitalic_T 所选 I2V 模型的默认值。在 PnP (Tumanyan et al., 2023b 之后,我们设置 l1=4subscript14l_{1}=4italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 了卷积特征注入以及 l2=l3={4,5,6,,11}subscript2subscript345611l_{2}=l_{3}=\{4,5,6,...,11\}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { 4 , 5 , 6 , … , 11 } 空间和时间注意力注入。在采样过程中,我们应用无文本分类器指导(CFG)(Ho & Salimans,2022适用于所有模型,并在所有编辑中使用相同的负面提示“扭曲、不连续、丑陋、模糊、低分辨率、静止、毁容、断开的肢体、丑陋的面孔、不完整的手臂”。为了在我们的实现中获得初始编辑的帧,我们使用了一组图像编辑模型候选者,包括基于提示的图像编辑模型 InstructPix2Pix(Brooks等 人,2023 年)、风格迁移模型神经风格迁移 (NST)(Gatys等人 ,2015 年)、主题驱动的图像编辑模型 AnyDoor(Chen et al., 2023c和身份驱动的图像编辑模型 InstantID(Wang et al., 2024b。我们只对成功编辑的帧进行实验,这对我们的方法至关重要。我们在单个 Nvidia A6000 GPU 上进行了所有实验。 要编辑 16 帧的视频,它需要大约 15G 的 GPU 内存和大约 100 秒的整个推理过程。我们建议读者参考附录 A,以了解有关我们的实现细节和超参数设置的更多讨论。

Table 2: Quantitative comparisons for our AnyV2V with baselines on prompt-based video editing. Alignment: prompt alignment; Overall: overall preference. Bold: best results; \ulUnderline: top-2.
表 2: 我们的 AnyV2V 与基于提示的视频编辑基线的定量比较。对齐:提示对齐;总体:总体偏好。 粗体:最佳结果;\ul下划线:top-2。
Method  方法 Human Evaluation \uparrow
人工评价 \uparrow
CLIP Scores \uparrow
CLIP 分数 \uparrow
Alignment  对准 Overall  整体 CLIP-Text  剪辑文本 CLIP-Image  CLIP 图像
Tune-A-Video  Tune-A-Video (调谐视频) 15.2% 2.1% 0.2902 0.9704
TokenFlow 31.7% \ul20.7% 0.2858 0.9783
FLATTEN 25.5% 16.6% 0.2742 \ul0.9739
AnyV2V (SEINE)  AnyV2V(塞纳河) 28.9% 8.3% \ul0.2910 0.9631
AnyV2V (ConsistI2V)  AnyV2V (ConsistI2V) (任意V2V) \ul33.8% 11.7% 0.2896 0.9556
AnyV2V (I2VGen-XL)  AnyV2V (I2VGen-XL) 69.7% 46.2% 0.2932 0.9652

5.2 Tasks Definition
5,2 任务定义

  • 1.

    Prompt-based Editing: allows users to manipulate video content using only natural language. This can include descriptive prompts or instructions. With the prompt, Users can perform a wide range of edits, such as incorporating accessories, spawning or swapping objects, adding effects, or altering the background.


    1. 基于提示的编辑:允许用户仅使用自然语言作视频内容。这可以包括描述性提示或说明。通过提示,用户可以执行广泛的编辑,例如合并附件、生成或交换对象、添加效果或更改背景。
  • 2.

    Reference-based Style Transfer: In the realm of style transfer tasks, the artistic styles of Monet and Van Gogh are frequently explored, but in real-life examples, users might want to use a distinct style based on one particular artwork. In reference-based style transfer, we focus on using a style image as a reference to perform video editing. The edited video should capture the distinct style of the referenced artwork.


    2. 基于参考的风格迁移:在风格迁移任务领域,经常探索莫奈和梵高的艺术风格,但在现实生活中,用户可能希望根据一件特定的艺术品使用不同的风格。在基于引用的样式传输中,我们重点介绍使用样式图像作为参考来执行视频编辑。编辑后的视频应捕捉到引用图稿的独特风格。
  • 3.

    Subject-driven Editing: In subject-driven video editing, we aim at replacing an object in the video with a target subject based on a given subject image while maintaining the video motion and persevering the background.


    3. 主体驱动的编辑:在主体驱动的视频编辑中,我们的目标是根据给定的主体图像将视频中的对象替换为目标主体,同时保持视频运动和背景的持久性。
  • 4.

    Identity Manipulation: Identity manipulation allows the user to manipulate video content by replacing a person with another person’s identity in the video based on an input image of the target person.


    4. 身份纵:身份纵允许用户根据目标人的输入图像,通过在视频中将一个人替换为另一个人的身份来纵视频内容。

5.3 Evaluation Results
5,3 评估结果

As shown in Figure 3 and 4, AnyV2V can perform the following video editing tasks: (1) prompt-based editing, (2) reference-based style transfer, (3) subject-driven editing, and (4) identity manipulation. we compare AnyV2V against three baseline models Tune-A-Video (Wu et al., 2023b), TokenFlow (Geyer et al., 2023) and FLATTEN (Cong et al., 2023) for (1) prompt-based editing. Since there exists no publicly available baseline method for task (2) (3) (4), we evaluate the performance of three I2V generation models under AnyV2V. We included a more comprehensive evaluation in Appendix B.
如图 3图 4 所示,AnyV2V 可以执行以下视频编辑任务:(1) 基于提示的编辑,(2) 基于参考的样式迁移,(3) 主题驱动的编辑,以及 (4) 身份作。我们将 AnyV2V 与三个基线模型 Tune-A-Video (Wu et al., 2023b)、TokenFlow (Geyer et al., 2023 和 FLATTEN (Cong et al., 2023 进行了比较,以进行 (1) 基于提示的编辑。由于任务 (2) (3) (4) 没有公开可用的基线方法,我们评估了三种 I2V 生成模型在 AnyV2V 下的性能。我们在附录 B 中纳入了更全面的评估。

Prompt-based Editing  基于提示的编辑

Unlike the baseline methods (Wu et al., 2023b; Geyer et al., 2023; Cong et al., 2023) that often introduce unwarranted changes not specified in the text commands, AnyV2V utilizes the precision of image editing models to ensure only the targeted areas of the scene are altered, leaving the rest unchanged. Combining AnyV2V with the instruction-guided image editing model InstructPix2Pix (Brooks et al., 2023), AnyV2V accurately places a party hat on an elderly man’s head and correctly paints the plane in blue. Additionally, it maintains the original video’s background and fidelity, whereas, in comparison, baseline methods often alter the color tone and shape of objects, as illustrated in Figure 3. Also, for motion tasks such as adding snowing weather, I2V models from AnyV2V provide inherent support for animating the snowing while the baseline methods would result in flickering. For quantitative evaluations, we conduct a human evaluation to examine the degree of prompt alignment and overall preference of the edited videos based on user voting, and also compute the metrics CLIP-Text for text alignment and CLIP-Image for temporal consistency. In detail, CLIP-Text is computed by the average cosine similarity between text embeddings from CLIP model (Radford et al., 2021) and CLIP-Image is computed in the same way but between image embeddings for every pair of consecutive frames. Table 2 shows that AnyV2V generally achieves high text-alignment and temporal consistency, while AnyV2V with I2VGen-XL backbone is the most preferred method because it does not edit the video precisely.
与基线方法(Wu等 人,2023b;Geyer等 人,2023 年;Cong et al., 2023经常引入文本命令中未指定的不必要的更改,AnyV2V 利用图像编辑模型的精度来确保仅更改场景的目标区域,其余区域保持不变。将 AnyV2V 与指令指导的图像编辑模型 InstructPix2Pix(Brooks et al., 2023相结合,AnyV2V 准确地将派对帽戴在老人的头上,并正确地将飞机涂成蓝色。此外,它还保持了原始视频的背景和保真度,而相比之下,基线方法通常会改变对象的色调和形状,如图 3 所示。此外,对于添加下雪天气等运动任务,AnyV2V 的 I2V 模型为下雪动画提供了固有的支持,而基线方法会导致闪烁。对于定量评估,我们进行了人工评估,以根据用户投票检查编辑后的视频的提示对齐程度和整体偏好,并计算用于文本对齐的 CLIP-Text 指标和用于时间一致性的 CLIP-Image 指标。具体来说,CLIP-Text 是通过 CLIP 模型的文本嵌入之间的平均余弦相似度来计算的(Radford et al., 2021),而 CLIP-Image 的计算方式相同,但在每对连续帧的图像嵌入之间。表 2 显示,AnyV2V 通常可实现高文本对齐和时间一致性,而带有 I2VGen-XL 主干的 AnyV2V 是最受欢迎的方法,因为它不会精确编辑视频。

Style Transfer, Subject-Driven Editing and Identity Manipulation
样式迁移、主题驱动的编辑和身份作

For these novel tasks, we stress the alignment with the reference image instead of the text prompt. As shown in Figure 4, we can observe that for task (2), AnyV2V can capture one particular style that is tailor-made, and even if the style is not learned by the text encoder. In the examples, AnyV2V captures the style of Vassily Kandinsky’s artwork “Composition VII” and Vincent Van Gogh’s artwork “Chateau in Auvers at Sunset” accurately. For task (3), AnyV2V can replace the subject in the video with other subjects even if the new subject differs from the original subject. In the examples, a cat is replaced with a dog according to the reference image and maintains highly aligned motion and background as reflected in the source video. A car is replaced by our desired car while the wheel is still spinning in the edited video. For task (4), AnyV2V can swap a person’s identity to anyone. We report both human evaluation results and find that AnyV2V with I2VGen-XL backbone is the most preferred method in terms of reference alignment and overall performance, plotted in Table 5, which is in Appendix B.
对于这些新颖的任务,我们强调与参考图像而不是文本提示的对齐方式。如图 4 所示,我们可以观察到,对于任务 (2),AnyV2V 可以捕获一种特定的定制样式,即使该样式没有被文本编码器学习。在示例中,AnyV2V 准确地捕捉了瓦西里·康定斯基 (Vassily Kandinsky) 的作品“构图 VII”和文森特·梵高 (Vincent Van Gogh) 的作品“日落时分的奥维尔城堡”的风格。对于任务 (3),AnyV2V 可以将视频中的主体替换为其他主体,即使新主体与原始主体不同。在示例中,根据参考图像,猫被替换为狗,并保持源视频中反映的高度对齐的运动和背景。在编辑后的视频中,当车轮仍在旋转时,一辆汽车被我们想要的汽车取代。对于任务 (4),AnyV2V 可以将一个人的身份交换给任何人。我们报告了人工评估结果,发现具有 I2VGen-XL 主干的 AnyV2V 是参考对齐和整体性能方面最首选的方法,如附录 B 中的表 5 所示。

I2V Backbones  I2V 主干网

We find that AnyV2V (I2VGen-XL) tends to be the most robust both qualitatively and quantitatively. It has a good generalization ability to produce consistent motions in the video with high visual quality. AnyV2V (ConsistI2V) can generate consistent motion, but sometimes the watermark would appear due to its training data, thus harming the visual quality. AnyV2V (SEINE)’s generalization ability is relatively weaker but still produces consistent and high-quality video if the motion is simple enough, such as a person walking.
我们发现 AnyV2V (I2VGen-XL) 在定性和定量上往往是最稳健的。它具有良好的泛化能力,可以在视频中产生一致的动作,并具有高视觉质量。AnyV2V (ConsistI2V) 可以产生一致的运动,但有时由于其训练数据而会出现水印,从而损害视觉质量。AnyV2V (SEINE) 的泛化能力相对较弱,但如果动作足够简单,例如一个人走路,仍然可以产生一致且高质量的视频。

Refer to caption
Figure 5: AnyV2V can edit video length beyond the training frame while maintaining motion consistency. The first row is the source video frames while the second rows are the edited. The editing prompt of the image was “turn woman into a robot” using image model InstructPix2Pix (Brooks et al., 2023).
图 5: AnyV2V 可以在保持运动一致性的同时编辑训练帧之外的视频长度。第一行是源视频帧,第二行是编辑的视频帧。图像的编辑提示是使用图像模型 InstructPix2Pix 的“将女人变成机器人”(Brooks et al., 2023)。

Editing Video beyond Training Frames of I2V model Current state-of-the-art I2V models (Chen et al., 2023d; Ren et al., 2024; Zhang et al., 2023c) are mostly trained on video data that contains only 16 frames. To edit videos that have length beyond the training frames of the I2V model, an intuitive approach would be generating videos in an auto-regressive manner as used in ConsistI2V (Ren et al., 2024) and SEINE (Chen et al., 2023d). However, we find that such an experiment setup cannot maintain semantic consistency in our case. As many works in extending video generation exploit the initial latent to generate longer video (Qiu et al., 2023; Wu et al., 2023c), we leverage the longer inverted latent as the initial latent and force an I2V model to generate longer frames of output. Our experiments found that the inverted latent contains enough temporal and semantic information to allow the generated video to maintain temporal and semantic consistency, as shown in Figure 5.
在 I2V 模型的训练帧之外编辑视频 当前最先进的 I2V 模型(Chen等 人,2023d;任 et al., 2024;Zhang et al., 2023c主要使用仅包含 16 帧的视频数据进行训练。要编辑长度超出 I2V 模型训练帧的视频,一种直观的方法是以自回归方式生成视频,如 ConsistI2V(任 et al., 2024和 SEINE (Chen et al., 2023d中使用的视频。然而,我们发现在我们的例子中,这样的实验设置无法保持语义的一致性。由于许多扩展视频生成的工作利用初始潜在因素来生成更长的视频(Qiu et al., 2023;Wu et al., 2023c中,我们利用较长的倒置潜在作为初始潜在,并强制 I2V 模型生成更长的输出帧。我们的实验发现,倒置的 latent 包含足够的时间和语义信息,使生成的视频能够保持时间和语义的一致性,如图 5 所示。

5.4 Ablation Study
5,4 消融研究

To verify the effectiveness of our design choices, we conduct an ablation study by iteratively disabling the three core components in our model: temporal feature injection, spatial feature injection, and DDIM inverted latent as initial noise. We use AnyV2V (I2VGen-XL) and a subset of 20 samples in this ablation study and report both the frame-wise consistency results using CLIP-Image score in Table 3 and qualitative comparisons in Figure 6. We provide more ablation analysis of other design considerations of our model in the Appendix.
为了验证我们设计选择的有效性,我们通过迭代禁用模型中的三个核心组件来进行消融研究:时间特征注入、空间特征注入和 DDIM 倒置潜在作为初始噪声。在这项消融研究中,我们使用 AnyV2V (I2VGen-XL) 和 20 个样本的子集,并使用表 3 中的 CLIP-Image 分数和图 6 中的定性比较报告帧一致性结果。我们在附录中提供了对模型其他设计考虑因素的更多消融分析。

Refer to caption
Figure 6: Visual comparisons of AnyV2V’s editing results after disabling temporal feature injection (T.I.), spatial feature injection (S.I.) and DDIM inverted initial noise (D.I.).
图 6: 禁用时间特征注入 (T.I.)、空间特征注入 (S.I.) 和 DDIM 反转初始噪声 (D.I.) 后 AnyV2V 编辑结果的视觉比较。

Effectiveness of Temporal Feature Injection According to the results, after disabling temporal feature injection in AnyV2V (I2VGen-XL), while we observe a slight increase in the CLIP-Image score value, the edited videos often demonstrate less adherence to the motion presented in the source video. For example, in the second frame of the “couple sitting” case (3rd{}^{\text{rd}}start_FLOATSUPERSCRIPT rd end_FLOATSUPERSCRIPT row, 2nd{}^{\text{nd}}start_FLOATSUPERSCRIPT nd end_FLOATSUPERSCRIPT column in the right panel in Figure 6), the motion of the woman raising her leg in the source video is not reflected in the edited video without applying temporal injection. On the other hand, even when the style of the video is completely changed, AnyV2V (I2VGen-XL) with temporal injection is still able to capture this nuance motion in the edited video.
时间特征注入的有效性 根据结果,在 AnyV2V (I2VGen-XL) 中禁用时间特征注入后,虽然我们观察到 CLIP-Image 分数值略有增加,但编辑后的视频通常表现出对源视频中呈现的运动的依从性较低。例如,在 “couple sitting” 案例的第二帧(图 6 右侧面板的 3 rd{}^{\text{rd}}start_FLOATSUPERSCRIPT rd end_FLOATSUPERSCRIPT 行,2 nd{}^{\text{nd}}start_FLOATSUPERSCRIPT nd end_FLOATSUPERSCRIPT 列)中,源视频中女性抬腿的动作在未应用时间注入的情况下不会反映在编辑后的视频中。另一方面,即使视频的风格完全改变,具有时间注入的 AnyV2V (I2VGen-XL) 仍然能够在编辑后的视频中捕捉到这种细微的动作。

Effectiveness of Spatial Feature Injection As shown in Table 3, we observe a drop in the CLIP-Image score after removing the spatial feature injection mechanisms from our model, indicating that the edited videos are not smoothly progressed across consecutive frames and contain more appearance and motion inconsistencies. Further illustrated in the third row of Figure 6, removing spatial feature injection will often result in incorrect subject appearance and pose (as shown in the “ballet dancing” case) and degenerated background appearance (evident in the “couple sitting” case). These observations demonstrate that directly generating edited videos from the DDIM inverted noise is often not enough to fully preserve the source video structures, and the spatial feature injection mechanisms are crucial for achieving better editing results.
空间特征注入的有效性 如表 3 所示,我们从模型中删除空间特征注入机制后观察到 CLIP-Image 分数下降,这表明编辑后的视频在连续帧中进展不顺畅,并且包含更多的外观和运动不一致。在图 6 的第三行中进一步说明,去除空间特征注入通常会导致不正确的主体外观和姿势(如 “ballet dance” 案例所示)和退化的背景外观(在 “couple sitting” 案例中很明显)。这些观察结果表明,直接从 DDIM 倒置噪声生成编辑后的视频通常不足以完全保留源视频结构,空间特征注入机制对于获得更好的编辑结果至关重要。

DDIM Inverted Noise as Structural Guidance Finally, we observe a further decrease in CLIP-Image scores and a significantly degraded visual appearance in both examples in Figure 6 after replacing the initial DDIM inverted noise with random noise during sampling. This indicates that the I2V generation models become less capable of animating the input image when the editing prompt is completely out-of-domain and highlights the importance of the DDIM inverted noise as the structural guidance of the edited videos.
DDIM 反向噪声作为结构导向 最后,在采样过程中用随机噪声替换初始 DDIM 反转噪声后,我们观察到图 6 中两个示例的 CLIP-Image 分数进一步下降,视觉外观明显下降。这表明当编辑提示完全超出域时,I2V 生成模型对输入图像进行动画处理的能力变得较弱,并突出了 DDIM 倒置噪声作为编辑视频的结构指导的重要性。

Table 3: Ablation study results for AnyV2V (I2VGen-XL). T. Injection and S. Injection correspond to temporal and spatial feature injection mechanisms, respectively.
表 3: AnyV2V (I2VGen-XL) 的消融研究结果。T. Injection 和 S. Injection 分别对应于时间和空间特征注入机制。
Model CLIP-Image \uparrow
CLIP 图像 \uparrow
AnyV2V (I2VGen-XL)  AnyV2V (I2VGen-XL) 0.9648
AnyV2V (I2VGen-XL) w/o T. Injection
AnyV2V (I2VGen-XL) 无 T. 注射
0.9652
AnyV2V (I2VGen-XL) w/o T. Injection & S. Injection
AnyV2V (I2VGen-XL) 无 T. 注射 & S. 注射
0.9637
AnyV2V (I2VGen-XL) w/o T. Injection & S. Injection & DDIM Inversion
AnyV2V (I2VGen-XL) 无 T. 注射 & S. 注射 & DDIM 反转
0.9607

6 Conclusion  6 结论

In this paper, we presented AnyV2V, a novel unified framework for video editing. Our framework is training-free, highly cost-effective, and can be applied to any image editing model and I2V generation model. To perform video editing with high precision, we propose a two-stage approach to first edit the initial frame of the source video and then condition an I2V model with the edited first frame and the source video features and inverted latents to produce the edited video at any length. Comprehensive experiments have shown that our method achieves outstanding outcomes across a broad spectrum of applications that are beyond the scope of existing SOTA methods while achieving superior results on both common video metrics and human evaluation. For future work, we aim to find a tuning-free method to bridge the I2V properties into T2V models so that we can leverage existing strong T2V models for video editing.
在本文中,我们提出了 AnyV2V,这是一种新颖的视频编辑统一框架。我们的框架无需培训,性价比高,可以应用于任何图像编辑模型和 I2V 生成模型。为了高精度地执行视频编辑,我们提出了一种两阶段方法,首先编辑源视频的初始帧,然后使用编辑后的第一帧和源视频特征和反转的潜在值来调节 I2V 模型,以生成任意长度的编辑视频。全面的实验表明,我们的方法在超出现有 SOTA 方法范围的广泛应用中取得了出色的结果,同时在常见的视频指标和人工评估方面都取得了卓越的结果。对于未来的工作,我们的目标是找到一种无需调整的方法,将 I2V 属性桥接到 T2V 模型中,以便我们可以利用现有的强大 T2V 模型进行视频编辑。

References  引用

  • Avrahami et al. (2022)  Avrahami 等人 (2022) Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18208–18218, 2022.
    Omri Avrahami、Dani Lischinski 和 Ohad Fried。 用于文本驱动自然图像编辑的混合扩散。 IEEE/CVF 计算机视觉和模式识别会议论文集,第 18208-18218 页,2022 年 。
  • Bai et al. (2024)  Bai et al. (2024) Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning-free framework for video motion and appearance editing. arXiv preprint arXiv:2402.13185, 2024.
    白剑红、何天宇、王尉迟、郭俊良、胡浩基、刘作柱和江卞。 Uniedit:用于视频运动和外观编辑的统一免调整框架。 arXiv 预印本 arXiv:2402.13185,2024 年。
  • Bar-Tal et al. (2024)  Bar-Tal 等人 (2024) Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
    Omer Bar-Tal、Hila Chefer、Omer Tov、Charles Herrmann、Roni Paiss、Shiran Zada、Ariel Ephrat、Junhwa Hur、Yuanzhen Li、Tomer Michaeli 等 人。 Lumiere:用于视频生成的时空扩散模型。 arXiv 预印本 arXiv:2401.12945,2024 年。
  • Brooks et al. (2023)  Brooks 等人 (2023) Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
    蒂姆·布鲁克斯 (Tim Brooks)、亚历山大·霍林斯基 (Aleksander Holynski) 和阿列克谢·埃夫罗斯 (Alexei A. Efros)。 Instructpix2pix:学习遵循图像编辑说明。 CVPR 中,2023 年。
  • Ceylan et al. (2023)  Ceylan 等人 (2023) Duygu Ceylan, Chun-Hao Paul Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. 2023.
    Duygu Ceylan、Chun-Hao Paul Huang 和 Niloy J Mitra。 Pix2video:使用图像扩散进行视频编辑。 2023 年。
  • Chen et al. (2023a)  Chen et al. (2023a) Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
    陈浩欣, 夏梦涵, 何英青, 张勇, 存晓东, 杨少树, 邢金波, 刘耀芳, 陈启峰, 王新涛, 等 . Videocrafter1:用于高质量视频生成的开放式扩散模型。 arXiv 预印本 arXiv:2310.19512, 2023a。
  • Chen et al. (2024) Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  • Chen et al. (2023b) Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
  • Chen et al. (2023c) Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
  • Chen et al. (2023d) Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023d.
  • Cheng et al. (2023) Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset. arXiv preprint arXiv:2311.00213, 2023.
  • Cong et al. (2023) Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Esser et al. (2023) Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7346–7356, 2023.
  • Gatys et al. (2015) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015. URL http://arxiv.org/abs/1508.06576.
  • Geyer et al. (2023) Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
  • Ghiasi et al. (2017) Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. CoRR, abs/1705.06830, 2017. URL http://arxiv.org/abs/1705.06830.
  • Gu et al. (2023a) Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, and Xin Eric Wang. Photoswap: Personalized subject swapping in images, 2023a.
  • Gu et al. (2023b) Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point correspondence. arXiv preprint, 2023b.
  • Guo et al. (2023) Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • Henschel et al. (2024) Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773, 2024.
  • Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. (2022) Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022.
  • Hong et al. (2022) Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  • Horn & Schunck (1981) Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
  • Jeong & Ye (2023) Hyeonho Jeong and Jong Chul Ye. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
  • Ku et al. (2024) Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=OuV9ZrkQlc.
  • Ku et al. (2022) Wing-Fung Ku, Wan-Chi Siu, Xi Cheng, and H. Anthony Chan. Intelligent painter: Picture composition with resampling diffusion model, 2022.
  • Li et al. (2023) Tianle Li, Max Ku, Cong Wei, and Wenhu Chen. Dreamedit: Subject-driven image editing, 2023.
  • Liang et al. (2023) Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, et al. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. arXiv preprint arXiv:2312.17681, 2023.
  • Liu et al. (2023a) Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023a.
  • Liu et al. (2023b) Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. arXiv:2303.04761, 2023b.
  • Lötzsch et al. (2022) Winfried Lötzsch, Max Reimann, Martin Büßemeyer, Amir Semmo, Jürgen Döllner, and Matthias Trapp. Wise: Whitebox image stylization by example-based learning. ECCV, 2022.
  • Mokady et al. (2023) Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6038–6047, 2023.
  • Nichol et al. (2022) Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp.  16784–16804. PMLR, 2022.
  • (37) OpenAI. Video generation models as world simulators. URL https://openai.com/research/video-generation-models-as-world-simulators.
  • Ouyang et al. (2023) Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023.
  • Qi et al. (2023) Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
  • Qiu et al. (2023) Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling, 2023.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Ren et al. (2024) Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Suvorov et al. (2021) Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
  • Tumanyan et al. (2023a) Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1921–1930, June 2023a.
  • Tumanyan et al. (2023b) Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1921–1930, 2023b.
  • Wang et al. (2024a) Haofan Wang, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733, 2024a.
  • Wang et al. (2024b) Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024b.
  • Wang et al. (2024c) Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024c.
  • Wang et al. (2023) Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  • Wu et al. (2023a) Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, and Peter Vajda. Fairy: Fast parallelized instruction-guided video-to-video synthesis. arXiv preprint arXiv:2312.13834, 2023a.
  • Wu et al. (2023b) Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7623–7633, 2023b.
  • Wu et al. (2023c) Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537, 2023c.
  • Xing et al. (2023) Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  • Yang et al. (2023) Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
  • Zhang et al. (2023a) David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  • Zhang et al. (2023b) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023b.
  • Zhang et al. (2023c) Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
  • Zhang et al. (2023d) Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.

Appendix

Appendix A Discussion on Model Implementation Details

When adapting our AnyV2V to various I2V generation models, we identify two sets of hyperparameters that are crucial to the final video editing results. They are (1) selection of U-Net decoder layers (l1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and l3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) to perform convolution, spatial attention and temporal attention injection and (2) Injection thresholds τconv\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT, τsa\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT and τta\tau_{ta}italic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT that control feature injections to happen in specific diffusion steps. In this section, we provide more discussions and analysis on the selection of these hyperparameters.

A.1 U-Net Layers for Feature Injection

Refer to caption
Figure 7: Visualizations of the convolution, spatial attention and temporal attention features during video sampling for I2V generation models’ decoder layers. We feed in the DDIM inverted noise to the I2V models such that the generated videos (first row) are reconstructions of the source video.

To better understand how different layers in the I2V denoising U-Net produce features during video sampling, we perform a visualization of the convolution, spatial and temporal attention features for the three candidate I2V models I2VGen-XL (Zhang et al., 2023c), ConsistI2V (Ren et al., 2024) and SEINE (Chen et al., 2023d). Specifically, we visualize the average activation values across all channels in the output feature map from the convolution layers, and the average attention scores across all attention heads and all tokens (i.e. average attention weights for all other tokens attending to the current token). The results are shown in Figure 7.

According to the figure, we observe that the intermediate convolution features from different I2V models show similar characteristics during video generation: earlier layers in the U-Net decoder produce features that represent the overall layout of the video frames and deeper layers capture the high-frequency details such as edges and textures. We choose to set l1=4l_{1}=4italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 for convolution feature injection to inject background and layout guidance to the edited video without introducing too many high-frequency details. For spatial and temporal attention scores, we observe that the spatial attention maps tend to represent the semantic regions in the video frames while the temporal attention maps highlight the foreground moving subjects (e.g. the running woman in Figure 7). One interesting observation for I2VGen-XL is that its spatial attention operations in deeper layers almost become hard attention, as the spatial tokens only attend to a single or very few tokens in each frame. We propose to inject features in decoder layers 4 to 11 (l2=l3={4,5,,11}l_{2}=l_{3}=\{4,5,...,11\}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { 4 , 5 , … , 11 }) to preserve the semantic and motion information from the source video.

A.2 Ablation Analysis on Feature Injection Thresholds

We perform additional ablation analysis using different feature injection thresholds to study how these hyperparameters affect the edited video.

Effect of Spatial Injection Thresholds τconv,τsa\tau_{conv},\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT

We study the effect of disabling spatial feature injection or using different τconv\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT and τsa\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT values during video editing and show the qualitative results in Figure 8. We find that when spatial feature injection is disabled, the edited videos fail to fully adhere to the layout and motion from the source video. When spatial feature injection thresholds are too high, the edited videos are corrupted by the high-frequency details from the source video (e.g. textures from the woman’s down jacket in Figure 8). Setting τconv=τsa=0.2T\tau_{conv}=\tau_{sa}=0.2Titalic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = 0.2 italic_T achieves a desired editing outcome for our experiments.

Refer to caption
Figure 8: Hyperparameter study on spatial feature injection. We find that τsa=0.2T\tau_{sa}=0.2Titalic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = 0.2 italic_T is the best setting for maintaining the layout and structure in the edited video while not introducing unnecessary visual details from the source video. τc,s\tau_{c,s}italic_τ start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT represents τconv\tau_{conv}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT and τsa\tau_{sa}italic_τ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT. (Editing prompt: teddy bear running. The experiment was conducted with the I2VGen-XL backbone.
Effect of Temporal Injection Threshold τta\tau_{ta}italic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT

We study the hyperparameter of temporal feature injection threshold τta\tau_{ta}italic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT in different settings and show the results in Figure 9. We observe that in circumstances where τta<0.5T\tau_{ta}<0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT < 0.5 italic_T (TTitalic_T is the total denoising steps), the motion guidance is too weak that it leads to only partly aligned motion with the source video, even though the motion itself is logical and smooth. At τta>0.5T\tau_{ta}>0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT > 0.5 italic_T, the generated video shows a stronger adherence to the motion but distortion occurs. We employ τta=0.5T\tau_{ta}=0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT = 0.5 italic_T in our experiments and find that this value strikes the perfect balance on motion alignment, motion consistency, and video fidelity.

Refer to caption
Figure 9: Hyperparameter study on temporal feature injection. We find that τta=0.5T\tau_{ta}=0.5Titalic_τ start_POSTSUBSCRIPT italic_t italic_a end_POSTSUBSCRIPT = 0.5 italic_T to be the optimal setting as it balances motion alignment, motion consistency, and video fidelity. (Editing prompt: darth vader walking. The experiment was conducted with the SEINE backbone.

Appendix B Evaluation Detail

B.1 Quantitative Evaluations

Prompt-based Editing

For (1) prompt-based editing, we conduct a human evaluation to examine the degree of prompt alignment and overall preference of the edited videos based on user voting. We compare AnyV2V against three baseline models: Tune-A-Video (Wu et al., 2023b), TokenFlow (Geyer et al., 2023) and FLATTEN (Cong et al., 2023). Human evaluation results in Table 2 demonstrate that our model achieves the best overall preference and prompt alignment among all methods, and AnyV2V (I2VGen-XL) is the most preferred method. We conjecture that the gain is coming from our compatibility with state-of-the-art image editing models.

We also employ automatic evaluation metrics on our edited video of the human evaluation datasets. Following previous works (Ceylan et al., 2023; Bai et al., 2024), our automatic evaluation employs the CLIP (Radford et al., 2021) model to assess both text alignment and temporal consistency. For text alignment, we calculate the CLIP-Score, specifically by determining the average cosine similarity between the CLIP text embeddings derived from the editing prompt and the CLIP image embeddings across all frames. For temporal consistency, we evaluate the average cosine similarity between the CLIP image embeddings of every pair of consecutive frames. These two metrics are referred to as CLIP-Text and CLIP-Image, respectively. Our automatic evaluations in Table 2 demonstrate that our model is competitive in prompt-based editing compared to baseline methods.

Reference-based Style Transfer; Identity Manipulation and Subject-driven Editing

For novel tasks (2), (3) and (4), we evaluate the performance of three I2V generation models using human evaluations and show the results in Table 5. As these tasks require reference images instead of text prompts, we focus on evaluating the reference alignment and overall preference of the edited videos. According to the results, we observe that AnyV2V (I2VGen-XL) is the best model across all tasks, underscoring its robustness and versatility in handling diverse video editing tasks. AnyV2V (SEINE) and AnyV2V (ConsistI2V) show varied performance across tasks. AnyV2V (SEINE) performs good reference alignment in reference-based style transfer and identity manipulation, but falls short in subject-driven editing with lower scores. On the other hand, AnyV2V (ConsistI2V) shines in subject-driven editing, achieving second-best results in both reference alignment and overall preference. Since the latest image editing models have not yet reached a level of maturity that allows for consistent and precise editing (Ku et al., 2024), we also report the image editing success rate in Table 5 to clarify that our method relies on a good image frame edit.

B.2 Qualitative Results

Prompt-based Editing

By leveraging the strength of image editing models, our AnyV2V framework provides precise control of the edits such that the irrelevant parts in the scene are untouched after editing. In our experiment, we used InstructPix2Pix (Brooks et al., 2023) for the first frame edit. Shown in Figure 3, our method correctly places a party hat on an old man’s head and successfully turns the color of an airplane to blue, while preserving the background and keeping the fidelity to the source video. Comparing our work with the three baseline models TokenFlow (Geyer et al., 2023), FLATTEN (Cong et al., 2023), and Tune-A-Video (Wu et al., 2023b), the baseline methods display either excessive or insufficient changes in the edited video to align with the editing text prompt. The color tone and object shapes are also tilted. It is also worth mentioning that our approach is far more consistent on some motion tasks such as adding snowing weather, due to the I2V model’s inherent support for animating still scenes. The baseline methods, on the other hand, can add snow to individual frames but cannot generate the effect of snow falling, as the per-frame or one-shot editing methods lack the ability of temporal modeling.

Reference-based Style Transfer

Our approach diverges from relying solely on textual descriptors for conducting style edits, using the style transfer model NST (Gatys et al., 2015) to obtain the edited frame. This level of controllability offers artists the unprecedented opportunity to use their art as a reference for video editing, opening new avenues for creative expression. As demonstrated in Figure 4, our method captures the distinctive style of Vassily Kandinsky’s artwork “Composition VII” and Vincent Van Gogh’s artwork “Chateau in Auvers at Sunset” accurately, while such an edit is often hard to perform using existing text-guided video editing methods.

Subject-driven Editing

In our experiment, we employed a subject-driven image editing model AnyDoor (Chen et al., 2023c) for the first frame editing. AnyDoor allows replacing any object in the target image with the subject from only one reference image. We observe from Figure 4 that AnyV2V produces highly motion-consistent videos when performing subject-driven object swapping. In the first example, AnyV2V successfully replaces the cat with a dog according to the reference image and maintains highly aligned motion and background as reflected in the source video. In the second example, the car is replaced by our desired car while maintaining the rotation angle in the edited video.

Identity Manipulation

By integrating the identity-preserved image personalization model InstantID (Wang et al., 2024b) with ControlNet (Zhang et al., 2023b), this approach enables the replacement of an individual’s identity to create an initial frame. Our AnyV2V framework then processes this initial frame to produce an edited video, swapping the person’s identity as showcased in Figure 4. To the best of our knowledge, our work is the first to provide such flexibility in the video editing models. Note that the InstantID with ControlNet method will alter the background due to its model property. It is possible to leverage other identity-preserved image personalization models and apply them to AnyV2V to preserve the background.

B.3 Human Evaluation

B.3.1 Dataset

Our human evaluation dataset contains a total of 89 samples that have been collected from https://www.pexels.com. For prompt-based editing, we employed InstructPix2Pix (Brooks et al., 2023) to compose the examples. Topics include swapping objects, adding objects, and removing objects. For subject-driven editing, we employed AnyDoor (Chen et al., 2023c) to replace objects with reference subjects. For Neural Style Transfer, we employed NST (Gatys et al., 2015) to compose the examples. For identity manipulation, we employed InstantID (Wang et al., 2024b) to compose the examples. See Table 4 for the statistic.

Table 4: Number of entries for Video Editing Evaluation Dataset
Category Number of Entries Image Editing Models Used
Prompt-based Editing 45 InstructPix2Pix
Reference-based Style Transfer 20 NST
Identity Manipulation 13 InstantID
Subject-driven Editing 11 AnyDoor
Total 89

B.3.2 Interface

In the evaluation setup, we provide the evaluator with generated images from both the baseline models and the AnyV2V models. Evaluators are tasked with selecting videos that best align with the provided prompt or the reference image. Additionally, they are asked to express their overall preference for the edited videos. For a detailed view of the interface used in this process, please see Figure 10.

Refer to caption
Figure 10: The interface of individual evaluation.
Table 5: Comparisons for three I2V models under AnyV2V framework on novel video editing tasks. Align: reference alignment; Overall: overall preference. Bold: best results; \ulUnderline: top-2.
Task Reference-based Subject-driven Identity
Style Transfer Editing Manipulation
Image Editing Method NST AnyDoor InstantID
Image Editing Success Rate \approx90% \approx10% \approx80%
Human Evaluation Align \uparrow Overall \uparrow Align \uparrow Overall \uparrow Align \uparrow Overall \uparrow
AnyV2V (SEINE) \ul92.3% \ul30.8% 48.4% 15.2% \ul72.7% 18.2%
AnyV2V (ConsistI2V) 38.4% 10.3% \ul63.6% \ul42.4% \ul72.7% \ul27.3%
AnyV2V (I2VGen-XL) 100.0% 76.9% 93.9% 84.8% 90.1% 45.4%

Appendix C Discussion

C.1 Limitations

Inaccurate Edit from Image Editing Models. As our method relies on an initial frame edit, the image editing models are used. However, the current state-of-the-art models are not mature enough to perform accurate edits consistently (Ku et al., 2024). For example, in the subject-driven video editing task, we found that AnyDoor (Chen et al., 2023c) requires several tries to get a good editing result. Efforts are required in manually picking a good edited frame. We expect that in the future better image editing models will minimize such effort.

Limited ability of I2V models. We found that the results from our method cannot follow the source video motion if the motion is fast (e.g. billiard balls hitting each other at full speed) or complex (e.g. a person clipping her hair). One possible reason is that the current popular I2V models are generally trained on slow-motion videos, such that lacking the ability to regenerate fast or complex motion even with motion guidance. We anticipate that the presence of a robust I2V model can address this issue.

C.2 License of Assets

For image editing models, InstructPix2Pix (Brooks et al., 2023) inherits Creative ML OpenRAIL-M License as it is built upon Stable Diffusion. Neural Style Transfer (Gatys et al., 2015) is under Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. InstantID (Wang et al., 2024b) is under Apache License 2.0. AnyDoor (Chen et al., 2023c) is under the MIT License.

For baselines, Tune-A-Video (Wu et al., 2023b) is under Apache License 2.0, TokenFlow (Geyer et al., 2023) is under MIT License andFLATTEN (Cong et al., 2023) is under Apache License 2.0.

For the human evaluation dataset, the dataset has been collected from https://www.pexels.com, with all data governed by the terms outlined at https://www.pexels.com/license/.

We decide to release AnyV2V code under the Creative Commons Attribution 4.0 License for easy access in the research community.

C.3 Societal Impacts

Postive Social Impact. AnyV2V has the potential to significantly enhance the capabilities of video editing systems, making it easier for a wider range of users to manipulate images. This could have numerous positive social impacts, as users would be able to achieve their editing goals without needing professional editing knowledge, such as using Photoshop or painting.

Misinformation spread and Privacy violations. As our technique allows for object manipulation, it can produce highly realistic yet completely fabricated videos of one individual or subject. There is a risk that harmful actors could exploit our system to generate counterfeit videos to disseminate false information. Moreover, the ability to create convincing counterfeit content featuring individuals without their permission undermines privacy protections, possibly leading to the illicit use of a person’s likeness for harmful purposes and damaging their reputation. These issues are similarly present in DeepFake technologies. To mitigate the risk of misuse, one proposed solution is the adoption of unseen watermarking, a method commonly used to tackle such concerns in image generation.

C.4 Safeguards

It is crucial to implement proper safeguards and responsible AI frameworks when developing user-friendly video editing systems. For the human evaluation dataset, we manually collect a diverse range of images to ensure a balanced representation of objects from various domains. We only collect images that are considered safe.

C.5 Ethical Concerns for Human Evaluation

We believe our proposed human evaluation does not incur ethical concerns due to the following reasons: (1) the study does not involve any form of intervention or interaction that could affect the participants’ well-being. (2) Rating video content does not involve any physical or psychological risk, nor does it expose participants to sensitive or distressing material. (3) The data collected from participants will be entirely anonymous and will not contain any identifiable private information. (4) Participation in the study is entirely voluntary, and participants can withdraw at any time without any consequences.

C.6 New Assets

Our paper introduces several new assets including a human evaluation dataset and demo videos generated by AnyV2V. Each asset is thoroughly documented, detailing its creation, usage, and any relevant methodologies. The human evaluation dataset documentation includes details on how participant consent was obtained. The demo videos are provided as an anonymized zip file to comply with submission guidelines, with detailed instructions for use. All assets are shared under an open license to facilitate reuse and further research.

Appendix D More AnyV2V Showcases

Refer to caption
Figure 11: AnyV2V becomes an instruction-based video editing tool when plugged with instruction-guided image editing models like InstructPix2Pix (Brooks et al., 2023). Prompt used “Turn the couple into robots”, “Turn horse into zebra”, and “Turn the sand into snow”.
Refer to caption
Figure 12: The more recent model InstantStyle (Wang et al., 2024a) can seamlessly plug in AnyV2V to perform reference-based style transfer video editing. The reference art image in the bottom left corner is used to retrieve the first image edit.
Refer to caption
Figure 13: AnyV2V can perform subject-driven video editing with a single image reference, using a subject-driven image editing model like AnyDoor (Chen et al., 2023c). The subject image in the bottom left corner is used to retrieve the first image edit.