Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
索拉：大视觉模型的背景、技术、限制和机会的综述

Yixin Liu

{}^{1}

{}^{1}

{}^{1}

{}^{1}

{}^{1}

{}^{1}

¹ 张凯

{}^{1}

¹ 李元

{}^{1}

¹ 严志玲

{}^{1}

¹ 高楚杰

{}^{1}

¹
Ruoxi Chen

{}^{1}

¹¹1Equal contributions. The order was determined by rolling dice. Chujie, Ruoxi, Yuan, Yue, and Zhengqing are visiting students in the LAIR lab at Lehigh University. The GitHub link is https://github.com/lichao-sun/SoraReview
等同贡献。顺序由掷骰子决定。初杰，若曦，元，悦，和正清是在Lehigh大学的LAIR实验室的访问学生。GitHub链接是https://github.com/lichao-sun/SoraReview Zhengqing Yuan

{}^{1}

{}^{1}

{}^{1}

{}^{1}

¹ 袁正清

{}^{1}

¹ 黄悦

{}^{1}

¹ 孙汉驰

{}^{1}

¹
Jianfeng Gao

{}^{2}

Lifang He

{}^{1}

Lichao Sun

{}^{1}

²²2Lichao Sun is co-corresponding author: lis221@lehigh.edu
高剑锋

{}^{2}

何丽芳

{}^{1}

孙立超

{}^{1}

{}^{1}

Lehigh University

{}^{2}

Microsoft Research

{}^{1}

利哈伊大学

{}^{2}

微软研究

Abstract 摘要

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model’s background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora’s development and investigate the underlying technologies used to build this “world simulator”. Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.
Sora是一个文本到视频的生成型AI模型，由OpenAI在2024年2月发布。该模型经过训练，能够根据文本指令生成真实或想象的场景视频，并在模拟物理世界方面显示出潜力。本文根据公开的技术报告和逆向工程，对模型的背景、相关技术、应用、剩余挑战以及文本到视频AI模型的未来方向进行了全面的审查。我们首先追溯Sora的发展，并研究用于构建这个“世界模拟器”的底层技术。然后，我们详细描述了Sora在从电影制作和教育到营销的多个行业中的应用和潜在影响。我们讨论了需要解决的主要挑战和限制，以广泛部署Sora，例如确保安全和无偏见的视频生成。最后，我们讨论了Sora和视频生成模型的未来发展，以及该领域的进步如何可能开启新的人工智能交互方式，提高视频生成的生产力和创造力。 Refer to caption Figure 1: Sora: A Breakthrough in AI-Powered Vision Generation.
图 1：Sora：AI驱动视觉生成的突破。

1 Introduction 1简介

Since the release of ChatGPT in November 2022, the advent of AI technologies has marked a significant transformation, reshaping interactions and integrating deeply into various facets of daily life and industry [1, 2]. Building on this momentum, OpenAI released, in February 2024, Sora, a text-to-video generative AI model that can generate videos of realistic or imaginative scenes from text prompts. Compared to previous video generation models, Sora is distinguished by its ability to produce up to 1-minute long videos with high quality while maintaining adherence to user’s text instructions [3]. This progression of Sora is the embodiment of the long-standing AI research mission of equipping AI systems (or AI Agents) with the capability of understanding and interacting with the physical world in motion. This involves developing AI models that are capable of not only interpreting complex user instructions but also applying this understanding to solve real-world problems through dynamic and contextually rich simulations.
自2022年11月ChatGPT发布以来，AI技术的出现标志着一次重大的转变，重塑了互动方式，并深入融入了日常生活和各行各业的各个方面[1,2]。在这个势头的推动下，OpenAI在2024年2月发布了Sora，这是一个文本到视频的生成型AI模型，可以根据文本提示生成真实或想象的场景的视频。与以前的视频生成模型相比，Sora的特点是能够在保持对用户文本指令的遵守的同时，生成长达1分钟的高质量视频[3]。Sora的进步是长期AI研究任务的体现，该任务旨在赋予AI系统（或AI代理）理解和与运动中的物理世界互动的能力。这涉及到开发不仅能够解释复杂用户指令的AI模型，而且还能够将这种理解应用于通过动态和富有上下文的模拟来解决实际问题。

Refer to caption — Figure 2: Examples of Sora in text-to-video generation. Text instructions are given to the OpenAI Sora model, and it generates three videos according to the instructions.
图 2：Sora在文本到视频生成中的示例。文本指令被给予OpenAI的Sora模型，它根据指令生成了三个视频。

Sora demonstrates a remarkable ability to accurately interpret and execute complex human instructions, as illustrated in Figure 2. The model can generate detailed scenes that include multiple characters that perform specific actions against intricate backgrounds. Researchers attribute Sora’s proficiency to not only processing user-generated textual prompts but also discerning the complicated interplay of elements within a scenario. One of the most striking aspects of Sora is its capacity for up to a minute-long video while maintaining high visual quality and compelling visual coherency. Unlike earlier models that can only generate short video clips, Sora’s minute-long video creation possesses a sense of progression and a visually consistent journey from its first frame to the last. In addition, Sora’s advancements are evident in its ability to produce extended video sequences with nuanced depictions of motion and interaction, overcoming the constraints of shorter clips and simpler visual renderings that characterized earlier video generation models. This capability represents a leap forward in AI-driven creative tools, allowing users to convert text narratives to rich visual stories. Overall, these advances show the potential of Sora as a world simulator to provide nuanced insights into the physical and contextual dynamics of the depicted scenes. [3].
如图2所示，Sora展示了准确解读和执行复杂人类指令的显著能力。该模型可以生成包含多个角色在复杂背景下执行特定动作的详细场景。研究人员将Sora的熟练程度归因于不仅处理用户生成的文本提示，还能洞察场景内元素之间复杂的相互作用。Sora最引人注目的方面之一是其在保持高视觉质量和引人入胜的视觉连贯性的同时，具有长达一分钟的视频制作能力。与早期只能生成短视频片段的模型不同，Sora的一分钟视频创作具有进程感和从第一帧到最后一帧的视觉连贯性。此外，Sora的进步还体现在其能够产生带有细腻运动和互动描绘的扩展视频序列，克服了早期视频生成模型所特有的短片段和简单视觉呈现的限制。这项能力代表了AI驱动创新工具的一次飞跃，允许用户将文本叙述转化为丰富的视觉故事。总的来说，这些进步展示了Sora作为世界模拟器的潜力，能够提供对描绘场景的物理和情境动态的细致洞察。[3]。

Technology. At the heart of Sora is a pre-trained diffusion transformer [4]. Transformer models have proven scalable and effective for many natural language tasks. Similar to powerful large language models (LLMs) such as GPT-4, Sora can parse text and comprehend complex user instructions. To make video generation computationally efficient, Sora employs spacetime latent patches as its building blocks. Specifically, Sora compresses a raw input video into a latent spacetime representation. Then, a sequence of latent spacetime patches is extracted from the compressed video to encapsulate both the visual appearance and motion dynamics over brief intervals. These patches, analogous to word tokens in language models, provide Sora with detailed visual phrases to be used to construct videos. Sora’s text-to-video generation is performed by a diffusion transformer model. Starting with a frame filled with visual noise, the model iteratively denoises the image and introduces specific details according to the provided text prompt. In essence, the generated video emerges through a multi-step refinement process, with each step refining the video to be more aligned with the desired content and quality.
技术。Sora的核心是一个预训练的扩散变压器模型[4]。变压器模型已经证明在许多自然语言任务中都能有效地扩展和应用。与GPT-4等强大的大型语言模型（LLMs）类似，Sora可以解析文本并理解复杂的用户指令。为了使视频生成在计算上高效，Sora采用了时空潜在补丁作为其构建块。具体来说，Sora将原始输入视频压缩成一个潜在的时空表示。然后，从压缩视频中提取出一系列的时空潜在补丁，以封装短时间间隔内的视觉外观和运动动态。这些补丁，类似于语言模型中的词语标记，为Sora提供了用于构建视频的详细视觉短语。Sora的文本到视频生成是由一个扩散变压器模型执行的。从一个充满视觉噪声的帧开始，该模型迭代地去噪图像，并根据提供的文本提示引入特定的细节。从本质上讲，生成的视频是通过多步精炼过程产生的，每一步都使视频更加符合所需的内容和质量。

Highlights of Sora. Sora’s capabilities have profound implications in various aspects:
Sora的亮点。Sora的能力在各个方面都有深远的影响：

•

Improving simulation abilities: Training Sora at scale is attributed to its remarkable ability to simulate various aspects of the physical world. Despite lacking explicit 3D modeling, Sora exhibits 3D consistency with dynamic camera motion and long-range coherence that includes object persistence and simulates simple interactions with the world. Moreover, Sora intriguingly simulates digital environments like Minecraft, controlled by a basic policy while maintaining visual fidelity. These emergent abilities suggest that scaling video models is effective in creating AI models to simulate the complexity of physical and digital worlds.

• 提升模拟能力：大规模训练Sora的成功归功于其在模拟物理世界各个方面的出色能力。尽管缺乏明确的3D建模，Sora在动态摄像机运动和长距离一致性（包括对象持久性和模拟与世界的简单交互）方面表现出3D的一致性。此外，Sora还能够模拟像Minecraft这样的数字环境，同时由基本策略控制并保持视觉保真度。这些新出现的能力表明，扩大视频模型的规模在创建用于模拟物理和数字世界复杂性的AI模型方面是有效的。
•

Boosting creativity: Imagine outlining a concept through text, whether a simple object or a full scene, and seeing a realistic or highly stylized video rendered within seconds. Sora allows an accelerated design process for faster exploration and refinement of ideas, thus significantly boosting the creativity of artists, filmmakers, and designers.

• 提升创造力：想象通过文字勾勒出一个概念，无论是一个简单的物体还是一个完整的场景，并在几秒钟内看到一个逼真或高度风格化的视频呈现出来。Sora使设计过程加速，以更快地探索和完善想法，从而显著提升艺术家、电影制作人和设计师的创造力。
•

Driving educational innovations: Visual aids have long been integral to understanding important concepts in education. With Sora, educators can easily turn a class plan from text to videos to captivate students’ attention and improve learning efficiency. From scientific simulations to historical dramatizations, the possibilities are boundless.

• 推动教育创新：视觉辅助工具一直是理解教育中重要概念的重要部分。有了Sora，教育者可以轻松地将课程计划从文本转化为视频，以吸引学生的注意力并提高学习效率。从科学模拟到历史剧情再现，可能性是无限的。
•

Enhancing Accessibility: Enhancing accessibility in the visual domain is paramount. Sora offers an innovative solution by converting textual descriptions to visual content. This capability empowers all individuals, including those with visual impairments, to actively engage in content creation and interact with others in more effective ways. Consequently, it allows for a more inclusive environment where everyone has the opportunity to express his or her ideas through videos.

• 提升可访问性：在视觉领域提升可访问性至关重要。Sora提供了一种创新的解决方案，将文本描述转化为视觉内容。这种能力赋予所有人，包括视觉障碍者，积极参与内容创作并以更有效的方式与他人互动的能力。因此，它为每个人提供了一个更具包容性的环境，每个人都有机会通过视频表达自己的想法。
•

Fostering emerging applications: The applications of Sora are vast. For example, marketers might use it to create dynamic advertisements tailored to specific audience descriptions. Game developers might use it to generate customized visuals or even character actions from player narratives.

• 培育新兴应用：Sora的应用范围广泛。例如，营销人员可能会使用它来根据特定的受众描述创建动态广告。游戏开发者可能会使用它根据玩家的叙述生成定制的视觉效果或甚至角色动作。

Limitations and Opportunities. While Sora’s achievements highlight significant advancements in AI, challenges remain. Depicting complex actions or capturing subtle facial expressions are among the areas where the model could be enhanced. In addition, ethical considerations such as mitigating biases in generated content and preventing harmful visual outputs underscore the importance of responsible usage by developers, researchers, and the broader community. Ensuring that Sora’s outputs are consistently safe and unbiased is a principal challenge. The field of video generation is advancing swiftly, with academic and industry research teams making relentless strides. The advent of competing text-to-video models suggests that Sora may soon be part of a dynamic ecosystem. This collaborative and competitive environment fosters innovation, leading to improved video quality and new applications that help improve the productivity of workers and make people’s lives more entertaining.
限制与机遇。尽管Sora的成就突显了AI的重大进步，但仍存在挑战。描绘复杂动作或捕捉微妙的面部表情是模型可以被增强的领域之一。此外，诸如减轻生成内容中的偏见和防止有害视觉输出的道德考虑，强调了开发人员、研究人员和更广泛社区负责任使用的重要性。确保Sora的输出始终安全且无偏见是一个主要的挑战。视频生成领域正在迅速发展，学术和行业研究团队正在不懈努力。竞争性的文本到视频模型的出现表明，Sora可能很快就会成为一个动态生态系统的一部分。这种合作和竞争的环境促进了创新，导致了视频质量的提高和新应用的出现，这些应用有助于提高工人的生产力，使人们的生活更加有趣。

Our Contributions. Based on published technical reports and our reverse engineering, this paper presents the first comprehensive review of Sora’s background, related technologies, emerging applications, current limitations, and future opportunities.
我们的贡献。基于已发布的技术报告和我们的逆向工程，本文首次全面回顾了Sora的背景、相关技术、新兴应用、当前限制和未来机会。

2 Background 背景

2.1 History 2.1历史

In the realm of computer vision (CV), prior to the deep learning revolution, traditional image generation techniques relied on methods like texture synthesis [5] and texture mapping [6], based on hand-crafted features. However, these methods were limited in their capacity to produce complex and vivid images. The introduction of Generative Adversarial Networks (GANs) [7] and Variational Autoencoders (VAEs) [8] marked a significant turning point due to its remarkable capabilities across various applications. Subsequent developments, such as flow models [9] and diffusion models [10], further enhanced image generation with greater detail and quality. The recent progress in Artificial Intelligence Generated Content (AIGC) technologies has democratized content creation, enabling users to generate desired content through simple textual instructions [11].
在计算机视觉（CV）领域，在深度学习革命之前，传统的图像生成技术依赖于纹理合成[5]和纹理映射[6]等方法，这些方法基于手工制作的特征。然而，这些方法在生成复杂和生动的图像方面的能力有限。生成对抗网络（GANs）[7]和变分自编码器（VAEs）[8]的引入标志着一个重要的转折点，因为它在各种应用中具有显著的能力。后续的发展，如流模型[9]和扩散模型[10]，进一步增强了图像生成的细节和质量。人工智能生成内容（AIGC）技术的最近进步已经使内容创建民主化，使用户能够通过简单的文本指令[11]生成所需的内容。

Over the past decade, the development of generative CV models has taken various routes, as shown in Figure 3. This landscape began to shift notably following the successful application of the transformer architecture [12] in NLP, as demonstrated by BERT [13] and GPT [14]. In CV, researchers take this concept even further by combining the transformer architecture with visual components, allowing it to be applied to downstream CV tasks, such as Vision Transformer (ViT) [15] and Swin Transformer [16]. Parallel to the transformer’s success, diffusion models have also made significant strides in the fields of image and video generation [10]. Diffusion models offer a mathematically sound framework for converting noise into images with U-Nets [17], where U-Nets facilitate this process by learning to predict and mitigate noise at each step. Since 2021, a paramount focus in AI has been on generative language and vision models that are capable of interpreting human instructions, known as multimodal models. For example, CLIP [18] is a pioneering vision-language model that combines transformer architecture with visual elements, facilitating its training on vast datasets of text and images. By integrating visual and linguistic knowledge from the outset, CLIP can function as an image encoder within multimodal generation frameworks. Another notable example is Stable Diffusion [19], a versatile text-to-image AI model celebrated for its adaptability and ease of use. It employs transformer architecture and latent diffusion techniques to decode textual inputs and produce images of a wide array of styles, further illustrating the advancements in multimodal AI.
在过去的十年里，生成式计算机视觉（CV）模型的发展走过了各种路径，如图3所示。随着Transformer架构[12]在自然语言处理（NLP）中的成功应用，如BERT[13]和GPT[14]所示，这个领域开始显著变化。在CV中，研究人员通过将Transformer架构与视觉组件相结合，进一步推动了这一概念的发展，使其能够应用于下游的CV任务，如视觉Transformer（ViT）[15]和Swin Transformer[16]。与Transformer的成功并行，扩散模型在图像和视频生成领域也取得了重大进步[10]。扩散模型为将噪声转化为图像提供了数学上的稳健框架，其中U-Nets[17]通过学习预测和减轻每一步的噪声来促进这个过程。自2021年以来，AI的主要关注点已经转向了能够解释人类指令的生成语言和视觉模型，这被称为多模态模型。例如，CLIP[18]是一种开创性的视觉-语言模型，它将变压器架构与视觉元素相结合，便于其在大量的文本和图像数据集上进行训练。通过从一开始就整合视觉和语言知识，CLIP可以在多模态生成框架内作为图像编码器功能。另一个值得注意的例子是Stable Diffusion[19]，这是一个被赞誉为适应性强且易于使用的多功能文本到图像的AI模型。它采用变压器架构和潜在扩散技术来解码文本输入，并生成各种风格的图像，进一步展示了多模态AI的进步。

Following the release of ChatGPT in November 2022, we have witnessed the emergence of commercial text-to-image products in 2023, such as Stable Diffusion [19], Midjourney [20], DALL-E 3 [21]. These tools enable users to generate new images of high resolution and quality with simple text prompts, showcasing the potential of AI in creative image generation. However, transitioning from text-to-image to text-to-video is challenging due to the temporal complexity of videos. Despite numerous efforts in industry and academia, most existing video generation tools, such as Pika [22] and Gen-2 [23], are limited to producing only short video clips of a few seconds. In this context, Sora represents a significant breakthrough, akin to ChatGPT’s impact in the NLP domain. Sora is the first model that is capable of generating videos up to one minute long based on human instructions, marking a milestone that profoundly influences research and development in generative AI. To facilitate easy access to the latest advancements in vision generation models, the most recent works have been compiled and provided in the Appendix and our GitHub.
在2022年11月发布ChatGPT后，我们在2023年见证了商业文本到图像产品的出现，例如Stable Diffusion [19]，Midjourney [20]，DALL-E 3 [21]。这些工具使用户能够通过简单的文本提示生成高分辨率和高质量的新图像，展示了AI在创意图像生成方面的潜力。然而，从文本到图像过渡到文本到视频是具有挑战性的，因为视频的时间复杂性。尽管工业界和学术界做出了大量努力，但大多数现有的视频生成工具，如Pika [22]和Gen-2 [23]，仅限于生成几秒钟的短视频片段。在这种情况下，Sora代表了一个重大突破，类似于ChatGPT在NLP领域的影响。Sora是第一个能够根据人类指令生成长达一分钟的视频的模型，标志着一个深刻影响生成性AI研究和开发的里程碑。为了方便访问最新的视觉生成模型的进展，最近的作品已经在附录和我们的GitHub中编译和提供。

2.2 Advanced Concepts 2.2高级概念

Scaling Laws for Vision Models. With scaling laws for LLMs, it is natural to ask whether the development of vision models follows similar scaling laws. Recently, Zhai et al. [24] have demonstrated that the performance-compute frontier for ViT models with enough training data roughly follows a (saturating) power law. Following them, Google Research [25] presented a recipe for highly efficient and stable training of a 22B-parameter ViT. Results show that great performance can be achieved using the frozen model to produce embeddings, and then training thin layers on top. Sora, as a large vision model (LVM), aligns with these scaling principles, uncovering several emergent abilities in text-to-video generation. This significant progression underscores the potential for LVMs to achieve advancements like those seen in LLMs.
对于视觉模型的规模定律。有了LLMs的规模定律，我们自然会问视觉模型的发展是否遵循类似的规模定律。最近，Zhai等人[24]已经证明，只要有足够的训练数据，ViT模型的性能-计算前沿大致遵循一个（饱和的）幂律。在他们之后，Google Research [25]提出了一种高效稳定训练22B参数ViT的配方。结果显示，使用冻结模型生成嵌入，然后在顶部训练薄层，可以获得很好的性能。Sora作为一个大型视觉模型（LVM），符合这些规模原则，揭示了在文本到视频生成中的几种新能力。这一重大进展强调了LVM实现类似LLMs那样的进步的潜力。

Emergent Abilities. Emergent abilities in LLMs are sophisticated behaviors or functions that manifest at certain scales—often linked to the size of the model’s parameters—that were not explicitly programmed or anticipated by their developers. These abilities are termed "emergent" because they emerge from the model’s comprehensive training across varied datasets, coupled with its extensive parameter count. This combination enables the model to form connections and draw inferences that surpass mere pattern recognition or rote memorization. Typically, the emergence of these abilities cannot be straightforwardly predicted by extrapolating from the performance of smaller-scale models. While numerous LLMs, such as ChatGPT and GPT-4, exhibit emergent abilities, vision models demonstrating comparable capabilities have been scarce until the advent of Sora. According to Sora’s technical report, it is the first vision model to exhibit confirmed emergent abilities, marking a significant milestone in the field of computer vision.
突现能力。在LLMs中的突现能力是一种复杂的行为或功能，它们在某些规模上显现出来——通常与模型参数的大小有关——这些能力并未被开发者明确编程或预期。这些能力被称为"突现"，因为它们源于模型在各种数据集上的全面训练，以及其大量的参数计数。这种组合使模型能够形成连接并进行推理，超越了单纯的模式识别或死记硬背。通常，这些能力的出现不能直接通过从小规模模型的性能推断出来。虽然有许多LLMs，如ChatGPT和GPT-4，展示了突现能力，但在Sora出现之前，展示相当能力的视觉模型一直很少。根据Sora的技术报告，它是第一个展示出确认的突现能力的视觉模型，这标志着计算机视觉领域的一个重要里程碑。

In addition to its emergent abilities, Sora exhibits other notable capabilities, including instruction following, visual prompt engineering, and video understanding. These aspects of Sora’s functionality represent significant advancements in the vision domain and will be explored and discussed in the rest sections.
除了其新兴能力外，Sora还展示了其他显著的能力，包括遵循指示、视觉提示工程以及视频理解。这些方面的Sora的功能代表了视觉领域的重大进步，并将在接下来的部分中进行探讨和讨论。

3 Technology 3科技

3.1 Overview of Sora 3.1 Sora概述

In the core essence, Sora is a diffusion transformer [4] with flexible sampling dimensions as shown in Figure 4. It has three parts: (1) A time-space compressor first maps the original video into latent space. (2) A ViT then processes the tokenized latent representation and outputs the denoised latent representation. (3) A CLIP-like [26] conditioning mechanism receives LLM-augmented user instructions and potentially visual prompts to guide the diffusion model to generate styled or themed videos. After many denoising steps, the latent representation of the generated video is obtained and then mapped back to pixel space with the corresponding decoder. In this section, we aim to reverse engineer the technology used by Sora and discuss a wide range of related works.
在核心本质上，Sora是一个具有灵活采样维度的扩散变换器[ 4]，如图4所示。它有三个部分：（1）时空压缩器首先将原始视频映射到潜在空间。（2）然后，ViT处理标记化的潜在表示，并输出去噪的潜在表示。（3）类似CLIP的[ 26]调节机制接收LLM增强的用户指令和可能的视觉提示，以指导扩散模型生成风格化或主题化的视频。经过多次去噪步骤后，获得生成视频的潜在表示，然后使用相应的解码器将其映射回像素空间。在本节中，我们的目标是逆向工程Sora使用的技术，并讨论一系列相关的工作。

3.2 Data Pre-processing 3.2数据预处理

3.2.1 Variable Durations, Resolutions, Aspect Ratios
3.2.1 可变持续时间、分辨率、宽高比

One distinguishing feature of Sora is its ability to train on, understand, and generate videos and images at their native sizes [3] as illustrated in Figure 5. Traditional methods often resize, crop, or adjust the aspect ratios of videos to fit a uniform standard—typically short clips with square frames at fixed low resolutions [27][28][29]. Those samples are often generated at a wider temporal stride and rely on separately trained frame-insertion and resolution-rendering models as the final step, creating inconsistency across the video. Utilizing the diffusion transformer architecture [4] (see Section 3.2.4), Sora is the first model to embrace the diversity of visual data and can sample in a wide array of video and image formats, ranging from widescreen 1920x1080p videos to vertical 1080x1920p videos and everything in between without compromising their original dimensions.
Sora的一个显著特点是其能够在原始大小上训练、理解和生成视频和图像[3]，如图5所示。传统方法通常会调整视频的大小、裁剪或调整视频的宽高比以适应统一的标准——通常是固定低分辨率的方形帧短片[27][28][29]。这些样本通常在更宽的时间步长上生成，并依赖于单独训练的帧插入和分辨率渲染模型作为最后的步骤，从而在视频中产生不一致性。利用扩散变换器架构[4]（参见第3.2.4节），Sora是第一个接纳视觉数据多样性的模型，并且可以在各种视频和图像格式中采样，范围从宽屏1920x1080p视频到垂直1080x1920p视频，以及介于两者之间的所有内容，而不会损害它们的原始尺寸。

Training on data in their native sizes significantly improves composition and framing in the generated videos. Empirical findings suggest that by maintaining the original aspect ratios, Sora achieves a more natural and coherent visual narrative. The comparison between Sora and a model trained on uniformly cropped square videos demonstrates a clear advantage as shown in Figure 6. Videos produced by Sora exhibit better framing, ensuring subjects are fully captured in the scene, as opposed to the sometimes truncated views resulting from square cropping.
在原始尺寸的数据上进行训练，显著提高了生成视频的构图和镜头布局。实证发现表明，通过保持原始的纵横比，Sora实现了更自然、更连贯的视觉叙述。如图6所示，Sora与在统一裁剪的正方形视频上训练的模型进行比较，显示出明显的优势。Sora生成的视频展示了更好的镜头布局，确保了场景中的主体被完全捕捉，而不是像正方形裁剪那样有时会产生被截断的视图。

This nuanced understanding and preservation of original video and image characteristics mark a significant advancement in the field of generative models. Sora’s approach not only showcases the potential for more authentic and engaging video generation but also highlights the importance of diversity in training data for achieving high-quality results in generative AI. The training approach of Sora aligns with the core tenet of Richard Sutton’s The Bitter Lesson[30], which states that leveraging computation over human-designed features leads to more effective and flexible AI systems. Just as the original design of diffusion transformers seeks simplicity and scalability [31], Sora’s strategy of training on data at their native sizes eschews traditional AI reliance on human-derived abstractions, favoring instead a generalist method that scales with computational power. In the rest of this section, we try to reverse engineer the architecture design of Sora and discuss related technologies to achieve this amazing feature.
这种对原始视频和图像特性的细致理解和保护标志着生成模型领域的重大进步。Sora的方法不仅展示了更真实、更吸引人的视频生成的潜力，而且突显了训练数据多样性对于实现生成AI的高质量结果的重要性。Sora的训练方法与Richard Sutton的《苦涩的教训》[30]的核心原则相一致，该原则指出，利用计算优于人类设计的特性可以导致更有效和灵活的AI系统。正如扩散变换器的原始设计寻求简单性和可扩展性[31]一样，Sora在其原生大小的数据上进行训练的策略避免了传统AI对人类衍生抽象的依赖，而更倾向于随着计算能力的提高而扩展的通用方法。在本节的其余部分，我们试图逆向工程Sora的架构设计，并讨论相关技术以实现这个惊人的特性。

3.2.2 Unified Visual Representation
3.2.2统一视觉表示

To effectively process diverse visual inputs including images and videos with varying durations, resolutions, and aspect ratios, a crucial approach involves transforming all forms of visual data into a unified representation, which facilitates the large-scale training of generative models. Specifically, Sora patchifies videos by initially compressing videos into a lower-dimensional latent space, followed by decomposing the representation into spacetime patches. However, Sora’s technical report [3] merely presents a high-level idea, making reproduction challenging for the research community. In this section, we try to reverse-engineer the potential ingredients and technical pathways. Additionally, we will discuss viable alternatives that could replicate Sora’s functionalities, drawing upon insights from existing literature.
为了有效处理包括图像和视频在内的各种视觉输入，这些输入的持续时间、分辨率和纵横比各不相同，一个关键的方法是将所有形式的视觉数据转化为统一的表示，这有助于大规模训练生成模型。具体来说，Sora首先将视频压缩到一个低维的潜在空间，然后将表示分解成时空片段。然而，Sora的技术报告[3]仅仅提出了一个高层次的想法，使得研究社区难以复制。在这一部分，我们试图逆向工程潜在的成分和技术路径。此外，我们将讨论可能复制Sora功能的可行替代方案，借鉴现有文献的见解。

3.2.3 Video Compression Network
3.2.3视频压缩网络

Sora’s video compression network (or visual encoder) aims to reduce the dimensionality of input data, especially a raw video, and output a latent representation that is compressed both temporally and spatially as shown in Figure 7. According to the references in the technical report, the compression network is built upon VAE or Vector Quantised-VAE (VQ-VAE) [32]. However, it is challenging for VAE to map visual data of any size to a unified and fixed-sized latent space if resizing and cropping are not used as mentioned in the technical report. We summarize two distinct implementations to address this issue:
索拉的视频压缩网络（或视觉编码器）旨在降低输入数据的维度，特别是原始视频，并输出一个在时间和空间上都被压缩的潜在表示，如图7所示。根据技术报告中的参考资料，压缩网络是基于VAE或向量量化-VAE（VQ-VAE）[32]构建的。然而，如果不使用调整大小和裁剪，对于VAE来说，将任何大小的视觉数据映射到统一的、固定大小的潜在空间是具有挑战性的，如技术报告中所提到的。我们总结了两种不同的实现方式来解决这个问题：

Spatial-patch Compression. This involves transforming video frames into fixed-size patches, akin to the methodologies employed in ViT [15] and MAE [33] (see Figure 8), before encoding them into a latent space. This approach is particularly effective for accommodating videos of varying resolutions and aspect ratios, as it encodes entire frames through the processing of individual patches. Subsequently, these spatial tokens are organized in a temporal sequence to create a spatial-temporal latent representation. This technique highlights several critical considerations: Temporal dimension variability – given the varying durations of training videos, the temporal dimension of the latent space representation cannot be fixed. To address this, one can either sample a specific number of frames (padding or temporal interpolation [34] may be needed for much shorter videos) or define a universally extended (super long) input length for subsequent processing (more details are described in Section 3.2.4); Utilization of pre-trained visual encoders – for processing videos of high resolution, leveraging existing pre-trained visual encoders, such as the VAE encoder from Stable Diffusion [19], is advisable for most researchers while Sora’s team is expected to train their own compression network with a decoder (the video generator) from scratch via the manner employed in training latent diffusion models [19, 35, 36]. These encoders can efficiently compress large-size patches (e.g., $256\times 256$ ), facilitating the management of large-scale data; Temporal information aggregation – since this method primarily focuses on spatial patch compression, it necessitates an additional mechanism for aggregating temporal information within the model. This aspect is crucial for capturing dynamic changes over time and is further elaborated in subsequent sections (see details in Section 3.2.6 and Figure 14).
空间块压缩。这涉及将视频帧转换为固定大小的块，类似于ViT [15] 和MAE [33] 所采用的方法（参见图8），然后将它们编码到一个潜在空间中。这种方法对于适应不同分辨率和纵横比的视频特别有效，因为它通过处理单个块来编码整个帧。随后，这些空间令牌按照时间顺序组织，以创建一个空间-时间潜在表示。这种技术强调了几个关键的考虑因素：时间维度的可变性 - 鉴于训练视频的持续时间各不相同，潜在空间表示的时间维度不能固定。为了解决这个问题，可以选择采样特定数量的帧（对于较短的视频可能需要填充或时间插值[34]），或者定义一个普遍扩展（超长）的输入长度进行后续处理（更多细节在第3.2节中描述）。4) 利用预训练的视觉编码器 - 用于处理高分辨率的视频，利用现有的预训练视觉编码器，如Stable Diffusion [19]的VAE编码器，对于大多数研究人员来说是明智的，而Sora的团队则预计将从头开始通过训练潜在扩散模型的方式训练他们自己的压缩网络，配有一个解码器（视频生成器）[19, 35, 36]。这些编码器可以有效地压缩大尺寸的补丁（例如， $256\times 256$ ），便于管理大规模数据；时间信息聚合 - 由于这种方法主要关注空间补丁压缩，因此需要在模型中添加一个聚合时间信息的额外机制。这个方面对于捕捉随时间变化的动态变化至关重要，并在后续部分中有进一步的详细说明（参见第3.2.6节和图14的详细信息）。

Spatial-temporal-patch Compression. This technique is designed to encapsulate both spatial and temporal dimensions of video data, offering a comprehensive representation. This technique extends beyond merely analyzing static frames by considering the movement and changes across frames, thereby capturing the video’s dynamic aspects. The utilization of 3D convolution emerges as a straightforward and potent method for achieving this integration [37]. The graphical illustration and the comparison against pure spatial-pachifying are depicted in Figure 9. Similar to spatial-patch compression, employing spatial-temporal-patch compression with predetermined convolution kernel parameters – such as fixed kernel sizes, strides, and output channels – results in variations in the dimensions of the latent space due to the differing characteristics of video inputs. This variability is primarily driven by the diverse durations and resolutions of the videos being processed. To mitigate this challenge, the approaches adopted for spatial patchification are equally applicable and effective in this context.
空间-时间-块压缩。这种技术旨在封装视频数据的空间和时间维度，提供全面的表示。这种技术超越了仅仅分析静态帧，通过考虑帧之间的运动和变化，从而捕捉视频的动态方面。使用3D卷积作为实现这种整合的直接而强大的方法[37]。图形描绘和与纯空间块化的比较在图9中展示。与空间-块压缩类似，使用空间-时间-块压缩，预设的卷积核参数 - 如固定的核大小，步长和输出通道 - 会导致潜在空间维度的变化，这主要是由于视频输入的特性不同。这种变化主要由处理的视频的不同持续时间和分辨率驱动。为了缓解这个挑战，采用的空间块化方法在这个情境中同样适用和有效。

In summary, we reverse engineer the two patch-level compression approaches based on VAE or its variant like VQ-VQE because operations on patches are more flexible to process different types of videos. Since Sora aims to generate high-fidelity videos, a large patch size or kernel size is used for efficient compression. Here, we expect that fixed-size patches are used for simplicity, scalability, and training stability. But varying-size patches could also be used [39] to make the dimension of the whole frames or videos in latent space consistent. However, it may result in invalid positional encoding, and cause challenges for the decoder to generate videos with varying-size latent patches.
总的来说，我们对基于VAE或其变体如VQ-VQE的两种补丁级压缩方法进行了逆向工程，因为对补丁的操作在处理不同类型的视频时更为灵活。由于Sora旨在生成高保真度的视频，因此使用了大的补丁大小或内核大小以实现有效的压缩。在这里，我们期望使用固定大小的补丁，以便简化、扩展和稳定训练。但是，也可以使用不同大小的补丁[39]，以使整个帧或视频在潜在空间中的维度保持一致。然而，这可能会导致位置编码无效，并给解码器生成具有不同大小潜在补丁的视频带来挑战。

3.2.4 Spacetime Latent Patches
3.2.4时空潜在补丁

There is a pivotal concern remaining in the compression network part: How to handle the variability in latent space dimensions (i.e., the number of latent feature chunks or patches from different video types) before feeding patches into the input layers of the diffusion transformer. Here, we discuss several solutions.
在压缩网络部分仍有一个关键的问题：如何处理潜在空间维度的变化性（即，来自不同视频类型的潜在特征块或片段的数量）在将片段输入扩散变换器的输入层之前。在这里，我们讨论了几种解决方案。

Based on Sora’s technical report and the corresponding references, patch n’ pack (PNP) [40] is likely the solution. PNP packs multiple patches from different images in a single sequence as shown in Figure 10. This method is inspired by example packing used in natural language processing [41] that accommodates efficient training on variable length inputs by dropping tokens. Here the patchification and token embedding steps need to be completed in the compression network, but Sora may further patchify the latent for transformer token as Diffusion Transformer does [4]. Regardless there is a second-round patchification or not, we need to address two concerns, how to pack those tokens in a compact manner and how to control which tokens should be dropped. For the first concern, a simple greedy approach is used which adds examples to the first sequence with enough remaining space. Once no more example can fit, sequences are filled with padding tokens, yielding the fixed sequence lengths needed for batched operations. Such a simple packing algorithm can lead to significant padding, depending on the distribution of the length of inputs. On the other hand, we can control the resolutions and frames we sample to ensure efficient packing by tuning the sequence length and limiting padding. For the second concern, an intuitive approach is to drop the similar tokens [42, 43, 33, 44] or, like PNP, apply dropping rate schedulers. However, it is worth noting that 3D Consistency is one of the nice properties of Sora. Dropping tokens may ignore fine-grained details during training. Thus, we believe that OpenAI is likely to use a super long context window and pack all tokens from videos although doing so is computationally expensive e.g., the multi-head attention [45, 46] operator exhibits quadratic cost in sequence length. Specifically, spacetime latent patches from a long-duration video can be packed in one sequence while the ones from several short-duration videos are concatenated in the other sequence.
根据Sora的技术报告和相应的参考文献，补丁打包（PNP）[40]可能是解决方案。PNP将来自不同图像的多个补丁打包在一个序列中，如图10所示。这种方法受到自然语言处理中使用的示例打包的启发[41]，通过丢弃令牌来适应对可变长度输入的有效训练。在这里，需要在压缩网络中完成补丁化和令牌嵌入步骤，但Sora可能会进一步将潜在的转换为Diffusion Transformer所做的转换器令牌[4]。无论是否有第二轮的补丁化，我们都需要解决两个问题，如何以紧凑的方式打包这些令牌，以及如何控制应该丢弃哪些令牌。对于第一个问题，使用了一种简单的贪婪方法，该方法将示例添加到具有足够剩余空间的第一个序列中。一旦没有更多的例子可以适应，序列就会用填充令牌填充，产生批处理操作所需的固定序列长度。这样一个简单的打包算法可能会导致大量的填充，这取决于输入长度的分布。另一方面，我们可以通过调整序列长度和限制填充来控制我们采样的分辨率和帧数，以确保有效的打包。对于第二个问题，一个直观的方法是丢弃相似的标记[42,43,33,44]，或者像PNP一样，应用丢弃率调度器。然而，值得注意的是，3D一致性是Sora的一个好属性。丢弃标记可能会在训练过程中忽略细粒度的细节。因此，我们相信OpenAI可能会使用一个超长的上下文窗口，并将所有的视频标记打包，尽管这样做在计算上是昂贵的，例如，多头注意力[45,46]操作符在序列长度上呈二次方的成本。具体来说，来自长时间视频的时空潜在补丁可以在一个序列中打包，而来自几个短时间视频的补丁则在另一个序列中连接。

3.2.5 Discussion 3.2.5讨论

We discuss two technical solutions to data pre-processing that Sora may use. Both solutions are performed at the patch level due to the characteristics of flexibility and scalability for modeling. Different from previous approaches where videos are resized, cropped, or trimmed to a standard size, Sora trains on data at its native size. Although there are several benefits (see detailed analysis in Section 3.2.1), it brings some technical challenges, among which one of the most significant is that neural networks cannot inherently process visual data of variable durations, resolutions, and aspect ratios. Through reverse engineering, we believe that Sora firstly compresses visual patches into low-dimensional latent representations, and arranges such latent patches or further patchified latent patches in a sequence, then injects noise into these latent patches before feeding them to the input layer of diffusion transformer. Spatial-temporal patchification is adopted by Sora because it is simple to implement, and it can effectively reduce the context length with high-information-density tokens and decrease the complexity of subsequent modeling of temporal information.
我们讨论了Sora可能使用的两种数据预处理的技术解决方案。由于建模的灵活性和可扩展性特点，这两种解决方案都在补丁级别进行。与之前将视频调整大小、裁剪或修剪到标准大小的方法不同，Sora在其原始大小的数据上进行训练。尽管有几个好处（详见第3.2.1节的详细分析），但它带来了一些技术挑战，其中最重要的一个是神经网络无法固有地处理变化的持续时间、分辨率和纵横比的视觉数据。通过逆向工程，我们相信Sora首先将视觉补丁压缩成低维潜在表示，然后将这些潜在补丁或进一步的补丁化潜在补丁排列成一个序列，然后在将它们输入到扩散变压器的输入层之前，向这些潜在补丁注入噪声。Sora采用的空间-时间补丁化方法，因为它易于实现，并且可以有效地用高信息密度的令牌减少上下文长度，并降低后续时间信息建模的复杂性。

To the research community, we recommend using cost-efficient alternative solutions for video compression and representation, including utilizing pre-trained checkpoints (e.g., compression network) [47], shortening the context window, using light-weight modeling mechanisms such as (grouped) multi-query attention [48, 49] or efficient architectures (e.g. Mamba [50]), downsampling data and dropping tokens if necessary. The trade-off between effectiveness and efficiency for video modeling is an important research topic to be explored.
对于研究社区，我们建议使用成本效益高的视频压缩和表示的替代方案，包括利用预训练的检查点（例如，压缩网络）[47]，缩短上下文窗口，使用轻量级的建模机制，如（分组的）多查询注意力[48,49]或高效的架构（例如Mamba [50]），如果必要的话，对数据进行降采样和丢弃令牌。在视频建模中，效果和效率之间的权衡是一个需要探索的重要研究主题。

3.2.6 Diffusion Transformer
3.2.6扩散变换器

3.3 Modeling 3.3建模

Image Diffusion Transformer. Traditional diffusion models [51, 52, 53] mainly leverage convolutional U-Nets that include downsampling and upsampling blocks for the denoising network backbone. However, recent studies show that the U-Net architecture is not crucial to the good performance of the diffusion model. By incorporating a more flexible transformer architecture, the transformer-based diffusion models can use more training data and larger model parameters. Along this line, DiT [4] and U-ViT [54] are among the first works to employ vision transformers for latent diffusion models. As in ViT, DiT employs a multi-head self-attention layer and a pointwise feed-forward network interlaced with some layer norm and scaling layers. Moreover, as shown in Figure 11, DiT incorporates conditioning via adaptive layer norm (AdaLN) with an additional MLP layer for zero-initializing, which initializes each residual block as an identity function and thus greatly stabilizes the training process. The scalability and flexibility of DiT is empirically validated. DiT becomes the new backbone for diffusion models. In U-ViT, as shown in Figure 11, they treat all inputs, including time, condition, and noisy image patches, as tokens and propose long skip connections between the shallow and deep transformer layers. The results suggest that the downsampling and upsampling operators in CNN-based U-Net are not always necessary, and U-ViT achieves record-breaking FID scores in image and text-to-image generation.
图像扩散变换器。传统的扩散模型[51,52,53]主要利用包含下采样和上采样块的卷积U-Net作为去噪网络骨干。然而，最近的研究表明，U-Net架构对扩散模型的良好性能并不关键。通过整合更灵活的变换器架构，基于变换器的扩散模型可以使用更多的训练数据和更大的模型参数。沿着这条线，DiT[4]和U-ViT[54]是首批采用视觉变换器进行潜在扩散模型的工作。如同在ViT中，DiT采用了多头自注意力层和逐点前馈网络，交错一些层规范和缩放层。此外，如图11所示，DiT通过自适应层规范（AdaLN）并添加额外的MLP层进行零初始化，这初始化每个残差块为恒等函数，从而大大稳定了训练过程。DiT的可扩展性和灵活性得到了实证验证。DiT成为扩散模型的新骨干。在U-ViT中，如图11所示，他们将所有输入，包括时间、条件和噪声图像块，都视为标记，并在浅层和深层变压器层之间提出长跳跃连接。结果表明，基于CNN的U-Net中的下采样和上采样操作并不总是必要的，而U-ViT在图像和文本到图像生成中实现了破纪录的FID分数。

Like Masked AutoEncoder (MAE) [33], Masked Diffusion Transformer (MDT) [55] incorporates mask latent modeling into the diffusion process to explicitly enhance contextual relation learning among object semantic parts in image synthesis. Specifically, as shown in Figure 12, MDT uses a side-interpolated for an additional masked token reconstruction task during training to boost the training efficiency and learn powerful context-aware positional embedding for inference. Compared to DiT [4], MDT achieves better performance and faster learning speed. Instead of using AdaLN (i.e., shifting and scaling) for time-conditioning modeling, Hatamizadeh et al. [56] introduce Diffusion Vision Transformers (DiffiT), which uses a time-dependent self-attention (TMSA) module to model dynamic denoising behavior over sampling time steps. Besides, DiffiT uses two hybrid hierarchical architectures for efficient denoising in the pixel space and the latent space, respectively, and achieves new state-of-the-art results across various generation tasks. Overall, these studies show promising results in employing vision transformers for image latent diffusion, paving the way for future studies for other modalities.
就像遮罩自编码器（MAE）[33]，遮罩扩散变换器（MDT）[55]将遮罩潜在建模融入扩散过程中，以显式增强图像合成中对象语义部分之间的上下文关系学习。具体来说，如图12所示，MDT在训练过程中使用侧插值进行额外的遮罩令牌重建任务，以提高训练效率并学习强大的上下文感知位置嵌入以供推理。与DiT [4]相比，MDT实现了更好的性能和更快的学习速度。Hatamizadeh等人[56]没有使用AdaLN（即，移位和缩放）进行时间条件建模，而是引入了扩散视觉变换器（DiffiT），该变换器使用一个时间依赖的自我注意（TMSA）模块来模拟采样时间步骤上的动态去噪行为。此外，DiffiT使用两种混合层次结构分别进行像素空间和潜在空间的高效去噪，并在各种生成任务中实现了新的最先进的结果。总的来说，这些研究在使用视觉变换器进行图像潜在扩散方面显示出了有希望的结果，为未来其他模态的研究铺平了道路。

Video Diffusion Transformer. Building upon the foundational works in text-to-image (T2I) diffusion models, recent research has been focused on realizing the potential of diffusion transformers for text-to-video (T2V) generation tasks. Due to the temporal nature of videos, key challenges for applying DiTs in the video domain are: i) how to compress the video spatially and temporally to a latent space for efficient denoising; ii) how to convert the compressed latent to patches and feed them to the transformer; and iii) how to handle long-range temporal and spatial dependencies and ensure content consistency. Please refer to Section 3.2.3 for the first challenge. In this Section, we focus our discussion on transformer-based denoising network architectures designed to operate in the spatially and temporally compressed latent space. We give a detailed review of the two important works (Imagen Video [29] and Video LDM [36]) described in the reference list of the OpenAI Sora technique report.
视频扩散变换器。在文本到图像（T2I）扩散模型的基础工作上，最近的研究已经专注于实现扩散变换器在文本到视频（T2V）生成任务中的潜力。由于视频的时间性质，将DiTs应用于视频领域的关键挑战是：i)如何将视频在空间和时间上压缩到一个潜在空间以进行有效的去噪；ii)如何将压缩的潜在转换为补丁并将它们输入到变换器；以及iii)如何处理长距离的时间和空间依赖性，并确保内容的一致性。请参阅第3.2.3节以了解第一个挑战。在本节中，我们将讨论的重点放在设计用于在空间和时间上压缩的潜在空间中操作的基于变换器的去噪网络架构上。我们将详细回顾OpenAI Sora技术报告参考列表中描述的两项重要工作（Imagen视频[29]和视频LDM[36]）。

Imagen Video [29], a text-to-video generation system developed by Google Research, utilizes a cascade of diffusion models, which consists of 7 sub-models that perform text-conditional video generation, spatial super-resolution, and temporal super-resolution, to transform textual prompts into high-definition videos. As shown in Figure 13, firstly, a frozen T5 text encoder generates contextual embeddings from the input text prompt. These embeddings are critical for aligning the generated video with the text prompt and are injected into all models in the cascade, in addition to the base model. Subsequently, the embedding is fed to the base model for low-resolution video generation, which is then refined by cascaded diffusion models to increase the resolution. The base video and super-resolution models use a 3D U-Net architecture in a space-time separable fashion. This architecture weaves temporal attention and convolution layers with spatial counterparts to efficiently capture inter-frame dependencies. It employs v-prediction parameterization for numerical stability and conditioning augmentation to facilitate parallel training across models. The process involves joint training on both images and videos, treating each image as a frame to leverage larger datasets, and using classifier-free guidance [57] to enhance prompt fidelity. Progressive distillation [58] is applied to streamline the sampling process, significantly reducing the computational load while maintaining perceptual quality. Combining these methods and techniques allows Imagen Video to generate videos with not only high fidelity but also remarkable controllability, as demonstrated by its ability to produce diverse videos, text animations, and content in various artistic styles.
Imagen Video [29]，是由Google Research开发的一种文本到视频生成系统，它利用了由7个子模型组成的扩散模型级联，执行文本条件视频生成、空间超分辨率和时间超分辨率，将文本提示转化为高清视频。如图13所示，首先，一个冻结的T5文本编码器从输入的文本提示中生成上下文嵌入。这些嵌入对于将生成的视频与文本提示对齐至关重要，并被注入到级联中的所有模型中，除了基础模型。随后，嵌入被送入基础模型进行低分辨率视频生成，然后通过级联扩散模型进行细化以提高分辨率。基础视频和超分辨率模型以空间-时间可分离的方式使用3D U-Net架构。这种架构将时间注意力和卷积层与空间对应物交织在一起，以有效捕获帧间依赖性。它采用v-预测参数化以保证数值稳定性，并使用条件增强以便于模型之间的并行训练。该过程涉及对图像和视频的联合训练，将每个图像视为一个帧以利用更大的数据集，并使用无分类器指导[57]来增强提示的准确性。逐步精化[58]被应用于简化采样过程，显著降低了计算负载，同时保持了感知质量。结合这些方法和技术，使得Imagen Video能够生成不仅高度准确，而且可控性极强的视频，如其能够生成多样化的视频、文本动画和各种艺术风格的内容所示。

Blattmann et al. [36] propose to turn a 2D Latent Diffusion Model into a Video Latent Diffusion Model (Video LDM). They achieve this by adding some post-hoc temporal layers among the existing spatial layers into both the U-Net backbone and the VAE decoder that learns to align individual frames. These temporal layers are trained on encoded video data, while the spatial layers remain fixed, allowing the model to leverage large image datasets for pre-training. The LDM’s decoder is fine-tuned for temporal consistency in pixel space and temporally aligning diffusion model upsamplers for enhanced spatial resolution. To generate very long videos, models are trained to predict a future frame given a number of context frames, allowing for classifier-free guidance during sampling. To achieve high temporal resolution, the video synthesis process is divided into key frame generation and interpolation between these key frames. Following cascaded LDMs, a DM is used to further scale up the Video LDM outputs by 4 times, ensuring high spatial resolution while maintaining temporal consistency. This approach enables the generation of globally coherent long videos in a computationally efficient manner. Additionally, the authors demonstrate the ability to transform pre-trained image LDMs (e.g., Stable Diffusion) into text-to-video models by training only the temporal alignment layers, achieving video synthesis with resolutions up to 1280 × 2048.
Blattmann等人[36]提出将2D潜在扩散模型转化为视频潜在扩散模型（Video LDM）。他们通过在现有的空间层之间添加一些事后的时间层，这些层被添加到U-Net骨干和学习对齐单个帧的VAE解码器中，从而实现了这一目标。这些时间层在编码的视频数据上进行训练，而空间层保持固定，允许模型利用大型图像数据集进行预训练。LDM的解码器被微调以在像素空间中保持时间一致性，并且时间对齐扩散模型的上采样器以提高空间分辨率。为了生成非常长的视频，模型被训练以预测给定一定数量的上下文帧的未来帧，从而在采样过程中允许无分类器的引导。为了实现高时间分辨率，视频合成过程被划分为关键帧生成和这些关键帧之间的插值。在级联的LDM之后，DM被用来将Video LDM的输出进一步放大4倍，确保高空间分辨率同时保持时间一致性。这种方法使得以计算高效的方式生成全局连贯的长视频成为可能。此外，作者展示了通过仅训练时间对齐层，将预训练的图像LDMs（例如，稳定扩散）转化为文本到视频模型的能力，实现了最高达到1280×2048分辨率的视频合成。

3.3.1 Discussion 3.3.1讨论

Cascade diffusion models for spatial and temporal up-sampling. Sora can generate high-resolution videos. By reviewing existing works and our reverse engineering, we speculate that Sora also leverages cascade diffusion model architecture [59] which is composed of a base model and many space-time refiner models. The attention modules are unlikely to be heavily used in the based diffusion model and low-resolution diffusion model, considering the high computation cost and limited performance gain of using attention machines in high-resolution cases. For spatial and temporal scene consistency, as previous works show that temporal consistency is more important than spatial consistency for video/scene generation, Sora is likely to leverage an efficient training strategy by using longer video (for temporal consistency) with lower resolution. Moreover, Sora is likely to use a $v$ -parameterization diffusion model [58], considering its superior performance compared to other variants that predict the original latent $x$ or the noise $\epsilon$ .
级联扩散模型用于空间和时间的上采样。Sora可以生成高分辨率的视频。通过回顾现有的工作和我们的逆向工程，我们推测Sora也利用了级联扩散模型架构[59]，该架构由一个基础模型和许多空间-时间精炼模型组成。在基础扩散模型和低分辨率扩散模型中，注意力模块不太可能被大量使用，考虑到在高分辨率情况下使用注意力机器的高计算成本和有限的性能提升。对于空间和时间的场景一致性，正如以前的工作所显示的，时间一致性对于视频/场景生成比空间一致性更重要，Sora可能会通过使用更长的视频（为了时间一致性）和较低的分辨率来利用一种高效的训练策略。此外，Sora可能会使用一个 $v$ 参数化的扩散模型[58]，考虑到其与预测原始潜在 $x$ 或噪声 $\epsilon$ 的其他变体相比的优越性能。

On the latent encoder. For training efficiency, most of the existing works leverage the pre-trained VAE encoder of Stable Diffusions [60, 61], a pre-trained 2D diffusion model, as an initialized model checkpoint. However, the encoder lacks the temporal compression ability. Even though some works propose to only fine-tune the decoder for handling temporal information, the decoder’s performance of dealing with video temporal data in the compressed latent space remains sub-optimal. Based on the technique report, our reverse engineering shows that, instead of using an existing pre-trained VAE encoder, it is likely that Sora uses a space-time VAE encoder, trained from scratch on video data, which performs better than existing ones with a video-orient compressed latent space.
关于潜在编码器。为了提高训练效率，大多数现有的工作都利用了稳定扩散[60,61]的预训练VAE编码器，这是一个预训练的2D扩散模型，作为一个初始化的模型检查点。然而，编码器缺乏时间压缩能力。尽管有些工作提出只对解码器进行微调以处理时间信息，但解码器在压缩的潜在空间中处理视频时间数据的性能仍然不尽如人意。根据技术报告，我们的逆向工程显示，相比于使用现有的预训练VAE编码器，Sora可能使用了一个从头开始在视频数据上训练的时空VAE编码器，它在视频导向的压缩潜在空间中的表现优于现有的编码器。

3.4 Language Instruction Following
3.4语言指令后续

Users primarily engage with generative AI models through natural language instructions, known as text prompts [62, 63]. Model instruction tuning aims to enhance AI models’ capability to follow prompts accurately. This improved capability in prompt following enables models to generate output that more closely resembles human responses to natural language queries. We start our discussion with a review of instruction following techniques for large language models (LLMs) and text-to-image models such as DALL·E 3. To enhance the text-to-video model’s ability to follow text instructions, Sora utilizes an approach similar to that of DALL·E 3. The approach involves training a descriptive captioner and utilizing the captioner’s generated data for fine-tuning. As a result of instruction tuning, Sora is able to accommodate a wide range of user requests, ensuring meticulous attention to the details in the instructions and generating videos that precisely meet users’ needs.
用户主要通过自然语言指令（也被称为文本提示[62,63]）与生成型AI模型进行交互。模型指令调整旨在提高AI模型准确遵循提示的能力。这种在遵循提示方面的改进能力使模型能够生成更接近于人类对自然语言查询的响应的输出。我们从回顾大型语言模型（LLMs）和DALL·E 3等文本到图像模型的指令遵循技术开始讨论。为了提高文本到视频模型遵循文本指令的能力，Sora采用了类似于DALL·E 3的方法。该方法包括训练一个描述性字幕生成器，并利用字幕生成器生成的数据进行微调。作为指令调整的结果，Sora能够适应各种用户请求，确保对指令中的细节给予细致的关注，并生成精确满足用户需求的视频。

3.4.1 Large Language Models
3.4.1大型语言模型

The capability of LLMs to follow instructions has been extensively explored [64, 65, 66]. This ability allows LLMs to read, understand, and respond appropriately to instructions describing an unseen task without examples. Prompt following ability is obtained and enhanced by fine-tuning LLMs on a mixture of tasks formatted as instructions[64, 66], known as instruction tuning. Wei et al. [65] showed that instruction-tuned LLMs significantly outperform the untuned ones on unseen tasks. The instruction-following ability transforms LLMs into general-purpose task solvers, marking a paradigm shift in the history of AI development.
LLMs遵循指令的能力已经被广泛探索[64,65,66]。这种能力使LLMs能够阅读、理解并适当地回应描述未见过的任务的指令，而无需示例。通过对一系列格式化为指令的任务进行微调LLMs，获得并增强了遵循提示的能力[64,66]，这被称为指令调整。魏等人[65]表明，经过指令调整的LLMs在未见过的任务上明显优于未调整的。遵循指令的能力将LLMs转变为通用任务解决者，标志着AI发展历史上的范式转变。

3.4.2 Text-to-Image 3.4.2文本到图像

The instruction following in DALL·E 3 is addressed by a caption improvement method with a hypothesis that the quality of text-image pairs that the model is trained on determines the performance of the resultant text-to-image model [67]. The poor quality of data, particularly the prevalence of noisy data and short captions that omit a large amount of visual information, leads to many issues such as neglecting keywords and word order, and misunderstanding the user intentions [21]. The caption improvement approach addresses these issues by re-captioning existing images with detailed, descriptive captions. The approach first trains an image captioner, which is a vision-language model, to generate precise and descriptive image captions. The resulting descriptive image captions by the captioner are then used to fine-tune text-to-image models. Specifically, DALL·E 3 follows contrastive captioners (CoCa) [68] to jointly train an image captioner with a CLIP [26] architecture and a language model objective. This image captioner incorporates an image encoder a unimodal text encoder for extracting language information, and a multimodal text decoder. It first employs a contrastive loss between unimodal image and text embeddings, followed by a captioning loss for the multimodal decoder’s outputs. The resulting image captioner is further fine-tuned on a highly detailed description of images covering main objects, surroundings, backgrounds, texts, styles, and colorations. With this step, the image captioner is able to generate detailed descriptive captions for the images. The training dataset for the text-to-image model is a mixture of the re-captioned dataset generated by the image captioner and ground-truth human-written data to ensure that the model captures user inputs. This image caption improvement method introduces a potential issue: a mismatch between the actual user prompts and descriptive image descriptions from the training data. DALL·E 3 addresses this by upsampling, where LLMs are used to re-write short user prompts into detailed and lengthy instructions. This ensures that the model’s text inputs received in inference time are consistent with those in model training.
在DALL·E 3中，接下来的指示是通过一个假设来进行字幕改进方法的指导，即模型接受训练的文本-图像对的质量决定了最终文本到图像模型的性能[67]。数据质量差，特别是噪声数据的普遍存在和省略大量视觉信息的短字幕，导致了许多问题，如忽视关键词和词序，误解用户意图[21]。字幕改进方法通过对现有图像进行详细、描述性的重新字幕来解决这些问题。该方法首先训练一个图像字幕器，这是一个视觉-语言模型，用于生成精确和描述性的图像字幕。然后，由字幕器生成的描述性图像字幕被用来微调文本到图像的模型。具体来说，DALL·E 3遵循对比字幕器（CoCa）[68]，联合训练一个具有CLIP[26]架构和语言模型目标的图像字幕器。这个图像字幕器包含一个图像编码器，一个用于提取语言信息的单模式文本编码器，和一个多模式文本解码器。首先，它采用了一种对比损失，用于图像和文本嵌入之间，然后是多模态解码器输出的字幕损失。得到的图像字幕器进一步在覆盖主要对象、环境、背景、文本、风格和色彩的高度详细的图像描述上进行微调。通过这一步，图像字幕器能够为图像生成详细的描述性字幕。文本到图像模型的训练数据集是由图像字幕器生成的重新字幕化数据集和人工编写的真实数据混合而成，以确保模型捕获用户输入。这种图像字幕改进方法引入了一个潜在问题：实际用户提示和训练数据中的描述性图像描述之间的不匹配。DALL·E 3通过上采样来解决这个问题，其中LLMs用于将简短的用户提示重写为详细和冗长的指令。这确保了模型在推理时间接收的文本输入与模型训练中的一致。

3.4.3 Text-to-Video 3.4.3文本到视频

To enhance the ability of instruction following, Sora adopts a similar caption improvement approach. This method is achieved by first training a video captioner capable of producing detailed descriptions for videos. Then, this video captioner is applied to all videos in the training data to generate high-quality (video, descriptive caption) pairs, which are used to fine-tune Sora to improve its instruction following ability.
为了提高指令跟随的能力，Sora采用了类似的字幕改进方法。这种方法是通过首先训练一个能够为视频生成详细描述的视频字幕制作者来实现的。然后，将这个视频字幕制作者应用到训练数据中的所有视频，生成高质量的（视频，描述性字幕）对，用于微调Sora以提高其指令跟随能力。

Sora’s technical report [3] does not reveal the details about how the video captioner is trained. Given that the video captioner is a video-to-text model, there are many approaches to building it. A straightforward approach is to utilize CoCa architecture for video captioning by taking multiple frames of a video and feeding each frame into the image encoder [68], known as VideoCoCa [69]. VideoCoCa builds upon CoCa and re-uses the image encoder pre-trained weights and applies it independently on sampled video frames. The resulting frame token embeddings are flattened and concatenated into a long sequence of video representations. These flattened frame tokens are then processed by a generative pooler and a contrastive pooler, which are jointly trained with the contrastive loss and captioning loss. Other alternatives to building video captioners include mPLUG-2 [70], GIT [71], FrozenBiLM [72], and more. Finally, to ensure that user prompts align with the format of those descriptive captions in training data, Sora performs an additional prompt extension step, where GPT-4V is used to expand user inputs to detailed descriptive prompts.
Sora的技术报告[3]并未透露关于如何训练视频字幕生成器的细节。鉴于视频字幕生成器是一个视频到文本的模型，构建它有许多方法。一个直接的方法是利用CoCa架构进行视频字幕生成，通过获取视频的多个帧并将每个帧输入到图像编码器[68]，这被称为VideoCoCa[69]。VideoCoCa基于CoCa并重用预训练的图像编码器权重，并独立地应用于采样的视频帧。得到的帧令牌嵌入被展平并连接成一个长序列的视频表示。这些展平的帧令牌然后由生成池化器和对比池化器处理，这两者与对比损失和字幕生成损失一起训练。构建视频字幕生成器的其他替代方法包括mPLUG-2[70]，GIT[71]，FrozenBiLM[72]等。最后，为了确保用户提示与训练数据中描述性字幕的格式对齐，Sora执行了一个额外的提示扩展步骤，其中使用GPT-4V来扩展用户输入到详细的描述性提示。

3.4.4 Discussion 3.4.4讨论

The instruction-following ability is critical for Sora to generate one-minute-long videos with intricate scenes that are faithful to user intents. According to Sora’s technical report [3], this ability is obtained by developing a captioner that can generate long and detailed captions, which are then used to train the model. However, the process of collecting data for training such a captioner is unknown and likely labor-intensive, as it may require detailed descriptions of videos. Moreover, the descriptive video captioner might hallucinate important details of the videos. We believe that how to improve the video captioner warrants further investigation and is critical to enhance the instruction-following ability of text-to-image models.
遵循指令的能力对于Sora生成忠实于用户意图的复杂场景一分钟长视频至关重要。根据Sora的技术报告[3]，这种能力是通过开发一个可以生成长且详细的字幕的字幕器获得的，这些字幕然后用于训练模型。然而，收集训练这样一个字幕器的数据的过程是未知的，可能需要大量的劳动力，因为它可能需要对视频进行详细的描述。此外，描述性的视频字幕器可能会幻想视频的重要细节。我们认为，如何改进视频字幕器值得进一步研究，并且对于提高文本到图像模型的遵循指令的能力至关重要。

3.5 Prompt Engineering 3.5 提示工程

Prompt engineering refers to the process of designing and refining the input given to an AI system, particularly in the context of generative models, to achieve specific or optimized outputs [73, 74, 75]. The art and science of prompt engineering involve crafting these inputs in a way that guides the model to produce the most accurate, relevant, and coherent responses possible.
提示工程是指设计和优化输入到AI系统的过程，特别是在生成模型的背景下，以实现特定或优化的输出[73,74,75]。提示工程的艺术和科学涉及以一种方式制作这些输入，以指导模型产生尽可能最准确、相关和连贯的响应。

3.5.1 Text Prompt 3.5.1文本提示

Text prompt engineering is vital in directing text-to-video models (e.g., Sora [3]) to produce videos that are visually striking while precisely meeting user specifications. This involves crafting detailed descriptions to instruct the model to effectively bridge the gap between human creativity and AI’s execution capabilities [76]. The prompts for Sora cover a wide range of scenarios. Recent works (e.g., VoP [77], Make-A-Video [28], and Tune-A-Video [78]) have shown how prompt engineering leverages model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. As shown in Figure 15, “a stylish woman walking down a neon-lit Tokyo street…” is such a meticulously crafted text prompt that it ensures Sora to generate a video that aligns well with the expected vision. The quality of prompt engineering depends on the careful selection of words, the specificity of the details provided, and comprehension of their impact on the model’s output. For example, the prompt in Figure 15 specifies in detail the actions, settings, character appearances, and even the desired mood and atmosphere of the scene.
文本提示工程在引导文本到视频模型（例如，Sora [3]）生成视觉上引人注目且精确满足用户规格的视频方面至关重要。这涉及制作详细的描述，以指导模型有效地弥补人类创造力和AI执行能力之间的差距[76]。Sora的提示涵盖了广泛的场景。最近的一些工作（例如，VoP [77]，Make-A-Video [28]，和Tune-A-Video [78]）已经展示了如何利用提示工程利用模型的自然语言理解能力来解码复杂的指令，并将它们渲染成连贯，生动，高质量的视频叙述。如图15所示，“一个时尚的女人走在霓虹灯照亮的东京街道上...”就是一个精心制作的文本提示，它确保Sora生成的视频与预期的视觉效果紧密相符。提示工程的质量取决于词语的仔细选择，提供的细节的具体性，以及对它们对模型输出影响的理解。例如，图15中的提示详细指定了动作，设置，角色外观，甚至期望的情绪和场景的氛围。

3.5.2 Image Prompt 3.5.2图像提示

An image prompt serves as a visual anchor for the to-be-generated video’s content and other elements such as characters, setting, and mood [79]. In addition, a text prompt can instruct the model to animate these elements by e.g., adding layers of movement, interaction, and narrative progression that bring the static image to life [27, 80, 81]. The use of image prompts allows Sora to convert static images into dynamic, narrative-driven videos by leveraging both visual and textual information. In Figure 16, we show AI-generated videos of “a Shiba Inu wearing a beret and turtleneck”, “a unique monster family”, “a cloud forming the word ‘SORA”’ and “surfers navigating a tidal wave inside a historic hall”. These examples demonstrate what can be achieved by prompting Sora with DALL·E-generated images.
图像提示作为即将生成的视频内容以及其他元素（如角色、环境和情绪）的视觉锚点[79]。此外，文本提示可以指导模型通过例如添加运动、互动和叙事进程的层次来使这些元素动起来，使静态图像生动起来[27,80,81]。图像提示的使用使Sora能够通过利用视觉和文本信息将静态图像转化为动态的、叙事驱动的视频。在图16中，我们展示了AI生成的“戴着贝雷帽和高领衫的柴犬”、“独特的怪物家族”、“形成‘SORA’字样的云”和“冲浪者在历史悠久的大厅内驾驭潮汐波”的视频。这些例子展示了通过DALL·E生成的图像提示Sora可以实现的效果。

3.5.3 Video Prompt 3.5.3视频提示

Video prompts can also be used for video generation as demonstrated in [82, 83]. Recent works (e.g., Moonshot [84] and Fast-Vid2Vid [85]) show that good video prompts need to be specific and flexible. This ensures that the model receives clear direction on specific objectives, like the portrayal of particular objects and visual themes, and also allows for imaginative variations in the final output. For example, in the video extension tasks, a prompt could specify the direction (forward or backward in time) and the context or theme of the extension. In Figure 17(a), the video prompt instructs Sora to extend a video backward in time to explore the events leading up to the original starting point. When performing video-to-video editing through video prompts, as shown in Figure 17(b), the model needs to clearly understand the desired transformation, such as changing the video’s style, setting or atmosphere, or altering subtle aspects like lighting or mood. In Figure 17(c), the prompt instructs Sora to connect videos while ensuring smooth transitions between objects in different scenes across videos.
如[82,83]所示，视频提示也可以用于视频生成。最近的研究（例如，Moonshot [84]和Fast-Vid2Vid [85]）表明，好的视频提示需要具体且灵活。这确保了模型在特定目标上获得明确的指导，如特定物体和视觉主题的描绘，同时也允许在最终输出中有想象的变化。例如，在视频扩展任务中，提示可以指定扩展的方向（向前或向后的时间）以及扩展的上下文或主题。在图17（a）中，视频提示指示Sora将视频向后延伸，以探索导致原始起点的事件。在通过视频提示进行视频到视频的编辑时，如图17（b）所示，模型需要清楚地理解所需的转换，例如改变视频的风格、设置或氛围，或改变微妙的方面，如照明或情绪。在图17（c）中，提示指示Sora在确保视频中不同场景的物体之间平滑过渡的同时连接视频。

3.5.4 Discussion 3.5.4讨论

Prompt engineering allows users to guide AI models to generate content that aligns with their intent. As an example, the combined use of text, image, and video prompts enables Sora to create content that is not only visually compelling but also aligned well with users’ expectations and intent. While previous studies on prompt engineering have been focused on text and image prompts for LLMs and LVMs [86, 87, 88], we expect that there will be a growing interest in video prompts for video generation models.
提示工程允许用户引导AI模型生成与他们的意图相符的内容。例如，文本、图像和视频提示的联合使用使Sora能够创建的内容不仅视觉上引人入胜，而且与用户的期望和意图非常吻合。虽然以前关于提示工程的研究主要集中在文本和图像提示上，用于LLMs和LVMs[86,87,88]，但我们预期对视频提示在视频生成模型中的应用将会引起越来越多的关注。

3.6 Trustworthiness 3.6值得信赖

With the rapid advancement of sophisticated models such as ChatGPT [89], GPT4-V [90], and Sora [3], the capabilities of these models have seen remarkable enhancements. These developments have made significant contributions to improving work efficiency and propelling technological progress. However, these advancements also raise concerns about the potential for misuse of these technologies, including the generation of fake news [91, 92], privacy breaches [93], and ethical dilemmas [94, 95]. Consequently, the issue of trustworthiness in large models has garnered extensive attention from both the academic and industrial spheres, emerging as a focal point of contemporary research discussions.
随着ChatGPT [89]、GPT4-V [90]和Sora [3]等复杂模型的快速发展，这些模型的能力已经得到了显著的提升。这些发展为提高工作效率和推动技术进步做出了重大贡献。然而，这些进步也引发了对这些技术可能被滥用的担忧，包括生成假新闻[91,92]、侵犯隐私[93]和道德困境[94,95]。因此，大型模型的可信度问题已经引起了学术界和工业界的广泛关注，成为当前研究讨论的焦点。

3.6.1 Safety Concern 3.6.1安全问题

One primary area of focus is the model’s safety, specifically its resilience against misuse and so-called “jailbreak” attacks, where users attempt to exploit vulnerabilities to generate prohibited or harmful content [96, 97, 98, 99, 100, 101, 102, 103, 104, 105]. For instance, AutoDAN [103], a novel and interpretable adversarial attack method based on gradient techniques, is introduced to enable system bypass. In a recent study, researchers explore two reasons why LLMs struggle to resist jailbreak attacks: competing objectives and mismatched generalization [106]. Besides textual attacks, visual jailbreak also threatens the safety of multimodal models (e.g., GPT-4V [90], and Sora [3]). A recent study [107] found that large multimodal models are more vulnerable since the continuous and high-dimensional nature of the additional visual input makes it weaker against adversarial attacks, representing an expanded attack surface.
一个主要的关注领域是模型的安全性，特别是其对滥用和所谓的“越狱”攻击的抵抗力，其中用户试图利用漏洞生成禁止或有害的内容[96, 97, 98, 99, 100, 101, 102, 103, 104, 105]。例如，引入了一种基于梯度技术的新颖且可解释的对抗攻击方法AutoDAN[103]，以实现系统的绕过。在最近的一项研究中，研究人员探讨了为什么LLMs难以抵抗越狱攻击的两个原因：目标冲突和泛化不匹配[106]。除了文本攻击，视觉越狱也威胁到多模态模型的安全性（例如，GPT-4V [90]和Sora [3]）。最近的一项研究[107]发现，由于额外的视觉输入的连续性和高维性使其对抗性攻击的抵抗力更弱，大型多模态模型更容易受到攻击，这代表了扩大的攻击面。

3.6.2 Other Exploitation 3.6.2其他开发利用

Due to the large scale of the training dataset and training methodology of large foundation models (e.g., ChatGPT [89] and Sora [3]), the truthfulness of these models needs to be enhanced as the related issues like hallucination have been discussed widely [108]. Hallucination in this context refers to the models’ tendency to generate responses that may appear convincing but are unfounded or false [96]. This phenomenon raises critical questions about the reliability and trustworthiness of model outputs, necessitating a comprehensive approach to both evaluate and address the issue. Amount of studies have been dedicated to dissecting the problem of hallucination from various angles. This includes efforts aimed at evaluating the extent and nature of hallucination across different models and scenarios [109, 96, 110, 111]. These evaluations provide invaluable insights into how and why hallucinations occur, laying the groundwork for developing strategies to mitigate their incidence. Concurrently, a significant body of research is focused on devising and implementing methods to reduce hallucinations in these large models [112, 113, 114].
由于大规模训练数据集和大型基础模型（例如，ChatGPT [89] 和 Sora [3]）的训练方法，这些模型的真实性需要得到提高，因为像幻觉这样的相关问题已经被广泛讨论 [108]。在这个语境中，幻觉指的是模型生成看似有说服力但是无根据或者错误的回应的倾向 [96]。这种现象对模型输出的可靠性和可信度提出了严重的问题，需要全面的方法来评估和解决这个问题。有许多研究致力于从各种角度剖析幻觉问题。这包括旨在评估不同模型和场景中幻觉的程度和性质的努力 [109, 96, 110, 111]。这些评估为我们提供了如何以及为什么会发生幻觉的宝贵见解，为制定减少幻觉发生的策略奠定了基础。同时，大量的研究专注于设计和实施方法来减少这些大型模型中的幻觉 [112, 113, 114]。

Another vital aspect of trustworthiness is fairness and bias. The critical importance of developing models that do not perpetuate or exacerbate societal biases is a paramount concern. This priority stems from the recognition that biases encoded within these models can reinforce existing social inequities, leading to discriminatory outcomes. Studies in this area, as evidenced by the work of Gallegos et al. [115], Zhang et al. [116], Liang et al. [117], and Friedrich et al. [118], are dedicated to the meticulous identification and rectification of these inherent biases. The goal is to cultivate models that operate fairly, treating all individuals equitably without bias towards race, gender, or other sensitive attributes. This involves not only detecting and mitigating bias in datasets but also designing algorithms that can actively counteract the propagation of such biases [119, 120].
信任度的另一个重要方面是公平性和偏见。开发不会延续或加剧社会偏见的模型的重要性是一个首要关注的问题。这个优先事项源于认识到这些模型中编码的偏见可以加强现有的社会不平等，导致歧视性的结果。这个领域的研究，如Gallegos等人[115]，Zhang等人[116]，Liang等人[117]，和Friedrich等人[118]的工作所证明的，致力于精心识别和纠正这些固有的偏见。目标是培养公平运作的模型，公平对待所有人，不偏向种族、性别或其他敏感属性。这不仅涉及检测和减轻数据集中的偏见，还涉及设计可以积极抵制这种偏见传播的算法[119,120]。

Privacy preservation emerges as another foundational pillar when these models are deployed. In an era where data privacy concerns are escalating, the emphasis on protecting user data has never been more critical. The increasing public awareness and concern over how personal data is handled have prompted more rigorous evaluations of large models. These evaluations focus on the models’ capacity to protect user data, ensuring that personal information remains confidential and is not inadvertently disclosed. Research by Mireshghallah et al. [121], Plant et al. [122], and Li et al. [123] exemplify efforts to advance methodologies and technologies that safeguard privacy.
在这些模型被部署时，隐私保护成为另一个基础支柱。在数据隐私问题日益升级的时代，保护用户数据的重要性前所未有。公众对个人数据处理方式的日益关注和担忧，促使了对大型模型更严格的评估。这些评估关注模型保护用户数据的能力，确保个人信息保持机密，不会无意中被泄露。Mireshghallah等人[121]，Plant等人[122]，和Li等人[123]的研究，都是为了推进保护隐私的方法和技术的努力的典范。

3.6.3 Alignment 3.6.3对齐

In addressing these challenges, ensuring the trustworthiness of large models has become one of the primary concerns for researchers [124, 96, 99, 125]. Among the most important technologies is model alignment [125, 126], which refers to the process and goal of ensuring that the behavior and outputs of models are consistent with the intentions and ethical standards of human designers. This concerns the development of technology, its moral responsibilities, and social values. In the domain of LLMs, the method of Reinforcement Learning with Human Feedback (RLHF) [127, 128] has been widely applied for model alignment. This method combines Reinforcement Learning (RL) with direct human feedback, allowing models to better align with human expectations and standards in understanding and performing tasks.
在应对这些挑战时，确保大型模型的可信度已经成为研究人员的主要关注点之一[124,96,99,125]。其中最重要的技术之一是模型对齐[125,126]，这指的是确保模型的行为和输出与人类设计者的意图和道德标准一致的过程和目标。这涉及到技术的发展，其道德责任，以及社会价值。在LLMs领域，人类反馈强化学习（RLHF）[127,128]的方法已被广泛应用于模型对齐。这种方法将强化学习（RL）与直接的人类反馈相结合，使模型在理解和执行任务时更好地符合人类的期望和标准。

3.6.4 Discussion 3.6.4讨论

From Sora (specifically its technical report), we summarize some insightful findings that potentially offer an informative guideline for future work:
来自Sora（特别是其技术报告），我们总结了一些有洞察力的发现，这些发现可能为未来的工作提供有益的指导：

(1) Integrated Protection of Model and External Security: As models become more powerful, especially in generating content, ensuring that they are not misused to produce harmful content (such as hate speech [129] and false information [92, 91]) has become a serious challenge. In addition to aligning the model itself, external security protections are equally important. This includes content filtering and review mechanisms, usage permissions and access control, data privacy protection, as well as enhancements in transparency and explainability. For instance, OpenAI now uses a detection classifier to tell whether a given video is generated by Sora [130]. Moreover, a text classifier is deployed to detect the potentially harmful textual input [130].
（1）模型与外部安全的综合保护：随着模型变得越来越强大，尤其是在生成内容方面，确保它们不被误用来产生有害内容（如仇恨言论[129]和虚假信息[92, 91]）已经成为一个严重的挑战。除了对模型本身进行对齐外，外部安全保护同样重要。这包括内容过滤和审查机制，使用权限和访问控制，数据隐私保护，以及透明度和可解释性的提高。例如，OpenAI现在使用一个检测分类器来判断一个给定的视频是否由Sora[130]生成。此外，还部署了一个文本分类器来检测可能有害的文本输入[130]。

(2) Security Challenges of Multimodal Models: Multimodal models, such as text-to-video models like Sora bring additional complexity to security due to their ability to understand and generate various types of content (text, images, videos, etc.). Multimodal models can produce content in various forms, increasing the ways and scope of misuse and copyright issues. As the content generated by multimodal models is more complex and diverse, traditional methods of content verification and authenticity may no longer be effective. This requires the development of new technologies and methods to identify and filter harmful content generated by these models, increasing the difficulty of regulation and management.
（2）多模态模型的安全挑战：多模态模型，如Sora这样的文本到视频模型，由于其理解和生成各种类型内容（文本，图像，视频等）的能力，给安全带来了额外的复杂性。多模态模型可以以各种形式产生内容，增加了滥用和版权问题的方式和范围。由于多模态模型生成的内容更为复杂和多样，传统的内容验证和真实性可能不再有效。这需要开发新的技术和方法来识别和过滤这些模型生成的有害内容，增加了监管和管理的难度。

(3) The Need for Interdisciplinary Collaboration: Ensuring the safety of models is not just a technical issue but also requires cross-disciplinary cooperation. To address these challenges, experts from various fields such as law [131] and psychology [132] need to work together to develop appropriate norms (e.g., what’s the safety and what’s unsafe?), policies, and technological solutions. The need for interdisciplinary collaboration significantly increases the complexity of solving these issues.
（3）跨学科合作的需求：确保模型的安全性不仅是一个技术问题，也需要跨学科的合作。为了应对这些挑战，来自法律[131]和心理学[132]等各个领域的专家需要共同努力，制定适当的规范（例如，什么是安全的，什么是不安全的？），政策和技术解决方案。跨学科合作的需求显著增加了解决这些问题的复杂性。

4 Applications 4应用程序

As video diffusion models, exemplified by Sora, emerge as a forefront technology, their adoption across diverse research fields and industries is rapidly accelerating. The implications of this technology extend far beyond mere video creation, offering transformative potential for tasks ranging from automated content generation to complex decision-making processes. In this section, we delve into a comprehensive examination of the current applications of video diffusion models, highlighting key areas where Sora has not only demonstrated its capabilities but also revolutionized the approach to solving complex problems. We aim to offer a broad perspective for the practical deployment scenarios (see Figure 18).
作为视频扩散模型的代表，例如Sora，正在作为前沿技术崭露头角，其在各种研究领域和行业中的应用正在迅速加速。这项技术的影响远远超出了仅仅创建视频的范围，为从自动化内容生成到复杂决策过程等任务提供了变革性的潜力。在这一部分，我们深入全面地检查了视频扩散模型的当前应用，突出了Sora不仅展示了其能力，而且还革新了解决复杂问题的方法的关键领域。我们的目标是为实际部署场景（参见图18）提供一个广阔的视角。

4.1 Movie 4.1电影

Traditionally, creating cinematic masterpieces has been an arduous and expensive process, often requiring decades of effort, cutting-edge equipment, and substantial financial investments. However, the advent of advanced video generation technologies heralds a new era in film-making, one where the dream of autonomously producing movies from simple text inputs is becoming a reality. Researchers have ventured into the realm of movie generation by extending video generation models into creating movies. MovieFactory [133] applies diffusion models to generate film-style videos from elaborate scripts produced by ChatGPT [89], representing a significant leap forward. In the follow-up, MobileVidFactory [134] can automatically generate vertical mobile videos with only simple texts provided by users. Vlogger [135] makes it feasible for users to compose a minute-long vlog. These developments, epitomized by Sora’s ability to generate captivating movie content effortlessly, mark a pivotal moment in the democratization of movie production. They offer a glimpse into a future where anyone can be a filmmaker, significantly lowering the barriers to entry in the film industry and introducing a novel dimension to movie production that blends traditional storytelling with AI-driven creativity. The implications of these technologies extend beyond simplification. They promise to reshape the landscape of film production, making it more accessible and versatile in the face of evolving viewer preferences and distribution channels.
传统上，创作电影杰作是一个艰难且昂贵的过程，通常需要数十年的努力，尖端设备和大量的财务投入。然而，先进的视频生成技术的出现预示着电影制作的新时代，一个从简单的文本输入自主制作电影的梦想正在变为现实。研究人员已经通过扩展视频生成模型进入了电影生成的领域。MovieFactory [133] 应用扩散模型从ChatGPT [89] 生成的精细剧本中生成电影风格的视频，这是一个重大的进步。在后续中，MobileVidFactory [134] 可以自动生成只有用户提供的简单文本的垂直移动视频。Vlogger [135] 使用户能够编写一分钟长的视频博客。这些发展，以Sora的能力为代表，无需费力就能生成引人入胜的电影内容，标志着电影制作民主化的关键时刻。他们为我们展示了一个未来的画面，那里任何人都可以成为电影制作人，大大降低了进入电影行业的门槛，并引入了一种将传统叙事与AI驱动的创新相结合的新颖电影制作维度。这些技术的影响远远超过简化。他们承诺将重塑电影制作的格局，使其在面对不断变化的观众喜好和分发渠道时变得更加易于接触和多样化。

4.2 Education 4.2教育

The landscape of educational content has long been dominated by static resources, which, despite their value, often fall short of catering to the diverse needs and learning styles of today’s students. Video diffusion models stand at the forefront of an educational revolution, offering unprecedented opportunities to customize and animate educational materials in ways that significantly enhance learner engagement and understanding. These advanced technologies enable educators to transform text descriptions or curriculum outlines into dynamic, engaging video content tailored to the specific style, and interests of individual learners [136, 137, 138, 139]. Moreover, image-to-video editing techniques [140, 141, 142] present innovative avenues for converting static educational assets into interactive videos, thereby supporting a range of learning preferences and potentially increasing student engagement. By integrating these models into educational content creation, educators can produce videos on a myriad of subjects, making complex concepts more accessible and captivating for students. The use of Sora in revolutionizing the educational domain exemplifies the transformative potential of these technologies. This shift towards personalized, dynamic educational content heralds a new era in education.
教育内容的格局长期以来都被静态资源所主导，尽管它们具有价值，但往往无法满足当今学生的多样化需求和学习风格。视频传播模型站在教育革命的前沿，提供了前所未有的机会，以极大地增强学习者的参与度和理解力的方式，定制和激活教育材料。这些先进的技术使教育者能够将文本描述或课程大纲转化为针对个别学习者的特定风格和兴趣量身定制的动态、引人入胜的视频内容[136,137,138,139]。此外，图像到视频的编辑技术[140,141,142]提供了将静态教育资产转化为互动视频的创新途径，从而支持一系列的学习偏好，可能增加学生的参与度。通过将这些模型整合到教育内容创作中，教育者可以制作关于各种主题的视频，使复杂的概念对学生更加易于理解和吸引人。 Sora在教育领域的革新使用，体现了这些技术的变革潜力。这种向个性化、动态教育内容的转变预示着教育的新时代。

4.3 Gaming 4.3游戏

The gaming industry constantly seeks ways to push the boundaries of realism and immersion, yet traditional game development often grapples with the limitations of pre-rendered environments and scripted events. The generation of dynamic, high-fidelity video content and realistic sound by diffusion models effects in real-time, promise to overcome existing constraints, offering developers the tools to create evolving game environments that respond organically to player actions and game events [143, 144]. This could include generating changing weather conditions, transforming landscapes, or even creating entirely new settings on the fly, making game worlds more immersive and responsive. Some methods [145, 146] also synthesize realistic impact sounds from video inputs, enhancing game audio experiences. With the integration of Sora within the gaming domain, unparalleled immersive experiences that captivate and engage players can be created. How games are developed, played, and experienced will be innovated, as well as opening new possibilities for storytelling, interaction, and immersion.
游戏行业不断寻求推动现实主义和沉浸式体验的方式，然而，传统的游戏开发往往要与预渲染环境和脚本事件的限制进行斗争。动态生成高保真视频内容和通过扩散模型实时产生的逼真声音，有望克服现有的限制，为开发者提供工具，创造出能够有机响应玩家行为和游戏事件的不断演变的游戏环境[143,144]。这可能包括生成变化的天气条件，变化的景观，甚至在飞行中创造全新的设置，使游戏世界更具沉浸性和响应性。一些方法[145,146]还从视频输入中合成逼真的冲击声音，增强游戏音频体验。通过将Sora集成到游戏领域，可以创造出无与伦比的沉浸式体验，吸引并吸引玩家。游戏的开发、玩耍和体验方式将会创新，同时也为讲故事、互动和沉浸式体验开辟新的可能性。

4.4 Healthcare 4.4医疗保健

Despite generative capabilities, video diffusion models excel in understanding and generating complex video sequences, making them particularly suited for identifying dynamic anomalies within the body, such as early cellular apoptosis [147], skin lesion progression [148], and irregular human movements [149], which are crucial for early disease detection and intervention strategies. Additionally, models like MedSegDiff-V2 [150] and [151] leverage the power of transformers to segment medical images with unprecedented precision, enabling clinicians to pinpoint areas of interest across various imaging modalities with enhanced accuracy. The integration of Sora into clinical practice promises not only to refine diagnostic processes but also to personalize patient care, offering tailored treatment plans based on precise medical imaging analysis. However, this technological integration comes with its own set of challenges, including the need for robust data privacy measures and addressing ethical considerations in healthcare.
尽管具有生成能力，视频扩散模型在理解和生成复杂的视频序列方面表现出色，使它们特别适合识别体内的动态异常，如早期的细胞凋亡[147]，皮肤病变进展[148]，和不规则的人体运动[149]，这些对于早期疾病检测和干预策略至关重要。此外，像MedSegDiff-V2[150]和[151]这样的模型利用变形金刚的力量，以前所未有的精度对医学图像进行分割，使临床医生能够以更高的准确性在各种成像方式中精确定位感兴趣的区域。将Sora集成到临床实践中不仅有望改进诊断过程，还可以个性化患者护理，提供基于精确医学成像分析的定制治疗计划。然而，这种技术集成带来了自身的一系列挑战，包括需要强大的数据隐私保护措施和解决医疗保健中的伦理考虑。

4.5 Robotics 4.5机器人学

Video diffusion models now play important roles in robotics, showing a new era where robots can generate and interpret complex video sequences for enhanced perception [152, 153] and decision-making [154, 155, 156]. These models unlock new capabilities in robots, enabling them to interact with their environment and execute tasks with unprecedented complexity and precision. The introduction of web-scale diffusion models to robotics [152] showcases the potential for leveraging large-scale models to enhance robotic vision and understanding. Latent diffusion models are employed for language-instructed video prediction [157], allowing robots to understand and execute tasks by predicting the outcome of actions in video format. Furthermore, the reliance on simulated environments for robotics research has been innovatively addressed by video diffusion models capable of creating highly realistic video sequences [158, 159]. This enables the generation of diverse training scenarios for robots, mitigating the limitations imposed by the scarcity of real-world data. We believe, the integration of technologies like Sora into the robotics field holds the promise of groundbreaking developments. By harnessing the power of Sora, the future of robotics is poised for unprecedented advancements, where robots can seamlessly navigate and interact with their environments.
视频扩散模型现在在机器人技术中扮演着重要的角色，展示了一个新的时代，机器人可以生成和解读复杂的视频序列，以增强感知[152,153]和决策制定[154,155,156]。这些模型为机器人解锁了新的能力，使它们能够与环境互动，并以前所未有的复杂性和精确性执行任务。将网络规模的扩散模型引入机器人技术[152]，展示了利用大规模模型增强机器人视觉和理解的潜力。潜在扩散模型被用于语言指导的视频预测[157]，使机器人能够通过预测视频格式中的行动结果来理解和执行任务。此外，对模拟环境的依赖已经被能够创建高度真实的视频序列的视频扩散模型创新地解决[158,159]。这使得可以为机器人生成多样化的训练场景，缓解了真实世界数据稀缺带来的限制。我们相信，像Sora这样的技术的整合到机器人领域，有望带来突破性的发展。通过利用Sora的力量，机器人的未来正准备进行前所未有的进步，使机器人能够无缝地导航并与其环境互动。

5 Discussion 5讨论

Sora shows a remarkable talent for precisely understanding and implementing complex instructions from humans. This model excels at creating detailed videos with various characters, all set within elaborately crafted settings. A particularly impressive attribute of Sora is its ability to produce videos up to one minute in length while ensuring consistent and engaging storytelling. This marks a significant improvement over previous attempts that focused on shorter video pieces, as Sora’s extended sequences exhibit a clear narrative flow and maintain visual consistency from start to finish. Furthermore, Sora distinguishes itself by generating longer video sequences that capture complex movements and interactions, advancing past the restrictions of earlier models that could only handle short clips and basic images. This advancement signifies a major step forward in AI-powered creative tools, enabling users to transform written stories into vivid videos with a level of detail and sophistication that was previously unattainable.
索拉展现出了对人类复杂指令精确理解和执行的显著才能。这个模型擅长创造出包含各种角色的详细视频，所有这些都设置在精心制作的场景中。索拉的一个特别令人印象深刻的属性是其能够制作长达一分钟的视频，同时确保连贯和引人入胜的故事叙述。这标志着相比之前专注于更短视频片段的尝试的重大改进，因为索拉的扩展序列展示了清晰的叙事流程，并从头到尾保持视觉一致性。此外，索拉通过生成捕捉复杂动作和互动的更长视频序列，使自己区别于之前只能处理短片和基本图像的模型。这一进步标志着AI驱动的创意工具的重大步骤，使用户能够将书面故事转化为具有前所未有的细节和复杂性的生动视频。

5.1 Limitations 5.1限制

Challenges in Physical Realism. Sora, as a simulation platform, exhibits a range of limitations that undermine its effectiveness in accurately depicting complex scenarios. Most important is its inconsistent handling of physical principles within complex scenes, leading to a failure in accurately copying specific examples of cause and effect. For instance, consuming a portion of a cookie might not result in a corresponding bite mark, illustrating the system’s occasional departure from physical plausibility. This issue extends to the simulation of motion, where Sora generates movements that challenge realistic physical modeling, such as unnatural transformations of objects or the incorrect simulation of rigid structures like chairs, leading to unrealistic physical interactions. The challenge further increases when simulating complex interactions among objects and characters, occasionally producing outcomes that lean towards the humorous.
在物理真实性方面的挑战。Sora作为一个模拟平台，展示了一系列限制，这些限制削弱了其准确描绘复杂场景的有效性。最重要的是，它在处理复杂场景中的物理原理时的不一致性，导致无法准确复制因果关系的具体例子。例如，吃掉一部分饼干可能不会产生相应的咬痕，这说明系统偶尔会偏离物理的可能性。这个问题扩展到运动的模拟，其中Sora产生的运动挑战了现实的物理建模，例如物体的不自然变形或者像椅子这样的刚性结构的错误模拟，导致了不现实的物理交互。当模拟物体和角色之间的复杂交互时，挑战进一步增加，偶尔会产生倾向于幽默的结果。

Spatial and Temporal Complexities. Sora occasionally misunderstands instructions related to the placement or arrangement of objects and characters within a given prompt, leading to confusion about directions (e.g., confusing left for right). Additionally, it faces challenges in maintaining the temporal accuracy of events, particularly when it comes to adhering to designated camera movements or sequences. This can result in deviating from the intended temporal flow of scenes. In complex scenarios that involve a multitude of characters or elements, Sora has a tendency to insert irrelevant animals or people. Such additions can significantly change the originally envisioned composition and atmosphere of the scene, moving away from the planned narrative or visual layout. This issue not only affects the model’s ability to accurately recreate specific scenes or narratives but also impacts its reliability in generating content that closely aligns with the user’s expectations and the coherence of the generated output.
空间和时间的复杂性。Sora偶尔会误解与对象和角色在给定提示中的放置或排列相关的指示，导致对方向的混淆（例如，将左和右混淆）。此外，它在维持事件的时间准确性方面面临挑战，特别是在坚持指定的摄像机移动或序列时。这可能导致偏离场景预期的时间流动。在涉及大量角色或元素的复杂场景中，Sora有插入无关动物或人的倾向。这样的添加可以显著改变原本设想的场景和气氛，偏离计划的叙述或视觉布局。这个问题不仅影响模型准确重现特定场景或叙述的能力，也影响其生成与用户期望和生成输出的连贯性紧密相符的内容的可靠性。

Limitations in Human-computer Interaction (HCI). Sora, while showing potential in the video generation domain, faces significant limitations in HCI. These limitations are primarily evident in the coherence and efficiency of user-system interactions, especially when making detailed modifications or optimizations to generated content. For instance, users might find it difficult to precisely specify or adjust the presentation of specific elements within a video, such as action details and scene transitions. Additionally, Sora’s limitations in understanding complex language instructions or capturing subtle semantic differences could result in video content that does not fully meet user expectations or needs. These shortcomings restrict Sora’s potential in video editing and enhancement, also impacting the overall satisfaction of the user experience.
人机交互（HCI）中的限制。尽管Sora在视频生成领域显示出潜力，但在人机交互方面面临着重大的限制。这些限制主要体现在用户系统交互的连贯性和效率上，特别是在对生成内容进行详细的修改或优化时。例如，用户可能会发现很难精确地指定或调整视频中特定元素的呈现，如动作细节和场景转换。此外，Sora在理解复杂的语言指令或捕捉微妙的语义差异方面的限制可能会导致视频内容无法完全满足用户的期望或需求。这些缺点限制了Sora在视频编辑和增强方面的潜力，也影响了用户体验的整体满意度。

Usage Limitation. Regarding usage limitations, OpenAI has not yet set a specific release date for public access to Sora, emphasizing a cautious approach towards safety and readiness before broad deployment. This indicates that further improvements and testing in areas such as security, privacy protection, and content review may still be necessary for Sora. Moreover, at present, Sora can only generate videos up to one minute in length, and according to published cases, most generated videos are only a few dozen seconds long. This limitation restricts its use in applications requiring longer content display, such as detailed instructional videos or in-depth storytelling. This limitation reduces Sora’s flexibility in the content creation.
使用限制。关于使用限制，OpenAI尚未为Sora的公开访问设定具体的发布日期，强调在广泛部署之前对安全性和准备情况的谨慎态度。这表明，Sora可能仍需要在安全性、隐私保护和内容审查等领域进行进一步的改进和测试。此外，目前，Sora只能生成长达一分钟的视频，根据已发布的案例，大多数生成的视频只有几十秒长。这种限制限制了其在需要长时间内容展示的应用中的使用，如详细的教学视频或深度讲故事。这种限制降低了Sora在内容创建中的灵活性。

5.2 Opportunities 5.2机会

Academy. (1) The introduction of Sora by OpenAI marks a strategic shift towards encouraging the broader AI community to delve deeper into the exploration of text-to-video models, leveraging both diffusion and transformer technologies. This initiative aims to redirect the focus toward the potential of creating highly sophisticated and nuanced video content directly from textual descriptions, a frontier that promises to revolutionize content creation, storytelling, and information sharing. (2) The innovative approach of training Sora on data at its native size, as opposed to the traditional methods of resizing or cropping, serves as a groundbreaking inspiration for the academic community. It opens up new pathways by highlighting the benefits of utilizing unmodified datasets, which leads to the creation of more advanced generative models.
学院。 (1) OpenAI引入Sora标志着向鼓励更广泛的AI社区深入探索文本到视频模型的战略转变，利用扩散和变压器技术。这项倡议旨在将焦点转向从文本描述直接创建高度复杂和细腻的视频内容的潜力，这是一个有望革新内容创建、讲故事和信息分享的前沿。 (2) 以其原始大小对Sora进行训练的创新方法，而不是传统的调整大小或裁剪方法，为学术界提供了开创性的灵感。它通过强调利用未修改的数据集的好处，开辟了新的途径，这导致了更先进的生成模型的创建。

Industry. (1) The current capabilities of Sora signal a promising path for the advancement of video simulation technologies, highlighting the potential to significantly enhance realism within both physical and digital areas. The prospect of Sora enabling the creation of highly realistic environments through textual descriptions presents a promising future for content creation. This potential extends to revolutionizing game development, offering a glimpse into a future where immersive-generated worlds can be crafted with unprecedented ease and accuracy. (2) Companies may leverage Sora to produce advertising videos that swiftly adapt to market changes and create customized marketing content. This not only reduces production costs but also enhances the appeal and effectiveness of advertisements. The ability of Sora to generate highly realistic video content from textual descriptions alone could revolutionize how brands engage with their audience, allowing for the creation of immersive and compelling videos that capture the essence of their products or services in unprecedented ways.
行业。 (1) Sora当前的能力预示着视频模拟技术发展的光明前景，突显了在物理和数字领域内显著增强现实感的潜力。Sora通过文本描述创造高度真实的环境的前景，为内容创作展示了充满希望的未来。这种潜力扩展到革新游戏开发，提供了一个窥见未来的机会，其中沉浸式生成的世界可以以前所未有的便捷性和准确性进行创作。 (2) 公司可以利用Sora制作能迅速适应市场变化并创建定制营销内容的广告视频。这不仅降低了生产成本，而且增强了广告的吸引力和有效性。Sora仅通过文本描述生成高度真实的视频内容的能力，可能会革新品牌如何与其受众互动，允许创造沉浸式和引人入胜的视频，以前所未有的方式捕捉他们的产品或服务的精髓。

Society. (1) While the prospect of utilizing text-to-video technology to replace traditional filmmaking remains distant, Sora and similar platforms hold transformative potential for content creation on social media. The constraints of current video lengths do not diminish the impact these tools can have in making high-quality video production accessible to everyone, enabling individuals to produce compelling content without the need for expensive equipment. It represents a significant shift towards empowering content creators across platforms like TikTok and Reels, bringing in a new age of creativity and engagement. (2) Screenwriters and creative professionals can use Sora to transform written scripts into videos, assisting them in better showing and sharing their creative concepts, and even in producing short films and animations. The ability to create detailed, vivid videos from scripts can fundamentally change the pre-production process of filmmaking and animation, offering a glimpse into how future storytellers might pitch, develop, and refine their narratives. This technology opens up possibilities for a more dynamic and interactive form of script development, where ideas can be visualized and assessed in real time, providing a powerful tool for creativity and collaboration. (3) Journalists and news organizations can also utilize Sora to quickly generate news reports or explanatory videos, making the news content more vivid and engaging. This can significantly increase the coverage and audience engagement of news reports. By providing a tool that can simulate realistic environments and scenarios, Sora offers a powerful solution for visual storytelling, enabling journalists to convey complex stories through engaging videos that were previously difficult or expensive to produce. In summary, Sora’s potential to revolutionize content creation across marketing, journalism, and entertainment is immense.
社会。（1）虽然利用文字转视频技术取代传统电影制作的前景尚远，但Sora及类似平台对社交媒体内容创作具有变革性的潜力。当前视频长度的限制并未减少这些工具在使高质量视频制作对所有人开放方面的影响，使个人能够无需昂贵的设备就能制作出引人入胜的内容。这代表了向赋权TikTok和Reels等平台的内容创作者的重大转变，开启了一个新的创造力和参与度的时代。（2）编剧和创意专业人士可以使用Sora将书面剧本转化为视频，帮助他们更好地展示和分享他们的创意概念，甚至制作短片和动画。从剧本创建详细、生动的视频的能力可以从根本上改变电影制作和动画的预制作过程，为未来的讲故事者如何推销、开发和完善他们的叙事提供了一瞥。这项技术为更动态、互动的剧本开发提供了可能性，其中的想法可以实时可视化和评估，为创新和协作提供了强大的工具。(3)新闻记者和新闻机构也可以利用Sora快速生成新闻报道或解释性视频，使新闻内容更加生动和引人入胜。这可以显著增加新闻报道的覆盖范围和观众的参与度。通过提供一个可以模拟真实环境和场景的工具，Sora为视觉叙事提供了强大的解决方案，使记者能够通过引人入胜的视频传达复杂的故事，这些视频以前往往难以或昂贵的制作。总的来说，Sora在营销、新闻和娱乐等领域改变内容创作的潜力是巨大的。

6 Conclusion 6结论

We present a comprehensive review of Sora to help developers and researchers study the capabilities and related works of Sora. The review is based on our survey of published technical reports and reverse engineering based on existing literature. We will continue to update the paper when Sora’s API is available and further details about Sora are revealed. We hope that this review paper will prove a valuable resource for the open-source research community and lay a foundation for the community to jointly develop an open-source version of Sora in the near future to democratize video auto-creation in the era of AIGC. To achieve this goal, we invite discussions, suggestions, and collaborations on all fronts.
我们提供了一份全面的Sora评审，以帮助开发者和研究人员研究Sora的能力和相关工作。这份评审基于我们对已发布的技术报告的调查，以及基于现有文献的逆向工程。当Sora的API可用，以及关于Sora的更多细节被揭示时，我们将继续更新这篇论文。我们希望这篇评审论文将成为开源研究社区的宝贵资源，并为社区在不久的将来共同开发一个开源版本的Sora奠定基础，以实现在AIGC时代普及视频自动创建。为了实现这个目标，我们邀请在所有方面进行讨论，提出建议和合作。

References 参考文献

[1] OpenAI, “Chatgpt: Get instant answers, find creative inspiration, learn something new..” https://openai.com/chatgpt, 2022. 重试错误原因
[2] OpenAI, “Gpt-4 technical report,” 2023.
OpenAI，“Gpt-4 技术报告”，2023年。
[3] OpenAI, “Sora: Creating video from text.” https://openai.com/sora, 2024.
OpenAI，“Sora：从文本创建视频。” https://openai.com/sora, 2024.
[4] W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
W. Peebles 和 S. Xie，“使用变压器的可扩展扩散模型”，在IEEE/CVF国际计算机视觉会议论文集中，页码4195-4205，2023年。
[5] A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2, pp. 1033–1038, IEEE, 1999.
A. A. Efros 和 T. K. Leung，“通过非参数采样进行纹理合成”，在第七届IEEE国际计算机视觉会议论文集中，第2卷，第1033-1038页，IEEE，1999。
[6] P. S. Heckbert, “Survey of texture mapping,” IEEE computer graphics and applications, vol. 6, no. 11, pp. 56–67, 1986.
P. S. Heckbert，“纹理映射调查”，IEEE计算机图形和应用，第6卷，第11期，第56-67页，1986年。
[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” arXiv, 2014.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, 和 Y. Bengio, "生成对抗网络," arXiv, 2014.
[8] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
D. P. Kingma和M. Welling，“自编码变分贝叶斯”，arXiv预印本arXiv:1312.6114，2013。
[9] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014. 重试错误原因
[10] 重试错误原因 Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in Neural Information Processing Systems, vol. 32, 2019.
Y. Song 和 S. Ermon，“通过估计数据分布的梯度进行生成建模”，神经信息处理系统进展，第32卷，2019。
[11] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226, 2023. 重试错误原因
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, 和 I. Polosukhin, "注意力就是你所需要的," 在神经信息处理系统的进步中 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, 和 R. Garnett, 编辑), 第30卷, Curran Associates, Inc., 2017.
[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
J. Devlin, M.-W. Chang, K. Lee, 和 K. Toutanova, "Bert: 用于语言理解的深度双向变换器的预训练," arXiv预印本arXiv:1810.04805, 2018.
[14] 重试错误原因 A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever等，“通过生成预训练改善语言理解”，2018。
[15] 重试错误原因 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly等人，“一张图片价值16x16个词：大规模图像识别的变压器”，arXiv预印本arXiv:2010.11929，2020。
[16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021. 重试错误原因
[17] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241, Springer, 2015.
O. Ronneberger，P. Fischer和T. Brox，“U-net: 用于生物医学图像分割的卷积网络”，在《医学图像计算和计算机辅助干预-MICCAI 2015: 第18届国际会议》，德国慕尼黑，2015年10月5-9日，论文集，第三部分18，页码234-241，施普林格，2015。
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, 和 I. Sutskever, "从自然语言监督中学习可转移的视觉模型," 2021.
[19] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
R. Rombach，A. Blattmann，D. Lorenz，P. Esser 和 B. Ommer，“使用潜在扩散模型进行高分辨率图像合成”，在IEEE/CVF计算机视觉和模式识别会议论文集中，页码10684-10695，2022年。
[20] M. AI, “Midjourney: Text to image with ai art generator.” https://www.midjourneyai.ai/en, 2023.
M. AI, "中途：使用AI艺术生成器将文本转化为图像。" https://www.midjourneyai.ai/zh, 2023.
[21] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, p. 3, 2023.
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo等，“通过更好的标题改善图像生成”，计算机科学。https://cdn. openai. com/papers/dall-e-3. pdf，第2卷，第3页，2023年。
[22] P. AI, “Pika is the idea-to-video platform that sets your creativity in motion..” https://pika.art/home, 2023.
P. AI，“Pika是一个将你的创意转化为视频的平台。” https://pika.art/home, 2023.
[23] R. AI, “Gen-2: Gen-2: The next step forward for generative ai.” https://research.runwayml.com/gen2, 2023.
R. AI，“Gen-2：Gen-2：生成型AI的下一步前进。” https://research.runwayml.com/gen2, 2023.
[24] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022.
X. 翟，A. 科列斯尼科夫，N. 霍尔斯比，和 L. 贝尔，“扩展视觉变换器”，在IEEE/CVF计算机视觉和模式识别会议论文集中，页码12104-12113，2022年。
[25] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al., “Scaling vision transformers to 22 billion parameters,” in International Conference on Machine Learning, pp. 7480–7512, PMLR, 2023. 重试错误原因
[26] 重试错误原因 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021. 重试错误原因
[27] 重试错误原因 A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts等，“稳定的视频扩散：将潜在的视频扩散模型扩展到大型数据集”，arXiv预印本arXiv:2311.15127，2023。
[28] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y. Taigman, “Make-a-video: Text-to-video generation without text-video data,” 2022.
U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, 和 Y. Taigman, "Make-a-video: 无文本-视频数据的文本到视频生成", 2022.
[29] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet等，“利用扩散模型生成高清视频：Imagen视频”，arXiv预印本arXiv:2210.02303，2022。
[30] R. Sutton, “The bitter lesson.” http://www.incompleteideas.net/IncIdeas/BitterLesson.html, March 2019. Accessed: Your Access Date Here.
[31] S. Xie, “Take on sora technical report.” https://twitter.com/sainingxie/status/1758433676105310543, 2024.
[32] A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
[33] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022.
[34] S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941, 2023.
[35] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023.
[36] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
[37] M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: Adaptive space-time tokenization for videos,” Advances in Neural Information Processing Systems, vol. 34, pp. 12786–12797, 2021.
[38] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” arXiv preprint arXiv:2103.15691, 2021.
[39] L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Minderer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic, “Flexivit: One model for all patch sizes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14496–14506, 2023.
[40] M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsin, et al., “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[41] M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgibbon, “Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance,” arXiv preprint arXiv:2107.02027, 2021.
[42] H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-vit: Adaptive tokens for efficient vision transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818, 2022.
[43] D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” in The Eleventh International Conference on Learning Representations, 2022.
[44] M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V. Joze, E. Sommerlade, H. Pirsiavash, and J. Gall, “Adaptive token sampling for efficient vision transformers,” in European Conference on Computer Vision, pp. 396–414, Springer, 2022.
[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[46] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” in ICML, vol. 2, p. 4, 2021.
[47] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al., “Language model beats diffusion–tokenizer is key to visual generation,” arXiv preprint arXiv:2310.05737, 2023.
[48] N. Shazeer, “Fast transformer decoding: One write-head is all you need,” 2019.
[49] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” arXiv preprint arXiv:2305.13245, 2023.
[50] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[51] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” arXiv preprint arXiv:1503.03585, 2015.
[52] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[53] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
[54] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu, “All are worth words: A vit backbone for diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[55] S. Gao, P. Zhou, M.-M. Cheng, and S. Yan, “Masked diffusion transformer is a strong image synthesizer,” arXiv preprint arXiv:2303.14389, 2023.
[56] A. Hatamizadeh, J. Song, G. Liu, J. Kautz, and A. Vahdat, “Diffit: Diffusion vision transformers for image generation,” arXiv preprint arXiv:2312.02139, 2023.
[57] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[58] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” arXiv preprint arXiv:2202.00512, 2022.
[59] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249–2281, 2022.
[60] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
[61] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.
[62] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” arXiv, 2020.
[63] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825, 2022.
[64] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021.
[65] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021.
[66] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
[67] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning, pp. 4904–4916, PMLR, 2021.
[68] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022.
[69] S. Yan, T. Zhu, Z. Wang, Y. Cao, M. Zhang, S. Ghosh, Y. Wu, and J. Yu, “Video-text modeling with zero-shot transfer from contrastive captioners,” arXiv preprint arXiv:2212.04979, 2022.
[70] H. Xu, Q. Ye, M. Yan, Y. Shi, J. Ye, Y. Xu, C. Li, B. Bi, Q. Qian, W. Wang, et al., “mplug-2: A modularized multi-modal foundation model across text, image and video,” arXiv preprint arXiv:2302.00402, 2023.
[71] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, “Git: A generative image-to-text transformer for vision and language,” arXiv preprint arXiv:2205.14100, 2022.
[72] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Zero-shot video question answering via frozen bidirectional language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 124–141, 2022.
[73] Y. Li, “A practical survey on zero-shot prompt design for in-context learning,” in Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings, RANLP, INCOMA Ltd., Shoumen, BULGARIA, 2023.
[74] B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleashing the potential of prompt engineering in large language models: a comprehensive review,” arXiv preprint arXiv:2310.14735, 2023.
[75] S. Pitis, M. R. Zhang, A. Wang, and J. Ba, “Boosted prompt ensembles for large language models,” 2023.
[76] Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text-to-image generation,” 2023.
[77] S. Huang, B. Gong, Y. Pan, J. Jiang, Y. Lv, Y. Li, and D. Wang, “Vop: Text-video co-operative prompt tuning for cross-modal retrieval,” 2023.
[78] J. Z. Wu, Y. Ge, X. Wang, W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” 2023.
[79] T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7086–7096, June 2022.
[80] X. Chen, Y. Wang, L. Zhang, S. Zhuang, X. Ma, J. Yu, Y. Wang, D. Lin, Y. Qiao, and Z. Liu, “Seine: Short-to-long video diffusion model for generative transition and prediction,” 2023.
[81] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” 2024.
[82] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video synthesis,” 2018.
[83] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-shot video-to-video synthesis,” 2019.
[84] D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, and D. Sahoo, “Moonshot: Towards controllable video generation and editing with multimodal conditions,” 2024.
[85] L. Zhuo, G. Wang, S. Li, W. Wu, and Z. Liu, “Fast-vid2vid: Spatial-temporal compression for video-to-video synthesis,” 2022.
[86] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” 2021.
[87] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
[88] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in European Conference on Computer Vision, pp. 709–727, Springer, 2022.
[89] OpenAI, “Introducing chatgpt,” 2023.
[90] OpenAI, “Gpt-4v(ision) system card,” 2023.
[91] Y. Huang and L. Sun, “Harnessing the power of chatgpt in fake news: An in-depth exploration in generation, detection and explanation,” 2023.
[92] C. Chen and K. Shu, “Can llm-generated misinformation be detected?,” 2023.
[93] Z. Liu, Y. Huang, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, Y. Li, P. Shu, F. Zeng, L. Sun, W. Liu, D. Shen, Q. Li, T. Liu, D. Zhu, and X. Li, “Deid-gpt: Zero-shot medical text de-identification by gpt-4,” 2023.
[94] J. Yao, X. Yi, X. Wang, Y. Gong, and X. Xie, “Value fulcra: Mapping large language models to the multidimensional spectrum of basic human values,” 2023.
[95] Y. Huang, Q. Zhang, P. S. Y, and L. Sun, “Trustgpt: A benchmark for trustworthy and responsible large language models,” 2023.
[96] L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, Z. Liu, Y. Liu, Y. Wang, Z. Zhang, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang, J. Wang, J. Mitchell, K. Shu, K. Xu, K.-W. Chang, L. He, L. Huang, M. Backes, N. Z. Gong, P. S. Yu, P.-Y. Chen, Q. Gu, R. Xu, R. Ying, S. Ji, S. Jana, T. Chen, T. Liu, T. Zhou, W. Wang, X. Li, X. Zhang, X. Wang, X. Xie, X. Chen, X. Wang, Y. Liu, Y. Ye, Y. Cao, Y. Chen, and Y. Zhao, “Trustllm: Trustworthiness in large language models,” 2024.
[97] M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks, “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,” 2024.
[98] Y. Wang, H. Li, X. Han, P. Nakov, and T. Baldwin, “Do-not-answer: A dataset for evaluating safeguards in llms,” 2023.
[99] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, et al., “Decodingtrust: A comprehensive assessment of trustworthiness in gpt models,” arXiv preprint arXiv:2306.11698, 2023.
[100] Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang, “Safetybench: Evaluating the safety of large language models with multiple choice questions,” 2023.
[101] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2023.
[102] X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,” arXiv preprint arXiv:2310.04451, 2023.
[103] S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “Autodan: Interpretable gradient-based adversarial attacks on large language models,” 2023.
[104] A. Zhou, B. Li, and H. Wang, “Robust prompt optimization for defending language models against jailbreaking attacks,” arXiv preprint arXiv:2401.17263, 2024.
[105] X. Guo, F. Yu, H. Zhang, L. Qin, and B. Hu, “Cold-attack: Jailbreaking llms with stealthiness and controllability,” 2024.
[106] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?,” arXiv preprint arXiv:2307.02483, 2023.
[107] Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,” 2024.
[108] H. Liu, W. Xue, Y. Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A survey on hallucination in large vision-language models,” 2024.
[109] T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y. Yacoob, D. Manocha, and T. Zhou, “Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models,” 2023.
[110] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” 2023.
[111] Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P. Zhou, Y. Wan, N. Z. Gong, et al., “Metatool benchmark for large language models: Deciding whether to use tools and which to use,” arXiv preprint arXiv:2310.03128, 2023.
[112] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, “Mitigating hallucination in large multi-modal models via robust instruction tuning,” 2023.
[113] L. Wang, J. He, S. Li, N. Liu, and E.-P. Lim, “Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites,” in International Conference on Multimedia Modeling, pp. 32–45, Springer, 2024.
[114] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao, “Analyzing and mitigating object hallucination in large vision-language models,” arXiv preprint arXiv:2310.00754, 2023.
[115] I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,” arXiv preprint arXiv:2309.00770, 2023.
[116] J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, and X. He, “Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation,” arXiv preprint arXiv:2305.07609, 2023.
[117] Y. Liang, L. Cheng, A. Payani, and K. Shu, “Beyond detection: Unveiling fairness vulnerabilities in abusive language models,” 2023.
[118] F. Friedrich, P. Schramowski, M. Brack, L. Struppek, D. Hintersdorf, S. Luccioni, and K. Kersting, “Fair diffusion: Instructing text-to-image generation models on fairness,” arXiv preprint arXiv:2302.10893, 2023.
[119] R. Liu, C. Jia, J. Wei, G. Xu, L. Wang, and S. Vosoughi, “Mitigating political bias in language models through reinforced calibration,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14857–14866, May 2021.
[120] R. K. Mahabadi, Y. Belinkov, and J. Henderson, “End-to-end bias mitigation by modelling biases in corpora,” 2020.
[121] N. Mireshghallah, H. Kim, X. Zhou, Y. Tsvetkov, M. Sap, R. Shokri, and Y. Choi, “Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,” arXiv preprint arXiv:2310.17884, 2023.
[122] R. Plant, V. Giuffrida, and D. Gkatzia, “You are what you write: Preserving privacy in the era of large language models,” arXiv preprint arXiv:2204.09391, 2022.
[123] H. Li, Y. Chen, J. Luo, Y. Kang, X. Zhang, Q. Hu, C. Chan, and Y. Song, “Privacy in large language models: Attacks, defenses and future directions,” 2023.
[124] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang, “On the opportunities and risks of foundation models,” 2022.
[125] T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong, “Large language model alignment: A survey,” arXiv preprint arXiv:2309.15025, 2023.
[126] X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang, “Alignbench: Benchmarking chinese alignment of large language models,” 2023.
[127] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” 2023.
[128] T. Yu, Y. Yao, H. Zhang, T. He, Y. Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, M. Sun, and T.-S. Chua, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” 2023.
[129] M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing.,” Neurocomputing, p. 126232, 2023.
[130] OpenAI, “Sora safety.” https://openai.com/sora#safety, 2024.
[131] Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, S. Zhang, K. Chen, Z. Shen, and J. Ge, “Lawbench: Benchmarking legal knowledge of large language models,” arXiv preprint arXiv:2309.16289, 2023.
[132] Y. Li, Y. Huang, Y. Lin, S. Wu, Y. Wan, and L. Sun, “I think, therefore i am: Benchmarking awareness of large language models using awarebench,” 2024.
[133] J. Zhu, H. Yang, H. He, W. Wang, Z. Tuo, W.-H. Cheng, L. Gao, J. Song, and J. Fu, “Moviefactory: Automatic movie creation from text using large generative models for language and images,” arXiv preprint arXiv:2306.07257, 2023.
[134] J. Zhu, H. Yang, W. Wang, H. He, Z. Tuo, Y. Yu, W.-H. Cheng, L. Gao, J. Song, J. Fu, et al., “Mobilevidfactory: Automatic diffusion-based social media video generation for mobile devices from text,” in Proceedings of the 31st ACM International Conference on Multimedia, pp. 9371–9373, 2023.
[135] S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” arXiv preprint arXiv:2401.09414, 2024.
[136] R. Feng, W. Weng, Y. Wang, Y. Yuan, J. Bao, C. Luo, Z. Chen, and B. Guo, “Ccedit: Creative and controllable video editing via diffusion models,” arXiv preprint arXiv:2309.16496, 2023.
[137] J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang, et al., “Make-your-video: Customized video generation using textual and structural guidance,” arXiv preprint arXiv:2306.00943, 2023.
[138] Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
[139] Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan, et al., “Animate-a-story: Storytelling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023.
[140] H. Ni, C. Shi, K. Li, S. X. Huang, and M. R. Min, “Conditional image-to-video generation with latent flow diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18444–18455, 2023.
[141] L. Hu, X. Gao, P. Zhang, K. Sun, B. Zhang, and L. Bo, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” arXiv preprint arXiv:2311.17117, 2023.
[142] Y. Hu, C. Luo, and Z. Chen, “Make it move: controllable image-to-video generation with text descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18219–18228, 2022.
[143] K. Mei and V. Patel, “Vidm: Video implicit diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 9117–9125, 2023.
[144] S. Yu, K. Sohn, S. Kim, and J. Shin, “Video probabilistic diffusion models in projected latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466, 2023.
[145] K. Su, K. Qian, E. Shlizerman, A. Torralba, and C. Gan, “Physics-driven diffusion models for impact sound synthesis from videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9749–9759, 2023.
[146] S. Li, W. Dong, Y. Zhang, F. Tang, C. Ma, O. Deussen, T.-Y. Lee, and C. Xu, “Dance-to-music generation with encoder-based textual inversion of diffusion models,” arXiv preprint arXiv:2401.17800, 2024.
[147] A. Awasthi, J. Nizam, S. Zare, S. Ahmad, M. J. Montalvo, N. Varadarajan, B. Roysam, and H. V. Nguyen, “Video diffusion models for the apoptosis forcasting,” bioRxiv, pp. 2023–11, 2023.
[148] A. Bozorgpour, Y. Sadegheih, A. Kazerouni, R. Azad, and D. Merhof, “Dermosegdiff: A boundary-aware segmentation diffusion model for skin lesion delineation,” in International Workshop on PRedictive Intelligence In MEdicine, pp. 146–158, Springer, 2023.
[149] A. Flaborea, L. Collorone, G. M. D. di Melendugno, S. D’Arrigo, B. Prenkaj, and F. Galasso, “Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10318–10329, 2023.
[150] J. Wu, R. Fu, H. Fang, Y. Zhang, and Y. Xu, “Medsegdiff-v2: Diffusion based medical image segmentation with transformer,” arXiv preprint arXiv:2301.11798, 2023.
[151] G. J. Chowdary and Z. Yin, “Diffusion transformer u-net for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 622–631, Springer, 2023.
[152] I. Kapelyukh, V. Vosylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” IEEE Robotics and Automation Letters, 2023.
[153] W. Liu, T. Hermans, S. Chernova, and C. Paxton, “Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects,” in Workshop on Language and Robotics at CoRL 2022, 2022.
[154] M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” arXiv preprint arXiv:2205.09991, 2022.
[155] A. Ajay, Y. Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision-making?,” arXiv preprint arXiv:2211.15657, 2022.
[156] J. Carvalho, A. T. Le, M. Baierl, D. Koert, and J. Peters, “Motion planning diffusion: Learning and planning of robot motions with diffusion models,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1916–1923, IEEE, 2023.
[157] X. Gu, C. Wen, J. Song, and Y. Gao, “Seer: Language instructed video prediction with latent diffusion models,” arXiv preprint arXiv:2303.14897, 2023.
[158] Z. Chen, S. Kiami, A. Gupta, and V. Kumar, “Genaug: Retargeting behaviors to unseen situations via generative augmentation,” arXiv preprint arXiv:2302.06671, 2023.
[159] Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar, “Cacti: A framework for scalable multi-task multi-scene visual imitation learning,” arXiv preprint arXiv:2212.05711, 2022.
[160] T. Chen, L. Li, S. Saxena, G. Hinton, and D. J. Fleet, “A generalist framework for panoptic segmentation of images and videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 909–919, 2023.
[161] W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood, “Flexible diffusion modeling of long videos,” Advances in Neural Information Processing Systems, vol. 35, pp. 27953–27965, 2022.
[162] A. Gupta, S. Tian, Y. Zhang, J. Wu, R. Martín-Martín, and L. Fei-Fei, “Maskvit: Masked visual pre-training for video prediction,” arXiv preprint arXiv:2206.11894, 2022.
[163] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022.
[164] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
[165] D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
[166] S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh, “Long video generation with time-agnostic vqgan and time-sensitive transformer,” in European Conference on Computer Vision, pp. 102–118, Springer, 2022.
[167] R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual description,” arXiv preprint arXiv:2210.02399, 2022.
[168] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356, 2023.
[169] L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023.
[170] Z. Luo, D. Chen, Y. Zhang, Y. Huang, L. Wang, Y. Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10209–10218, 2023.
[171] A. Jabri, D. Fleet, and T. Chen, “Scalable adaptive computation for iterative generation,” arXiv preprint arXiv:2212.11972, 2022.
[172] L. Lian, B. Shi, A. Yala, T. Darrell, and B. Li, “Llm-grounded video diffusion models,” arXiv preprint arXiv:2309.17444, 2023.
[173] E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y. Leviathan, and Y. Hoshen, “Dreamix: Video diffusion models are general video editors,” arXiv preprint arXiv:2302.01329, 2023.
[174] J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, “Magicedit: High-fidelity and temporally coherent video editing,” arXiv preprint arXiv:2308.14749, 2023.
[175] W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, “Control-a-video: Controllable text-to-video generation with diffusion models,” arXiv preprint arXiv:2305.13840, 2023.
[176] W. Chai, X. Guo, G. Wang, and Y. Lu, “Stablevideo: Text-driven consistency-aware diffusion video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050, 2023.
[177] S. Yang, Y. Zhou, Z. Liu, and C. C. Loy, “Rerender a video: Zero-shot text-guided video-to-video translation,” arXiv preprint arXiv:2306.07954, 2023.
[178] D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217, 2023.
[179] B. Qin, J. Li, S. Tang, T.-S. Chua, and Y. Zhuang, “Instructvid2vid: Controllable video editing with natural language instructions,” arXiv preprint arXiv:2305.12328, 2023.
[180] D. Liu, Q. Li, A.-D. Dinh, T. Jiang, M. Shah, and C. Xu, “Diffusion action segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10139–10149, 2023.
[181] R. Feng, Y. Gao, T. H. E. Tse, X. Ma, and H. J. Chang, “Diffpose: Spatiotemporal diffusion model for video-based human pose estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14861–14872, 2023.
[182] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, et al., “Magvit: Masked generative video transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023.
[183] Z. Li, R. Tucker, N. Snavely, and A. Holynski, “Generative image dynamics,” arXiv preprint arXiv:2309.07906, 2023.
[184] EasyWithAI, “Zeroscope - ai text-to-video model.” https://easywithai.com/ai-video-generators/zeroscope/, 2023.
[185] R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
[186] Y. Zeng, G. Wei, J. Zheng, J. Zou, Y. Wei, Y. Zhang, and H. Li, “Make pixels dance: High-dynamic video generation,” arXiv preprint arXiv:2311.10982, 2023.
[187] A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei-Fei, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,” arXiv preprint arXiv:2312.06662, 2023.
[188] B. Wu, C.-Y. Chuang, X. Wang, Y. Jia, K. Krishnakumar, T. Xiao, F. Liang, L. Yu, and P. Vajda, “Fairy: Fast parallelized instruction-guided video-to-video synthesis,” arXiv preprint arXiv:2312.13834, 2023.
[189] D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar, et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023.
[190] J. Wu, X. Li, C. Si, S. Zhou, J. Yang, J. Zhang, Y. Li, K. Chen, Y. Tong, Z. Liu, et al., “Towards language-driven video inpainting via multimodal large language models,” arXiv preprint arXiv:2401.10226, 2024.
[191] O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, Y. Li, T. Michaeli, et al., “Lumiere: A space-time diffusion model for video generation,” arXiv preprint arXiv:2401.12945, 2024.

Appendix A Related Works

We show some related works about the video generation tasks in Table 1.

Table 1: Summary of Video Generation.

Model name	Year	Backbone	Task	Group
Imagen Video[29]	2022	Diffusion	Generation	Google
Pix2Seq-D[160]	2022	Diffusion	Segmentation	Google Deepmind
FDM[161]	2022	Diffusion	Prediction	UBC
MaskViT[162]	2022	Masked Vision Models	Prediction	Stanford, Salesforce
CogVideo[163]	2022	Auto-regressive	Generation	THU
Make-a-video[164]	2022	Diffusion	Generation	Meta
MagicVideo[165]	2022	Diffusion	Generation	ByteDance
TATS[166]	2022	Auto-regressive	Generation	University of Maryland, Meta
Phenaki[167]	2022	Masked Vision Models	Generation	Google Brain
Gen-1[168]	2023	Diffusion	Generation, Editing	RunwayML
LFDM[140]	2023	Diffusion	Generation	PSU, UCSD
Text2video-Zero[169]	2023	Diffusion	Generation	Picsart
Video Fusion[170]	2023	Diffusion	Generation	USAC, Alibaba
PYoCo[34]	2023	Diffusion	Generation	Nvidia
Video LDM[36]	2023	Diffusion	Generation	University of Maryland, Nvidia
RIN[171]	2023	Diffusion	Generation	Google Brain
LVD[172]	2023	Diffusion	Generation	UCB
Dreamix[173]	2023	Diffusion	Editing	Google
MagicEdit[174]	2023	Diffusion	Editing	ByteDance
Control-A-Video[175]	2023	Diffusion	Editing	Sun Yat-Sen University
StableVideo[176]	2023	Diffusion	Editing	ZJU, MSRA
Tune-A-Video[78]	2023	Diffusion	Editing	NUS
Rerender-A-Video[177]	2023	Diffusion	Editing	NTU
Pix2Video[178]	2023	Diffusion	Editing	Adobe, UCL
InstructVid2Vid[179]	2023	Diffusion	Editing	ZJU
DiffAct[180]	2023	Diffusion	Action Detection	University of Sydney
DiffPose[181]	2023	Diffusion	Pose Estimation	Jilin University
MAGVIT[182]	2023	Masked Vision Models	Generation	Google
AnimateDiff[138]	2023	Diffusion	Generation	CUHK
MAGVIT V2[47]	2023	Masked Vision Models	Generation	Google
Generative Dynamics[183]	2023	Diffusion	Generation	Google
VideoCrafter[81]	2023	Diffusion	Generation	Tencent
Zeroscope[184]	2023	-	Generation	EasyWithAI
ModelScope	2023	-	Generation	Damo
Gen-2[23]	2023	-	Generation	RunwayML
Pika[22]	2023	-	Generation	Pika Labs
Emu Video[185]	2023	Diffusion	Generation	Meta
PixelDance[186]	2023	Diffusion	Generation	ByteDance
Stable Video Diffusion[27]	2023	Diffusion	Generation	Stability AI
W.A.L.T[187]	2023	Diffusion	Generation	Stanford, Google
Fairy[188]	2023	Diffusion	Generation, Editing	Meta
VideoPoet[189]	2023	Auto-regressive	Generation, Editing	Google
LGVI[190]	2024	Diffusion	Editing	PKU, NTU
Lumiere[191]	2024	Diffusion	Generation	Google
Sora[3]	2024	Diffusion	Generation, Editing	OpenAI

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models索拉：大视觉模型的背景、技术、限制和机会的综述