这是用户在 2024-6-23 19:20 为 https://huggingface.co/blog/annotated-diffusion 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

The Annotated Diffusion Model
批注扩散模型

Published June 7, 2022 2022 年 6 月 7 日发布
Update on GitHub
Open In Colab

In this blog post, we'll take a deeper look into Denoising Diffusion Probabilistic Models
降噪扩散概率模型
(also known as DDPMs, diffusion models, score-based generative models or simply autoencoders 自编码器) as researchers have been able to achieve remarkable results with them for (un)conditional image/audio/video generation. Popular examples (at the time of writing) include GLIDE and DALL-E 2 by OpenAI, Latent Diffusion 潜在扩散 by the University of Heidelberg and ImageGen by Google Brain.

We'll go over the original DDPM paper by (Ho et al., 2020), implementing it step-by-step in PyTorch, based on Phil Wang's implementation - which itself is based on the original TensorFlow implementation. Note that the idea of diffusion for generative modeling was actually already introduced in (Sohl-Dickstein et al., 2015). However, it took until (Song et al., 2019) (at Stanford University), and then (Ho et al., 2020) (at Google Brain) who independently improved the approach.
我们将按照 Phil Wang 的实现在 PyTorch 中逐步实施(Ho 等人,2020)所撰写的原始 DDPM 论文,该实现本身基于原始的 TensorFlow 实现。值得注意的是,扩散的概念在(Sohl-Dickstein 等人,2015)中实际上已经被引入到了生成建模中。然而,直到(Song 等人,2019)(在斯坦福大学),然后(Ho 等人,2020)(在 Google Brain)独立改进了这一方法。

Note that there are several perspectives on diffusion models. Here, we employ the discrete-time (latent variable model) perspective, but be sure to check out the other perspectives as well.
请注意,对于扩散模型存在几种观点。在这里,我们采用离散时间(潜变量模型)的观点,但一定要也查阅其他观点。

Alright, let's dive in!
好的,让我们深入研究!

from IPython.display import Image
Image(filename='assets/78_annotated-diffusion/ddpm_paper.png')

We'll install and import the required libraries first (assuming you have PyTorch installed).
首先,我们将安装并导入所需的库(假设您已安装了 PyTorch)。

!pip install -q -U einops datasets matplotlib tqdm

import math
from inspect import isfunction
from functools import partial

%matplotlib inline
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from einops import rearrange, reduce
from einops.layers.torch import Rearrange

import torch
from torch import nn, einsum
import torch.nn.functional as F

What is a diffusion model?
什么是扩散模型?

A (denoising) diffusion model isn't that complex if you compare it to other generative models such as Normalizing Flows, GANs or VAEs: they all convert noise from some simple distribution to a data sample. This is also the case here where a neural network learns to gradually denoise data starting from pure noise.
与其他生成模型(如正规化流、GAN 或 VAE)相比,(去噪)扩散模型并不复杂:它们都将某些简单分布的噪声转换为数据样本。这在这里也是适用的,其中一个神经网络逐渐学习从纯噪声开始去噪数据。

In a bit more detail for images, the set-up consists of 2 processes:
对于图像的详细情况,设立包括 2 个过程:

  • a fixed (or predefined) forward diffusion process qq of our choosing, that gradually adds Gaussian noise to an image, until you end up with pure noise
    一个固定的(或预定义的)前向扩散过程 qq ,我们选择逐渐向图像添加高斯噪声,直到最终得到纯噪声。
  • a learned reverse denoising diffusion process pθp_\theta, where a neural network is trained to gradually denoise an image starting from pure noise, until you end up with an actual image.
    一个学习的逆去噪扩散过程 pθp_\theta ,在此过程中,一个神经网络被训练逐渐去噪一幅图像,从纯噪声开始,直到最终得到实际图像。

Both the forward and reverse process indexed by tt happen for some number of finite time steps TT (the DDPM authors use T=1000T=1000). You start with t=0t=0 where you sample a real image x0\mathbf{x}_0 from your data distribution (let's say an image of a cat from ImageNet), and the forward process samples some noise from a Gaussian distribution at each time step tt, which is added to the image of the previous time step. Given a sufficiently large TT and a well behaved schedule for adding noise at each time step, you end up with what is called an isotropic Gaussian distribution at t=Tt=T via a gradual process.
所有前向和逆向过程的索引 tt 都将发生一定数量的有限时间步 TT (DDPM 作者使用 T=1000T=1000 )。从 t=0t=0 开始,其中您可以从数据分布中抽样一个真实图像 x0\mathbf{x}_0 (比如来自 ImageNet 的一张猫的图像),前向过程在每个时间步 tt 从高斯分布中抽样一些噪声,将其添加到前一个时间步的图像中。在每个时间步中添加噪声的行为,以及足够大的 TT 和良好地表现出噪声添加时间表,最终你将得到所谓的各向同性高斯分布 t=Tt=T ,并通过渐进过程。

In more mathematical form
更详细的数学形式

Let's write this down more formally, as ultimately we need a tractable loss function which our neural network needs to optimize.
让我们更正式地写下这一点,因为最终我们需要一个可以被我们的神经网络优化的可处理的损失函数。

Let q(x0)q(\mathbf{x}_0) be the real data distribution, say of "real images". We can sample from this distribution to get an image, x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0). We define the forward diffusion process q(xtxt1)q(\mathbf{x}_t | \mathbf{x}_{t-1}) which adds Gaussian noise at each time step tt, according to a known variance schedule 0<β1<β2<...<βT<10 < \beta_1 < \beta_2 < ... < \beta_T < 1 as
q(x0)q(\mathbf{x}_0) 为真实数据分布,比如"真实图像"的分布。我们可以从这个分布中抽样得到一张图像 x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0) 。我们定义前向扩散过程 q(xtxt1)q(\mathbf{x}_t | \mathbf{x}_{t-1}) ,它在每个时间步长 tt 加入高斯噪声,根据已知的方差计划 0<β1<β2<...<βT<10 < \beta_1 < \beta_2 < ... < \beta_T < 1 ,即
q(xtxt1)=N(xt;1βtxt1,βtI). q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}).

Recall that a normal distribution (also called Gaussian distribution) is defined by 2 parameters: a mean μ\mu and a variance σ20\sigma^2 \geq 0. Basically, each new (slightly noisier) image at time step tt is drawn from a conditional Gaussian distribution with μt=1βtxt1\mathbf{\mu}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1} and σt2=βt\sigma^2_t = \beta_t, which we can do by sampling ϵN(0,I)\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and then setting xt=1βtxt1+βtϵ\mathbf{x}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \mathbf{\epsilon}.
回想一下,正态分布(又称高斯分布)由 2 个参数定义:平均值 μ\mu 和方差 σ20\sigma^2 \geq 0 。基本上,每个新的(稍微有些嘈杂的)时间步的图像都是从具有 μt=1βtxt1\mathbf{\mu}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1}σt2=βt\sigma^2_t = \beta_t 的条件高斯分布中抽取的,我们可以通过对 ϵN(0,I)\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) 进行采样然后设置 xt=1βtxt1+βtϵ\mathbf{x}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \mathbf{\epsilon} 来实现。

Note that the βt\beta_t aren't constant at each time step tt (hence the subscript) --- in fact one defines a so-called "variance schedule", which can be linear, quadratic, cosine, etc. as we will see further (a bit like a learning rate schedule).
请注意, βt\beta_t 在每个时间步长 t