Refer to caption — Figure 1: Generated samples on CelebA-HQ $256\times 256$ (left) and unconditional CIFAR10 (right)
图1：在CelebA-HQ $256\times 256$ （左）和无条件CIFAR 10（右）上生成的样本

Denoising Diffusion Probabilistic Models
去噪扩散概率模型

Jonathan Ho 何俊仁
UC Berkeley
jonathanho@berkeley.edu
加州大学伯克利分校
jonathanho@berkeley.edu
&Ajay Jain
UC Berkeley
ajayj@berkeley.edu
&Ajay Jain
加州大学伯克利分校
ajayj@berkeley.edu
&Pieter Abbeel
UC Berkeley
pabbeel@cs.berkeley.edu
&Pieter Abbeel
加州大学伯克利分校
pabbeel@cs.berkeley.edu

Abstract 摘要

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
我们提出了高品质的图像合成结果，使用扩散概率模型，一类潜变量模型的启发，从非平衡态热力学的考虑。我们的最佳结果是通过训练加权变分界获得的，该变分界是根据扩散概率模型和与Langevin动力学匹配的去噪得分之间的新联系设计的，并且我们的模型自然地允许渐进有损解压缩方案，该方案可以被解释为自回归解码的泛化。在无条件CIFAR10数据集上，我们获得了9.46的Inception评分和3.17的最新FID评分。在256x256 LSUN上，我们获得了与ProgressiveGAN相似的样本质量。我们的实施可在https://github.com/hojonathanho/diffusion上获得。

1 Introduction
1引言

Deep generative models of all kinds have recently exhibited high quality samples in a wide variety of data modalities. Generative adversarial networks (GANs), autoregressive models, flows, and variational autoencoders (VAEs) have synthesized striking image and audio samples [14, 27, 3, 58, 38, 25, 10, 32, 44, 57, 26, 33, 45], and there have been remarkable advances in energy-based modeling and score matching that have produced images comparable to those of GANs [11, 55].
最近，各种深度生成模型都在各种数据模式中展示了高质量的样本。生成对抗网络（GAN），自回归模型，流和变分自编码器（VAE）已经合成了引人注目的图像和音频样本[14，27，3，58，38，25，10，32，44，57，26，33，45]，并且在基于能量的建模和得分匹配方面取得了显着进展，产生了与GAN相当的图像[11，55]。

This paper presents progress in diffusion probabilistic models [53]. A diffusion probabilistic model (which we will call a “diffusion model” for brevity) is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time. Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. When the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization.
本文介绍了扩散概率模型的进展[53]。扩散概率模型（我们简称为“扩散模型”）是一个参数化的马尔可夫链，使用变分推理训练，以在有限时间后产生与数据匹配的样本。该链的转变被学习以逆转扩散过程，扩散过程是一个马尔可夫链，该马尔可夫链在与采样相反的方向上逐渐向数据添加噪声，直到信号被破坏。当扩散由少量高斯噪声组成时，将采样链转换设置为条件高斯也就足够了，从而允许特别简单的神经网络参数化。

Diffusion models are straightforward to define and efficient to train, but to the best of our knowledge, there has been no demonstration that they are capable of generating high quality samples. We show that diffusion models actually are capable of generating high quality samples, sometimes better than the published results on other types of generative models (Section 4). In addition, we show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling (Section 3.2) [55, 61]. We obtained our best sample quality results using this parameterization (Section 4.2), so we consider this equivalence to be one of our primary contributions.
扩散模型定义简单，训练效率高，但据我们所知，还没有证据表明它们能够生成高质量的样本。我们证明了扩散模型实际上能够生成高质量的样本，有时比其他类型的生成模型的结果更好（第4节）。此外，我们证明了扩散模型的某种参数化揭示了在训练期间与多个噪声水平上的去噪得分匹配以及在采样期间与退火Langevin动力学的等效性（第3.2节）[55，61]。使用此参数化，我们获得了最佳的样本质量结果（第4节）。2），所以我们认为这种等价性是我们的主要贡献之一。

Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models (our models do, however, have log likelihoods better than the large estimates annealed importance sampling has been reported to produce for energy based models and score matching [11, 55]). We find that the majority of our models’ lossless codelengths are consumed to describe imperceptible image details (Section 4.3). We present a more refined analysis of this phenomenon in the language of lossy compression, and we show that the sampling procedure of diffusion models is a type of progressive decoding that resembles autoregressive decoding along a bit ordering that vastly generalizes what is normally possible with autoregressive models.
尽管它们的样本质量，我们的模型与其他基于似然的模型相比没有竞争力的对数似然（然而，我们的模型确实具有比退火重要性抽样为基于能量的模型和得分匹配产生的大估计更好的对数似然[11，55]）。我们发现，我们的模型的大部分无损编码都被用来描述不可感知的图像细节（第4.3节）。我们提出了一个更精细的分析这种现象的语言的有损压缩，我们表明，扩散模型的采样过程是一种渐进的解码，类似于自回归解码沿着位排序，极大地概括了什么是通常可能的自回归模型。

2 Background
2背景

Diffusion models [53] are latent variable models of the form $p_{\theta}(\mathbf{x}_{0})\coloneqq\int p_{\theta}(\mathbf{x}_{0:T})\,d\mathbf{x}_{1:T}$ , where $\mathbf{x}_{1},\dotsc,\mathbf{x}_{T}$ are latents of the same dimensionality as the data $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ . The joint distribution $p_{\theta}(\mathbf{x}_{0:T})$ is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at $p(\mathbf{x}_{T})=\mathcal{N}(\mathbf{x}_{T};\mathbf{0},\mathbf{I})$ :
扩散模型[53]是形式为 $p_{\theta}(\mathbf{x}_{0})\coloneqq\int p_{\theta}(\mathbf{x}_{0:T})\,d\mathbf{x}_{1:T}$ 的潜变量模型，其中 $\mathbf{x}_{1},\dotsc,\mathbf{x}_{T}$ 是与数据 $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ 相同维度的潜变量。联合分布 $p_{\theta}(\mathbf{x}_{0:T})$ 被称为反向过程，并且它被定义为具有从 $p(\mathbf{x}_{T})=\mathcal{N}(\mathbf{x}_{T};\mathbf{0},\mathbf{I})$ 开始的学习高斯转换的马尔可夫链：

\displaystyle p_{\theta}(\mathbf{x}_{0:T})

\displaystyle\coloneqq p(\mathbf{x}_{T})\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}),\qquad p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\coloneqq\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),{\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t},t))

(1)

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ , called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule $\beta_{1},\dotsc,\beta_{T}$ :
扩散模型与其他类型的潜变量模型的区别在于，近似后验 $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ （称为前向过程或扩散过程）固定为马尔可夫链，该马尔可夫链根据方差表 $\beta_{1},\dotsc,\beta_{T}$ 逐渐向数据添加高斯噪声：

\displaystyle q(\mathbf{x}_{1:T}|\mathbf{x}_{0})

\displaystyle\coloneqq\prod_{t=1}^{T}q(\mathbf{x}_{t}|\mathbf{x}_{t-1}),\qquad q(\mathbf{x}_{t}|\mathbf{x}_{t-1})\coloneqq\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I})

(2)

Training is performed by optimizing the usual variational bound on negative log likelihood:
通过优化负对数似然的通常变分界限来执行训练：

\displaystyle\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_{0})\right]\leq\mathbb{E}_{q}\!\left[-\log\frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right]=\mathbb{E}_{q}\bigg{[}-\log p(\mathbf{x}_{T})-\sum_{t\geq 1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\bigg{]}\eqqcolon L

(3)

The forward process variances $\beta_{t}$ can be learned by reparameterization [33] or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ , because both processes have the same functional form when $\beta_{t}$ are small [53]. A notable property of the forward process is that it admits sampling $\mathbf{x}_{t}$ at an arbitrary timestep $t$ in closed form: using the notation $\alpha_{t}\coloneqq 1-\beta_{t}$ and $\bar{\alpha}_{t}\coloneqq\prod_{s=1}^{t}\alpha_{s}$ , we have
前向过程方差 $\beta_{t}$ 可以通过重新参数化[33]来学习，或者作为超参数保持恒定，并且反向过程的表达性部分地由 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ 中高斯条件的选择来确保，因为当 $\beta_{t}$ 很小时，两个过程具有相同的函数形式[53]。前向过程的一个值得注意的特性是，它允许以封闭形式在任意时间步长 $t$ 采样 $\mathbf{x}_{t}$ ：使用符号 $\alpha_{t}\coloneqq 1-\beta_{t}$ 和 $\bar{\alpha}_{t}\coloneqq\prod_{s=1}^{t}\alpha_{s}$ ，我们有

\displaystyle q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I})

(4)

Efficient training is therefore possible by optimizing random terms of $L$ with stochastic gradient descent. Further improvements come from variance reduction by rewriting $L$ 3 as:
因此，通过使用随机梯度下降优化 $L$ 的随机项，可以进行有效的训练。通过将 $L$ 3重写为：

\displaystyle\mathbb{E}_{q}\bigg{[}\underbrace{D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})~{}\|~{}p(\mathbf{x}_{T})\right)}_{L_{T}}+\sum_{t>1}\underbrace{D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})~{}\|~{}p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\right)}_{L_{t-1}}\underbrace{-\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}_{L_{0}}\bigg{]}

(5)

(See Appendix A for details. The labels on the terms are used in Section 3.) Equation 5 uses KL divergence to directly compare $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ against forward process posteriors, which are tractable when conditioned on $\mathbf{x}_{0}$ :
(See详情见附录A。术语上的标签在第3节中使用。）等式5使用KL发散直接比较 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ 与前向过程后验，当以 $\mathbf{x}_{0}$ 为条件时，前向过程后验是易处理的：

	$\displaystyle q(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0})$	$\displaystyle=\mathcal{N}(\mathbf{x}_{t-1};\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t},\mathbf{x}_{0}),\tilde{\beta}_{t}\mathbf{I}),$		(6)
	$\displaystyle\text{where}\quad\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t},\mathbf{x}_{0})$	$\displaystyle\coloneqq\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\mathbf{x}_{0}+\frac{\sqrt{\alpha_{t}}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}\mathbf{x}_{t}\quad\text{and}\quad\tilde{\beta}_{t}\coloneqq\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$		(7)

Consequently, all KL divergences in Eq. 5 are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.
因此，所有KL发散方程。 5是高斯之间的比较，因此它们可以以Rao-Blackwellized的方式使用封闭形式的表达式而不是高方差蒙特卡罗估计来计算。

3 Diffusion models and denoising autoencoders
3扩散模型和去噪自动编码器

Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances $\beta_{t}$ of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process. To guide our choices, we establish a new explicit connection between diffusion models and denoising score matching (Section 3.2) that leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4). Ultimately, our model design is justified by simplicity and empirical results (Section 4). Our discussion is categorized by the terms of Eq. 5.
扩散模型可能看起来是潜在变量模型的受限类别，但它们在实现中允许大量的自由度。必须选择前向过程的方差 $\beta_{t}$ 以及反向过程的模型架构和高斯分布参数化。为了指导我们的选择，我们在扩散模型和去噪得分匹配之间建立了一个新的显式连接（第3.2节），这导致了扩散模型的简化，加权变分边界目标（第3.4节）。最终，我们的模型设计是合理的简单性和实证结果（第4节）。我们的讨论是由Eq的条款分类。 5.

3.1 Forward process and $L_{T}$
3.1转发过程和 $L_{T}$

We ignore the fact that the forward process variances $\beta_{t}$ are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior $q$ has no learnable parameters, so $L_{T}$ is a constant during training and can be ignored.
我们忽略了前向过程方差 $\beta_{t}$ 可以通过重新参数化来学习的事实，而是将它们固定为常数（详见第4节）。因此，在我们的实现中，近似后验 $q$ 没有可学习的参数，因此 $L_{T}$ 在训练过程中是一个常数，可以忽略。

3.2 Reverse process and $L_{1:T-1}$
3.2反向过程和 $L_{1:T-1}$

Now we discuss our choices in $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),{\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t},t))$ for ${1<t\leq T}$ . First, we set ${\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t},t)=\sigma_{t}^{2}\mathbf{I}$ to untrained time dependent constants. Experimentally, both $\sigma_{t}^{2}=\beta_{t}$ and $\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ had similar results. The first choice is optimal for $\mathbf{x}_{0}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , and the second is optimal for $\mathbf{x}_{0}$ deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance [53].
现在，我们讨论在 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),{\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t},t))$ 中对 ${1<t\leq T}$ 的选择。首先，我们将 ${\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t},t)=\sigma_{t}^{2}\mathbf{I}$ 设置为未训练的时间相关常数。在实验中， $\sigma_{t}^{2}=\beta_{t}$ 和 $\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ 具有相似的结果。第一个选择对于 $\mathbf{x}_{0}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ 是最优的，第二个选择对于确定性地设置为一个点的 $\mathbf{x}_{0}$ 是最优的。这是两个极端的选择，对应于坐标单位方差数据的逆过程熵的上限和下限[53]。

Second, to represent the mean ${\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t)$ , we propose a specific parameterization motivated by the following analysis of $L_{t}$ . With $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),\sigma_{t}^{2}\mathbf{I})$ , we can write:
其次，为了表示平均值 ${\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t)$ ，我们提出了一个特定的参数化，其动机是对 $L_{t}$ 进行以下分析。使用 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),\sigma_{t}^{2}\mathbf{I})$ ，我们可以写：

\displaystyle L_{t-1}=\mathbb{E}_{q}\!\left[\frac{1}{2\sigma_{t}^{2}}\|\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t},\mathbf{x}_{0})-{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t)\|^{2}\right]+C

(8)

where $C$ is a constant that does not depend on $\theta$ . So, we see that the most straightforward parameterization of ${\boldsymbol{\mu}}_{\theta}$ is a model that predicts $\tilde{\boldsymbol{\mu}}_{t}$ , the forward process posterior mean. However, we can expand Eq. 8 further by reparameterizing Eq. 4 as $\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}})=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}$ for ${\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and applying the forward process posterior formula 7:
其中 $C$ 是不依赖于 $\theta$ 的常数。因此，我们看到 ${\boldsymbol{\mu}}_{\theta}$ 的最直接的参数化是预测 $\tilde{\boldsymbol{\mu}}_{t}$ 的模型，即前向过程后验均值。但是，我们可以扩展Eq。 8进一步重新参数化方程。 4作为 $\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}})=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}$ 用于 ${\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ ，并应用前向过程后验公式7：

	$\displaystyle L_{t-1}-C$	$\displaystyle=\mathbb{E}_{\mathbf{x}_{0},{\boldsymbol{\epsilon}}}\!\left[\frac{1}{2\sigma_{t}^{2}}\left\\|\tilde{\boldsymbol{\mu}}_{t}\!\left(\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}}),\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}})-\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}})\right)-{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}}),t)\right\\|^{2}\right]$		(9)
		$\displaystyle=\mathbb{E}_{\mathbf{x}_{0},{\boldsymbol{\epsilon}}}\!\left[\frac{1}{2\sigma_{t}^{2}}\left\\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}})-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}\right)-{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0},{\boldsymbol{\epsilon}}),t)\right\\|^{2}\right]$		(10)

Equation 10 reveals that ${\boldsymbol{\mu}}_{\theta}$ must predict $\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}\right)$ given $\mathbf{x}_{t}$ . Since $\mathbf{x}_{t}$ is available as input to the model, we may choose the parameterization
等式10揭示了在给定 $\mathbf{x}_{t}$ 的情况下， ${\boldsymbol{\mu}}_{\theta}$ 必须预测 $\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}\right)$ 。由于 $\mathbf{x}_{t}$ 可用作模型的输入，因此我们可以选择参数化

\displaystyle{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t)=\tilde{\boldsymbol{\mu}}_{t}\!\left(\mathbf{x}_{t},\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t}))\right)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)

(11)

where ${\boldsymbol{\epsilon}}_{\theta}$ is a function approximator intended to predict ${\boldsymbol{\epsilon}}$ from $\mathbf{x}_{t}$ . To sample $\mathbf{x}_{t-1}\sim p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ is to compute $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)+\sigma_{t}\mathbf{z}$ , where $\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . The complete sampling procedure, Algorithm 2, resembles Langevin dynamics with ${\boldsymbol{\epsilon}}_{\theta}$ as a learned gradient of the data density. Furthermore, with the parameterization 11, Eq. 10 simplifies to:
其中 ${\boldsymbol{\epsilon}}_{\theta}$ 是一个函数逼近器，用于从 $\mathbf{x}_{t}$ 预测 ${\boldsymbol{\epsilon}}$ 。样本 $\mathbf{x}_{t-1}\sim p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ 是计算 $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)+\sigma_{t}\mathbf{z}$ 5 ${\boldsymbol{\epsilon}}_{\theta}$ 7 #8 #9 完整的采样过程（算法2）类似于朗之万动力学，其中 ${\boldsymbol{\epsilon}}_{\theta}$ 作为数据密度的学习梯度。此外，通过参数化11，Eq. 10简化为：

\displaystyle\mathbb{E}_{\mathbf{x}_{0},{\boldsymbol{\epsilon}}}\!\left[\frac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t}(1-\bar{\alpha}_{t})}\left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}},t)\right\|^{2}\right]

(12)

which resembles denoising score matching over multiple noise scales indexed by $t$ [55]. As Eq. 12 is equal to (one term of) the variational bound for the Langevin-like reverse process 11, we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.
其类似于在由 $t$ [55]索引的多个噪声尺度上的去噪分数匹配。如等式 12等于类朗之万逆过程11的变分界限（的一项），我们看到优化类似于去噪得分匹配的目标等效于使用变分推断来拟合类似于朗之万动力学的采样链的有限时间边际。

To summarize, we can train the reverse process mean function approximator ${\boldsymbol{\mu}}_{\theta}$ to predict $\tilde{\boldsymbol{\mu}}_{t}$ , or by modifying its parameterization, we can train it to predict ${\boldsymbol{\epsilon}}$ . (There is also the possibility of predicting $\mathbf{x}_{0}$ , but we found this to lead to worse sample quality early in our experiments.) We have shown that the ${\boldsymbol{\epsilon}}$ -prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching. Nonetheless, it is just another parameterization of $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ , so we verify its effectiveness in Section 4 in an ablation where we compare predicting ${\boldsymbol{\epsilon}}$ against predicting $\tilde{\boldsymbol{\mu}}_{t}$ .
总而言之，我们可以训练反向过程均值函数逼近器 ${\boldsymbol{\mu}}_{\theta}$ 来预测 $\tilde{\boldsymbol{\mu}}_{t}$ ，或者通过修改其参数化，我们可以训练它来预测 ${\boldsymbol{\epsilon}}$ 。（也有可能预测 $\mathbf{x}_{0}$ ，但我们发现这会导致我们实验早期的样本质量更差。我们已经表明， ${\boldsymbol{\epsilon}}$ -预测参数化既类似于朗之万动力学，又简化了扩散模型的变分约束到类似于去噪得分匹配的目标。尽管如此，它只是 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ 的另一个参数化，因此我们在第4节中验证了其在消融中的有效性，其中我们比较了预测 ${\boldsymbol{\epsilon}}$ 和预测 $\tilde{\boldsymbol{\mu}}_{t}$ 。

Algorithm 1 Training
算法1训练

1:repeat
1：重复

\mathbf{x}_{0}\sim q(\mathbf{x}_{0})

2：

\mathbf{x}_{0}\sim q(\mathbf{x}_{0})

t\sim\mathrm{Uniform}(\{1,\dotsc,T\})

3：

t\sim\mathrm{Uniform}(\{1,\dotsc,T\})

{\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

4：

{\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

5: Take gradient descent step on
5：采取梯度下降步骤

\qquad\nabla_{\theta}\left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}},t)\right\|^{2}

6：

\qquad\nabla_{\theta}\left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}},t)\right\|^{2}

7:until converged
7：直到收敛

Algorithm 2 Sampling
算法2采样

\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

1：

\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

2:for

t=T,\dotsc,1

do
2：对于

t=T,\dotsc,1

\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

t>1

, else

\mathbf{z}=\mathbf{0}

3：

\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

t>1

，else

\mathbf{z}=\mathbf{0}

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)+\sigma_{t}\mathbf{z}

4：

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)+\sigma_{t}\mathbf{z}

5:end for
5：结束锻造

6:return

\mathbf{x}_{0}

6：返回

\mathbf{x}_{0}

3.3 Data scaling, reverse process decoder, and $L_{0}$
3.3数据缩放、反向处理解码器和 $L_{0}$

We assume that image data consists of integers in $\{0,1,\dotsc,255\}$ scaled linearly to $[-1,1]$ . This ensures that the neural network reverse process operates on consistently scaled inputs starting from the standard normal prior $p(\mathbf{x}_{T})$ . To obtain discrete log likelihoods, we set the last term of the reverse process to an independent discrete decoder derived from the Gaussian $\mathcal{N}(\mathbf{x}_{0};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{1},1),\sigma_{1}^{2}\mathbf{I})$ :
我们假设图像数据由线性缩放到 $[-1,1]$ 的 $\{0,1,\dotsc,255\}$ 中的整数组成。这确保了神经网络反向过程从标准正态先验 $p(\mathbf{x}_{T})$ 开始对一致缩放的输入进行操作。为了获得离散对数似然，我们将逆过程的最后一项设置为从高斯 $\mathcal{N}(\mathbf{x}_{0};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{1},1),\sigma_{1}^{2}\mathbf{I})$ 导出的独立离散解码器：

\displaystyle\begin{split}p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})&=\prod_{i=1}^{D}\int_{\delta_{-}(x_{0}^{i})}^{\delta_{+}(x_{0}^{i})}\mathcal{N}(x;\mu_{\theta}^{i}(\mathbf{x}_{1},1),\sigma_{1}^{2})\,dx\\ \delta_{+}(x)&=\begin{cases}\infty&\text{if}\ x=1\\ x+\frac{1}{255}&\text{if}\ x<1\end{cases}\qquad\delta_{-}(x)=\begin{cases}-\infty&\text{if}\ x=-1\\ x-\frac{1}{255}&\text{if}\ x>-1\end{cases}\end{split}

(13)

where $D$ is the data dimensionality and the $i$ superscript indicates extraction of one coordinate. (It would be straightforward to instead incorporate a more powerful decoder like a conditional autoregressive model, but we leave that to future work.) Similar to the discretized continuous distributions used in VAE decoders and autoregressive models [34, 52], our choice here ensures that the variational bound is a lossless codelength of discrete data, without need of adding noise to the data or incorporating the Jacobian of the scaling operation into the log likelihood. At the end of sampling, we display ${\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{1},1)$ noiselessly.
其中 $D$ 是数据维度， $i$ 上标表示提取一个坐标。 (It将是直接的，而不是纳入一个更强大的解码器，如条件自回归模型，但我们把它留给未来的工作。类似于VAE解码器和自回归模型中使用的离散化连续分布[34，52]，我们在这里的选择确保变分边界是离散数据的无损码长，而不需要向数据添加噪声或将缩放操作的雅可比矩阵并入对数似然中。在采样结束时，我们无噪声地显示 ${\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{1},1)$ 。

3.4 Simplified training objective
3.4简化培训目标

With the reverse process and decoder defined above, the variational bound, consisting of terms derived from Eqs. 12 and 13, is clearly differentiable with respect to $\theta$ and is ready to be employed for training. However, we found it beneficial to sample quality (and simpler to implement) to train on the following variant of the variational bound:
利用上面定义的反向过程和解码器，变分界限，由从等式2导出的项组成。 12和13，相对于 $\theta$ 是明显可区分的，并且准备用于训练。然而，我们发现在变分界限的以下变体上训练对样本质量有利（并且更容易实现）：

\displaystyle L_{\mathrm{simple}}(\theta)\coloneqq\mathbb{E}_{t,\mathbf{x}_{0},{\boldsymbol{\epsilon}}}\!\left[\left\|{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}},t)\right\|^{2}\right]

(14)

where $t$ is uniform between $1$ and $T$ . The $t=1$ case corresponds to $L_{0}$ with the integral in the discrete decoder definition 13 approximated by the Gaussian probability density function times the bin width, ignoring $\sigma_{1}^{2}$ and edge effects. The $t>1$ cases correspond to an unweighted version of Eq. 12, analogous to the loss weighting used by the NCSN denoising score matching model [55]. ( $L_{T}$ does not appear because the forward process variances $\beta_{t}$ are fixed.) Algorithm 1 displays the complete training procedure with this simplified objective.
其中 $t$ 在 $1$ 和 $T$ 之间是均匀的。 $t=1$ 情况对应于 $L_{0}$ ，其中离散解码器定义13中的积分由高斯概率密度函数乘以仓宽度来近似，忽略 $\sigma_{1}^{2}$ 和边缘效应。 $t>1$ 情况对应于Eq. 12，类似于NCSN去噪得分匹配模型使用的损失加权[55]。（ $L_{T}$ 未出现，因为前向过程方差 $\beta_{t}$ 是固定的。）算法1显示了具有该简化目标的完整训练过程。

Since our simplified objective 14 discards the weighting in Eq. 12, it is a weighted variational bound that emphasizes different aspects of reconstruction compared to the standard variational bound [18, 22]. In particular, our diffusion process setup in Section 4 causes the simplified objective to down-weight loss terms corresponding to small $t$ . These terms train the network to denoise data with very small amounts of noise, so it is beneficial to down-weight them so that the network can focus on more difficult denoising tasks at larger $t$ terms. We will see in our experiments that this reweighting leads to better sample quality.
由于我们的简化目标14放弃了等式中的权重。 12，它是一个加权变分界限，与标准变分界限相比，它强调了重建的不同方面[18，22]。特别地，我们在第4节中的扩散过程设置使得简化的目标降低对应于小 $t$ 的重量损失项。这些项训练网络以非常少量的噪声对数据进行降噪，因此降低它们的权重是有益的，这样网络就可以专注于更大的 $t$ 项的更困难的降噪任务。我们将在实验中看到，这种重新加权会带来更好的样本质量。

4 Experiments
4个实验

We set $T=1000$ for all experiments so that the number of neural network evaluations needed during sampling matches previous work [53, 55]. We set the forward process variances to constants increasing linearly from $\beta_{1}=10^{-4}$ to $\beta_{T}=0.02$ . These constants were chosen to be small relative to data scaled to $[-1,1]$ , ensuring that reverse and forward processes have approximately the same functional form while keeping the signal-to-noise ratio at $\mathbf{x}_{T}$ as small as possible ( $L_{T}=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})~{}\|~{}\mathcal{N}(\mathbf{0},\mathbf{I})\right)\approx 10^{-5}$ bits per dimension in our experiments).
我们为所有实验设置 $T=1000$ ，以便采样期间所需的神经网络评估数量与之前的工作相匹配[53，55]。我们将前向过程方差设置为从 $\beta_{1}=10^{-4}$ 到 $\beta_{T}=0.02$ 线性增加的常数。这些常数被选择为相对于缩放到 $[-1,1]$ 的数据较小，确保反向和正向过程具有大致相同的函数形式，同时保持 $\mathbf{x}_{T}$ 的信噪比尽可能小（ $L_{T}=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})~{}\|~{}\mathcal{N}(\mathbf{0},\mathbf{I})\right)\approx 10^{-5}$ 位每维度在我们的实验中）。

To represent the reverse process, we use a U-Net backbone similar to an unmasked PixelCNN++ [52, 48] with group normalization throughout [66]. Parameters are shared across time, which is specified to the network using the Transformer sinusoidal position embedding [60]. We use self-attention at the $16\times 16$ feature map resolution [63, 60]. Details are in Appendix B.
为了表示相反的过程，我们使用了一个类似于未屏蔽的PixelCNN++[52，48]的U-Net主干，整个[66]都进行了组规范化。参数跨时间共享，使用Transformer正弦位置嵌入[60]将其指定给网络。我们在 $16\times 16$ 特征图分辨率下使用自我注意力[63，60]。详情见附录B。

4.1 Sample quality
4.1样品质量

Table 1: CIFAR10 results. NLL measured in bits/dim.
表1：CIFAR 10结果。NLL以位/尺寸测量。

Model	IS	FID	NLL Test (Train) NLL测试（火车）
Conditional 条件
EBM [11]	$8.30$	$37.9$
JEM [17] 正义与平等运动 [17]	$8.76$	$38.4$
BigGAN [3]	$9.22$	$14.73$
StyleGAN2 + ADA (v1) [29] StyleGAN2 + ADA（v1） [29]	$\mathbf{10.06}$	$\mathbf{2.67}$
Unconditional 无条件
Diffusion (original) [53] [53]第53话			$\leq 5.40$
Gated PixelCNN [59]	$4.60$	$65.93$	$3.03$ $(2.90)$ 第0章第1章
Sparse Transformer [7] 稀疏Transformer [7]			$\mathbf{2.80}$
PixelIQN [43]	$5.29$	$49.46$
EBM [11]	$6.78$	$38.2$
NCSNv2 [56] [第56话]		$31.75$
NCSN [55] [55]第五十五话	$8.87\!\pm\!0.12$	$25.32$
SNGAN [39]	$8.22\!\pm\!0.05$	$21.7$
SNGAN-DDLS [4]	$9.09\!\pm\!0.10$	$15.42$
StyleGAN2 + ADA (v1) [29] StyleGAN2 + ADA（v1） [29]	$\mathbf{9.74}\pm 0.05$	$3.26$
Ours ( $L$ , fixed isotropic ${\boldsymbol{\Sigma}}$ ) 我们的（ $L$ ，固定各向同性 ${\boldsymbol{\Sigma}}$ ）	$7.67\!\pm\!0.13$	$13.51$	$\leq 3.70$ $(3.69)$ 第0章第1章
Ours ( $L_{\mathrm{simple}}$ ) 我们的（ $L_{\mathrm{simple}}$ ）	$9.46\!\pm\!0.11$	$\mathbf{3.17}$	$\leq 3.75$ $(3.72)$ 第0章第1章

Table 2: Unconditional CIFAR10 reverse process parameterization and training objective ablation. Blank entries were unstable to train and generated poor samples with out-of-range scores.
表2：无条件CIFAR 10反向过程参数化和训练目标消融。空白条目对训练不稳定，并且生成分数超出范围的差样本。

Objective 目的	IS	FID
$\tilde{\boldsymbol{\mu}}$ prediction (baseline) $\tilde{\boldsymbol{\mu}}$ 预测（基线）
$L$ , learned diagonal ${\boldsymbol{\Sigma}}$ $L$ ，学习对角线 ${\boldsymbol{\Sigma}}$	$7.28\!\pm\!0.10$	$23.69$
$L$ , fixed isotropic ${\boldsymbol{\Sigma}}$ $L$ ，固定各向同性 ${\boldsymbol{\Sigma}}$	$8.06\!\pm\!0.09$	$13.22$
$\\|\tilde{\boldsymbol{\mu}}-\tilde{\boldsymbol{\mu}}_{\theta}\\|^{2}$	–	–
${\boldsymbol{\epsilon}}$ prediction (ours) ${\boldsymbol{\epsilon}}$ 预测（我们的）
$L$ , learned diagonal ${\boldsymbol{\Sigma}}$ $L$ ，学习对角线 ${\boldsymbol{\Sigma}}$	–	–
$L$ , fixed isotropic ${\boldsymbol{\Sigma}}$ $L$ ，固定各向同性 ${\boldsymbol{\Sigma}}$	$7.67\!\pm\!0.13$	$13.51$
$\\|\tilde{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}\\|^{2}$ ( $L_{\mathrm{simple}}$ ) $\\|\tilde{\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}\\|^{2}$ （ $L_{\mathrm{simple}}$ ）	$\mathbf{9.46\!\pm\!0.11}$	$\mathbf{3.17}$

Table 1 shows Inception scores, FID scores, and negative log likelihoods (lossless codelengths) on CIFAR10. With our FID score of 3.17, our unconditional model achieves better sample quality than most models in the literature, including class conditional models. Our FID score is computed with respect to the training set, as is standard practice; when we compute it with respect to the test set, the score is 5.24, which is still better than many of the training set FID scores in the literature.
表1显示了CIFAR 10上的初始分数、FID分数和负对数似然（无损编码）。我们的FID得分为3.17，我们的无条件模型比文献中的大多数模型（包括类条件模型）实现了更好的样本质量。我们的FID分数是根据训练集计算的，这是标准做法;当我们根据测试集计算时，分数是5.24，这仍然比文献中的许多训练集FID分数要好。

We find that training our models on the true variational bound yields better codelengths than training on the simplified objective, as expected, but the latter yields the best sample quality. See Fig. 1 for CIFAR10 and CelebA-HQ $256\times 256$ samples, Fig. 4 and Fig. 4 for LSUN $256\times 256$ samples [71], and Appendix D for more.
我们发现，在真正的变分界限上训练我们的模型比在简化目标上训练产生更好的编码性能，正如预期的那样，但后者产生最好的样本质量。见图 CIFAR 10和CelebA-HQ $256\times 256$ 样品，图1。 4和图 LSUN $256\times 256$ 样本[71]为4，附录 D为更多。

4.2 Reverse process parameterization and training objective ablation
4.2逆向过程参数化和训练目标消融

In Table 2, we show the sample quality effects of reverse process parameterizations and training objectives (Section 3.2). We find that the baseline option of predicting $\tilde{\boldsymbol{\mu}}$ works well only when trained on the true variational bound instead of unweighted mean squared error, a simplified objective akin to Eq. 14. We also see that learning reverse process variances (by incorporating a parameterized diagonal ${\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t})$ into the variational bound) leads to unstable training and poorer sample quality compared to fixed variances. Predicting ${\boldsymbol{\epsilon}}$ , as we proposed, performs approximately as well as predicting $\tilde{\boldsymbol{\mu}}$ when trained on the variational bound with fixed variances, but much better when trained with our simplified objective.
在表2中，我们显示了反向过程参数化和训练目标的样本质量影响（第3.2节）。我们发现，预测 $\tilde{\boldsymbol{\mu}}$ 的基线选项只有在真正的变分界限而不是未加权的均方误差上训练时才能很好地工作，这是一个类似于Eq的简化目标。 14.我们还看到，与固定方差相比，学习逆过程方差（通过将参数化对角 ${\boldsymbol{\Sigma}}_{\theta}(\mathbf{x}_{t})$ 纳入变分界限）会导致不稳定的训练和较差的样本质量。正如我们提出的那样，预测 ${\boldsymbol{\epsilon}}$ 在固定方差的变分界限上训练时的表现与预测 $\tilde{\boldsymbol{\mu}}$ 大致相同，但在使用我们的简化目标训练时要好得多。

4.3 Progressive coding
4.3渐进编码

Table 1 also shows the codelengths of our CIFAR10 models. The gap between train and test is at most 0.03 bits per dimension, which is comparable to the gaps reported with other likelihood-based models and indicates that our diffusion model is not overfitting (see Appendix D for nearest neighbor visualizations). Still, while our lossless codelengths are better than the large estimates reported for energy based models and score matching using annealed importance sampling [11], they are not competitive with other types of likelihood-based generative models [7].
表1还显示了我们的CIFAR10模型的代码表。训练和测试之间的差距最多为每维0.03位，这与其他基于可能性的模型报告的差距相当，表明我们的扩散模型没有过拟合（最近邻可视化见附录D）。尽管如此，虽然我们的无损编码比基于能量的模型和使用退火重要性采样的分数匹配的大估计更好[11]，但它们与其他类型的基于可能性的生成模型[7]没有竞争力。

Since our samples are nonetheless of high quality, we conclude that diffusion models have an inductive bias that makes them excellent lossy compressors. Treating the variational bound terms $L_{1}+\cdots+L_{T}$ as rate and $L_{0}$ as distortion, our CIFAR10 model with the highest quality samples has a rate of 1.78 bits/dim and a distortion of 1.97 bits/dim, which amounts to a root mean squared error of 0.95 on a scale from 0 to 255. More than half of the lossless codelength describes imperceptible distortions.
由于我们的样本仍然是高质量的，我们得出结论，扩散模型有一个电感偏置，使他们成为优秀的有损压缩器。将变分约束项 $L_{1}+\cdots+L_{T}$ 视为速率，将 $L_{0}$ 视为失真，我们的具有最高质量样本的CIFAR 10模型具有1.78位/暗的速率和1.97位/暗的失真，这相当于在从0到255的尺度上的均方根误差为0.95。超过一半的无损码长描述了难以察觉的失真。

Progressive lossy compression
渐进有损压缩

We can probe further into the rate-distortion behavior of our model by introducing a progressive lossy code that mirrors the form of Eq. 5: see Algorithms 3 and 4, which assume access to a procedure, such as minimal random coding [19, 20], that can transmit a sample $\mathbf{x}\sim q(\mathbf{x})$ using approximately $D_{\mathrm{KL}}\!\left(q(\mathbf{x})~{}\|~{}p(\mathbf{x})\right)$ bits on average for any distributions $p$ and $q$ , for which only $p$ is available to the receiver beforehand.
我们可以通过引入反映等式形式的渐进有损代码来进一步探究我们模型的率失真行为。 5：参见算法 3和4，其假设访问诸如最小随机编码[19，20]的过程，该过程可以针对任何分布 $p$ 和 $q$ 平均使用大约 $D_{\mathrm{KL}}\!\left(q(\mathbf{x})~{}\|~{}p(\mathbf{x})\right)$ 比特来发送样本 $\mathbf{x}\sim q(\mathbf{x})$ ，对于该分布，只有 $p$ 预先可用于接收器。

Algorithm 3 Sending

\mathbf{x}_{0}

算法3发送

\mathbf{x}_{0}

1:Send

\mathbf{x}_{T}\sim q(\mathbf{x}_{T}|\mathbf{x}_{0})

using

p(\mathbf{x}_{T})

1：使用

p(\mathbf{x}_{T})

发送

2:for

t=T-1,\dotsc,2,1

do
2：对于

t=T-1,\dotsc,2,1

3: Send

\mathbf{x}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{t+1},\mathbf{x}_{0})

using

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})

3：使用

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})

发送

4:end for
4：结束

5:Send

\mathbf{x}_{0}

using

p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})

5：使用

p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})

发送

Algorithm 4 Receiving
算法4接收

1:Receive

\mathbf{x}_{T}

using

p(\mathbf{x}_{T})

1：使用

p(\mathbf{x}_{T})

接收

2:for

t=T-1,\dotsc,1,0

do
2：对于

t=T-1,\dotsc,1,0

3: Receive

\mathbf{x}_{t}

using

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})

3：使用

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})

接收

\mathbf{x}_{t}

4:end for
4：结束

5:return

\mathbf{x}_{0}

5：返回

\mathbf{x}_{0}

When applied to $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ , Algorithms 3 and 4 transmit $\mathbf{x}_{T},\dotsc,\mathbf{x}_{0}$ in sequence using a total expected codelength equal to Eq. 5. The receiver, at any time $t$ , has the partial information $\mathbf{x}_{t}$ fully available and can progressively estimate:
当应用于 $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ 时，算法 3和4使用等于等式2的总预期码长按顺序发送 $\mathbf{x}_{T},\dotsc,\mathbf{x}_{0}$ 。 5.接收机在任何时间 $t$ 具有完全可用的部分信息 $\mathbf{x}_{t}$ ，并且可以渐进地估计：

\displaystyle\mathbf{x}_{0}\approx\hat{\mathbf{x}}_{0}=\left(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t})\right)/\sqrt{\bar{\alpha}_{t}}

(15)

due to Eq. 4. (A stochastic reconstruction $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ is also valid, but we do not consider it here because it makes distortion more difficult to evaluate.) Figure 5 shows the resulting rate-distortion plot on the CIFAR10 test set. At each time $t$ , the distortion is calculated as the root mean squared error $\sqrt{\|\mathbf{x}_{0}-\hat{\mathbf{x}}_{0}\|^{2}/D}$ , and the rate is calculated as the cumulative number of bits received so far at time $t$ . The distortion decreases steeply in the low-rate region of the rate-distortion plot, indicating that the majority of the bits are indeed allocated to imperceptible distortions.
由于EQ。第四章（随机重建 $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ 也是有效的，但我们在这里不考虑它，因为它使失真更难以评估。图5显示了在CIFAR10测试集上生成的率失真图。在每个时间 $t$ ，失真被计算为均方根误差 $\sqrt{\|\mathbf{x}_{0}-\hat{\mathbf{x}}_{0}\|^{2}/D}$ ，并且速率被计算为到目前为止在时间 $t$ 接收的累积比特数。失真在率失真图的低率区域中急剧减小，这表明大部分比特确实被分配给了难以察觉的失真。

Figure 5: Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a

[0,255]

scale. See Table 4 for details.
图5：无条件CIFAR10测试集速率失真与时间的关系。失真以均方根误差在

[0，255]

标度上测量。详见表4。

Progressive generation 逐渐产生

We also run a progressive unconditional generation process given by progressive decompression from random bits. In other words, we predict the result of the reverse process, $\hat{\mathbf{x}}_{0}$ , while sampling from the reverse process using Algorithm 2. Figures 6 and 10 show the resulting sample quality of $\hat{\mathbf{x}}_{0}$ over the course of the reverse process. Large scale image features appear first and details appear last. Figure 7 shows stochastic predictions $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ with $\mathbf{x}_{t}$ frozen for various $t$ . When $t$ is small, all but fine details are preserved, and when $t$ is large, only large scale features are preserved. Perhaps these are hints of conceptual compression [18].
我们还运行了一个渐进的无条件生成过程，通过从随机位进行渐进解压缩。换句话说，我们预测逆过程的结果 $\hat{\mathbf{x}}_{0}$ ，同时使用算法2从逆过程采样。图6和图10显示了在反向过程中 $\hat{\mathbf{x}}_{0}$ 的所得样品质量。大尺度图像特征首先出现，细节最后出现。图7示出了随机预测 $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ ，其中 $\mathbf{x}_{t}$ 针对各种 $t$ 被冻结。当 $t$ 很小时，除了精细细节之外的所有细节都被保留，而当 $t$ 很大时，只有大尺度特征被保留。也许这是概念压缩的暗示[18]。

Connection to autoregressive decoding
与自回归解码的连接

Note that the variational bound 5 can be rewritten as:
注意，变分界限δ可以重写为：

\displaystyle L

\displaystyle=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T})~{}\|~{}p(\mathbf{x}_{T})\right)+\mathbb{E}_{q}\Bigg{[}\sum_{t\geq 1}D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}|\mathbf{x}_{t})~{}\|~{}p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\right)\Bigg{]}+H(\mathbf{x}_{0})

(16)

(See Appendix A for a derivation.) Now consider setting the diffusion process length $T$ to the dimensionality of the data, defining the forward process so that $q(\mathbf{x}_{t}|\mathbf{x}_{0})$ places all probability mass on $\mathbf{x}_{0}$ with the first $t$ coordinates masked out (i.e. $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ masks out the $t^{\text{th}}$ coordinate), setting $p(\mathbf{x}_{T})$ to place all mass on a blank image, and, for the sake of argument, taking $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ to be a fully expressive conditional distribution. With these choices, $D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T})~{}\|~{}p(\mathbf{x}_{T})\right)=0$ , and minimizing $D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}|\mathbf{x}_{t})~{}\|~{}p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\right)$ trains $p_{\theta}$ to copy coordinates $t+1,\dotsc,T$ unchanged and to predict the $t^{\text{th}}$ coordinate given $t+1,\dotsc,T$ . Thus, training $p_{\theta}$ with this particular diffusion is training an autoregressive model.
(See附录A为推导。）现在考虑将扩散过程长度 $T$ 设置为数据的维数，定义前向过程，使得 $q(\mathbf{x}_{t}|\mathbf{x}_{0})$ 将所有概率质量放置在 $\mathbf{x}_{0}$ 上，其中第一个 $t$ 坐标被屏蔽（即 $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ 屏蔽了 $t^{\text{th}}$ 坐标），设置 $p(\mathbf{x}_{T})$ 以将所有质量放置在空白图像上，并且，为了论证，假设 $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})$ 是一个完全表达的条件分布。有了这些选择， $D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T})~{}\|~{}p(\mathbf{x}_{T})\right)=0$ 和最小化 $D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}|\mathbf{x}_{t})~{}\|~{}p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\right)$ 训练 $p_{\theta}$ 复制坐标 $t+1,\dotsc,T$ 不变，并预测给定 $t+1,\dotsc,T$ 的 $t^{\text{th}}$ 坐标。因此，使用这种特定的扩散训练 $p_{\theta}$ 是训练自回归模型。

We can therefore interpret the Gaussian diffusion model 2 as a kind of autoregressive model with a generalized bit ordering that cannot be expressed by reordering data coordinates. Prior work has shown that such reorderings introduce inductive biases that have an impact on sample quality [38], so we speculate that the Gaussian diffusion serves a similar purpose, perhaps to greater effect since Gaussian noise might be more natural to add to images compared to masking noise. Moreover, the Gaussian diffusion length is not restricted to equal the data dimension; for instance, we use $T=1000$ , which is less than the dimension of the $32\times 32\times 3$ or $256\times 256\times 3$ images in our experiments. Gaussian diffusions can be made shorter for fast sampling or longer for model expressiveness.
因此，我们可以将高斯扩散模型2解释为一种具有广义位排序的自回归模型，其不能通过重新排序数据坐标来表达。先前的工作已经表明，这种重新排序会引入对样本质量有影响的归纳偏差[38]，因此我们推测高斯扩散具有类似的目的，可能会产生更大的效果，因为与掩蔽噪声相比，高斯噪声可能更自然地添加到图像中。此外，高斯扩散长度不限于等于数据维度;例如，我们使用 $T=1000$ ，它小于我们实验中 $32\times 32\times 3$ 或 $256\times 256\times 3$ 图像的维度。高斯扩散可以做得更短以实现快速采样，或者做得更长以实现模型表达。

4.4 Interpolation
4.4插值

We can interpolate source images $\mathbf{x}_{0},\mathbf{x}^{\prime}_{0}\sim q(\mathbf{x}_{0})$ in latent space using $q$ as a stochastic encoder, $\mathbf{x}_{t},\mathbf{x}^{\prime}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{0})$ , then decoding the linearly interpolated latent $\bar{\mathbf{x}}_{t}=(1-\lambda)\mathbf{x}_{0}+\lambda\mathbf{x}^{\prime}_{0}$ into image space by the reverse process, $\bar{\mathbf{x}}_{0}\sim p(\mathbf{x}_{0}|\bar{\mathbf{x}}_{t})$ . In effect, we use the reverse process to remove artifacts from linearly interpolating corrupted versions of the source images, as depicted in Fig. 8 (left). We fixed the noise for different values of $\lambda$ so $\mathbf{x}_{t}$ and $\mathbf{x}^{\prime}_{t}$ remain the same. Fig. 8 (right) shows interpolations and reconstructions of original CelebA-HQ $256\times 256$ images ( $t=500$ ). The reverse process produces high-quality reconstructions, and plausible interpolations that smoothly vary attributes such as pose, skin tone, hairstyle, expression and background, but not eyewear. Larger $t$ results in coarser and more varied interpolations, with novel samples at $t=1000$ (Appendix Fig. 9).
我们可以使用 $q$ 作为随机编码器在潜空间中内插源图像 $\mathbf{x}_{0},\mathbf{x}^{\prime}_{0}\sim q(\mathbf{x}_{0})$ ， $\mathbf{x}_{t},\mathbf{x}^{\prime}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{0})$ ，然后通过反向过程将线性内插的潜图像 $\bar{\mathbf{x}}_{t}=(1-\lambda)\mathbf{x}_{0}+\lambda\mathbf{x}^{\prime}_{0}$ 解码到图像空间中， $\bar{\mathbf{x}}_{0}\sim p(\mathbf{x}_{0}|\bar{\mathbf{x}}_{t})$ 。实际上，我们使用相反的过程来消除线性插值源图像损坏版本中的伪影，如图所示。 8（左）。我们修复了 $\lambda$ 的不同值的噪声，因此 $\mathbf{x}_{t}$ 和 $\mathbf{x}^{\prime}_{t}$ 保持不变。图 8（右）显示了原始CelebA-HQ $256\times 256$ 图像（ $t=500$ ）的插值和重建。相反的过程产生高质量的重建，和合理的插值，平滑地改变属性，如姿势，肤色，发型，表情和背景，但不是眼镜。较大的 $t$ 导致更粗糙和更多样的插值，在 $t=1000$ 处有新的样本（附录图 9）。

5 Related Work
5相关工作

While diffusion models might resemble flows [9, 46, 10, 32, 5, 16, 23] and VAEs [33, 47, 37], diffusion models are designed so that $q$ has no parameters and the top-level latent $\mathbf{x}_{T}$ has nearly zero mutual information with the data $\mathbf{x}_{0}$ . Our ${\boldsymbol{\epsilon}}$ -prediction reverse process parameterization establishes a connection between diffusion models and denoising score matching over multiple noise levels with annealed Langevin dynamics for sampling [55, 56]. Diffusion models, however, admit straightforward log likelihood evaluation, and the training procedure explicitly trains the Langevin dynamics sampler using variational inference (see Appendix C for details). The connection also has the reverse implication that a certain weighted form of denoising score matching is the same as variational inference to train a Langevin-like sampler. Other methods for learning transition operators of Markov chains include infusion training [2], variational walkback [15], generative stochastic networks [1], and others [50, 54, 36, 42, 35, 65].
虽然扩散模型可能类似于流[9，46，10，32，5，16，23]和VAE[33，47，37]，但扩散模型的设计使得 $q$ 没有参数，并且顶级潜在 $\mathbf{x}_{T}$ 与数据 $\mathbf{x}_{0}$ 的互信息几乎为零。我们的 ${\boldsymbol{\epsilon}}$ -预测逆过程参数化建立了扩散模型与多个噪声水平上的去噪得分匹配之间的联系，并采用退火Langevin动力学进行采样[55，56]。然而，扩散模型允许简单的对数似然估计，并且训练过程使用变分推理显式地训练Langevin动态采样器（详见附录C）。这种联系也有相反的含义，即某种加权形式的去噪得分匹配与变分推理相同，以训练Langevin样采样器。学习马尔可夫链的转移算子的其他方法包括注入训练[2]，变分walkback[15]，生成随机网络[1]等[50，54，36，42，35，65]。

By the known connection between score matching and energy-based modeling, our work could have implications for other recent work on energy-based models [67, 68, 69, 12, 70, 13, 11, 41, 17, 8]. Our rate-distortion curves are computed over time in one evaluation of the variational bound, reminiscent of how rate-distortion curves can be computed over distortion penalties in one run of annealed importance sampling [24]. Our progressive decoding argument can be seen in convolutional DRAW and related models [18, 40] and may also lead to more general designs for subscale orderings or sampling strategies for autoregressive models [38, 64].
通过分数匹配和基于能量的建模之间的已知联系，我们的工作可能会对其他最近关于基于能量的模型的工作产生影响[67，68，69，12，70，13，11，41，17，8]。我们的速率失真曲线是在变分界限的一次评估中随着时间的推移而计算的，这让人想起如何在一次退火重要性采样中根据失真惩罚来计算速率失真曲线[24]。我们的渐进式解码参数可以在卷积DRAW和相关模型中看到[18，40]，并且还可能导致自回归模型的子尺度排序或采样策略的更通用设计[38，64]。

6 Conclusion
6结论

We have presented high quality image samples using diffusion models, and we have found connections among diffusion models and variational inference for training Markov chains, denoising score matching and annealed Langevin dynamics (and energy-based models by extension), autoregressive models, and progressive lossy compression. Since diffusion models seem to have excellent inductive biases for image data, we look forward to investigating their utility in other data modalities and as components in other types of generative models and machine learning systems.
我们已经提出了高质量的图像样本使用扩散模型，我们已经发现之间的连接扩散模型和变分推理训练马尔可夫链，去噪得分匹配和退火朗之万动力学（和基于能量的模型扩展），自回归模型，和渐进式有损压缩。由于扩散模型似乎对图像数据有很好的归纳偏差，我们期待着研究它们在其他数据模式中的效用，并作为其他类型的生成模型和机器学习系统的组成部分。

Broader Impact 更广泛的影响

Our work on diffusion models takes on a similar scope as existing work on other types of deep generative models, such as efforts to improve the sample quality of GANs, flows, autoregressive models, and so forth. Our paper represents progress in making diffusion models a generally useful tool in this family of techniques, so it may serve to amplify any impacts that generative models have had (and will have) on the broader world.
我们在扩散模型方面的工作与其他类型的深度生成模型的现有工作类似，例如努力提高GAN、流、自回归模型等的样本质量。我们的论文代表了在使扩散模型成为这一技术家族中普遍有用的工具方面取得的进展，因此它可能有助于放大生成模型对更广泛世界的任何影响。

Unfortunately, there are numerous well-known malicious uses of generative models. Sample generation techniques can be employed to produce fake images and videos of high profile figures for political purposes. While fake images were manually created long before software tools were available, generative models such as ours make the process easier. Fortunately, CNN-generated images currently have subtle flaws that allow detection [62], but improvements in generative models may make this more difficult. Generative models also reflect the biases in the datasets on which they are trained. As many large datasets are collected from the internet by automated systems, it can be difficult to remove these biases, especially when the images are unlabeled. If samples from generative models trained on these datasets proliferate throughout the internet, then these biases will only be reinforced further.
不幸的是，有许多众所周知的恶意使用生成模型。样本生成技术可以用于制作用于政治目的的高知名度人物的虚假图像和视频。虽然在软件工具出现之前很久就手动创建了假图像，但像我们这样的生成模型使这个过程变得更容易。幸运的是，CNN生成的图像目前有细微的缺陷，可以检测到[62]，但生成模型的改进可能会使这变得更加困难。生成模型还反映了它们所训练的数据集中的偏差。由于许多大型数据集是通过自动化系统从互联网上收集的，因此很难消除这些偏差，特别是当图像未标记时。如果在这些数据集上训练的生成模型的样本在整个互联网上扩散，那么这些偏见只会进一步加强。

On the other hand, diffusion models may be useful for data compression, which, as data becomes higher resolution and as global internet traffic increases, might be crucial to ensure accessibility of the internet to wide audiences. Our work might contribute to representation learning on unlabeled raw data for a large range of downstream tasks, from image classification to reinforcement learning, and diffusion models might also become viable for creative uses in art, photography, and music.
另一方面，扩散模型可能对数据压缩有用，随着数据分辨率的提高和全球互联网流量的增加，数据压缩可能对确保广大受众访问互联网至关重要。我们的工作可能有助于对未标记的原始数据进行表征学习，用于从图像分类到强化学习的大量下游任务，并且扩散模型也可能在艺术，摄影和音乐中的创造性应用中变得可行。

Acknowledgments and Disclosure of Funding
资金的确认和披露

This work was supported by ONR PECASE and the NSF Graduate Research Fellowship under grant number DGE-1752814. Google’s TensorFlow Research Cloud (TFRC) provided Cloud TPUs.
这项工作得到了ONR PECASE和NSF研究生研究奖学金的支持，资助号为DGE-1752814。Google的TensorFlow Research Cloud（TFRC）提供了云TPU。

References 引用

Alain et al. [2016] Alain等人[2016] Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau-Laufer, Saizheng Zhang, and Pascal Vincent. GSNs: generative stochastic networks. Information and Inference: A Journal of the IMA, 5(2):210–249, 2016.
Guillaume Alain、Yoellow Bengio、Li Yao、Jason Yosinski、Eric Thibodeau-Laufer、张赛正和Pascal Vincent。GSN：生成随机网络。信息与推理：IMA杂志，5（2）：210-249，2016。
Bordes et al. [2017] Bordes等人[2017] Florian Bordes, Sina Honari, and Pascal Vincent. Learning to generate samples from noise through infusion training. In International Conference on Learning Representations, 2017.
弗洛里安·博德斯，西娜·霍纳里，帕斯卡·文森特。通过注入训练从噪声中学习生成样本。2017年，在国际学术会议上发表。
Brock et al. [2019] Brock等人[2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
安德鲁·布洛克，杰夫·唐纳休，凯伦·西蒙尼扬。用于高保真自然图像合成的大规模GAN训练。2019年，参加国际学术会议。
Che et al. [2020] Che等人[2020] Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and Yoshua Bengio. Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. arXiv preprint arXiv:2003.06060, 2020.
Tong Che，Ruixiang Zhang，Jascha Sohl-Dickstein，Hugo Larochelle，Liam Paull，Yuan Cao，and Yoonne Bengio. 你的GAN秘密地是一个基于能量的模型，你应该使用基于能量驱动的潜在采样。arXiv预印本arXiv：2003.06060，2020。
Chen et al. [2018a] Chen等人[2018 a] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018a.
Tian Qi Chen，Yulia Rubanova，Jesse Bettencourt，and大大卫 K Duvenaud. 神经元常微分方程在神经信息处理系统的进展，第6571-6583页，2018年a。
Chen et al. [2018b] Chen等人[2018 b] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregressive generative model. In International Conference on Machine Learning, pages 863–871, 2018b.
Xi Chen，Nikhil Mishra，Mostafa Rohaninejad，and Pieter Abbeel. PixelSNAIL：一种改进的自回归生成模型。在国际机器学习会议上，第863-871页，2018 b。
Child et al. [2019] Child等人[2019] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
Rewon Child，Scott Gray，Alec拉德福，and Ilya Sutskever. 用稀疏变换器生成长序列。2019年12月19日-
Deng et al. [2020] Deng等人[2020] Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc’Aurelio Ranzato. Residual energy-based models for text generation. arXiv preprint arXiv:2004.11714, 2020.
Yuntian Deng，Anton Bakhtin，Myle Ott，亚瑟，and Marc'Aurelio Ranzato. 基于剩余能量的文本生成模型。arXiv预印本arXiv：2004.11714，2020。
Dinh et al. [2014] Dinh等人[2014] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
Laurent Dinh、大卫Krueger和Yoshua Bengio。NICE：非线性独立分量估计。arXiv预印本arXiv：1410.8516，2014年。
Dinh et al. [2016] Dinh等人[2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
Laurent Dinh，Jascha Sohl-Dickstein，and Samy Bengio. 使用真实的NVP的密度估计。arXiv预印本arXiv：1605.08803，2016.
Du and Mordatch [2019] Du和Mordatch [2019] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, pages 3603–3613, 2019.
Yilun Du和Igor Mordatch。使用基于能量的模型进行隐式生成和建模。神经信息处理系统的进展，第3603-3613页，2019年。
Gao et al. [2018] Gao等人[2018] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative ConvNets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9155–9164, 2018.
Ruiqi Gao，Yang Lu，Junpei Zhou，Song-Chun Zhu，and Ying Nian Wu. 通过多网格建模和采样学习生成式ConvNets。在IEEE计算机视觉和模式识别会议论文集，第9155-9164页，2018年。
Gao et al. [2020] Gao等人[2020] Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7518–7528, 2020.
Ruiqi Gao，Erik Nijkamp，Diederik P Kingma，Zhen Xu，Andrew M Dai，and Ying Nian Wu. 基于能量模型的流量对比估计。在IEEE/CVF计算机视觉和模式识别会议论文集，第7518-7528页，2020年。
Goodfellow et al. [2014] Goodfellow等人[2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
Ian Goodfellow，Jean Pouget-Abadie，Mehdi Mirza，Bing Xu，大卫沃德-法利，Sherjil Ozair，Aaron Courville，和Yoonge Bengio. 生成性对抗网。神经信息处理系统的进展，第2672-2680页，2014年。
Goyal et al. [2017] Goyal等人[2017] Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pages 4392–4402, 2017.
Anirudh Goyal、Nan 迷迭香Ke、Surya Ganguli和Yoshua Bengio。变分回退：学习一个过渡算子作为一个随机递归网络。神经信息处理系统的进展，第4392-4402页，2017年。
Grathwohl et al. [2019] Grathwohl等人[2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.
Will Grathwohl，Ricky T.& Q. Chen，Jesse Bettencourt，and大卫Duvenaud. FFJORD：可扩展可逆生成模型的自由形式连续动力学。2019年，参加国际学术会议。
Grathwohl et al. [2020] Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020.
Will Grathwohl，Kuan-Chieh Wang，Joern-Henrik Jacobsen，大大卫Duvenaud，Mohammad Norouzi和Kevin Swersky。你的分类器实际上是一个基于能量的模型，你应该把它当作一个模型来对待。2020年，在国际学术会议上发表。
Gregor et al. [2016] Gregor等人[2016] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pages 3549–3557, 2016.
卡洛 · 格里高利， Frederic Besse ， Danilo Jimenez Revende ， Ivo Danihelka 和 Daan Wierstra 。朝向概念性的压缩。在神经信息处理系统进展，页 3549 - 3557 ， 2016 。
Harsha et al. [2007] Hank et al. （ 2007 ） Prahladh Harsha, Rahul Jain, David McAllester, and Jaikumar Radhakrishnan. The communication complexity of correlation. In Twenty-Second Annual IEEE Conference on Computational Complexity (CCC’07), pages 10–23. IEEE, 2007.
Prahladh Harsha ， Rahul Jain ， David McAllester 和 Jaikumar Radhakrishnan 。沟通的复杂性相关。在第 22 届 IEEE 计算复杂性年会（ CCC ' 07 ）第 10 - 23 页？IEEE ， 2007 年。
Havasi et al. [2019] Havasi et al. （ 2019 ） Marton Havasi, Robert Peharz, and José Miguel Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2019.
Marton Havasi，Robert Peharz，and José Miguel Hernán-Lobato. 最小随机代码学习：从压缩模型参数中获取比特。2019年，参加国际学术会议。
Heusel et al. [2017] Heusel等人[2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
Martin Heusel、Hubert Ramsauer、托马斯Unterthiner、Bernhard Nessler和Sepp Hochreiter。通过两个时间尺度更新规则训练的GAN收敛到局部纳什均衡。神经信息处理系统的进展，第6626-6637页，2017年。
Higgins et al. [2017] Higgins等人[2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
Irina Higgins，Loic Matthey，Arka Ricki，Christopher Burgess，Xavier Glorot，Matthew Botvinick，Shakir Mohamed，and亚历山大勒施纳. beta-VAE：使用约束变分框架学习基本视觉概念。2017年，在国际学术会议上发表。
Ho et al. [2019] Ho等人[2019] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, 2019.
Jonathan Ho，Xi Chen，Aravind Srinivas，Yan Duan，and Pieter Abbeel. Flow++：通过变量化和架构设计改进基于流的生成模型。在2019年的机器学习国际会议上。
Huang et al. [2020] Huang等人[2020] Sicong Huang, Alireza Makhzani, Yanshuai Cao, and Roger Grosse. Evaluating lossy compression rates of deep generative models. In International Conference on Machine Learning, 2020.
黄思聪，Alireza Makhzani，Yanshuai Cao和Roger Grosse。评估深度生成模型的有损压缩率。在2020年的机器学习国际会议上。
Kalchbrenner et al. [2017]
Kalchbrenner等人[2017] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779, 2017.
Nal Kalchbrenner、Aaron货车 den Oord、Karen Simonyan、Ivo Danihelka、Oriol Vinyals、Alex Graves和Koray Kavukcuoglu。视频像素网络。在国际机器学习会议上，第1771-1779页，2017年。
Kalchbrenner et al. [2018]
Kalchbrenner等人[2018] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419, 2018.
Nal Kalchbrenner、Erich埃尔森、Karen Simonyan、Seb Noury、Norman Casagrande、Edward Lockhart、Florian Stimberg、Aaron货车 den Oord、Sander Dieleman和Koray Kavukcuoglu。高效的神经音频合成。在国际机器学习会议上，第2410-2419页，2018年。
Karras et al. [2018] Karras等人[2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
Tero Karras，Timo Aila，Samuli Laine，and Jaakko Lehtinen. GAN的逐步增长，以提高质量，稳定性和变化。2018年，参加国际学术会议。
Karras et al. [2019] Karras等人[2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
Tero Karras、Samuli Laine和Timo Aila。一个基于风格的生成器架构，用于生成对抗网络。在IEEE计算机视觉和模式识别会议论文集，第4401-4410页，2019年。
Karras et al. [2020a] Karras等人[2020 a] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676v1, 2020a.
Tero Karras、Miika Aittala、Janne Hellsten、Samuli Laine、Jaakko Lehtinen和Timo Aila。用有限的数据训练生成对抗网络。arXiv预印本arXiv：2006.06676v1，2020 a。
Karras et al. [2020b] Karras等人[2020 b] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020b.
Tero Karras，Samuli Laine，Miika Aittala，Janne Hellsten，Jaakko Lehtinen，and Timo Aila. 对StyleGAN的图像质量进行了分析和改进。在IEEE/CVF计算机视觉和模式识别会议论文集，第8110-8119页，2020 b。
Kingma and Ba [2015] 金马和Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
迪德里克· P·金马和吉米·巴。Adam：随机最佳化的方法。在2015年国际学习代表会议上。
Kingma and Dhariwal [2018]
金马和达里瓦尔[2018] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
Diederik P Kingma和Prafulla达里瓦尔。Glow：具有可逆1x 1卷积的生成流。神经信息处理系统的进展，第10215-10224页，2018年。
Kingma and Welling [2013]
Kingma和Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
迪德里克· P·金马和马克斯·威林。自动编码变分贝叶斯。arXiv预印本arXiv：1312.6114，2013。
Kingma et al. [2016] Kingma等人[2016] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
Diederik P Kingma、Tim Salimans、Rafal Jozefowicz、Xi Chen、Ilya Sutskever和Max Welling。用逆自回归流改进变分推断。神经信息处理系统的进展，第4743-4751页，2016年。
Lawson et al. [2019] Lawson等人[2019] John Lawson, George Tucker, Bo Dai, and Rajesh Ranganath. Energy-inspired models: Learning with sampler-induced distributions. In Advances in Neural Information Processing Systems, pages 8501–8513, 2019.
约翰·劳森，乔治·塔克，博戴，和拉杰什·兰加纳斯。能量启发模型：用采样器诱导的分布学习。神经信息处理系统的进展，第8501-8513页，2019年。
Levy et al. [2018] Levy等人[2018] Daniel Levy, Matt D. Hoffman, and Jascha Sohl-Dickstein. Generalizing Hamiltonian Monte Carlo with neural networks. In International Conference on Learning Representations, 2018.
作者：丹尼尔利维，马特 D.霍夫曼和雅莎·索尔-迪克斯坦。用神经网络推广Hamilton Monte Carlo。2018年，参加国际学术会议。
Maaløe et al. [2019] Maaløe等人[2019] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. BIVA: A very deep hierarchy of latent variables for generative modeling. In Advances in Neural Information Processing Systems, pages 6548–6558, 2019.
Lars Maaløe，Marco Fraccaro，Valentin Liévin，and Ole Winther. BIVA：一个非常深层次的潜变量生成建模。神经信息处理系统的进展，第6548-6558页，2019年。
Menick and Kalchbrenner [2019]
Menick和Kalchbrenner [2019] Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Representations, 2019.
Jacob Menick 和 Nal Kalchbrenner 。通过子像素网络和多维扩展生成高分辨率图像。在国际学习代表会议 International Conference on Learning Representations， 2019 。
Miyato et al. [2018] Miyato et al. [ 2018 ] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
Takeru Miyato ， Toshiki Kataoka ， Masanori Koyama ， and Yuichi Yoshida 。 Spectral normalization for generative adversarial networks 生成对抗网络的频谱正常化。在国际学习代表会议 International Conference on Learning Representations， 2018 。
Nichol [2020] 尼克尔（ 2020 ） Alex Nichol. VQ-DRAW: A sequential discrete VAE. arXiv preprint arXiv:2003.01599, 2020.
亚历克斯 · 尼科尔斯 VQ-DRAW ： A sequential discrete 阿联酋。 arXiv preprint arXiv ： 2003.01599 （英文）， 2020 .
Nijkamp et al. [2019a] Nijkamp等人[2019 a] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of MCMC-based maximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019a.
Erik Nijkamp、Mitch Hill、Tian Han、Song-Chun Zhu和Ying Nian Wu。基于MCMC的能量模型的最大似然学习的剖析。arXiv预印本arXiv：1903.12370，2019 a。
Nijkamp et al. [2019b] Nijkamp等人[2019 b] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run MCMC toward energy-based model. In Advances in Neural Information Processing Systems, pages 5233–5243, 2019b.
Erik Nijkamp，Mitch Hill，Song-Chun Zhu，and Ying Nian Wu. 学习非收敛非持续短期MCMC向基于能量的模型。神经信息处理系统的进展，第5233-5243页，2019 b。
Ostrovski et al. [2018] Ostrovski等人[2018] Georg Ostrovski, Will Dabney, and Remi Munos. Autoregressive quantile networks for generative modeling. In International Conference on Machine Learning, pages 3936–3945, 2018.
乔治·奥斯特洛夫斯基，威尔·达布尼，雷米·穆诺斯. 用于生成建模的自回归分位数网络。在国际机器学习会议上，第3936-3945页，2018年。
Prenger et al. [2019] Prenger等人[2019] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
瑞恩·普伦格拉斐尔瓦莱和布莱恩·卡坦扎罗。WaveGlow：一个基于流的语音合成生成网络。在ICASSP 2019-2019 IEEE声学，语音和信号处理国际会议（ICASSP），第3617-3621页。IEEE，2019年。
Razavi et al. [2019] Razavi等人[2019] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems, pages 14837–14847, 2019.
Ali Razavi，Aaron货车 den Oord，和Oriol Vinyals. 使用VQ-VAE-2生成各种高保真图像。《神经信息处理系统进展》，第14837-14847页，2019年。
Rezende and Mohamed [2015]
[2015年] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
达尼洛·雷兹特和沙基尔·穆罕默德。使用规范化流的变分推理。国际机器学习会议，第1530-1538页，2015年。
Rezende et al. [2014] Rezalan等人[2014年] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
达尼洛· 希门尼斯·雷兹，沙克尔·穆罕默德和达安·维尔斯特拉。深度生成模型中的随机反向传播和近似推理。国际机器学习会议，第1278-1286页，2014年。
Ronneberger et al. [2015]
Ronneberger等人[2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
Olaf Ronneberger Philipp Fischer和托马斯Brox。U-Net：用于生物医学图像分割的卷积网络。医学图像计算和计算机辅助干预国际会议，第234-241页。施普林格，2015年。
Salimans and Kingma [2016]
Salimans和Kingma [2016] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.
Tim Salimans和Durk P Kingma。权重归一化：一种简单的重新参数化，可加速深度神经网络的训练。神经信息处理系统的进展，第901-909页，2016年。
Salimans et al. [2015] Salimans等人[2015] Tim Salimans, Diederik Kingma, and Max Welling. Markov Chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
蒂姆·萨利曼斯，迪德里克·金玛，马克斯·威林。马尔可夫链蒙特卡罗和变分推理：弥合差距。在国际机器学习会议上，第1218-1226页，2015年。
Salimans et al. [2016] Salimans等人[2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
Tim Salimans、Ian Goodfellow、Wojciech Zaremba、Vicki Cheung、Alec拉德福和Xi Chen。训练甘斯的改进技术。神经信息处理系统的进展，第2234-2242页，2016年。
Salimans et al. [2017] Salimans等人[2017] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, 2017.
Tim Salimans，Andrej Karpathy，Xi Chen，and Diederik P Kingma. PixelCNN++：使用离散化逻辑混合似然和其他修改改进PixelCNN。2017年，在国际学术会议上发表。
Sohl-Dickstein et al. [2015]
Sohl-Dickstein等人[2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015.
Jascha Sohl-Dickstein、Eric韦斯、Niru Maheswaranathan和Surya Ganguli。使用非平衡热力学的深度无监督学习。在国际机器学习会议上，第2256-2265页，2015年。
Song et al. [2017] Song等人[2017] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-NICE-MC: Adversarial training for MCMC. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
Jiaming Song，Shengjia Zhao，and Stefano Ermon. A-NICE-MC：MCMC的对抗训练。神经信息处理系统的进展，第5140-5150页，2017年。
Song and Ermon [2019] 宋和厄蒙[2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
杨松和Stefano Ermon。通过估计数据分布的梯度进行生成式建模。神经信息处理系统的进展，第11895-11907页，2019年。
Song and Ermon [2020] 宋和厄蒙[2020] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. arXiv preprint arXiv:2006.09011, 2020.
杨松和Stefano Ermon。训练基于分数的生成模型的改进技术。arXiv预印本arXiv：2006.09011，2020。
van den Oord et al. [2016a]
货车den Oord等人[2016 a] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
亚伦货车登奥德，桑德Dieleman，海加禅，卡伦Simonyan，奥里奥尔Vinyals，亚历克斯格雷夫斯，纳尔Kalchbrenner，安德鲁高级，和Koray Kavukcuoglu。WaveNet：原始音频的生成模型。arXiv预印本arXiv：1609.03499，2016 a。
van den Oord et al. [2016b]
货车den Oord等人[2016 b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning, 2016b.
亚伦货车登奥德，纳尔Kalchbrenner，和Koray Kavukcuoglu。像素递归神经网络。机器学习国际会议，2016年b。
van den Oord et al. [2016c]
货车den Oord等人[2016 c] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016c.
Aaron货车 den Oord、Nal Kalchbrenner、Oriol Vinyals、Lasse Espeholt、Alex Graves和Koray Kavukcuoglu。使用PixelCNN解码器生成条件图像。在《神经信息处理系统进展》中，第4790-4798页，2016 c。
Vaswani et al. [2017] Vaswani等人[2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
阿什什·瓦斯瓦尼、诺姆·沙泽尔、尼基·帕尔马、雅各布·乌斯兹科里特、莱昂·琼斯、艾登·戈麦斯、武卡斯·凯撒和伊利亚·波罗苏欣。你需要的只是关注。神经信息处理系统的进展，第5998-6008页，2017年。
Vincent [2011] 文森特[2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
帕斯卡·文森特。分数匹配和去噪自动编码器之间的连接。NeuralComputation，23（7）：1661-1674，2011.
Wang et al. [2020] Wang等人[2020] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot…for now. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
Sheng-Yu Wang，奥利弗王，理查德张，安德鲁欧文斯，和阿列克谢 A埃夫罗斯. cnn生成的图像很容易被发现……就目前而言。在IEEE计算机视觉和模式识别会议论文集，2020年。
Wang et al. [2018] Wang等人[2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
王小龙、Ross Girshick、Abhinav Gupta和Kaiming He。非局部神经网络。在IEEE计算机视觉和模式识别会议论文集，第7794-7803页，2018年。
Wiggers and Hoogeboom [2020]
Wiggers and Hoogeboom（2020） Auke J Wiggers and Emiel Hoogeboom. Predictive sampling with forecasting autoregressive models. arXiv preprint arXiv:2002.09928, 2020.
奥克· J·威格斯和特里尔·胡格布姆。预测自回归模型的预测抽样。arXiv预印本arXiv：2002.09928，2020。
Wu et al. [2020] Wu等人[2020] Hao Wu, Jonas Köhler, and Frank Noé. Stochastic normalizing flows. arXiv preprint arXiv:2002.06707, 2020.
Hao Wu，Jonas Köhler，Frank Noé. 随机标准化流。arXiv预印本arXiv：2002.06707，2020。
Wu and He [2018] 吴和他[2018] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
吴宇新和何开明。组正常化。在欧洲计算机视觉会议（ECCV）的会议记录中，第3-19页，2018年。
Xie et al. [2016] Xie等人[2016] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644, 2016.
Jianwen Xie，Yang Lu，Song-Chun Zhu，and Yingnian Wu. 生成Convnet理论。在国际机器学习会议上，第2635-2644页，2016年。
Xie et al. [2017] Xie等人[2017] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7093–7101, 2017.
Jianwen Xie，Song-Chun Zhu，and Ying Nian Wu. 利用时空产生式转换网路合成动态图案。在IEEE计算机视觉和模式识别会议论文集，第7093-7101页，2017年。
Xie et al. [2018] Xie等人[2018] Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu. Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8629–8638, 2018.
Jianwen Xie，Zilong Zheng，Ruiqi Gao，Wengguan Wang，Song-Chun Zhu，and Ying Nian Wu. 学习描述符网络用于三维形状合成与分析。在IEEE计算机视觉和模式识别会议论文集，第8629-8638页，2018年。
Xie et al. [2019] Xie等人[2019] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Learning energy-based spatial-temporal generative convnets for dynamic patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
Jianwen Xie，Song-Chun Zhu，and Ying Nian Wu. 基于学习能量的动态模式时空生成卷积网。IEEETransactions on Pattern Analysis and Machine Intelligence，2019。
Yu et al. [2015] Yu等人[2015] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
Fisher Yu，Yinda Zhang，Shuran Song，Ari Seff，and Jianxiong Xiao. LSUN：使用深度学习构建大规模图像数据集，并将人类纳入循环。arXiv预印本arXiv：1506.03365，2015年。
Zagoruyko and Komodakis [2016]
Zagoruyko和Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Sergey Zagoruyko和Nikos Komodakis 广泛的残余网络。arXiv预印本arXiv：1605.07146，2016年。

Extra information 额外信息

LSUN

FID scores for LSUN datasets are included in Table 3. Scores marked with ^∗ are reported by StyleGAN2 as baselines, and other scores are reported by their respective authors.
LSUN数据集的FID评分见表3。标记为“^0”的分数由StyleGAN2报告为基线，其他分数由各自的作者报告。

Table 3: FID scores for LSUN

256\times 256

datasets
表3：LSUN

256 \times 256 × 256

数据集的FID评分

Model	LSUN Bedroom LSUN卧室	LSUN Church LSUN教会	LSUN Cat
ProgressiveGAN [27] ProgressiveGAN[27]	8.34	6.42	37.52
StyleGAN [28] StyleGAN[28]	2.65	4.21^∗ 4.21^分	8.53^∗ 8.53^米
StyleGAN2 [30] StyleGAN2[30]	-	3.86	6.93
Ours ( $L_{\mathrm{simple}}$ ) 我们的（ $L_{\mathrm{simple}}$ ）	6.36	7.89	19.75
Ours ( $L_{\mathrm{simple}}$ , large) 我们的（ $L_{\mathrm{simple}}$ ，大号）	4.90	-	-

Progressive compression 渐进压缩

Our lossy compression argument in Section 4.3 is only a proof of concept, because Algorithms 3 and 4 depend on a procedure such as minimal random coding [20], which is not tractable for high dimensional data. These algorithms serve as a compression interpretation of the variational bound 5 of Sohl-Dickstein et al. [53], not yet as a practical compression system.
我们在4.3节中的有损压缩参数只是一个概念证明，因为算法3和4依赖于一个过程，如最小随机编码[20]，这对于高维数据来说是不容易处理的。这些算法作为一个压缩解释的变分界5Sohl-Dickstein 等人。[53]，尚未作为实际的压缩系统。

Table 4: Unconditional CIFAR10 test set rate-distortion values (accompanies Fig. 5)
表4：无条件CIFAR 10测试集率失真值（附图）第五章）

Reverse process time ( $T-t+1$ ) 反向处理时间（ $T-t+1$ ）	Rate (bits/dim) 速率（位/调光）	Distortion (RMSE $[0,255]$ ) 失真（RMSE $[0,255]$ ）
1000	1.77581	0.95136
900	0.11994	12.02277
800	0.05415	18.47482
700	0.02866	24.43656
600	0.01507	30.80948
500	0.00716	38.03236
400	0.00282	46.12765
300	0.00081	54.18826
200	0.00013	60.97170
100	0.00000	67.60125

Appendix A Extended derivations
附录A扩展推导

Below is a derivation of Eq. 5, the reduced variance variational bound for diffusion models. This material is from Sohl-Dickstein et al. [53]; we include it here only for completeness.
下面是Eq的推导。 5、扩散模型的约化方差变分界。该材料来自Sohl-Dickstein 等人。[53]我们在这里只是为了完整性。

$\displaystyle L$	$\displaystyle=\mathbb{E}_{q}\!\left[-\log\frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\|\mathbf{x}_{0})}\right]$	(17)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log p(\mathbf{x}_{T})-\sum_{t\geq 1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t}\|\mathbf{x}_{t-1})}\right]$	(18)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log p(\mathbf{x}_{T})-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t}\|\mathbf{x}_{t-1})}-\log\frac{p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1})}{q(\mathbf{x}_{1}\|\mathbf{x}_{0})}\right]$	(19)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log p(\mathbf{x}_{T})-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0})}\cdot\frac{q(\mathbf{x}_{t-1}\|\mathbf{x}_{0})}{q(\mathbf{x}_{t}\|\mathbf{x}_{0})}-\log\frac{p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1})}{q(\mathbf{x}_{1}\|\mathbf{x}_{0})}\right]$	(20)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log\frac{p(\mathbf{x}_{T})}{q(\mathbf{x}_{T}\|\mathbf{x}_{0})}-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0})}-\log p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1})\right]$	(21)
	$\displaystyle=\mathbb{E}_{q}\!\left[D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T}\|\mathbf{x}_{0})~{}\\|~{}p(\mathbf{x}_{T})\right)+\sum_{t>1}D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0})~{}\\|~{}p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})\right)-\log p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1})\right]$	(22)

The following is an alternate version of $L$ . It is not tractable to estimate, but it is useful for our discussion in Section 4.3.
以下是 $L$ 的替代版本。这是不易估计的，但对我们在4.3节中的讨论是有用的。

$\displaystyle L$	$\displaystyle=\mathbb{E}_{q}\!\left[-\log p(\mathbf{x}_{T})-\sum_{t\geq 1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t}\|\mathbf{x}_{t-1})}\right]$	(23)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log p(\mathbf{x}_{T})-\sum_{t\geq 1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}\cdot\frac{q(\mathbf{x}_{t-1})}{q(\mathbf{x}_{t})}\right]$	(24)
	$\displaystyle=\mathbb{E}_{q}\!\left[-\log\frac{p(\mathbf{x}_{T})}{q(\mathbf{x}_{T})}-\sum_{t\geq 1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}\|\mathbf{x}_{t})}-\log q(\mathbf{x}_{0})\right]$	(25)
	$\displaystyle=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T})~{}\\|~{}p(\mathbf{x}_{T})\right)+\mathbb{E}_{q}\!\left[\sum_{t\geq 1}D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}\|\mathbf{x}_{t})~{}\\|~{}p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t})\right)\right]+H(\mathbf{x}_{0})$	(26)

Appendix B Experimental details
附录B实验详情

Our neural network architecture follows the backbone of PixelCNN++ [52], which is a U-Net [48] based on a Wide ResNet [72]. We replaced weight normalization [49] with group normalization [66] to make the implementation simpler. Our $32\times 32$ models use four feature map resolutions ( $32\times 32$ to $4\times 4$ ), and our $256\times 256$ models use six. All models have two convolutional residual blocks per resolution level and self-attention blocks at the $16\times 16$ resolution between the convolutional blocks [6]. Diffusion time $t$ is specified by adding the Transformer sinusoidal position embedding [60] into each residual block. Our CIFAR10 model has 35.7 million parameters, and our LSUN and CelebA-HQ models have 114 million parameters. We also trained a larger variant of the LSUN Bedroom model with approximately 256 million parameters by increasing filter count.
我们的神经网络架构遵循PixelCNN++[52]的主干，这是一个基于Wide ResNet[72]的U-Net[48]。我们用组规范化[66]取代了权重规范化[49]，以使实现更简单。我们的 $32\times 32$ 模型使用四种特征映射分辨率（ $32\times 32$ 到 $4\times 4$ ），我们的 $256\times 256$ 模型使用六种。所有模型都有每个分辨率级别的两个卷积残差块和卷积块之间的 $16\times 16$ 分辨率的自注意块[6]。通过将Transformer正弦位置嵌入[60]添加到每个残差块中来指定扩散时间 $t$ 。我们的CIFAR 10模型有3570万个参数，我们的LSUN和CelebA-HQ模型有1.14亿个参数。我们还通过增加过滤器数量训练了LSUN Bedroom模型的一个更大的变体，大约有2.56亿个参数。

We used TPU v3-8 (similar to 8 V100 GPUs) for all experiments. Our CIFAR model trains at 21 steps per second at batch size 128 (10.6 hours to train to completion at 800k steps), and sampling a batch of 256 images takes 17 seconds. Our CelebA-HQ/LSUN (256²) models train at 2.2 steps per second at batch size 64, and sampling a batch of 128 images takes 300 seconds. We trained on CelebA-HQ for 0.5M steps, LSUN Bedroom for 2.4M steps, LSUN Cat for 1.8M steps, and LSUN Church for 1.2M steps. The larger LSUN Bedroom model was trained for 1.15M steps.
我们使用TPU v3-8（类似于8个V100 GPU）进行所有实验。我们的CIFAR模型以每秒21步的速度训练，批量为128（训练完成800 k步需要10.6小时），对一批256张图像进行采样需要17秒。我们的CelebA-HQ/LSUN（256²）模型以每秒2.2步的速度训练，批量大小为64，对一批128张图像进行采样需要300秒。我们在CelebA-HQ上训练了0.5M步，LSUN卧室训练了2.4M步，LSUN猫训练了1.8M步，LSUN教堂训练了1.2M步。较大的LSUN Bedroom模型被训练了115万步。

Apart from an initial choice of hyperparameters early on to make network size fit within memory constraints, we performed the majority of our hyperparameter search to optimize for CIFAR10 sample quality, then transferred the resulting settings over to the other datasets:
除了在早期选择超参数以使网络大小符合内存限制外，我们还执行了大部分超参数搜索以优化CIFAR 10样本质量，然后将结果设置转移到其他数据集：

•

We chose the $\beta_{t}$ schedule from a set of constant, linear, and quadratic schedules, all constrained so that $L_{T}\approx 0$ . We set $T=1000$ without a sweep, and we chose a linear schedule from $\beta_{1}=10^{-4}$ to $\beta_{T}=0.02$ .

·

我们从一组常数、线性和二次时间表中选择了 $\beta_{t}$ 时间表，所有时间表都受到约束，使得 $L_{T}\近似为0$ 。我们设置 $T = 1000$ 而不进行扫描，并且我们选择从 $\beta_{1}=10^{-4}$ 到 $\beta_{T}=0.02$ 。
•

We set the dropout rate on CIFAR10 to $0.1$ by sweeping over the values $\{0.1,0.2,0.3,0.4\}$ . Without dropout on CIFAR10, we obtained poorer samples reminiscent of the overfitting artifacts in an unregularized PixelCNN++ [52]. We set dropout rate on the other datasets to zero without sweeping.

·

我们通过扫描值 $\{0.1，0.2，0.3，0.4\}$ 将CIFAR 10上的脱落率设置为 $0.1$ 。在CIFAR 10上没有dropout的情况下，我们获得了更差的样本，让人想起未正则化的PixelCNN++中的过拟合伪影[52]。我们将其他数据集上的丢弃率设置为零，而不进行扫描。
•

We used random horizontal flips during training for CIFAR10; we tried training both with and without flips, and found flips to improve sample quality slightly. We also used random horizontal flips for all other datasets except LSUN Bedroom.

·
在
CIFAR 10的训练过程中，我们使用了随机水平翻转;我们尝试了使用和不使用翻转的训练，发现翻转可以稍微提高样本质量。我们还对除LSUN Bedroom之外的所有其他数据集使用了随机水平翻转。
•

We tried Adam [31] and RMSProp early on in our experimentation process and chose the former. We left the hyperparameters to their standard values. We set the learning rate to $2\times 10^{-4}$ without any sweeping, and we lowered it to $2\times 10^{-5}$ for the $256\times 256$ images, which seemed unstable to train with the larger learning rate.

·

我们在实验过程的早期尝试了Adam[31]和RMSProp，并选择了前者。我们将超参数保留为标准值。我们将学习率设置为 $2 \times 10^{-}^{-4}$ ，而不进行任何扫描，并将其降低到 $2 \times 10^{-}$ 对于 $256\times 256$ 个图像，这对于较大的学习率来说似乎不稳定。
•

We set the batch size to 128 for CIFAR10 and 64 for larger images. We did not sweep over these values.

·
对于
CIFAR 10，我们将批处理大小设置为128，对于更大的图像，我们将批处理大小设置为64。我们没有忽略这些值。
•

We used EMA on model parameters with a decay factor of 0.9999. We did not sweep over this value.

·
我们
对衰减因子为0.9999的模型参数使用EMA。我们没有覆盖这个值。

Final experiments were trained once and evaluated throughout training for sample quality. Sample quality scores and log likelihood are reported on the minimum FID value over the course of training. On CIFAR10, we calculated Inception and FID scores on 50000 samples using the original code from the OpenAI [51] and TTUR [21] repositories, respectively. On LSUN, we calculated FID scores on 50000 samples using code from the StyleGAN2 [30] repository. CIFAR10 and CelebA-HQ were loaded as provided by TensorFlow Datasets (https://www.tensorflow.org/datasets), and LSUN was prepared using code from StyleGAN. Dataset splits (or lack thereof) are standard from the papers that introduced their usage in a generative modeling context. All details can be found in the source code release.
最终实验训练一次，并在整个训练过程中评估样品质量。在训练过程中，样本质量评分和对数似然率报告在最小FID值上。在CIFAR 10上，我们分别使用OpenAI[51]和TTUR[21]存储库中的原始代码计算了50000个样本的Inception和FID分数。在LSun上，我们使用StyleGAN 2[30]存储库中的代码计算了50000个样本的FID分数。CIFAR 10和CelebA-HQ如TensorFlow Datasets（https://www.tensorflow.org/datasets）提供的那样加载，并且LSUN使用来自StyleGAN的代码准备。数据集分割（或缺乏分割）是在生成建模环境中介绍其使用的论文中的标准。所有细节都可以在源代码发布中找到。

Appendix C Discussion on related work
附录C有关工作的讨论

Our model architecture, forward process definition, and prior differ from NCSN [55, 56] in subtle but important ways that improve sample quality, and, notably, we directly train our sampler as a latent variable model rather than adding it after training post-hoc. In greater detail:
我们的模型架构，前向过程定义和先验与NCSN[55，56]在提高样本质量的微妙但重要的方面有所不同，值得注意的是，我们直接将采样器训练为潜变量模型，而不是在训练后添加它。更详细地说：

1.

We use a U-Net with self-attention; NCSN uses a RefineNet with dilated convolutions. We condition all layers on $t$ by adding in the Transformer sinusoidal position embedding, rather than only in normalization layers (NCSNv1) or only at the output (v2).

1.

我们使用具有自我注意力的U-Net; NCSN使用具有扩张卷积的RefineNet。我们通过添加Transformer正弦位置嵌入来对所有层进行条件化，而不是仅在归一化层（NCSNv 1）或仅在输出（v2）处。
2.

Diffusion models scale down the data with each forward process step (by a $\sqrt{1-\beta_{t}}$ factor) so that variance does not grow when adding noise, thus providing consistently scaled inputs to the neural net reverse process. NCSN omits this scaling factor.

2.

扩散模型在每个前向过程步骤中缩小数据（通过 $\sqrt{1-\beta_{t}}$ 因子），以便在添加噪声时方差不会增加，从而为神经网络反向过程提供一致的缩放输入。NCSN省略了这个比例因子。
3.

Unlike NCSN, our forward process destroys signal ( $D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})~{}\|~{}\mathcal{N}(\mathbf{0},\mathbf{I})\right)\approx 0$ ), ensuring a close match between the prior and aggregate posterior of $\mathbf{x}_{T}$ . Also unlike NCSN, our $\beta_{t}$ are very small, which ensures that the forward process is reversible by a Markov chain with conditional Gaussians. Both of these factors prevent distribution shift when sampling.

3.

与NCSN不同，我们的前向过程破坏了信号（ $D_{\mathrm{KL}}\！\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})~{}\|~{}\mathcal{N}(\mathbf{0},\mathbf{I})\right)\approx 0$ ), ensuring a close match between the prior and aggregate posterior of $\mathbf{x}_{T}$ . Also unlike NCSN, our $\beta_{t}$ are very small, which ensures that the forward process is reversible by a Markov chain with conditional Gaussians. Both of these factors prevent distribution shift when sampling.
4.

Our Langevin-like sampler has coefficients (learning rate, noise scale, etc.) derived rigorously from $\beta_{t}$ in the forward process. Thus, our training procedure directly trains our sampler to match the data distribution after $T$ steps: it trains the sampler as a latent variable model using variational inference. In contrast, NCSN’s sampler coefficients are set by hand post-hoc, and their training procedure is not guaranteed to directly optimize a quality metric of their sampler.

4.

我们的Langevin样采样器具有系数（学习率，噪声尺度等）。在正演过程中严格地从 $beta_{t}$ 导出。因此，我们的训练过程直接训练我们的采样器，以匹配 $T$ 步骤后的数据分布：它使用变分推理将采样器训练为潜变量模型。相比之下，NCSN的采样器系数是事后手动设置的，并且它们的训练过程不能保证直接优化它们的采样器的质量度量。

Appendix D Samples
附录D样品

Additional samples 额外样品

Figure 11, 13, 16, 17, 18, and 19 show uncurated samples from the diffusion models trained on CelebA-HQ, CIFAR10 and LSUN datasets.
图11、13、16、17、18和19显示了来自在CelebA-HQ、CIFAR 10和LSUN数据集上训练的扩散模型的未经策划的样本。

Latent structure and reverse process stochasticity
潜结构与逆过程随机性

During sampling, both the prior $\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and Langevin dynamics are stochastic. To understand the significance of the second source of noise, we sampled multiple images conditioned on the same intermediate latent for the CelebA $256\times 256$ dataset. Figure 7 shows multiple draws from the reverse process $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ that share the latent $\mathbf{x}_{t}$ for $t\in\{1000,750,500,250\}$ . To accomplish this, we run a single reverse chain from an initial draw from the prior. At the intermediate timesteps, the chain is split to sample multiple images. When the chain is split after the prior draw at $\mathbf{x}_{T=1000}$ , the samples differ significantly. However, when the chain is split after more steps, samples share high-level attributes like gender, hair color, eyewear, saturation, pose and facial expression. This indicates that intermediate latents like $\mathbf{x}_{750}$ encode these attributes, despite their imperceptibility.
在采样期间，先验 $\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ 和Langevin动态都是随机的。为了理解第二个噪声源的重要性，我们对CelebA $256\times 256$ 数据集的相同中间潜伏期的多个图像进行了采样。图7示出了来自反向过程 $\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t})$ 的多个抽取，其共享用于 $t\in\{1000,750,500,250\}$ 的潜在 $\mathbf{x}_{t}$ 。为了实现这一点，我们从先前的初始绘制运行一个反向链。在中间时间步，链被分裂以采样多个图像。当在 $\mathbf{x}_{T=1000}$ 处的先前拉伸之后将链分开时，样品显著不同。然而，当链在更多步骤之后被分裂时，样本共享高级属性，如性别、发色、眼镜、饱和度、姿势和面部表情。这表明像 $\mathbf{x}_{750}$ 这样的中间潜伏期编码了这些属性，尽管它们是不可感知的。

Coarse-to-fine interpolation
由粗到精插值

Figure 9 shows interpolations between a pair of source CelebA $256\times 256$ images as we vary the number of diffusion steps prior to latent space interpolation. Increasing the number of diffusion steps destroys more structure in the source images, which the model completes during the reverse process. This allows us to interpolate at both fine granularities and coarse granularities. In the limiting case of $0$ diffusion steps, the interpolation mixes source images in pixel space. On the other hand, after $1000$ diffusion steps, source information is lost and interpolations are novel samples.
图9示出了在潜在空间插值之前，当我们改变扩散步骤的数量时，一对源CelebA $256\times 256$ 图像之间的插值。增加扩散步骤的数量会破坏源图像中的更多结构，模型会在反向过程中完成。这允许我们在细粒度和粗粒度下进行插值。在 $0$ 扩散步骤的限制情况下，插值会在像素空间中混合源图像。另一方面，在 $1000$ 扩散步骤之后，源信息丢失并且插值是新样本。