Diffusion Meets Flow Matching: Two Sides of the Same Coin
扩散与流量匹配：一枚硬币的两面

Flow matching and diffusion models are two popular frameworks in generative modeling. Despite seeming similar, there is some confusion in the community about their exact connection. In this post, we aim to clear up this confusion and show that diffusion models and Gaussian flow matching are the same, although different model specifications can lead to different network outputs and sampling schedules. This is great news, it means you can use the two frameworks interchangeably.
流匹配模型和扩散模型是生成模型中两个流行的框架。尽管看似相似，但社区中对它们之间的确切联系存在一些困惑。在这篇文章中，我们将消除这种混淆，并说明扩散模型和高斯流匹配是相同的，尽管不同的模型规格会导致不同的网络输出和采样时间表。这是一个好消息，这意味着您可以交替使用这两个框架。

Authors 作者

Affiliations 联盟

Ruiqi Gao 高瑞琪

Google DeepMind 谷歌 DeepMind

Emiel Hoogeboom 埃米尔-胡格布姆

Jonathan Heek 乔纳森-希克

Valentin De Bortoli
瓦伦丁-德博托利

Kevin P. Murphy
凯文-墨菲

Tim Salimans 蒂姆-萨利曼斯

Published 原文

Dec. 2, 2024 2024 年 12 月 2 日

Flow matching has gained popularity recently, due to the simplicity of its formulation and the “straightness” of its induced sampling trajectories. This raises the commonly asked question:
流量匹配因其表述简单和诱导采样轨迹的 "直线性 "而在最近大受欢迎。这就提出了一个常见问题：

"Which is better, diffusion or flow matching?"
"扩散和流动匹配哪个更好？

As we will see, diffusion models and flow matching are equivalent (for the common special case that the source distribution used with flow matching corresponds to a Gaussian), so there is no single answer to this question. In particular, we will show how to convert one formalism to another. But why does this equivalence matter? Well, it allows you to mix and match techniques developed from the two frameworks. For example, after training a flow matching model, you can use either a stochastic or deterministic sampling method (contrary to the common belief that flow matching is always deterministic).
我们将看到，扩散模型和流量匹配是等价的（对于常见的特殊情况，即流量匹配使用的源分布对应于高斯分布），因此这个问题没有单一的答案。特别是，我们将展示如何将一种形式主义转换为另一种形式主义。但为什么这种等价性很重要呢？因为它可以让你混合和匹配从这两种框架中开发出来的技术。例如，在训练流量匹配模型后，您可以使用随机或确定性采样方法（与人们通常认为流量匹配总是确定性的观点相反）。

We will focus on the most commonly used flow matching formalism with the optimal transport path
我们将重点讨论最常用的流量匹配形式与最佳传输路径, which is closely related to rectified flow
与整流有密切关系 and stochastic interpolants
和随机插补. Our purpose is not to recommend one approach over another (both frameworks are valuable, each rooted in distinct theoretical perspectives, and it’s actually even more encouraging that they lead to the same algorithm in practice), but rather to help practitioners understand and feel confident about using these frameworks interchangeably, while understanding the true degrees of freedom one has when tuning the algorithm—regardless of what it’s called.
.我们的目的不是推荐一种方法而不是另一种方法（这两种框架都很有价值，各自植根于不同的理论视角，而且实际上更令人鼓舞的是，它们在实践中产生了相同的算法），而是帮助实践者理解并自信地交替使用这些框架，同时理解在调整算法时的真正自由度--无论它被称为什么。

Check this Google Colab for code used to produce plots and animations in this post.
查看此 Google Colab，了解用于制作本帖中的绘图和动画的代码。

Overview 概述

We start with a quick overview of the two frameworks.
我们首先简要介绍一下这两个框架。

Diffusion models 扩散模型

A diffusion process gradually destroys an observed datapoint $x$ (such as an image) over time $t$ , by mixing the data with Gaussian noise. The noisy data at time $t$ is given by a forward process: $\begin{matrix} (1) & z_{t} = α_{t} x + σ_{t} ϵ, where ϵ \sim N (0, I) . \end{matrix}$ $α_{t}$ and $σ_{t}$ define the noise schedule. A noise schedule is called variance-preserving if $α_{t}^{2} + σ_{t}^{2} = 1$ . The noise schedule is designed in a way such that $z_{0}$ is close to the clean data, and $z_{1}$ is close to a Gaussian noise.
扩散过程通过将数据与高斯噪声混合，随着时间 $t$ 的推移逐渐破坏观测到的数据点 $x$ （例如图像）。时间 $t$ 时的噪声数据由一个前向过程给出： $\begin{matrix} (1) & z_{t} = α_{t} x + σ_{t} ϵ, where ϵ \sim N (0, I) . \end{matrix}$ $α_{t}$ 和 $σ_{t}$ 定义了噪声时间表。如果有 $α_{t}^{2} + σ_{t}^{2} = 1$ ，则称为保方差噪声计划。噪声计划的设计方式是： $z_{0}$ 接近干净数据， $z_{1}$ 接近高斯噪声。

To generate new samples, we can “reverse” the forward process: We initialize the sample $z_{1}$ from a standard Gaussian. Given the sample $z_{t}$ at time step $t$ , we predict what the clean sample might look like with a neural network (a.k.a. denoiser model) $\hat{x} = \hat{x} (z_{t}; t)$ , and then we project it back to a lower noise level $s$ with the same forward transformation:
为了生成新样本，我们可以 "逆转 "正向过程：我们根据标准高斯对样本 $z_{1}$ 进行初始化。给定时间步骤 $t$ 时的样本 $z_{t}$ ，我们用神经网络（又称去噪模型） $\hat{x} = \hat{x} (z_{t}; t)$ 预测干净样本的样子，然后用相同的正向变换将其投影回较低的噪声级别 $s$ ：

$\begin{array}{rcl} (2) & z_{s} & = & α_{s} \hat{x} + σ_{s} \hat{ϵ}, \end{array}$ where $\hat{ϵ} = (z_{t} - α_{t} \hat{x}) / σ_{t}$ . (Alternatively we can train a neural network to predict the noise $\hat{ϵ}$ .) We keep alternating between predicting the clean data, and projecting it back to a lower noise level until we get the clean sample. This is the DDIM sampler
$\begin{matrix} (2) & z_{s} & = & α_{s} \hat{x} + σ_{s} \hat{ϵ}, \end{matrix}$ 其中 $\hat{ϵ} = (z_{t} - α_{t} \hat{x}) / σ_{t}$ 。（或者，我们也可以训练一个神经网络来预测噪声 $\hat{ϵ}$ 。）我们不断交替预测干净的数据，并将其预测回较低的噪音水平，直到得到干净的样本。这就是 DDIM 采样器. The randomness of samples only comes from the initial Gaussian sample, and the entire reverse process is deterministic. We will discuss the stochastic samplers later.
.样本的随机性仅来自初始高斯样本，整个反向过程都是确定的。我们稍后将讨论随机取样器。

Flow matching 流量匹配

In flow matching, we view the forward process as a linear interpolation between the data $x$ and a noise term $ϵ$ : $\begin{array}{r} (3) & z_{t} = (1 - t) x + t ϵ . \end{array}$
在流量匹配中，我们将前向过程视为数据 $x$ 和噪声项 $ϵ$ : $\begin{matrix} (3) & z_{t} = (1 - t) x + t ϵ . \end{matrix}$ 之间的线性插值。

This corresponds to the diffusion forward process if the noise is Gaussian (a.k.a. Gaussian flow matching) and we use the schedule $α_{t} = 1 - t, σ_{t} = t$ .
如果噪声为高斯噪声（又称高斯流匹配），且我们使用 $α_{t} = 1 - t, σ_{t} = t$ 时间表，则这相当于扩散前向过程。

Using simple algebra, we can derive that $z_{t} = z_{s} + u \cdot (t - s)$ for $s < t$ , where $u = ϵ - x$ is the “velocity”, “flow”, or “vector field”. Hence, to sample $z_{s}$ given $z_{t}$ , we reverse time and replace the vector field with our best guess at time $t$ : $\hat{u} = \hat{u} (z_{t}; t) = \hat{ϵ} - \hat{x}$ , represented by a neural network, to get
通过简单的代数，我们可以得出 $z_{t} = z_{s} + u \cdot (t - s)$ 为 $s < t$ ，其中 $u = ϵ - x$ 为 "速度"、"流量 "或 "矢量场"。因此，在给定 $z_{t}$ 的情况下，要对 $z_{s}$ 进行采样，我们可以逆转时间，用我们在 $t$ : $\hat{u} = \hat{u} (z_{t}; t) = \hat{ϵ} - \hat{x}$ 时间的最佳猜测来替换矢量场，用神经网络来表示，从而得到

\begin{array}{r} (4) & z_{s} = z_{t} + \hat{u} \cdot (s - t) . \end{array}

Initializing the sample $z_{1}$ from a standard Gaussian, we keep getting $z_{s}$ at a lower noise level than $z_{t}$ , until we obtain the clean sample.
从标准高斯样本 $z_{1}$ 初始化，我们不断得到噪音水平低于 $z_{t}$ 的 $z_{s}$ ，直到得到干净样本。

Comparison 比较

So far, we can already discern the similar essences in the two frameworks:
至此，我们已经可以看出这两个框架的相似本质：

1. Same forward process, if we assume that one end of flow matching is Gaussian, and the noise schedule of the diffusion model is in a particular form.
1.同样的前向过程，如果我们假设流量匹配的一端是高斯的，而扩散模型的噪声表是特定形式的。

2. "Similar" sampling processes: both follow an iterative update that involves a guess of the clean data at the current time step. (Spoiler: below we will show they are exactly the same!)
2."相似 "的采样过程：两者都采用迭代更新的方式，包括对当前时间步的干净数据进行猜测。(剧透：下面我们将展示它们是完全一样的！）。

Sampling 取样

It is commonly thought that the two frameworks differ in how they generate samples: Flow matching sampling is deterministic with “straight” paths, while diffusion model sampling is stochastic and follows “curved paths”. Below, we clarify this misconception. We will focus on deterministic sampling first, since it is simpler, and will discuss the stochastic case later on.
一般认为，这两种框架在生成样本的方式上有所不同：流量匹配采样是确定性的，路径是 "直线"，而扩散模型采样是随机的，路径是 "曲线"。下面，我们将澄清这一误解。由于确定性采样较为简单，我们将首先关注确定性采样，稍后再讨论随机采样。

Imagine you want to use your trained denoiser model to transform random noise into a datapoint. Recall that the DDIM update is given by $z_{s} = α_{s} \hat{x} + σ_{s} \hat{ϵ}$ . Interestingly, by rearranging terms it can be expressed in the following formulation, with respect to several sets of network outputs and reparametrizations:
想象一下，您想使用训练有素的去噪模型将随机噪音转换为数据点。回想一下，DDIM 更新由 $z_{s} = α_{s} \hat{x} + σ_{s} \hat{ϵ}$ 给出。有趣的是，通过重新排列术语，它可以用以下公式表示，与几组网络输出和重新参数化有关：

\begin{matrix} (5) & {\tilde{z}}_{s} = {\tilde{z}}_{t} + Network output \cdot (η_{s} - η_{t}) \end{matrix}

Network Output 网络输出	Reparametrization 重新参数化
$\hat{x}$ -prediction $\hat{x}$ 预测	${\tilde{z}}_{t} = z_{t} / σ_{t}$ and $η_{t} = α_{t} / σ_{t}$ ${\tilde{z}}_{t} = z_{t} / σ_{t}$ 和 $η_{t} = α_{t} / σ_{t}$
$\hat{ϵ}$ -prediction $\hat{ϵ}$ 预测	${\tilde{z}}_{t} = z_{t} / α_{t}$ and $η_{t} = σ_{t} / α_{t}$ ${\tilde{z}}_{t} = z_{t} / α_{t}$ 和 $η_{t} = σ_{t} / α_{t}$
$\hat{u}$ -flow matching vector field $\hat{u}$ 流匹配矢量场	${\tilde{z}}_{t} = z_{t} / (α_{t} + σ_{t})$ and $η_{t} = σ_{t} / (α_{t} + σ_{t})$ ${\tilde{z}}_{t} = z_{t} / (α_{t} + σ_{t})$ 和 $η_{t} = σ_{t} / (α_{t} + σ_{t})$

Remember the flow matching update in Equation (4)? This should look similar. If we set the network output as $\hat{u}$ in the last line and let $α_{t} = 1 - t$ , $σ_{t} = t$ , we have ${\tilde{z}}_{t} = z_{t}$ and $η_{t} = t$ , which is the flow matching update! More formally, the flow matching update is a Euler sampler of the sampling ODE (i.e., $d z_{t} = \hat{u} d t$ ), and with the flow matching noise schedule,
还记得公式 (4) 中的流量匹配更新吗？这看起来应该很相似。如果我们在最后一行将网络输出设置为 $\hat{u}$ ，并让 $α_{t} = 1 - t$ , $σ_{t} = t$ ，我们就会得到 ${\tilde{z}}_{t} = z_{t}$ 和 $η_{t} = t$ ，这就是流量匹配更新！更正式地说，流量匹配更新是采样 ODE 的欧拉采样器（即 $d z_{t} = \hat{u} d t$ ），并具有流量匹配噪声时间表、

Diffusion with DDIM sampler == Flow matching sampler (Euler).
使用 DDIM 采样器的扩散 == 流量匹配采样器（欧拉）。

Some other comments on the DDIM sampler:
关于 DDIM 采样器的其他一些评论：

The DDIM sampler analytically integrates the reparametrized sampling ODE (i.e., $d {\tilde{z}}_{t} = [Network output] \cdot d η_{t}$ ) if the network output is a constant over time. Of course the network prediction is not constant, but it means the inaccuracy of DDIM sampler only comes from approximating the intractable integral of the network output (unlike the Euler sampler of the probability flow ODE
如果网络输出随时间变化是一个常数，那么 DDIM 采样器就会对重新参数化的采样 ODE（即 $d {\tilde{z}}_{t} = [Network output] \cdot d η_{t}$ ）进行分析积分。当然，网络预测并不是恒定的，但这意味着 DDIM 采样器的不准确性仅来自于近似网络输出的难解积分（与概率流 ODE 的欧拉采样器不同 which involves an additional linear term of $z_{t}$ ). The DDIM sampler can be considered a first-order Euler sampler of the repamemetrized sampling ODE, which has the same update rule for different network outputs. However, if one uses a higher-order ODE solver, the network output can make a difference, which means the $\hat{u}$ output proposed by flow matching can make a difference from diffusion models.
其中涉及 $z_{t}$ 的额外线性项）。DDIM 采样器可视为重置采样 ODE 的一阶欧拉采样器，对不同的网络输出具有相同的更新规则。但是，如果使用高阶 ODE 求解器，网络输出可能会有所不同，这意味着流量匹配提出的 $\hat{u}$ 输出可能与扩散模型有所不同。
The DDIM sampler is invariant to a linear scaling applied to the noise schedule $α_{t}$ and $σ_{t}$ , as scaling does not affect ${\tilde{z}}_{t}$ and $η_{t}$ . This is not true for other samplers e.g. Euler sampler of the probability flow ODE.
DDIM 采样器对噪声表 $α_{t}$ 和 $σ_{t}$ 的线性缩放是不变的，因为缩放不会影响 ${\tilde{z}}_{t}$ 和 $η_{t}$ 。其他采样器（如概率流 ODE 的欧拉采样器）则不然。

To validate Claim 2, we present the results obtained using several noise schedules, each of which follows a flow-matching schedule ( $α_{t} = 1 - t, σ_{t} = t$ ) with different scaling factors. Feel free to change the slider below the figure. At the left end, the scaling factor is $1$ , which is exactly the flow matching schedule (FM), while at the right end, the scaling factor is $1 / [(1 - t)^{2} + t^{2}]$ , which corresponds to a variance-preserving schedule (VP). We see that DDIM (and flow matching sampler) always gives the same final data samples, regardless of the scaling of the schedule. The paths bend in different ways as we are showing $z_{t}$ (but not ${\tilde{z}}_{t}$ ), which is scale-dependent along the path. For the Euler sampler of the probabilty flow ODE, the scaling makes a true difference: we see that both the paths and the final samples change.
为了验证主张 2，我们展示了使用几种噪声计划获得的结果，每种计划都遵循不同缩放因子的流量匹配计划 ( $α_{t} = 1 - t, σ_{t} = t$ )。请随意改变图下的滑块。在左端，缩放因子为 $1$ ，这正是流量匹配计划（FM），而在右端，缩放因子为 $1 / [(1 - t)^{2} + t^{2}]$ ，这相当于方差保留计划（VP）。我们可以看到，无论计划表的缩放比例如何，DDIM（和流量匹配采样器）总是给出相同的最终数据样本。由于我们显示的是 $z_{t}$ （而不是 ${\tilde{z}}_{t}$ ），路径弯曲的方式不同，这与路径上的缩放有关。对于概率流 ODE 的欧拉采样器，缩放会产生真正的不同：我们可以看到路径和最终样本都发生了变化。

Wait a second! People often say flow matching results in straight paths, but in the above figure, the sampling trajectories look curved.
等一下！人们常说流量匹配的结果是直线路径，但在上图中，采样轨迹看起来是弯曲的。

Well first, why do they say that? If the model would be perfectly confident about the data point it is moving to, the path from noise to data will be a straight line, with the flow matching noise schedule. Straight line ODEs would be great because it means that there is no integration error whatsoever. Unfortunately, the predictions are not for a single point. Instead they average over a larger distribution. And flowing straight to a point != straight to a distribution.
首先，他们为什么这么说？如果模型对其移动到的数据点完全有把握，那么从噪声到数据的路径将是一条直线，流量与噪声时间表相匹配。直线 ODE 非常好，因为这意味着不存在任何积分误差。遗憾的是，预测并不是针对单个点的。相反，它们是更大分布的平均值。而直线流向一个点 != 直线流向一个分布。

In the interactive graph below, you can change the variance of the data distribution on the right hand side by the slider. Note how the variance preserving schedule is better (straighter paths) for wide distributions, while the flow matching schedule works better for narrow distributions.
在下面的交互式图表中，你可以通过滑块改变右侧数据分布的方差。请注意，方差保持计划表更适合宽分布（路径更直），而流量匹配计划表更适合窄分布。

Finding such straight paths for real-life datasets like images is of course much less straightforward. But the conclusion remains the same: The optimal integration method depends on the data distribution.
当然，要为现实生活中的数据集（如图像）找到这样的直线路径就没那么简单了。但结论是相同的：最佳的整合方法取决于数据分布。

Two important takeaways from deterministic sampling:
确定性取样的两个重要启示

1. Equivalence in samplers: DDIM is equivalent to the flow matching sampler, and is invariant to a linear scaling to the noise schedule.
1.采样器的等效性：DDIM 等价于流量匹配采样器，并且对噪声计划的线性缩放不变。

2. Straightness misnomer: Flow matching schedule is only straight for a model predicting a single point. For realistic distributions, other schedules can give straighter paths.
2.直线误称：流量匹配计划只对预测单点的模型而言是直线的。对于现实的分布，其他时间表可以给出更直的路径。

Training 培训

Diffusion models 扩散模型 are trained by estimating $\hat{x} = \hat{x} (z_{t}; t)$ , or alternatively $\hat{ϵ} = \hat{ϵ} (z_{t}; t)$ with a neural net. Learning the model is done by minimizing a weighted mean squared error (MSE) loss: $\begin{matrix} (6) & L (x) = E_{t \sim U (0, 1), ϵ \sim N (0, I)} [w (λ_{t}) \cdot \frac{d λ}{d t} \cdot ‖ \hat{ϵ} - ϵ ‖_{2}^{2}], \end{matrix}$ where $λ_{t} = \log (α_{t}^{2} / σ_{t}^{2})$ is the log signal-to-noise ratio, and $w (λ_{t})$ is the weighting function, balancing the importance of the loss at different noise levels. The term $d λ / d t$ in the training objective seems unnatural and in the literature is often merged with the weighting function. However, their separation helps disentangle the factors of training noise schedule and weighting function clearly, and helps emphasize the more important design choice: the weighting function.
通过估算 $\hat{x} = \hat{x} (z_{t}; t)$ 或使用神经网络估算 $\hat{ϵ} = \hat{ϵ} (z_{t}; t)$ 来训练模型。通过最小化加权均方误差 (MSE) 损失来学习模型： $\begin{matrix} (6) & L (x) = E_{t \sim U (0, 1), ϵ \sim N (0, I)} [w (λ_{t}) \cdot \frac{d λ}{d t} \cdot ‖ \hat{ϵ} - ϵ ‖_{2}^{2}], \end{matrix}$ 其中 $λ_{t} = \log (α_{t}^{2} / σ_{t}^{2})$ 是信噪比对数， $w (λ_{t})$ 是加权函数，用于平衡不同噪声水平下损失的重要性。训练目标中的术语 $d λ / d t$ 似乎不自然，在文献中经常与加权函数合并。不过，将它们分开有助于明确区分训练噪声时间安排和加权函数这两个因素，并有助于强调更重要的设计选择：加权函数。

Flow matching also fits in the above training objective. Recall below is the conditional flow matching objective used by
流量匹配也符合上述训练目标。请回顾下面的条件流量匹配目标 :

\begin{matrix} (7) & L_{CFM} (x) = E_{t \sim U (0, 1), ϵ \sim N (0, I)} [‖ \hat{u} - u ‖_{2}^{2}] \end{matrix}

Since $\hat{u}$ can be expressed as a linear combination of $\hat{ϵ}$ and $z_{t}$ , the CFM training objective can be rewritten as mean squared error on $ϵ$ with a specific weighting.
由于 $\hat{u}$ 可以表示为 $\hat{ϵ}$ 和 $z_{t}$ 的线性组合，因此 CFM 训练目标可以改写为 $ϵ$ 上的均方误差，并赋予特定权重。

How do we choose what the network should output?
我们该如何选择网络输出的内容？

Below we summarize several network outputs proposed in the literature, including a few versions used by diffusion models and the one used by flow matching. They can be derived from each other given the current data $z_{t}$ . One may see the training objective defined with respect to MSE of different network outputs in the literature. From the perspective of training objective, they all correspond to having some additional weighting in front of the $ϵ$ -MSE that can be absorbed in the weighting function.
下面我们总结了文献中提出的几种网络输出，包括扩散模型使用的几个版本和流量匹配使用的版本。在当前数据 $z_{t}$ 条件下，它们可以相互推导。在文献中，我们可以看到根据不同网络输出的 MSE 来定义训练目标。从训练目标的角度来看，它们都对应于在 $ϵ$ -MSE 前面有一些额外的加权，这些加权可以被加权函数吸收。

Network Output 网络输出	Formulation 配方	MSE on Network Output 网络输出的 MSE
$\hat{ϵ}$ -prediction $\hat{ϵ}$ 预测	$\hat{ϵ}$	$‖ \hat{ϵ} - ϵ ‖_{2}^{2}$
$\hat{x}$ -prediction $\hat{x}$ 预测	$\hat{x} = (z_{t} - σ_{t} \hat{ϵ}) / α_{t}$	$‖ \hat{x} - x ‖_{2}^{2} = e^{- λ} ‖ \hat{ϵ} - ϵ ‖_{2}^{2}$
$\hat{v}$ -prediction $\hat{v}$ 预测	$\hat{v} = α_{t} \hat{ϵ} - σ_{t} \hat{x}$	$‖ \hat{v} - v ‖_{2}^{2} = α_{t}^{2} (e^{- λ} + 1)^{2} ‖ \hat{ϵ} - ϵ ‖_{2}^{2}$
$\hat{u}$ -flow matching vector field $\hat{u}$ 流匹配矢量场	$\hat{u} = \hat{ϵ} - \hat{x}$	$‖ \hat{u} - u ‖_{2}^{2} = (e^{- λ / 2} + 1)^{2} ‖ \hat{ϵ} - ϵ ‖_{2}^{2}$

In practice, however, the model output might make a difference. For example,
不过，在实际操作中，模型输出可能会有所区别。例如

$\hat{ϵ}$ -prediction can be problematic at high noise levels, because any error in $\hat{ϵ}$ will get amplified in $\hat{x} = (z_{t} - σ_{t} \hat{ϵ}) / α_{t}$ , as $α_{t}$ is close to 0. It means that small changes create a large loss under some weightings.
$\hat{ϵ}$ 预测在噪声水平较高时可能会出现问题，因为 $\hat{ϵ}$ 中的任何误差都会在 $\hat{x} = (z_{t} - σ_{t} \hat{ϵ}) / α_{t}$ 中被放大，因为 $α_{t}$ 接近 0。
Following the similar reason, $\hat{x}$ -prediction is problematic at low noise levels, because $x$ as a target is not informative when added noise is small, and the error gets amplified in $\hat{ϵ}$ .
出于类似的原因， $\hat{x}$ 预测在低噪声水平下也会出现问题，因为当添加的噪声较小时，作为目标的 $x$ 信息量不大，而 $\hat{ϵ}$ 中的误差会被放大。

Therefore, a heuristic is to choose a network output that is a combination of $\hat{x}$ - and $\hat{ϵ}$ -predictions, which applies to the $\hat{v}$ -prediction and the flow matching vector field $\hat{u}$ .
因此，启发式方法是选择一个网络输出，它是 $\hat{x}$ - 和 $\hat{ϵ}$ - 预测的组合，适用于 $\hat{v}$ - 预测和流量匹配向量场 $\hat{u}$ 。

How do we choose the weighting function?
如何选择加权函数？

The weighting function is the most important part of the loss. It balances the importance of high frequency and low frequency components in perceptual data such as images, videos and audo
加权函数是损耗中最重要的部分。它平衡了图像、视频和音频等感知数据中高频和低频成分的重要性。. This is crucial, as certain high frequency components in those signals are not perceptible to humans, and thus it is better not to waste model capacity on them when the model capacity is limited. Viewing losses via their weightings, one can derive the following non-obvious result:
.这一点至关重要，因为这些信号中的某些高频成分是人类无法感知的，因此在模型容量有限的情况下，最好不要将模型容量浪费在这些成分上。根据损失的权重，我们可以得出以下非显而易见的结果：

Flow matching weighting == diffusion weighting of {\bf v}-MSE loss + cosine noise schedule.
流量匹配加权 == {\bf v} 的扩散加权 -MSE 损失 + 余弦噪声计划。

That is, the conditional flow matching objective in Equation (7) is the same as a commonly used setting in diffusion models! See Appendix D.2-3 in
也就是说，等式（7）中的条件流量匹配目标与扩散模型中常用的设置相同！请参见 for a detailed derivation. Below we plot several commonly used weighting functions in the literature, as a function of $λ$ .
的详细推导。下面我们将绘制文献中几种常用加权函数与 $λ$ 的函数关系图。

The flow matching weighting (also $v$ -MSE + cosine schdule weighting) decreases exponentially as $λ$ increases. Empirically we find another interesting connection: The Stable Diffusion 3 weighting
随着 $λ$ 的增加，流量匹配加权（也是 $v$ -MSE + 余弦舒度加权）呈指数下降。根据经验，我们还发现了另一个有趣的联系：稳定扩散 3 加权, a reweighted version of flow matching, is very similar to the EDM weighting
是流量匹配的再加权版本，与 EDM 加权非常相似 that is popular for diffusion models.
这在扩散模型中很流行。

How do we choose the training noise schedule?
如何选择训练噪音时间表？

We discuss the training noise schedule last, as it should be the least important to training for the following reasons:
我们最后讨论训练噪音时间表，因为它对训练来说应该是最不重要的，原因如下：

The training loss is invariant to the training noise schedule. Specifically, the loss fuction can be rewritten as $L (x) = \int_{λ_{min}}^{λ_{max}} w (λ) E_{ϵ \sim N (0, I)} [‖ \hat{ϵ} - ϵ ‖_{2}^{2}] d λ$ , which is only related to the endpoints ( $λ_{max}$ , $λ_{min}$ ), but not the schedule $λ_{t}$ in between. In practice, one should choose $λ_{max}$ , $λ_{min}$ such that the two ends are close enough to the clean data and Gaussian noise respectively. $λ_{t}$ might still affect the variance of the Monte Carlo estimator of the training loss. A few heuristics have been proposed in the literature to automatically adjust the noise schedules over the course of training. This blog post has a nice summary.
训练损失与训练噪声时间表无关。具体来说，损失函数可以重写为 $L (x) = \int_{λ_{min}}^{λ_{max}} w (λ) E_{ϵ \sim N (0, I)} [‖ \hat{ϵ} - ϵ ‖_{2}^{2}] d λ$ ，它只与端点（ $λ_{max}$ , $λ_{min}$ ）有关，而与中间的时间表 $λ_{t}$ 无关。实际上，我们应该选择 $λ_{max}$ , $λ_{min}$ ，使两端分别足够接近干净数据和高斯噪声。 $λ_{t}$ 仍可能影响蒙特卡罗训练损失估计器的方差。文献中提出了一些启发式方法，用于在训练过程中自动调整噪声调度。这篇博文对此进行了很好的总结。
Similar to sampling noise schedule, the training noise schedule is invariant to a linear scaling, as one can easily apply a linear scaling to $z_{t}$ and an unscaling at the network input to get the equivalence. The key defining property of a noise schedule is the log signal-to-noise ratio $λ_{t}$ .
与采样噪声计划类似，训练噪声计划对线性缩放也是不变的，因为我们可以轻松地对 $z_{t}$ 进行线性缩放，并在网络输入处取消缩放，从而得到等效结果。噪声时间表的关键定义属性是对数信噪比 $λ_{t}$ 。
One can choose completely different noise schedules for training and sampling, based on distinct heuristics: For training, it is desirable to have a noise schedule that minimizes the variance of the Monte Carlo estimator, whereas for sampling the noise schedule is more related to the discretization error of the ODE / SDE sampling trajectories and the model curvature.
我们可以根据不同的启发式方法，为训练和采样选择完全不同的噪声计划：对于训练，我们希望噪声计划能使蒙特卡罗估计器的方差最小，而对于采样，噪声计划则更多地与 ODE / SDE 采样轨迹的离散化误差和模型曲率有关。

Summary 摘要

A few takeaways for training of diffusion models / flow matching:
对扩散模型/流量匹配训练的几点启示

1. Equivalence in weightings: The weighting function is important for training, which balances the importance of different frequency components of perceptual data. Flow matching weightings coincidentlly match commonly used diffusion training weightings in the literature.
1.加权等效：加权函数对训练非常重要，它可以平衡感知数据中不同频率成分的重要性。流量匹配权重与文献中常用的扩散训练权重不谋而合。

2. Insignificance of training noise schedule: The noise schedule is far less important to the training objective, but can affect the training efficiency.
2.训练噪音时间表的重要性：噪音计划对训练目标的影响要小得多，但会影响训练效率。

3. Difference in network outputs: The network output proposed by flow matching is new, which nicely balances \hat{\bf x}- and \hat{\epsilon}-prediction, similar to \hat{\bf v}-prediction.
3.网络输出的差异：流量匹配提出的网络输出是新的，它很好地平衡了 \hat{\bf x} - 预测和 \hat{\epsilon} - 预测，类似于 \hat{\bf v} - 预测。

Diving deeper into samplers
深入了解采样器

In this section, we discuss different kinds of samplers in more detail.
本节将详细讨论各种采样器。

Reflow operator 回流焊操作员

The Reflow operation in flow matching connects noise and data points in a straight line. One can obtain these (data, noise) pairs by running a deterministic sampler from noise. A model can then be trained to directly predict the data given the noise avoiding the need for sampling. In the diffusion literature, the same approach was the one of the first distillation techniques
流量匹配中的回流操作将噪声点和数据点连成一条直线。通过从噪声中运行确定性采样器，可以获得这些（数据、噪声）对。然后就可以训练出一个模型，在噪声条件下直接预测数据，从而避免了采样的需要。在扩散文献中，同样的方法也是最早的蒸馏技术之一.

Deterministic sampler vs. stochastic sampler
确定性采样器与随机采样器的比较

So far we have just discussed the deterministic sampler of diffusion models or flow matching. An alternative is to use stochastic samplers such as the DDPM sampler
到目前为止，我们只是讨论了扩散模型或流量匹配的确定性采样器。另一种方法是使用随机取样器，如 DDPM 取样器.

Performing one DDPM sampling step going from \lambda_t to \lambda_t + \Delta\lambda is exactly equivalent to performing one DDIM sampling step to \lambda_t + 2\Delta\lambda, and then renoising to \lambda_t + \Delta\lambda by doing forward diffusion. That is, the renoising by doing forward diffusion reverses exactly half the progress made by DDIM. To see this, let’s take a look at a 2D example. Starting from the same mixture of Gaussians distribution, we can take either a small DDIM sampling step with the sign of the update reversed (left), or a small forward diffusion step (right):
执行一个从 \lambda_t 到 \lambda_t + \Delta\lambda 的 DDPM 采样步骤，正好等同于执行一个到 \lambda_t + 2\Delta\lambda 的 DDIM 采样步骤，然后通过正向扩散重新优化到 \lambda_t + \Delta\lambda 。也就是说，通过前向扩散进行的重细化正好逆转了 DDIM 所取得的一半进展。为了说明这一点，我们来看一个 2D 例子。从相同的高斯混合分布开始，我们可以采取更新符号相反的一小步 DDIM 采样（左图），或者一小步正向扩散（右图）：

For individual samples, these updates behave quite differently: the reversed DDIM update consistently pushes each sample away from the modes of the distribution, while the diffusion update is entirely random. However, when aggregating all samples, the resulting distributions after the updates are identical. Consequently, if we perform a DDIM sampling step (without reversing the sign) followed by a forward diffusion step, the overall distribution remains unchanged from the one prior to these updates.
对于单个样本，这些更新的表现截然不同：反向 DDIM 更新会持续地将每个样本推离分布的模式，而扩散更新则完全是随机的。然而，当汇总所有样本时，更新后的分布结果是相同的。因此，如果我们先执行一个 DDIM 采样步骤（不反转符号），然后再执行一个正向扩散步骤，总体分布与更新前的分布保持不变。

The fraction of the DDIM step to undo by renoising is a hyperparameter which we are free to choose (i.e. does not have to be exact half of the DDIM step), and which has been called the level of churn by
我们可以自由选择（即不必精确到 DDIM 步长的一半）通过重细化来撤销 DDIM 步长的部分超参数，这部分超参数被称为 "搅动水平"（level of churn）。. Interestingly, the effect of adding churn to our sampler is to diminish the effect on our final sample of our model predictions made early during sampling, and to increase the weight on later predictions. This is shown in the figure below:
.有趣的是，在采样器中加入 "搅动 "的效果是降低了采样过程中早期模型预测对最终样本的影响，而增加了后期预测的权重。如下图所示：

Here we ran different samplers for 100 sampling steps using a cosine noise schedule and $\hat{v}$ -prediction
在这里，我们使用余弦噪声计划和 $\hat{v}$ 预测，对不同的采样器进行了 100 步采样。. Ignoring nonlinear interactions, the final sample produced by the sampler can be written as a weighted sum of predictions ${\hat{v}}_{t}$ made during sampling and a Gaussian noise $e$ : $z_{0} = \sum_{t} h_{t} {\hat{v}}_{t} + \sum_{t} c_{t} e$ . The weights $h_{t}$ of these predictions are shown on the y-axis for different diffusion times $t$ shown on the x-axis. DDIM results in an equal weighting of $\hat{v}$ -predictions for this setting, as shown in
.忽略非线性相互作用，采样器产生的最终样本可以写成采样过程中预测值 ${\hat{v}}_{t}$ 与高斯噪声 $e$ : $z_{0} = \sum_{t} h_{t} {\hat{v}}_{t} + \sum_{t} c_{t} e$ 的加权和。这些预测的权重 $h_{t}$ 显示在 y 轴上，不同的扩散时间 $t$ 显示在 x 轴上。在这种情况下，DDIM 使 $\hat{v}$ 预测值的权重相等，如图所示, whereas DDPM puts more emphasis on predictions made towards the end of sampling. Also see
而 DDPM 则更重视采样末期的预测。另见 for analytic expressions of these weights in the $\hat{x}$ - and $\hat{ϵ}$ -predictions.
和 $\hat{ϵ}$ 预测中这些权重的解析表达式。

SDE and ODE Perspective
SDE 和 ODE 视角

We’ve observed the practical equivalence between diffusion models and flow matching algorithms. Here, we formally describe the equivalence of the forward and sampling processes using ODE and SDE, as a completeness in theory.
我们已经观察到扩散模型和流量匹配算法之间的实际等价性。在此，我们使用 ODE 和 SDE 正式描述了前向过程和采样过程的等价性，作为理论上的完备性。

Diffusion models 扩散模型

The forward process of diffusion models which gradually destroys a data over time can be described by the following stochastic differential equation (SDE):
扩散模型的前向过程会随着时间的推移逐渐破坏数据，可以用下面的随机微分方程（SDE）来描述：

\begin{matrix} (8) & d z_{t} = f_{t} z_{t} d t + g_{t} d z, \end{matrix}

where $d z$ is an infinitesimal Gaussian (formally, a Brownian motion). f_t and g_t decide the noise schedule. The generative process is given by the reverse of the forward process, whose formula is given by
其中， $d z$ 是无穷小高斯（形式上是布朗运动）。 f_t 和 g_t 决定了噪声调度。生成过程由正向过程的反向过程给出，其公式为

\begin{matrix} (9) & d z_{t} = (f_{t} z_{t} - \frac{1 + η_{t}^{2}}{2} g_{t}^{2} \nabla \log p_{t} (z_{t})) d t + η_{t} g_{t} d z, \end{matrix}

where \nabla \log p_t is the score of the forward process.
其中， \nabla \log p_t 是前向过程的得分。

Note that we have introduced an additional parameter \eta_t which controls the amount of stochasticity at inference time. This is related to the churn parameter introduced before. When discretizing the backward process we recover DDIM in the case \eta_t = 0 and DDPM in the case \eta_t = 1.
请注意，我们引入了一个额外的参数 \eta_t ，用于控制推理时的随机性。这与之前引入的搅动参数有关。在对后向过程进行离散化时，我们会在 \eta_t = 0 的情况下恢复 DDIM，在 \eta_t = 1 的情况下恢复 DDPM。

Flow matching 流量匹配

The interpolation between $x$ and $ϵ$ in flow matching can be described by the following ordinary differential equation (ODE):
流量匹配中 $x$ 和 $ϵ$ 之间的插值可以用下面的常微分方程（ODE）来描述：

\begin{matrix} (10) & d z_{t} = u_{t} d t . \end{matrix}

Assuming the interpolation is $z_{t} = α_{t} x + σ_{t} ϵ$ , then $u_{t} = {\dot{α}}_{t} x + {\dot{σ}}_{t} ϵ$ .
假设插值为 $z_{t} = α_{t} x + σ_{t} ϵ$ ，则 $u_{t} = {\dot{α}}_{t} x + {\dot{σ}}_{t} ϵ$ 。

The generative process is simply reversing the ODE in time, and replacing $u_{t}$ by its conditional expectation with respect to $z_{t}$ . This is a specific case of stochastic interpolants
生成过程只是在时间上反转 ODE，并将 $u_{t}$ 替换为其相对于 $z_{t}$ 的条件期望。这是随机内插法的一种特殊情况, in which case it can be generalized to an SDE:
在这种情况下，可以将其推广为 SDE：

$\begin{matrix} (11) & d z_{t} = (u_{t} - \frac{1}{2} ε_{t}^{2} \nabla \log p_{t} (z_{t})) d t + ε_{t} d z, \end{matrix}$ where $ε_{t}$ controls the amount of stochasticity at inference time.
$\begin{matrix} (11) & d z_{t} = (u_{t} - \frac{1}{2} ε_{t}^{2} \nabla \log p_{t} (z_{t})) d t + ε_{t} d z, \end{matrix}$ 其中 $ε_{t}$ 控制推理时的随机性。

Equivalence of the two frameworks
两个框架的等效性

Both frameworks are defined by three hyperparameters respectively: f_t, g_t, \eta_t for diffusion, and \alpha_t, \sigma_t, \varepsilon_t for flow matching. We can show the equivalence by deriving one set of hyperparameters from the other. From diffusion to flow matching:
这两个框架分别由三个超参数定义： f_t, g_t, \eta_t 表示扩散， \alpha_t, \sigma_t, \varepsilon_t 表示流量匹配。我们可以从一组超参数推导出另一组超参数，从而证明两者的等价性。从扩散到流动匹配：

α_{t} = \exp (\int_{0}^{t} f_{s} d s), σ_{t} = {(\int_{0}^{t} g_{s}^{2} \exp (- 2 \int_{0}^{s} f_{u} d u) d s)}^{1 / 2}, ε_{t} = η_{t} g_{t} .

From flow matching to diffusion:
从流量匹配到扩散：

f_{t} = \partial_{t} \log (α_{t}), g_{t}^{2} = 2 α_{t} σ_{t} \partial_{t} (σ_{t} / α_{t}), η_{t} = ε_{t} / (2 α_{t} σ_{t} \partial_{t} (σ_{t} / α_{t}))^{1 / 2} .

In summary, aside from training considerations and sampler selection, diffusion and Gaussian flow matching exhibit no fundamental differences.
总之，除了训练考虑因素和采样器选择之外，扩散匹配和高斯流匹配没有本质区别。

Closing takeaways 结束语

If you’ve read this far, hopefully we’ve convinced you that diffusion models and Gaussian flow matching are equivalent. However, we highlight two new model specifications that Gaussian flow matching brings to the field:
如果您读到这里，希望我们已经说服了您，扩散模型和高斯流匹配是等价的。不过，我们将重点介绍高斯流匹配为该领域带来的两种新的模型规范：

Network output: Flow matching proposes a vector field parametrization of the network output that is different from the ones used in diffusion literature. The network output can make a difference when higher-order samplers are used. It may also affect the training dynamics.
网络输出：流量匹配提出了一种网络输出的矢量场参数，与扩散文献中使用的参数不同。在使用高阶采样器时，网络输出会产生不同的影响。它还可能影响训练动态。
Sampling noise schedule: Flow matching leverages a simple sampling noise schedule $α_{t} = 1 - t$ and $σ_{t} = t$ , with the same update rule as DDIM.
采样噪声时间表：流量匹配利用简单的采样噪声时间表 $α_{t} = 1 - t$ 和 $σ_{t} = t$ ，其更新规则与 DDIM 相同。

It would be interesting to investigate the importance of these two model specifications empirically in different real world applications, which we leave to future work. It is also an exciting research area to apply flow matching to more general cases where the source distribution is non-Gaussian, e.g. for more structured data like protein
在不同的实际应用中，对这两种模型规格的重要性进行实证研究是很有意义的，这也是我们今后的工作重点。将流匹配应用于源分布为非高斯分布的更一般情况也是一个令人兴奋的研究领域，例如蛋白质等结构更复杂的数据。.

Acknowledgements 致谢

Thanks to our colleagues at Google DeepMind for fruitful discussions. In particular, thanks to Sander Dieleman, Ben Poole and Aleksander Hołyński.
感谢谷歌 DeepMind 的同事们富有成效的讨论。特别要感谢 Sander Dieleman、Ben Poole 和 Aleksander Hołyński。

References 参考资料

Flow matching for generative modeling
生成模型的流程匹配
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M. and Le, M., 2022. arXiv preprint arXiv:2210.02747.
Flow straight and fast: Learning to generate and transfer data with rectified flow
直线快速流动：学习以整流方式生成和传输数据
Liu, X., Gong, C. and Liu, Q., 2022. arXiv preprint arXiv:2209.03003.
Building normalizing flows with stochastic interpolants
用随机插值法建立归一化流程
Albergo, M.S. and Vanden-Eijnden, E., 2022. arXiv preprint arXiv:2209.15571.
Stochastic interpolants: A unifying framework for flows and diffusions
随机插值：流动与扩散的统一框架
Albergo, M.S., Boffi, N.M. and Vanden-Eijnden, E., 2023. arXiv preprint arXiv:2303.08797.
Denoising diffusion implicit models
扩散隐式模型去噪
Song, J., Meng, C. and Ermon, S., 2020. arXiv preprint arXiv:2010.02502.
Score-based generative modeling through stochastic differential equations
通过随机微分方程建立基于分数的生成模型
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. and Poole, B., 2020. arXiv preprint arXiv:2011.13456.
Understanding diffusion objectives as the elbo with simple data augmentation
通过简单的数据扩增了解作为 elbo 的扩散目标
Kingma, D. and Gao, R., 2024. Advances in Neural Information Processing Systems, Vol 36.
Kingma, D. and Gao, R., 2024.神经信息处理系统进展》，第 36 卷。
Diffusion is spectral autoregression [HTML]
扩散是光谱自回归 [HTML]
Dieleman, S., 2024.
Scaling rectified flow transformers for high-resolution image synthesis
用于高分辨率图像合成的缩放整流变压器
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F. and others,, 2024. Forty-first International Conference on Machine Learning.
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F. and others, 2024.第四十一届机器学习国际会议。
Elucidating the design space of diffusion-based generative models
阐明基于扩散的生成模型的设计空间
Karras, T., Aittala, M., Aila, T. and Laine, S., 2022. Advances in neural information processing systems, Vol 35, pp. 26565--26577.
Karras, T., Aittala, M., Aila, T. and Laine, S., 2022.神经信息处理系统进展》，第 35 卷，第 26565-26577 页。
Knowledge distillation in iterative generative models for improved sampling speed [PDF]
迭代生成模型中的知识提炼以提高采样速度 [PDF] （英文
Luhman, E. and Luhman, T., 2021. arXiv preprint arXiv:2101.02388.
Denoising diffusion probabilistic models
去噪扩散概率模型
Ho, J., Jain, A. and Abbeel, P., 2020. Advances in neural information processing systems, Vol 33, pp. 6840--6851.
Ho, J., Jain, A. and Abbeel, P., 2020.神经信息处理系统进展》，第 33 卷，第 6840--6851 页。
Progressive Distillation for Fast Sampling of Diffusion Models
渐进蒸馏法快速采样扩散模型
Salimans, T. and Ho, J., 2022. International Conference on Learning Representations.
Salimans, T. and Ho, J., 2022.学习表征国际会议。
Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models
Dpm-solver++：用于扩散概率模型引导采样的快速求解器
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C. and Zhu, J., 2022. arXiv preprint arXiv:2211.01095.
Se (3)-stochastic flow matching for protein backbone generation
用于生成蛋白质骨架的 Se (3) 随机流动匹配
Bose, A.J., Akhound-Sadegh, T., Huguet, G., Fatras, K., Rector-Brooks, J., Liu, C., Nica, A.C., Korablyov, M., Bronstein, M. and Tong, A., 2023. arXiv preprint arXiv:2310.02391.

Diffusion Meets Flow Matching: Two Sides of the Same Coin扩散与流量匹配：一枚硬币的两面

Authors 作者

Affiliations 联盟

Published 原文

Overview 概述

Diffusion models 扩散模型

Flow matching 流量匹配

Comparison 比较

Sampling 取样

Training 培训

How do we choose what the network should output?我们该如何选择网络输出的内容？

How do we choose the weighting function?如何选择加权函数？

How do we choose the training noise schedule?如何选择训练噪音时间表？

Summary 摘要

Diving deeper into samplers深入了解采样器

Reflow operator 回流焊操作员

Deterministic sampler vs. stochastic sampler确定性采样器与随机采样器的比较

SDE and ODE PerspectiveSDE 和 ODE 视角

Diffusion models 扩散模型

Flow matching 流量匹配

Equivalence of the two frameworks两个框架的等效性

Closing takeaways 结束语

Acknowledgements 致谢

Footnotes

References 参考资料

Diffusion Meets Flow Matching: Two Sides of the Same Coin
扩散与流量匹配：一枚硬币的两面

How do we choose what the network should output?
我们该如何选择网络输出的内容？

How do we choose the weighting function?
如何选择加权函数？

How do we choose the training noise schedule?
如何选择训练噪音时间表？

Diving deeper into samplers
深入了解采样器

Deterministic sampler vs. stochastic sampler
确定性采样器与随机采样器的比较

SDE and ODE Perspective
SDE 和 ODE 视角

Equivalence of the two frameworks
两个框架的等效性