DDP: Diffusion Model for Dense Visual Prediction
DDP：密集视觉预测扩散模型

Yuanfeng Ji^1∗, Zhe Chen^3∗, Enze Xie^2†, Lanqing Hong², Xihui Liu¹,
纪元峰 {{0}, 陈哲 {{1}, 谢恩泽 ^2† , 洪兰清 {{3}, 刘锡辉 {{4},
Zhaoqiang Liu², Tong Lu³, Zhenguo Li², Ping Luo¹
刘兆强 {{0}, 吕彤 ³ , 李振国 {{2}, 罗萍 ¹
¹The University of Hong Kong ²Huawei Noah’s Ark Lab ³Nanjing University
¹ 香港大学 {{1}华为诺亚方舟实验室 ³ 南京大学
https://github.com/JiYuanFeng/DDP

Abstract 摘要

We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a “noise-to-map” generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research.
我们提出了一个基于条件扩散管道的简单、高效、强大的密集视觉预测框架。我们的方法遵循 "噪声到地图 "的预测生成范式，在图像的引导下从随机高斯分布中逐步去除噪声。这种方法被称为 DDP，它将去噪扩散过程有效地扩展到了现代感知管道中。无需针对特定任务进行设计和架构定制，DDP 很容易推广到大多数密集预测任务中，例如语义分割和深度估计。此外，与之前的单步判别方法相比，DDP 还具有动态推理和不确定性感知等吸引人的特性。我们在三项具有代表性的任务中展示了六种不同基准的最佳结果，与专业同行相比，DDP 在每项任务中都能实现最先进或具有竞争力的性能，而无需技巧。例如，语义分割（在 Cityscapes 上为 83.9 mIoU）、BEV 地图分割（在 nuScenes 上为 70.6 mIoU）和深度估计（在 KITTI 上为 0.05 REL）。我们希望我们的方法能成为一个坚实的基线，并促进未来的研究。

^†^†footnotetext:

*

Equal contribution.^†^†footnotetext:

\dagger

Corresponding author.

1 Introduction 1 引言

Dense prediction tasks are the foundation of computer vision research, including a wide range of perceptual tasks such as semantic segmentation [21, 99], depth estimation [31, 70, 74], and optical flow [29, 31]. These tasks require correctly predicting the discrete labels or continuous values for all pixels in the image, which provides detailed contextual understanding and enables various applications.
密集预测任务是计算机视觉研究的基础，包括语义分割[21, 99]、深度估计[31, 70, 74]和光流[29, 31]等广泛的感知任务。这些任务要求正确预测图像中所有像素的离散标签或连续值，从而提供详细的上下文理解并实现各种应用。

Numerous methods have rapidly improved the result of perception tasks over a short period of time. In general terms, these methods can be divided into two paradigms: discriminative-based [30, 96, 85, 18] and generative-based [84, 34, 39, 46, 88]. The former approach, which directly learns the mapping between input-output pairs and predicts in a single forward step, has become the current de-facto choice due to its simplicity and efficiency. Whereas, generative models aim at modeling the underlying distribution of the data, conceptually having a greater capacity to handle challenging tasks. However, they are often restricted by complex architecture customization as well as various training difficulties [67, 42, 6].
许多方法在短时间内迅速改善了感知任务的结果。一般来说，这些方法可分为两种范式：基于判别的[30, 96, 85, 18]和基于生成的[84, 34, 39, 46, 88]。前一种方法直接学习输入输出对之间的映射，只需向前迈出一步即可进行预测，因其简单高效而成为当前的首选。而生成模型旨在对数据的基本分布进行建模，在概念上更有能力处理具有挑战性的任务。然而，它们往往受到复杂的架构定制和各种训练困难的限制[67, 42, 6]。

Refer to caption — Figure 1: Conditional diffusion pipeline for dense visual predictions. Specifically, a conditional diffusion model is employed, where $q$ is the forward diffusion process and $p_{\theta}$ is the inverse process. The framework iteratively transforms the noise sample $\bm{y}_{T}$ , drawn from a standard Gaussian distribution, into the desired target prediction $\bm{y}_{0}$ under the guidance of the input image $\bm{x}$ .
图 1：用于密集视觉预测的条件扩散管道。具体来说，我们采用了条件扩散模型，其中 $q$ 是正向扩散过程， $p_{\theta}$ 是反向扩散过程。该框架对噪声样本 $\bm{y}_{T}$ 进行迭代转换。在输入图像 $\bm{x}$ 的引导下，将取自标准高斯分布的噪声样本 $\bm{y}_{T}$ 转化为所需的目标预测 $\bm{y}_{0}$ 。.

These challenges have been largely addressed by the diffusion and score-based models [35, 71, 75]. The solutions, based on denosing diffusion process, are conceptually simple: they apply a continuous diffusion process to transform data into noise and generate new samples by simulating the time-reversed diffusion process. These methods now enable easy training and achieve superior results on various generative tasks [57, 65, 63, 60]. Witnessing these great successes, there has been a recent surge of interest to introduce diffusion models to dense prediction tasks, including semantic segmentation [1, 14, 82, 81] and depth estimation [68]. However, these methods simply transfer the heavy frameworks from image generation tasks to dense prediction, resulting in low efficiency, slow convergence, and sub-optimal performance.
基于扩散和分数的模型在很大程度上解决了这些难题[35, 71, 75]。这些基于否认扩散过程的解决方案概念简单：它们应用连续扩散过程将数据转化为噪声，并通过模拟时间逆转扩散过程生成新样本。现在，这些方法可以轻松进行训练，并在各种生成任务中取得优异成绩[57, 65, 63, 60]。有鉴于这些巨大的成功，最近人们对将扩散模型引入密集预测任务（包括语义分割[1, 14, 82, 81]和深度估计[68]）产生了浓厚的兴趣。然而，这些方法只是简单地将图像生成任务中的繁重框架转移到密集预测中，导致效率低、收敛慢、性能不理想。

In this paper, we introduce a general, simple, yet effective diffusion framework for dense visual prediction. Our method named as DDP, which extends the denoising diffusion process into the modern perception pipeline effectively (see Figure 2). During training, the Gaussian noise controlled by a noise schedule [58] is added to the encoded ground truth to obtain the noisy maps. Then these noisy maps are fused with the conditional features from the image encoder, e.g., Swin Transformer [52]. Finally, these fused features are fed to a lightweight map decoder to produce the predictions without noise. At the inference phase, DDP generates predictions by reversing the learned diffusion process, which adjusts a noisy Gaussian distribution to the learned map distribution under the guidance of the test images (see Figure 1).
在本文中，我们为密集视觉预测引入了一个通用、简单而有效的扩散框架。我们的方法被命名为 DDP，它将去噪扩散过程有效地扩展到了现代感知管道中（见图 2）。在训练过程中，由噪声表[58] 控制的高斯噪声会被添加到编码后的地面实况中，从而得到噪声地图。然后将这些噪声图与来自图像编码器（如 Swin 变换器[52]）的条件特征进行融合。最后，将这些融合后的特征输入轻量级地图解码器，生成无噪声的预测结果。在推理阶段，DDP 通过逆转所学的扩散过程来生成预测结果，在测试图像的引导下，将噪声高斯分布调整为所学的地图分布（见图 1）。

Compared to previous cumbersome diffusion perception models [82, 81, 68], DDP decouples the image encoder and map decoder. The image encoder runs only once, while the diffusion process is performed only in the lightweight decoder head. With this efficient design, our proposed method can easily be applied to modern perception tasks. Furthermore, unlike previous single-step discriminative models, DDP is capable of performing iterative inference multiple times using the shared parameters and exhibits the following appealing properties: (1) dynamic inference to trade off computation and prediction quality and (2) natural awareness of the prediction uncertainty.

We evaluate DDP on three representative dense prediction tasks, including semantic segmentation, BEV map segmentation, and depth estimation, using six popular datasets (ADE20K [99], Cityscapes [21], nuScenes [7], KITTI [31], NYU-DepthV2 [70], and SUN RGB-D [74]). Our experimental results demonstrate that DDP significantly outperforms existing state-of-the-art methods. Specifically, on ADE20K, DDP achieves 46.1 mIoU with a single sampling step, which is significantly better than UperNet [83] and K-Net [95]. On nuScenes, DDP yields an mIoU of 70.3, which is clearly better than the BEVFusion [54] baseline that achieves an mIoU of 62.7. Furthermore, by increasing the sampling steps, DDP can achieve even higher performance on both ADE20K and nuScenes, reaching anmIoU of 47.0 and 70.6, respectively. Moreover, the gains are more versatile for different model architectures as well as model sizes. DDP achieves 83.9 mIoU on Cityscapes with the ConvNeXt-L backbone and produces a leading REL of 0.05 on KITTI with the Swin-L backbone.
我们使用六个流行的数据集（ADE20K [ 99]、Cityscapes [ 21]、nuScenes [ 7]、KITTI [ 31]、NYU-DepthV2 [ 70] 和 SUN RGB-D [ 74]）对 DDP 的三个代表性密集预测任务进行了评估，包括语义分割、BEV 地图分割和深度估计。实验结果表明，DDP 明显优于现有的先进方法。具体来说，在 ADE20K 上，DDP 通过单步采样实现了 46.1 mIoU，明显优于 UperNet [ 83] 和 K-Net [ 95]。在 nuScenes 上，DDP 的 mIoU 为 70.3，明显优于 BEVFusion [ 54] 基线（mIoU 为 62.7）。此外，通过增加采样步数，DDP 还能在 ADE20K 和 nuScenes 上实现更高的性能，mIoU 分别达到 47.0 和 70.6。此外，对于不同的模型架构和模型大小，DDP 的收益也更加多样化。DDP 利用 ConvNeXt-L 主干网在 Cityscapes 上实现了 83.9 mIoU，利用 Swin-L 主干网在 KITTI 上实现了 0.05 的领先 REL。

Overall, our contributions in this work are three-fold.
总的来说，我们在这项工作中的贡献有三个方面。

•

We formulate the dense visual prediction tasks as a general conditional denoising process, with simple yet highly effective designs.

- 我们将密集视觉预测任务表述为一般条件去噪过程，设计简单而高效。
•

Our “noise-to-map” generative paradigm offers several appealing properties, such as the ability to perform dynamic inference and uncertain awareness.

- 我们的 "噪声到地图 "生成范式具有多种吸引人的特性，例如能够进行动态推理和不确定感知。
•

We conduct extensive experiments on three representative tasks with six diverse benchmarks. The results demonstrate that our method, which we refer to as DDP, achieves competitive performance when compared to previous discriminative methods.

- 我们在三个具有代表性的任务中使用六种不同的基准进行了广泛的实验。结果表明，与之前的判别方法相比，我们的方法（我们称之为 DDP）取得了具有竞争力的性能。

2 Related Work 2 相关工作

Diffusion Model. 扩散模型。

Diffusion [35, 71] and score-based generative models [73] have been particularly successful as generative models and achieve impressive results across various modalities, including images [60, 66, 27, 57, 25, 25], video [36, 37], audio [43], and biomedical [2, 77, 69, 22]. Given the notable achievements of diffusion models in these respective domains, leveraging such models to develop generation-based perceptual models would prove to be a highly promising avenue to push the boundaries of perceptual tasks to newer heights.
扩散模型[35, 71]和基于分数的生成模型[73]作为生成模型特别成功，在各种模式中取得了令人印象深刻的结果，包括图像[60, 66, 27, 57, 25, 25]、视频[36, 37]、音频[43]和生物医学[2, 77, 69, 22]。鉴于扩散模型在这些领域取得的显著成就，利用这些模型开发基于生成的感知模型将被证明是一条极具前景的途径，可将感知任务的界限推向新的高度。

Dense Prediction.

The perception of real-world scenes via pixel-by-pixel classification or regression is commonly formulated as dense prediction tasks, such as semantic segmentation [21, 99], depth estimation [31, 70, 74], and optical flow [29, 31]. Numerous methods have emerged and achieved tremendous progress, and these advances can be roughly divided to: multi-scale feature aggregation [9, 10, 83], high-capacity backbone [85, 97, 61] and powerful decoder head [76, 95, 19, 40]. In this paper, as shown in Figure 1, which differs from previous discriminative-based methods, we explore a generative “noise-to-map” paradigm for general dense prediction tasks.

Diffusion Models for Dense Prediction.

With the recent success of diffusion models in generation tasks, there has been a noticeable rise in interest to incorporate them into dense visual prediction tasks. Several pioneering works [82, 1, 81, 14, 68, 12] attempted to apply the diffusion model to visual perception tasks, e.g. image segmentation or depth estimation task. For example, Wolleb et al. [81] explore the diffusion model for medical image segmentation. Pix2Seq-D [14] applies the bit diffusion model [16] for panoptic segmentation. Our concurrent work DepthGen [68] involves diffusion pipeline to the task of depth estimation. For all the diffusion models listed above, one or two parameter-heavy convolutional U-Nets [64] are adopted, leading to low efficiency, slow convergence, and sub-optimal performance. In this work, as illustrated in Figure 2, we introduce a simple yet effective diffusion framework, which extends the denoising diffusion process into the modern perception pipeline while maintaining accuracy and efficiency.

3 Methodology

3.1 Preliminaries

Dense Prediction. 密集预测。

The objective of dense prediction tasks is to predict discrete labels or continuous values, denoted as $\bm{y}$ , for every pixel present in the input image $\bm{x}\in\mathbb{R}^{3\times h\times w}$ .
密集预测任务的目标是预测离散标签或连续值（表示为 $\bm{y}$ 来预测输入图像 $\bm{x}\in\mathbb{R}^{3\times h\times w}$ 中每个像素的值。.

Conditional Diffusion Model.
条件扩散模型

The conditional diffusion model, which is an extension of the diffusion model [35, 71, 75], belongs to the category of likelihood-based models inspired by non-equilibrium thermodynamics. The conditional diffusion model assumes a forward noising process by gradually adding noise to the data sample, which is defined as:
条件扩散模型是扩散模型的扩展[35, 71, 75]，属于受非平衡态热力学启发的基于似然法的模型。条件扩散模型假定有一个前向噪声过程，即在数据样本中逐渐添加噪声，其定义如下

q\left(\bm{z}_{t}\mid\bm{z}_{0}\right)=\mathcal{N}\left(\bm{z}_{t};\sqrt{\bar{\alpha}_{t}}\bm{z}_{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right),

(1)

which transforms the data sample $\bm{z}_{0}$ to a latent noisy sample $\bm{z}_{t}$ for $t\in\{0,1,\ldots,T\}$ . The constants $\bar{\alpha}_{t}:=$ $\prod_{s=0}^{t}\alpha_{s}=\prod_{s=0}^{t}\left(1-\beta_{s}\right)$ and $\beta_{s}$ represents the noise schedule [58, 35]. During training, the reverse process model $f_{\theta}\left(\bm{z}_{t},\bm{x},t\right)$ is trained to predict $\bm{z}_{0}$ from $\bm{z}_{t}$ under the guidance of condition $\bm{x}$ by minimizing the training objective function (i.e., $l_{2}$ loss). At the inference stage, predicted data sample $\bm{z}_{0}$ is reconstructed from a random noise $\bm{z}_{T}$ with the model $f_{\theta}$ , conditional input $\bm{x}$ , and a translation rule [35, 72] in a markovian way, i.e., $\bm{z}_{T}\rightarrow\bm{z}_{T-\Delta}\rightarrow\ldots\rightarrow\bm{z}_{0}$ , which can be formulated as:
将数据样本 $\bm{z}_{0}$ 转换为 $t\in\{0,1,\ldots,T\}$ 的潜在噪声样本 $\bm{z}_{t}$ 。.常数 $\bar{\alpha}_{t}:=$ $\prod_{s=0}^{t}\alpha_{s}=\prod_{s=0}^{t}\left(1-\beta_{s}\right)$ 和 $\beta_{s}$ 代表噪声时间表[58, 35]。在训练过程中，反向过程模型 $f_{\theta}\left(\bm{z}_{t},\bm{x},t\right)$ 在条件 $\bm{x}$ 的指导下，通过最小化训练目标函数（即 $l_{2}$ 损失），训练从 $\bm{z}_{t}$ 预测 $\bm{z}_{0}$ 。在推理阶段，预测的数据样本 $\bm{z}_{0}$ 通过模型 $f_{\theta}$ 从随机噪声 $\bm{z}_{T}$ 中重建。，条件输入 $\bm{x}$ ，以及翻译规则[ 35, 36和翻译规则[35, 72]，即 $\bm{z}_{T}\rightarrow\bm{z}_{T-\Delta}\rightarrow\ldots\rightarrow\bm{z}_{0}$ 。可以表述为

p_{\theta}\left(\bm{z}_{0:T}\mid\bm{x}\right)=p\left(\bm{z}_{T}\right)\prod_{t=1}^{T}p_{\theta}\left(\bm{z}_{t-1}\mid\bm{z}_{t},\bm{x}\right).

(2)

In this paper, our goal is to solve dense prediction tasks via the conditional diffusion model. In our setting, the data samples are the ground truth map $\bm{z}_{0}=\bm{y}$ , and a neural network $f_{\theta}$ is trained to predict $\bm{z}_{0}$ from random noise $\bm{z}_{t}\sim\mathcal{N}(0,\mathbf{I})$ conditioned on the corresponding image $\bm{x}$ .
在本文中，我们的目标是通过条件扩散模型解决密集预测任务。在我们的设置中，数据样本是地面实况地图 $\bm{z}_{0}=\bm{y}$ ，神经网络 $f_{\theta}$ 经过训练，从随机噪声 $\bm{z}_{t}\sim\mathcal{N}(0,\mathbf{I})$ 中预测 $\bm{z}_{0}$ 。训练神经网络 $f_{\theta}$ ，以相应的图像 $\bm{x}$ 为条件，从随机噪声 $\bm{z}_{t}\sim\mathcal{N}(0,\mathbf{I})$ 中预测 $\bm{z}_{0}$ 。.

3.2 Architecture 3.2 结构

Since the diffusion model generates samples progressively, it requires multiple runs of the model in the inference stage. Previous methods [82, 68, 81] apply the model $f_{\theta}$ in multiple steps on the raw image $\bm{x}$ , which significantly increases the computational overhead. To alleviate this issue, we separate the entire model into two parts: image encoder and map decoder, as shown in Figure 2. The image encoder forwards only once to extract the feature map from the input image $\bm{x}$ . Then the map decoder employs it as the condition rather than the raw image $\bm{x}$ , to gradually refine the prediction from the noisy map $\bm{y}_{t}$ .
由于扩散模型是逐步生成样本的，因此在推理阶段需要多次运行模型。之前的方法 [ 82, 68, 81] 在原始图像 $\bm{x}$ 上分多个步骤应用模型 $f_{\theta}$ 。这大大增加了计算开销。为了缓解这一问题，我们将整个模型分为两部分：图像编码器和地图解码器，如图 2 所示。图像编码器只转发一次，从输入图像 $\bm{x}$ 中提取特征图。.然后，地图解码器将其作为条件，而不是原始图像 $\bm{x}$ ，以逐步完善从图像 $\bm{y}_{t}$ 到{{5}的预测。从噪声图 $\bm{y}_{t}$ 中逐步完善预测。.

Image Encoder.

The image encoder receives the raw image $\bm{x}$ as input and generates multi-scale features at 4 different resolutions. Subsequently, these multi-scale features are fused using the FPN [51] and aggregated by a 1 $\times$ 1 convolution. The produced feature map, with the resolution of $256\times\frac{h}{4}\times\frac{w}{4}$ , is employed as the condition for the map decoder. In contrast to the previous methods [1, 82, 68], DDP is able to work with modern network architectures such as ConvNext [53] and Swin Transformer [52].

Map Decoder.

The map decoder $f_{\theta}$ takes as input the noisy map $\bm{y}_{t}$ and the feature map from the image encoder via concatenation and performs a pixel-by-pixel classification or regression. Following the common practice [18, 100, 93] in modern perception pipelines, we simply stack six layers of deformable attention as the map decoder. Compared to previous works [1, 82, 68, 14, 81] that use the parameter-intensive U-Nets, our map decoder is lightweight and compact, allowing efficient reuse of the shared parameters during the multi-step reverse diffusion process.

⬇

def train(images, maps):

"""images: [b, 3, h, w], maps: [b, 1, h, w]"""

img_enc = image_encoder(images) # encode image

map_enc = encoding(maps) # encode gt

map_enc = (sigmoid(map_enc) * 2 - 1) * scale

# corrupt gt

t, eps = uniform(0, 1), normal(mean=0, std=1)
t，eps = uniform(0，1)，normal(mean=0，std=1)

map_crpt = sqrt(alpha_cumprod(t)) * map_enc +
map_crpt = sqrt(alpha_cumprod(t))* map_enc +

sqrt(1 - alpha_cumprod(t)) * eps
sqrt(1 - alpha_cumprod(t))* eps

# predict and backward

map_pred = map_decoder(map_crpt, img_enc, t)
map_pred = map_decoder(map_crpt、img_enc、t)

loss = objective_func(map_pred, maps)

return loss 回波损耗

Algorithm 1 DDP Training 算法 1 DDP 训练

3.3 Training 3.3 培训

During training, we first construct a diffusion process from the ground truth $\bm{y}$ to the noisy map $\bm{y}_{t}$ and then train the model to reverse this process. The training procedure for DDP is provided in Algorithm 1 (for more details please refer to Appendix A).
在训练过程中，我们首先构建一个从地面实况 $\bm{y}$ 到噪声地图 $\bm{y}_{t}$ 的扩散过程，然后训练模型以逆转这一过程。算法 1 提供了 DDP 的训练过程（更多细节请参阅附录 A）。

Label Encoding. 标签编码。

Standard diffusion models assume continuous data, which makes them a convenient choice for regression tasks with continuous values (e.g., depth estimation). However, existing studies [14, 16] show that they are unsuitable for discrete labels (e.g., semantic segmentation). Therefore, we explore several encoding strategies for the discrete labels, including: (1) One-hot encoding, which represents categorical labels as binary vectors of 0 and 1; (2) Analog bits encoding [14], which first converts discrete integers into bit strings, and then casts them as real numbers; (3) Class embedding, which uses a learnable embedding layer to project discrete labels into a high-dimensional continuous space, with a sigmoid function for normalization. For all of these strategies, we normalize and scale the range of encoded labels within $[-{\rm scale},+{\rm scale}]$ , as shown in Algorithm 1. Notably, the scaling factor $\rm scale$ controls the signal-to-noise ratio (SNR) [14, 13], which is an important hyper-parameter for diffusion models. We compare these strategies in Table 5a and find class embedding work best. More discussions are in Section 4.5.
标准扩散模型假定数据是连续的，这使它们成为连续值回归任务（如深度估计）的便捷选择。然而，现有研究[14, 16]表明，它们并不适合离散标签（如语义分割）。因此，我们为离散标签探索了几种编码策略，包括：（1）一热编码，将分类标签表示为 0 和 1 的二进制向量；（2）模拟比特编码[14]，首先将离散整数转换为比特串，然后将其转换为实数；（3）类嵌入，使用可学习嵌入层将离散标签投射到高维连续空间，并使用 sigmoid 函数进行归一化。对于所有这些策略，我们都会在 $[-{\rm scale},+{\rm scale}]$ 内对编码标签的范围进行归一化和缩放。如算法 1 所示。值得注意的是，缩放因子 $\rm scale$ 控制着信噪比（SNR）[ 14, 13]，这是扩散模型的一个重要超参数。我们在表 5a 中比较了这些策略，发现类嵌入的效果最好。更多讨论见第 4.5 节。

Map Corruption.

We add Gaussian noise to corrupt the encoded ground truth, obtaining the noisy map $\bm{y}_{t}$ . As shown in Equation 1, the intensity of corruption noise is controlled by $\alpha_{t}$ , which adopts the monotonically decreasing schedule for $\alpha_{t}$ in different time steps $t\in[0,1]$ . Different noise scheduling strategies, including cosine schedule [58] and linear schedule [35], are compared and discussed in Section 4.5. We found that the cosine schedule usually worked best in our benchmark tasks.

Objective Function.

Standard diffusion models are trained with $l_{2}$ loss, which is reasonable for dense prediction tasks, but we found that adopting a task-specific loss works better for supervision, e.g., cross-entropy loss for semantic segmentation, sigloss for depth estimation.

⬇

def sample(images, steps, td=1):

"""steps: sample steps, td: time difference"""

img_enc = image_encoder(images)

map_t = normal(0, 1) # [b, 256, h/4, w/4]

for step in range(steps):

# time intervals

t_now = 1 - step / steps

t_next = max(1 - (step + 1 + td) / steps, 0)

# predict map_0 from map_t

map_pred = map_decoder(map_t, img_enc, t_now)

# estimate map_t at t_next

map_t = ddim(map_t, map_pred, t_now, t_next)

return map_pred

Algorithm 2 DDP Sampling

3.4 Inference

Given a test image as condition input, the model starts with a random noise map sampled from a Gaussian distribution and gradually refines the prediction, we summarize the inference procedure in Algorithm 2.
给定测试图像作为条件输入，模型从高斯分布采样的随机噪音图开始，逐步完善预测，我们在算法 2 中总结了推理过程。

Sampling Rule. 取样规则。

We choose the DDIM update rule [72] for the sampling. In each sampling step $t$ , the random noise $\bm{y}_{T}$ or the predicted noisy map $\bm{y}_{t+1}$ from the last step is fused with the conditional feature map, and sent to the map decoder $f_{\theta}$ for map prediction. After getting the predicted result of the current step, we compute the noisy map $\bm{y}_{t}$ for the next step using the reparameterization trick. Following [15, 14, 12], we use the asymmetric time intervals (controlled by a hyper-parameter $td$ ) during the inference stage, and $td=1$ works best in our method.
我们选择 DDIM 更新规则 [ 72] 进行采样。在每个采样步骤 $t$ 中的随机噪声 $\bm{y}_{T}$ 或上一步预测的噪声地图 $\bm{y}_{t+1}$ 与条件特征地图融合，并发送给地图解码器 $f_{\theta}$ 进行地图预测。得到当前步骤的预测结果后，我们使用重参数化技巧计算下一步的噪声地图 $\bm{y}_{t}$ 。按照[ 15, 14, 12] 的方法，我们在推理阶段使用非对称时间间隔（由超参数 $td$ 控制）， $td=1$ 在我们的方法中效果最佳。

Sampling Drift. 采样漂移。

As displayed in Figure 3a, we empirically observe that the model performance improves in a few sampling steps and then declines slightly as the number of steps increases. Similar observations can also be found in [12, 11, 68]. This performance decline can be attributed to the “sampling drift” challenge, which refers to the discrepancy between the distribution of training and sampling data. During training, the model is trained to inverse the noisy ground truth map, while during testing, the model is inferred to remove noise from its “imperfect” prediction, which drifts away from the underlying corrupted distributions. This drift becomes pronounced with smaller time steps $t$ , owing to the compounded errors, and is further intensified when a sample deviates more substantially from the distribution of ground truth [24].
如图 3a 所示，我们根据经验观察到，模型性能在几个采样步骤中得到改善，然后随着步骤数的增加而略有下降。类似的观察结果也出现在 [ 12, 11, 68] 中。这种性能下降可归因于 "采样漂移 "挑战，即训练数据和采样数据分布之间的差异。在训练过程中，模型被训练为反演噪声地面实况图，而在测试过程中，模型被推断为去除其 "不完美 "预测中的噪声，从而偏离了底层的损坏分布。随着时间步长 $t$ 越小，这种漂移越明显。由于复合误差的存在，这种漂移会随着时间步长 $t$ 的减小而变得更加明显，当样本与地面实况分布的偏离程度更大时，漂移会进一步加剧[24]。

To verify our hypothesis, in the last 5k iterations of training, we construct $\bm{y}_{t}$ using the model’s prediction rather than the ground truth. The approach transforms the training target to remove the added noise on its own predictions, thereby aligning the data distribution of training and testing. We name this approach “self-aligned denoising.” As revealed in Figure 3a, this approach tends to produce saturation instead of performance degradation. Our findings suggest that incorporating the diffusion process into perception tasks could enhance efficacy compared to image generation (e.g., about 50 DDIM steps for image generation). In other words, the proposed DDP can improve efficiency (e.g., satisfied results in 3 iterative steps) while retaining the benefits of the diffusion model. More discussions can be found in Appendix A.
为了验证我们的假设，在训练的最后 5k 次迭代中，我们使用模型的预测而不是地面实况来构建 $\bm{y}_{t}$ 。这种方法会对训练目标进行转换，以去除自身预测中增加的噪声，从而使训练和测试的数据分布保持一致。我们将这种方法命名为 "自对齐去噪"。如图 3a 所示，这种方法往往会产生饱和而非性能下降。我们的研究结果表明，与图像生成相比，将扩散过程纳入感知任务可以提高效率（例如，图像生成大约需要 50 个 DDIM 步骤）。换句话说，建议的 DDP 可以提高效率（例如，只需 3 个迭代步骤就能得到满意的结果），同时保留扩散模型的优点。更多讨论见附录 A。

Multiple Inference.

By virtue of the multi-step sampling procedure, our method supports dynamic inference, which has the flexibility to trade compute for prediction quality. Besides, it naturally enables the assessment of the reliability and uncertainty of model predictions.

4 Experiment

We first present the appealing properties of our DDP, followed by empirical evaluations of its performance against leading methods on several representative tasks, including semantic segmentation, BEV map segmentation, and monocular depth estimation. Finally, we provide ablation studies on the DDP components. Due to space limitations, more implementation details and experimental results are provided in Appendix B and Appendix C, respectively.

4.1 Main Properties

We explore and show properties of DDP in Figure 3 using the default setting in Section 4.2. With such a multi-step sampling procedure, we have the flexibility to trade computational cost for prediction quality. Furthermore, the stochastic sampling process allows the computing of pixel-wise uncertainty maps of the prediction.

Dynamic Inference. 动态推理

We evaluate DDP with ConvNext-T and ConvNext-L backbones by increasing their sampling steps from 1 to 10. The results are presented in Figure 3a. It can be seen that the DDP can continuously improve its performance by using more sampling steps. For example, DDP with ConvNext-T shows an increase from 82.33 mIoU (1 step) to 82.60 mIoU (3 steps), and we visualize the inference trajectory in Figure 3b. In comparison to the previous single-step method, our approach boasts the flexibility to balance computational cost against accuracy. This means our method can be adapted to different trade-offs between speed and accuracy under various scenarios without the need to retrain the network.
我们通过将 ConvNext-T 和 ConvNext-L 主干网的采样步长从 1 步增加到 10 步，对 DDP 进行了评估。结果如图 3a 所示。可以看出，DDP 可以通过增加采样步长不断提高性能。例如，使用 ConvNext-T 的 DDP 从 82.33 mIoU（1 步）增加到 82.60 mIoU（3 步），我们可以在图 3b 中看到推理轨迹。与之前的单步推理方法相比，我们的方法在计算成本与准确性之间实现了灵活的平衡。这意味着我们的方法可以适应各种情况下速度和准确性之间的不同权衡，而无需重新训练网络。

Uncertainty Awareness. 不确定性意识。

In addition to the performance gains, the proposed DDP can naturally provide uncertainty estimates. In the multi-step sampling process, we can simply count the pixels where the predicted result of each step differs from the result of the previous step, and finally, we simply normalize this change count map to 0-1 and obtain an uncertainty map. In comparison, DDP is naturally and easily capable of estimating uncertainty, whereas previous methods [55, 33] require complicated modeling such as Bayesian networks.
除了性能上的提升，拟议的 DDP 还能自然地提供不确定性估计。在多步采样过程中，我们可以简单地统计每一步预测结果与上一步结果不同的像素，最后将变化计数图归一化为 0-1，得到不确定性图。相比之下，DDP 能够自然轻松地估计不确定性，而之前的方法 [ 55, 33] 则需要贝叶斯网络等复杂的建模。

4.2 Semantic Segmentation 4.2 语义分割

Datasets. 数据集。

We evaluate the proposed DDP using two widely used datasets: ADE20K [99] and Cityscapes [21]. ADE20K is a large-scale scene parsing dataset with over 20,000 images, and Cityscapes is a street scene dataset with high-quality pixel-level annotations for 5,000 images.
我们使用两个广泛使用的数据集对所提出的 DDP 进行了评估：ADE20K [ 99] 和 Cityscapes [ 21] 。ADE20K 是一个大规模场景解析数据集，包含 20,000 多张图像；Cityscapes 是一个街道场景数据集，包含 5,000 张图像的高质量像素级注释。

Settings.

In the training phase, following common practices [80, 17, 85, 79], the crop size is set to 512 $\times$ 512 for ADE20K, and 512 $\times$ 1024 for Cityscapes. We optimize our DDP models using the AdamW [56] optimizer, with an initial learning rate of $6\times 10^{-5}$ and a weight decay of 0.01. All models are trained for 160k iterations and compared fairly with previous non-diffusion methods.

Method	Backbone	#Param	FLOPs	mIoU	+MS
UperNet [83]	Swin-T	60M	236G	44.5	45.8
Region Rebalance [23]	Swin-T	60M	236G	45.0	46.5
MaskFormer [19]	Swin-T	42M	55G	46.7	48.8
Mask2Former [18]	Swin-T	47M	74G	47.7	49.6
K-Net [95]	Swin-T	73M	256G	45.8	46.3
SenFormer [5]	Swin-T	144M	179G	46.0	46.4
Non-diffusion Baseline	Swin-T	35M	111G	44.9	46.1
DDP (step 1)	Swin-T	40M	113G	46.1	47.6
DDP (step 3)	Swin-T	40M	252G	47.0	47.8
UperNet [83]	Swin-S	81M	259G	47.6	49.5
DDP (step 1)	Swin-S	61M	136G	48.4	49.7
DDP (step 3)	Swin-S	61M	276G	48.7	49.7
UperNet [83]	Swin-B	121M	297G	48.1	49.7
DDP (step 1)	Swin-B	99M	173G	49.2	50.8
DDP (step 3)	Swin-B	99M	312G	49.4	50.8
UperNet [83]	Swin-L^†	234M	411G	52.1	53.5
DDP (step 1)	Swin-L^†	207M	285G	53.1	54.4
DDP (step 3)	Swin-L^†	207M	425G	53.2	54.4

Table 1: Semantic segmentation on ADE20K val set. We report single-scale (SS) and multi-scale (MS) mIoU. The FLOPs are measured with 512

\times

512 inputs. Backbones pre-trained on ImageNet-22K are marked with ^†.
表 1：ADE20K val 集的语义分割。我们报告了单尺度（SS）和多尺度（MS）mIoU。FLOP 是在 512

\times

512 个输入。在 ImageNet-22K 上预先训练的骨干用 ^† 标记。.

Results on ADE20K. ADE20K 的结果。

Table 1 presents the semantic segmentation performance of DDP on ADE20K [99], which shows that our method consistently outperforms many representative methods [83, 23, 95, 5] and the non-diffusion baseline across different backbones. For instance, when using Swin-T [52] as the backbone, our DDP (step 1) yields a promising result of 46.1 mIoU, surpassing the non-diffusion baseline (DDP w/o diffusion process) by 1.2 points (46.1 vs. 44.9). Moreover, our DDP (step 3) can further enhance the performance to 47.0 mIoU, attaining a remarkable gain of 0.9 points by multi-steps of denoising diffusion. With the Swin-L backbone, our DDP (step 3) achieves the best performance of 53.2 mIoU, which is 1.1 points (53.2 vs. 52.1) better than UperNet with comparable FLOPs. These results suggest that our DDP not only achieves a performance gain but also offers more flexibility than previous methods.
表 1 列出了 DDP 在 ADE20K 上的语义分割性能[ 99]，表明我们的方法在不同的骨干网中始终优于许多代表性方法[ 83, 23, 95, 5]和非扩散基线。例如，当使用 Swin-T [ 52] 作为骨干网时，我们的 DDP（步骤 1）产生了 46.1 mIoU 的可喜结果，比非扩散基线（DDP w/o diffusion process）高出 1.2 个点（46.1 vs. 44.9）。此外，通过多步去噪扩散，我们的 DDP（第 3 步）可将性能进一步提高到 47.0 mIoU，显著提高了 0.9 个百分点。在 Swin-L 主干网中，我们的 DDP（第 3 步）实现了 53.2 mIoU 的最佳性能，比具有可比 FLOPs 的 UperNet 高出 1.1 个点（53.2 vs. 52.1）。这些结果表明，我们的 DDP 不仅提高了性能，而且比以前的方法更具灵活性。

Method 方法	Backbone 骨干网	#Param	FLOPs	mIoU	+MS
Segmenter [76] 分段器 [ 76］	ViT-L^† ViT-L {{0}	333M	2685G	79.10	81.30
SETR-PUP [97] SETR-PUP [ 97］	ViT-L^† ViT-L {{0}	318M	2955G	79.34	82.15
StructToken [50] 结构令牌 [ 50］	ViT-L^† ViT-L {{0}	364M	2913G	80.05	82.07
OCRNet [91, 92]	HRFormer-B 人力资源表格 B	56M	2240G	81.90	82.60
SegFormer-B5 [85]	MiT-B5	85M	1448G	82.25	83.48
DiversePatch [32] DiversePatch [ 32］	Swin-L^† 斯温-L {{0}	234M	3190G	82.70	83.60
Mask2Former [18]	Swin-L^† 斯温-L {{0}	216M	2113G	83.30	84.30
DDP (step 1) DDP （步骤 1）	Swin-T	39M	885G	80.96	82.25
DDP (step 3) DDP （步骤 3）	Swin-T	39M	1992G	81.24	82.46
DDP (step 1)	Swin-S	61M	1067G	82.17	83.06
DDP (step 3)	Swin-S	61M	2174G	82.41	83.21
DDP (step 1)	Swin-B	99M	1357G	82.37	83.36
DDP (step 3)	Swin-B	99M	2464G	82.54	83.42
DDP (step 1)	ConvNext-T	40M	883G	82.33	83.00
DDP (step 3)	ConvNext-T	40M	1989G	82.60	83.15
DDP (step 1)	ConvNext-S	62M	1059G	82.37	83.38
DDP (step 3)	ConvNext-S	62M	2166G	82.69	83.58
DDP (step 1)	ConvNext-B	100M	1340G	82.59	83.47
DDP (step 3)	ConvNext-B	100M	2447G	82.78	83.49
DDP (step 1)	ConvNext-L^†	209M	2139G	82.95	83.76
DDP (step 3)	ConvNext-L^†	209M	3245G	83.21	83.92

Table 2: Semantic segmentation on Cityscapes val set. We report single-scale (SS) and multi-scale (MS) mIoU. The FLOPs are measured with 1024

\times

2048 inputs. Backbones pre-trained on ImageNet-22K are marked with ^†.

Results on Cityscapes.

We compare our DDP with various representative models on Cityscapes [21] in Table 2, such as Segmenter [76], SETR [97], SegFormer [85], DiversePatch [32], and Mask2Former [18], and so on. As shown, we conduct extensive experiments based on ConvNeXt [53] and Swin [52] with different model sizes. When using ConvNeXt-L^† as the backbone, our DDP (step 1) produces a competitive result of 82.95 mIoU, and it can be further boosted to 83.21 mIoU (step 3). This phenomenon was also observed when taking Swin-T as the backbone, and the mIoU increased from 80.96 to 81.24 through additional 2 sampling steps. These experimental results demonstrate the scalability of our methodology, which can be applied to different model structures of arbitrary size. Moreover, once again, the experimental results show that DDP achieves progressive improvements through multi-step denoising diffusion while keeping comparable computational overhead.

Discussion. 讨论。

The original intention of DDP is to design a diffusion-based general framework for various dense prediction tasks. Although its segmentation performance is slightly lower than its specialized counterpart Mask2Former [18], it remains highly competitive and has several attractive features. How to design a segmentation-specific diffusion framework to achieve better performance than Mask2Former is left for future research.
DDP 的初衷是为各种密集预测任务设计一个基于扩散的通用框架。虽然它的分割性能略低于其专业对应的 Mask2Former [ 18]，但它仍然具有很强的竞争力，并有几个吸引人的特点。如何设计一个特定于细分领域的扩散框架，以获得比 Mask2Former 更好的性能，有待于未来的研究。

4.3 BEV Map Segmentation 4.3BEV 地图分割

Dataset. 数据集。

We conduct our experiments of BEV map segmentation on the nuScenes [7] dataset. It is a large-scale autonomous driving perception dataset, which includes over 1000 urban road scenes covering different time periods and weather conditions in two cities, Boston and Singapore.
我们在 nuScenes [ 7] 数据集上进行了 BEV 地图分割实验。这是一个大规模的自动驾驶感知数据集，包括波士顿和新加坡两个城市的 1000 多个城市道路场景，涵盖了不同的时间段和天气条件。

Settings. 设置

We further verify the DDP framework on the BEV map segmentation task. Specifically, we equip our method with the representative method BEVFusion [54], where we directly replace its segmentation head with the proposed map decoder for the diffusion process. We follow evaluation protocol from [54] and compare the results with state-of-the-art methods [86, 89, 54, 4]. We report the IoU of 6 background classes, including drivable space (Dri), pedestrian crossing (Ped), walk-way (Wal), stop line (Sto), car-parking area (Car), and lane divider (Div), and use the mean IoU as the primary evaluation metric. Other training settings are kept the same as [54] for fair comparisons.
我们在 BEV 地图分割任务中进一步验证了 DDP 框架。具体来说，我们将我们的方法与具有代表性的 BEVFusion 方法[54]相结合，直接用提出的扩散过程地图解码器替换其分割头。我们遵循 [ 54] 的评估协议，并将结果与最先进的方法进行比较 [ 86, 89, 54, 4]。我们报告了 6 个背景类别的 IoU，包括可驾驶空间 (Dri)、人行横道 (Ped)、人行道 (Wal)、停车线 (Sto)、停车场 (Car) 和车道分隔线 (Div)，并使用平均 IoU 作为主要评估指标。为进行公平比较，其他训练设置与 [ 54] 相同。

Results. 结果

We show the results of our BEV map segmentation experiments in Table 3, which exhibit the superior performance of our approach, over existing state-of-the-art methods. Specifically, in the camera-only scenario, our DDP (step 1) attains a 59.3 mIoU score on the nuScenes validation dataset, which surpasses the previous best method X-Align [4] by 1.3 mIoU (59.3 vs. 58.0). By iteratively refining the output of the model, DDP (step 3) sets a new state-of-the-art record of 59.4 mIoU solely based on camera modality. In the multi-modality setting, we improve the segmentation results of our DDP (step 1) to 70.3 mIoU by combining LiDAR information, significantly higher than the current state-of-the-art methods [54, 4] by at least 4.6 mIoU. Remarkably, this performance can be further enhanced to a maximum of 70.6 mIoU by leveraging the benefits of iterative denoising diffusion. In summary, these results demonstrate that DDP can be easily generalized to other tasks and obtain performance gains, proving the effectiveness and generalization of our approach.
我们在表 3 中展示了 BEV 地图分割实验的结果，这些结果表明，与现有的先进方法相比，我们的方法具有更优越的性能。具体来说，在纯相机场景中，我们的 DDP（步骤 1）在 nuScenes 验证数据集上获得了 59.3 mIoU 分数，比之前的最佳方法 X-Align [ 4] 高出 1.3 mIoU（59.3 vs. 58.0）。通过对模型输出进行迭代改进，DDP（第 3 步）创造了 59.4 mIoU 的新纪录，这完全是基于相机模式。在多模态设置中，通过结合激光雷达信息，我们将 DDP（步骤 1）的分割结果提高到 70.3 mIoU，比目前最先进的方法[54, 4]高出至少 4.6 mIoU。值得注意的是，通过利用迭代去噪扩散的优势，这一性能可进一步提高到最高 70.6 mIoU。总之，这些结果表明，DDP 可以很容易地推广到其他任务并获得性能提升，证明了我们方法的有效性和推广性。

Method 方法	Modality 模式	Dri 滴	Ped	Wal 沃尔	Sto 斯多	Car 汽车	Div 分部	Mean 平均值
OFT [62]	C	74.0	35.3	45.9	27.5	35.9	33.9	42.1
LSS [59] LSS [ 59］	C	75.4	38.8	46.3	30.3	39.1	36.5	44.4
CVT [98] 无级变速器 [ 98］	C	74.3	36.8	39.9	25.8	35.0	29.4	40.2
M²BEV [86] M ² BEV [ 86］	C	77.2	-	-	-	-	40.5	-
BEVFusion [54] BEVFusion [ 54］	C	81.7	54.8	58.4	47.4	50.7	46.4	56.6
X-Align [4] X 对齐 [ 4］	C	82.4	55.6	59.3	49.6	53.8	47.4	58.0
DDP (step 1) DDP （步骤 1）	C	83.2	58.5	61.6	52.4	51.1	48.9	59.3
DDP (step 3) DDP （步骤 3）	C	83.6	58.3	61.8	52.3	51.4	49.2	59.4
PointPainting [78] 点画 [ 78］	C+L	75.9	48.5	57.1	36.9	34.5	41.9	49.1
MVP [89] MVP [ 89］	C+L	76.1	48.7	57.0	36.9	33.0	42.2	49.0
BEVFusion [54] BEVFusion [ 54］	C+L	85.5	60.5	67.6	52.0	57.0	53.7	62.7
X-Align [4] X 对齐 [ 4］	C+L	86.8	65.2	70.0	58.3	57.1	58.2	65.7
DDP (step 1) DDP （步骤 1）	C+L	89.3	69.5	74.8	62.5	63.5	62.3	70.3
DDP (step 3) DDP （步骤 3）	C+L	89.4	69.8	75.0	63.0	63.8	62.6	70.6

Table 3: BEV map segmentation on nuScenes val set. We report the IoU of 6 background classes and the mean IoU. “C” and “L” denotes the camera modality and LiDAR modality, respectively.
表 3：在 nuScenes val set 上的 BEV 地图分割。我们报告了 6 个背景类别的 IoU 和平均 IoU。C "和 "L "分别表示相机模式和激光雷达模式。

4.4 Depth Estimation 4.4 深度估计

Datasets. 数据集。

We evaluate the depth estimation performance of DDP on three prominent datasets, namely KITTI [31], NYU-DepthV2 [70], and SUN RGB-D [74]. (1) The KITTI dataset encompasses stereo image pairs and corresponding ground truth depth maps for outdoor scenes captured by a car-mounted camera. Following common practices [28, 48], we use about 26K left-view images for training and 697 images for testing. (2) The NYU dataset contains RGB-Depth images for indoor scenes captured at a resolution of 640 $\times$ 480. Similar to prior research [48], the model is trained on 24K train images and evaluated on the reserved 652 images. (3) The SUN RGB-D dataset is a vast collection of around 10K indoor images. We employ it to evaluate the generalization abilities of our NYU pre-trained models. The results on KITTI are shown in the main paper, while others will be provided in the supplementary material.
我们在 KITTI [ 31]、NYU-DepthV2 [ 70] 和 SUN RGB-D [ 74] 这三个著名的数据集上评估了 DDP 的深度估计性能。(1) KITTI 数据集包含由车载摄像头拍摄的室外场景的立体图像对和相应的地面真实深度图。按照通常的做法[28, 48]，我们使用约 26K 张左视角图像进行训练，并使用 697 张图像进行测试。(2) 纽约大学数据集包含以 640 $\times$ 分辨率拍摄的室内场景 RGB 深度图像。480.与之前的研究[48]类似，该模型在 24K 张训练图像上进行训练，并在保留的 652 张图像上进行评估。(3) SUN RGB-D 数据集是一个包含约 10K 张室内图像的庞大集合。我们用它来评估纽约大学预训练模型的泛化能力。KITTI 数据集的结果显示在正文中，其他数据集的结果将在补充材料中提供。

Settings. 设置

We incorporate the DDP model into the codebase developed by [48] for depth estimation experiments. We excluded the discrete label encoding module as the task requires continuous value regression. All experimental settings are the same as [48] for a fair comparison.
我们将 DDP 模型纳入 [ 48] 开发的代码库，用于深度估计实验。由于任务需要连续值回归，我们排除了离散标签编码模块。为进行公平比较，所有实验设置均与 [ 48] 相同。

Metrics. 衡量标准。

Typically, the evaluation of depth estimation methods employs the following metrics: accuracy under threshold ( $\delta_{i}<1.25^{i},i=1,2,3$ ), mean absolute relative error (REL), mean squared relative error (SqRel), root mean squared error (RMSE), root mean squared log error (RMSE log), and mean log10 error (log10).
通常情况下，深度估算方法的评估采用以下指标：阈值下的精度（ $\delta_{i}<1.25^{i},i=1,2,3$ ）、平均绝对相对误差（REL）、平均平方相对误差（SqRel）、均方根误差（RMSE）、均方根对数误差（RMSE log）和平均 log10 误差（log10）。

Method 方法	Backbone 骨干网	$\delta_{1}\uparrow$	$\delta_{2}\uparrow$	$\delta_{3}\uparrow$	REL $\downarrow$	SqRel $\downarrow$	RMSE $\downarrow$	RMSE $\log\downarrow$
DORN [30] 多恩[ 30］	ResNet-101	0.932	0.984	0.994	0.072	0.307	2.727	0.120
VNL [90] VNL [ 90］	ResNeXt-101	0.938	0.990	0.998	0.072	-	3.258	0.117
BTS [44] BTS [ 44］	DenseNet-161	0.956	0.993	0.998	0.059	0.245	2.756	0.096
TransDepth [87] 跨深度 [ 87］	ResNet-50 + ViT-B	0.956	0.994	0.999	0.064	0.252	2.755	0.098
DPT [61] DPT [ 61］	ResNet-50 + ViT-B	0.959	0.995	0.999	0.062	-	2.573	0.092
AdaBins [3] AdaBins [ 3］	EfficientNet-B5 + Mini-ViT	0.964	0.995	0.999	0.058	0.190	2.360	0.088
DepthFormer [48] 深度调节器 [ 48］	ResNet-50 + Swin-T	0.966	0.995	0.999	0.056	0.177	2.252	0.086
DepthFormer [48] 深度调节器 [ 48］	ResNet-50 + Swin-L^†	0.975	0.997	0.999	0.052	0.158	2.143	0.079
BinsFormer [49] 垃圾箱格式 [ 49］	Swin-L^† 斯温-L {{0}	0.974	0.997	0.999	0.052	0.151	2.098	0.079
DepthGen (step 8)* [68] 深度发生器（步骤 8）* [ 68]	Efficient U-Net 高效 U-Net	0.953	0.991	0.998	0.064	0.356	2.985	0.100
DDP (step 3) DDP （步骤 3）	Swin-T	0.969	0.996	0.999	0.054	0.168	2.172	0.083
DDP (step 3) DDP （步骤 3）	Swin-S	0.970	0.996	0.999	0.053	0.167	2.171	0.082
DDP (step 3) DDP （步骤 3）	Swin-B^† 斯温-B {{0}	0.973	0.997	0.999	0.051	0.155	2.119	0.078
DDP (step 3) DDP （步骤 3）	Swin-L^† 斯温-L {{0}	0.975	0.997	0.999	0.050	0.148	2.072	0.076

Table 4: Depth estimation on the KITTI val set. Backbones pre-trained on ImageNet-22K are marked with ^†. We report the performance of DDP with 3 diffusion steps. The best and second-best results are bolded or underlined, respectively. ↓ means lower is better, and ↑ means higher is better. * denotes best results of our concurrent work [68].
表 4：KITTI val 集上的深度估算。在 ImageNet-22K 上预先训练的骨干用 ^† 标记。.我们报告了采用 3 个扩散步骤的 DDP 的性能。最佳和次佳结果分别以粗体或下划线表示。↓ 表示越低越好，↑ 表示越高越好。* 表示我们同时进行的工作的最佳结果[68]。

Results. 结果

Table 4 shows the depth estimation results on the KITTI dataset. We compare the proposed DDP models with state-of-the-art depth estimators. Specifically, we choose DepthFormer [48] and DepthGen [68] as our main competitors, in which DepthFormer is a strong counterpart and achieved leading performance, while DepthGen is a concurrent work of ours and is also a diffusion-based depth estimator. As we can observe, although the performance on this benchmark tends to be saturated, our DDP models still outperform all the competitors with clear margins in most metrics, such as REL, SqRel, and RMSE. For instance, equipped with Swin-L^†, our method achieves a state-of-the-art RMSE log of 0.076 by 3 steps of denoising diffusion. Compared with the concurrent diffusion-based model [68], we find that: (1) DDP outperforms DepthGen with clear margins, particularly in regards to the RMSE $\downarrow$ metric (2.072 vs. 2.985), which can be contributed by the equipped advanced pipeline design (e.g., Swin Transformer vs. U-Net). (2) DDP is more lightweight and efficient compared to DepthGen, as the denoising diffusion process occurs solely on the decoder head, whereas with DepthGen, the process occurs on the entire model.
表 4 显示了 KITTI 数据集上的深度估计结果。我们将所提出的 DDP 模型与最先进的深度估算器进行了比较。具体来说，我们选择了 DepthFormer [ 48] 和 DepthGen [ 68] 作为我们的主要竞争对手，其中 DepthFormer 是我们的强力对手，取得了领先的性能，而 DepthGen 是我们的并行工作，也是一种基于扩散的深度估算器。我们可以看到，虽然在这个基准上的性能趋于饱和，但我们的 DDP 模型在大多数指标（如 REL、SqRel 和 RMSE）上仍然明显优于所有竞争对手。例如，在配备了 Swin-L ^† 后，我们的方法在 REL、SqRel 和 RMSE 等大多数指标上都明显优于所有竞争对手。的情况下，通过 3 步去噪扩散，我们的方法实现了最先进的 RMSE log 0.076。与基于并发扩散的模型[68]相比，我们发现(1) DDP 明显优于 DepthGen，尤其是在 RMSE $\downarrow$ 指标上（2.072 vs. 2.985），这可能是由于配备了先进的管道设计（例如 Swin Transformer vs. U-Net）。(2) DDP 与 DepthGen 相比更轻便、更高效，因为去噪扩散过程仅发生在解码器头部，而 DepthGen 则发生在整个模型上。

Type 类型	mAcc	mIoU
analog bits 模拟比特	57.6	46.2
onehot ot	56.8	46.2
embedding 嵌入	58.4	47.0

(a) Label encoding. We find class embedding works best.
(a) 标签编码。我们发现类嵌入的效果最好。

Scale 规模	mAcc	mIoU
0.001	56.6	45.4
0.01	58.4	47.0
0.02	57.5	46.8
0.04	56.8	45.9
0.1	55.0	44.0

(b) Scaling factor. The best scaling factor is 0.01.
(b) 比例因子。最佳缩放因子为 0.01。

Type 类型	mAcc	mIoU
cosine 余弦值	58.4	47.0
linear 线形	56.3	45.1

$L$	mAcc	mIoU	#Param
1	56.1	44.5	2.4M
2	56.5	45.0	3.6M
4	57.2	45.7	6.0M
6	58.4	47.0	8.4M
12	55.7	46.0	15.6M

(d) Decoder depth

L

. Six blocks work best.
(d) 解码深度

L

..六块效果最佳。

Step 步骤	mIoU	FLOPs	FPS
1	45.8	256G	18
1	46.1	113G	19
2	46.8	182G	15
3	47.0	252G	13
4	46.8	322G	11

(e) Accuracy vs. Efficiency. Yellow denotes K-Net [95].
(e) 精度与效率。黄色表示 K-Net [ 95]。

Table 5: DDP ablation experiments with Swin-T [52] on ADE20K semantic segmentation. We report the performance with 3 sampling steps in (a), (b), (c), and (d). If not specified, the default settings are: the label encoding strategy is class embedding, the scaling factor is set to 0.01, the noise schedule is cosine, and the map decoder has a depth of 6. Default settings are marked in gray.
表 5：使用 Swin-T [ 52] 对 ADE20K 语义分割进行的 DDP 消融实验。我们在 (a)、(b)、(c) 和 (d) 中报告了 3 个采样步骤的性能。如果没有指定，默认设置为：标签编码策略为类嵌入，缩放因子设为 0.01，噪声调度为余弦，地图解码器深度为 6。默认设置以灰色标出。

4.5 Ablation Study 4.5 消融研究

We conduct ablation studies on the ADE20K semantic segmentation. All models are trained using our DDP with Swin-T [52] backbone for 160k iterations. Other settings are the same as the settings in Section 4.2.
我们对 ADE20K 语义分割进行了消融研究。所有模型都是使用我们的 DDP 与 Swin-T [ 52] 骨干进行 160k 次迭代训练的。其他设置与第 4.2 节中的设置相同。

Label Encoding. 标签编码。

Since the labels of semantic segmentation are discrete, we need to encode them first. As shown in Table 5a, here we study the effect of three different strategies. For each of them, we search the optimal scaling factor. The results show that class embedding is a better strategy to encode semantic labels than one-hot and analog bits [14].
由于语义分割的标签是离散的，因此我们需要先对其进行编码。如表 5a 所示，我们在此研究了三种不同策略的效果。对于每种策略，我们都会搜索最佳缩放因子。结果表明，类嵌入是比一比特和模拟比特更好的语义标签编码策略[14]。

Signal Scale. 信号刻度。

As shown in Table 5b, we search for the best scaling factor for the class embedding strategy. As can be seen, when we use a larger scaling factor than 0.01, the performance degraded significantly. This is because using a larger scaling factor, more easy cases are reserved with the same time step $t$ . In addition, we found the best scaling factor (i.e., 0.01) for class embedding is typically smaller than analog bits [14] and one-hot (i.e., 0.1).
如表 5b 所示，我们搜索了类嵌入策略的最佳缩放因子。可以看出，当我们使用比 0.01 更大的缩放因子时，性能明显下降。这是因为使用更大的缩放因子时，在相同的时间步长 $t$ 下会保留更多的简单案例。.此外，我们还发现类嵌入的最佳缩放因子（即 0.01）通常小于模拟比特[ 14] 和一热（即 0.1）。

Noise Schedule. 噪音时间表。

As shown in Table 5c, we compare the effectiveness of the cosine schedule [58] and linear schedule [35] in DDP for semantic segmentation, and find that the model using the cosine schedule achieves notably better performance (47.0 vs. 45.1). This is attributed to the cosine schedule’s mechanism of simulating the realistic scenario of gradually weakening signal influence, which prompts the model to learn stronger denoising capabilities, in contrast to the simple linear schedule.
如表 5c 所示，我们比较了余弦时间表[58]和线性时间表[35]在 DDP 语义分割中的效果，发现使用余弦时间表的模型取得了明显更好的性能（47.0 vs. 45.1）。这要归功于余弦时间表模拟信号影响逐渐减弱的现实场景的机制，与简单的线性时间表相比，余弦时间表促使模型学习更强的去噪能力。

Decoder Depth. 解码器深度

We study the effect of decoder depth in Table 5d and observe that the map decoder requires a suitable depth. Initially, the model accuracy improves as the depth increases, but eventually decreases. Therefore, we finally adopted a map decoder with 6 blocks, which only has 8.4M parameters. Overall, the map decoder is lightweight and efficient, compared with representative methods K-Net [95] (41.5M) and UperNet [83] (31.5M).
我们在表 5d 中研究了解码器深度的影响，发现地图解码器需要一个合适的深度。起初，模型精度会随着深度的增加而提高，但最终会降低。因此，我们最终采用了有 6 个区块的地图解码器，它只有 840 万个参数。总体而言，与具有代表性的 K-Net [ 95] （41.5M）和 UperNet [ 83] （31.5M）相比，图解码器是轻量级和高效的。

Accuracy vs. Efficiency. 准确性与效率

We show the dynamic trade-off of DDP between accuracy and efficiency in Table 5e. Compared with the representative discriminative method K-Net [95], DDP yields a better mIoU when using only one sampling step, with fewer FLOPs and higher FPS. When adopting three sampling steps, the performance is further boosted to 47.0 mIoU, while maintaining comparable FLOPs and FPS. These results show that DDP can iteratively infer multiple times with reasonable time cost.
表 5e 显示了 DDP 在准确性和效率之间的动态权衡。与具有代表性的判别方法 K-Net [ 95] 相比，DDP 在只使用一个采样步骤时能产生更好的 mIoU，同时 FLOPs 更少，FPS 更高。当采用三个采样步骤时，性能进一步提高到 47.0 mIoU，同时保持了相当的 FLOPs 和 FPS。这些结果表明，DDP 可以以合理的时间成本进行多次迭代推断。

5 Conclusion 5 结束语

This paper introduced DDP, a simple, efficient, yet powerful framework for dense visual predictions based on conditional diffusion. It extends the denoising diffusion process into modern perception pipelines, without requiring architectural customization or task-specific design. We demonstrate DDP’s effectiveness through state-of-the-art or competitive performance on three representative tasks and six diverse benchmarks. Moreover, it additionally exhibits multiple inference and uncertainty awareness, which contrasts with previous single-step discriminative methods. These results indicate that DDP can serve as an important baseline for future research in dense prediction tasks. One potential drawback of DDP is its non-negligible additional computational cost for multi-step inference. Besides, while DDP has demonstrated excellent improvement on several benchmark datasets for dense visual prediction tasks, further research is necessary to determine its efficacy in other domains.
本文介绍了 DDP，这是一个基于条件扩散的简单、高效但功能强大的密集视觉预测框架。它将去噪扩散过程扩展到了现代感知流水线中，无需进行架构定制或特定任务设计。我们在三个具有代表性的任务和六个不同的基准测试中，通过最先进或具有竞争力的性能证明了 DDP 的有效性。此外，它还表现出多重推理和不确定性意识，这与之前的单步判别方法形成了鲜明对比。这些结果表明，DDP 可以作为未来密集预测任务研究的重要基准。DDP 的一个潜在缺点是多步推理的额外计算成本不可忽略。此外，虽然 DDP 在密集视觉预测任务的几个基准数据集上表现出了出色的改进，但要确定它在其他领域的功效，还需要进一步的研究。

Acknowledgement. 致谢。

We gratefully acknowledge the support of MindSpore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
我们衷心感谢 MindSpore、CANN（神经网络计算架构）和 Ascend AI 处理器对本研究的支持。

References 参考资料

[1] Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
Tomer Amit、Eliya Nachmani、Tal Shaharbany 和 Lior Wolf。Segdiff：用扩散概率模型进行图像分割。ArXiv 预印本 arXiv:2112.00390, 2021.
[2] Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
Namrata Anand 和 Tudor Achim.用等变去噪扩散概率模型生成蛋白质结构和序列。arXiv预印本arXiv:2205.15019，2022。
[3] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In CVPR, pages 4009–4018, 2021.
Shariq Farooq Bhat、Ibraheem Alhashim 和 Peter Wonka。Adabins：使用自适应 bins 进行深度估计。CVPR，第 4009-4018 页，2021 年。
[4] Shubhankar Borse, Marvin Klingner, Varun Ravi Kumar, Hong Cai, Abdulaziz Almuzairee, Senthil Yogamani, and Fatih Porikli. X-align: Cross-modal cross-view alignment for bird’s-eye-view segmentation. In WACV, pages 3287–3297, 2023.
Shubhankar Borse、Marvin Klingner、Varun Ravi Kumar、Hong Cai、Abdulaziz Almuzairee、Senthil Yogamani 和 Fatih Porikli。X-align：用于鸟瞰分割的跨模态跨视角对齐。WACV，第 3287-3297 页，2023 年。
[5] Walid Bousselham, Guillaume Thibault, Lucas Pagano, Archana Machireddy, Joe Gray, Young Hwan Chang, and Xubo Song. Efficient self-ensemble framework for semantic segmentation. arXiv preprint arXiv:2111.13280, 2021.
Walid Bousselham、Guillaume Thibault、Lucas Pagano、Archana Machireddy、Joe Gray、Young Hwan Chang 和 Xubo Song。语义分割的高效自组装框架。arXiv 预印本 arXiv:2111.13280, 2021.
[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Biggan: Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
安德鲁-布洛克（Andrew Brock）、杰夫-多纳休（Jeff Donahue）和凯伦-西蒙扬（Karen Simonyan）。Biggan：用于高保真自然图像合成的大规模gan训练。在 ICLR，2019 年。
[7] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
Holger Caesar、Varun Bankiti、Alex H Lang、Sourabh Vora、Venice Erin Liong、Qiang Xu、Anush Krishnan、Yu Pan、Giancarlo Baldan 和 Oscar Beijbom：用于自动驾驶的多模态数据集。In CVPR, pages 11621-11631, 2020.
[8] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
陈凯、王佳琪、庞江苗、曹宇航、熊宇、李潇潇、孙书洋、冯万森、刘紫薇、徐家瑞、张铮、程大志、朱晨晨程天恒、赵启杰、李步宇、卢鑫、朱睿、吴玥、戴继峰、王景东、施建平、欧阳万里、陈改洛、林大华。MMDetection：ArXiv preprint arXiv:1906.07155, 2019.
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
Liang-Chieh Chen、George Papandreou、Iasonas Kokkinos、Kevin Murphy 和 Alan L Yuille。Deeplab：利用深度卷积网、无痕卷积和全连接 crfs 进行语义图像分割。TPAMI，2017。
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.用于语义图像分割的可分离卷积编码器-解码器。在 ECCV 上，第 801-818 页，2018 年。
[11] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang.取样就像学分数一样简单：数据假设最小的扩散模型理论》，arXiv 预印本 arXiv:2209.11215, 2022.
[12] Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
陈守发、孙培泽、宋一兵、罗平。Diffusiondet：ArXiv preprint arXiv:2211.09788, 2022.
[13] Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
Ting Chen.论扩散模型噪声调度的重要性》，arXiv preprint arXiv:2301.10972, 2023.
[14] Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J Fleet. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022.
Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J Fleet.图像和视频全视角分割的通用框架》，arXiv preprint arXiv:2210.06366, 2022.
[15] Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton.模拟比特：使用具有自条件的扩散模型生成离散数据》，arXiv 预印本 arXiv:2208.04202, 2022.
[16] Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. In ICLR, 2023.
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton.模拟比特：使用具有自条件的扩散模型生成离散数据。国际语言学年会，2023 年。
[17] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2023.
陈哲、段雨晨、王文海、何俊军、吕彤、戴继峰、乔宇。用于密集预测的视觉变换器适配器。2023 年，ICLR。
[18] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar.用于通用图像分割的遮罩-注意力遮罩变换器。CVPR，第 1290-1299 页，2022 年。
[19] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 34:17864–17875, 2021.
Bowen Cheng、Alex Schwing 和 Alexander Kirillov.每像素分类并非语义分割的全部。NeurIPS, 34:17864-17875, 2021.
[20] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
MMSegmentation 贡献者。MMSegmentation：https://github.com/open-mmlab/mmsegmentation, 2020.
[21] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
Marius Cordts、Mohamed Omran、Sebastian Ramos、Timo Rehfeld、Markus Enzweiler、Rodrigo Benenson、Uwe Franke、Stefan Roth 和 Bernt Schiele。用于城市场景语义理解的城市景观数据集。2016年，CVPR。
[22] Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
Gabriele Corso、Hannes Stärk、Bowen Jing、Regina Barzilay 和 Tommi Jaakkola。Diffdock：Diffdock: Diffusion steps, twists, and turns for molecular docking. ArXiv preprint arXiv:2210.01776, 2022.
[23] Jiequan Cui, Yuhui Yuan, Zhisheng Zhong, Zhuotao Tian, Han Hu, Stephen Lin, and Jiaya Jia. Region rebalance for long-tailed semantic segmentation. arXiv preprint arXiv:2204.01969, 2022.
Jiequan Cui, Yuhui Yuan, Zhisheng Zhong, Zhuotao Tian, Han Hu, Stephen Lin, and Jiaya Jia.长尾语义分割的区域再平衡。arXiv preprint arXiv:2204.01969, 2022.
[24] Giannis Daras, Yuval Dagan, Alexandros G Dimakis, and Constantinos Daskalakis. Consistent diffusion models: Mitigating sampling drift by learning to be consistent. arXiv preprint arXiv:2302.09057, 2023.
Giannis Daras、Yuval Dagan、Alexandros G Dimakis 和 Constantinos Daskalakis。一致扩散模型：ArXiv preprint arXiv:2302.09057, 2023.
[25] Giannis Daras and Alexandros G Dimakis. Multiresolution textual inversion. arXiv preprint arXiv:2211.17115, 2022.
Giannis Daras 和 Alexandros G Dimakis.多分辨率文本反演》，arXiv preprint arXiv:2211.17115, 2022.
[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.Imagenet：大规模分层图像数据库。2009年，CVPR，第248-255页。
[27] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
Prafulla Dhariwal 和 Alexander Nichol.扩散模型在图像合成中击败甘斯。NeurIPS, 34:8780-8794, 2021.
[28] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 2014.
David Eigen、Christian Puhrsch 和 Rob Fergus。使用多尺度深度网络从单张图像预测深度图。NeurIPS, 2014.
[29] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick Van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852, 2015.
Philipp Fischer、Alexey Dosovitskiy、Eddy Ilg、Philip Häusser、Caner Hazırbaş、Vladimir Golkov、Patrick Van der Smagt、Daniel Cremers 和 Thomas Brox。 Flownet：用卷积网络学习光流。arXiv preprint arXiv:1504.06852, 2015.
[30] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
傅欢、龚明明、王朝晖、Kayhan Batmanghelich 和陶大成。用于单目深度估计的深度顺序回归网络。In CVPR, pages 2002-2011, 2018.
[31] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
Andreas Geiger、Philip Lenz、Christoph Stiller 和 Raquel Urtasun。视觉与机器人：kitti 数据集。IJRR，32（11）：1231-1237，2013.
[32] Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021.
龚成跃，王迪林，李萌，维卡斯-钱德拉，以及刘强。具有补丁多样化的视觉变换器。ArXiv 预印本 arXiv:2104.12753, 2021.
[33] Ali Harakeh, Michael Smart, and Steven L Waslander. Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. In ICRA, pages 87–93. IEEE, 2020.
Ali Harakeh、Michael Smart 和 Steven L Waslander。Bayesod：用于深度物体探测器不确定性估计的贝叶斯方法。In ICRA, pages 87-93.IEEE, 2020.
[34] Jan Hendrik Metzen, Mummadi Chaithanya Kumar, Thomas Brox, and Volker Fischer. Universal adversarial perturbations against semantic image segmentation. In ICCV, pages 2755–2764, 2017.
Jan Hendrik Metzen、Mummadi Chaithanya Kumar、Thomas Brox 和 Volker Fischer。针对语义图像分割的通用对抗扰动。In ICCV, pages 2755-2764, 2017.
[35] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
Jonathan Ho, Ajay Jain, and Pieter Abbeel.去噪扩散概率模型。NeurIPS, 33:6840-6851, 2020.
[36] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet.视频扩散模型。ArXiv:2204.03458, 2022.
[37] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang.Cogvideo：ArXiv preprint arXiv:2205.15868, 2022.
[38] Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocular depth estimation using depth-attention volume. In ECCV, pages 581–597, 2020.
Lam Huynh、Phong Nguyen-Ha、Jiri Matas、Esa Rahtu 和 Janne Heikkilä。使用深度注意体积指导单目深度估算。在 ECCV 中，第 581-597 页，2020 年。
[39] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 1125–1134, 2017.
菲利普-伊索拉、朱俊彦、周廷辉和阿列克谢-A-埃弗罗斯。利用条件对抗网络实现图像到图像的翻译。2017年，CVPR，第1125-1134页。
[40] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:2211.06220, 2022.
Jitesh Jain、Jiachen Li、MangTik Chiu、Ali Hassani、Nikita Orlov 和 Humphrey Shi。一个变换器：ArXiv preprint arXiv:2211.06220, 2022.
[41] Pan Ji, Runze Li, Bir Bhanu, and Yi Xu. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In ICCV, pages 12787–12796, 2021.
Pan Ji、Runze Li、Bir Bhanu 和 Yi Xu。Monoindoor：面向室内环境的自监督单目深度估计的良好实践。In ICCV, pages 12787-12796, 2021.
[42] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
Tero Karras、Timo Aila、Samuli Laine 和 Jaakko Lehtinen。为提高质量、稳定性和变异性而进行的甘蔗逐步生长。在2018年的ICLR会议上。
[43] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In ECCV, pages 491–507, 2020.
Alexander Kolesnikov、Lucas Beyer、Xiaohua Zhai、Joan Puigcerver、Jessica Yung、Sylvain Gelly 和 Neil Houlsby。大转移（比特）：通用视觉表征学习。在 ECCV 上，第 491-507 页，2020 年。
[44] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh.从大到小：用于单目深度估计的多尺度局部平面引导》，arXiv preprint arXiv:1907.10326, 2019.
[45] Boying Li, Yuan Huang, Zeyu Liu, Danping Zou, and Wenxian Yu. Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In ICCV, pages 12663–12673, 2021.
李伯英、黄源、刘泽宇、邹丹平、余文贤。结构深度：利用结构规律性进行自监督室内深度估计。In ICCV, pages 12663-12673, 2021.
[46] Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, and Sanja Fidler. Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In CVPR, pages 8300–8311, 2021.
李大庆、杨俊林、卡斯滕-克雷斯、安东尼奥-托拉尔巴和萨尼娅-菲德勒。使用生成模型进行语义分割：半监督学习和强域外泛化。CVPR，第 8300-8311 页，2021 年。
[47] Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022.
李坤昌、王亚丽、高鹏、宋光禄、刘宇、李红生和乔宇。Uniformer：ArXiv preprint arXiv:2201.04676, 2022.
[48] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211, 2022.
Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang.Depthformer：利用长程相关性和局部信息实现精确的单目深度估计。arXiv预印本arXiv:2203.14211，2022。
[49] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022.
李振宇、王旭阳、刘先明和蒋俊军。Binsformer：重新审视用于单目深度估计的自适应 bins。arXiv 预印本 arXiv:2204.00987, 2022.
[50] Fangjian Lin, Zhanhao Liang, Junjun He, Miao Zheng, Shengwei Tian, and Kai Chen. Structtoken: Rethinking semantic segmentation with structural prior. arXiv preprint arXiv:2203.12612, 2022.
林方健、梁占豪、何俊军、郑淼、田胜伟、陈凯。Structtoken：ArXiv preprint arXiv:2203.12612, 2022.
[51] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.用于物体检测的特征金字塔网络。2017年，CVPR，第2117-2125页。
[52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
Ze Liu、Yutong Lin、Yue Cao、Han Hu、Yixuan Wei、Zheng Zhang、Stephen Lin 和 Baining Guo。Swin transformer：使用移位窗口的分层视觉变换器。In ICCV, pages 10012-10022, 2021.
[53] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.
刘壮、毛汉子、吴超远、克里斯托夫-费希滕霍夫、特雷弗-达雷尔、谢赛宁。2020 年代的信念网络。arXiv 预印本 arXiv:2201.03545, 2022.
[54] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022.
刘志坚、唐皓天、亚历山大-阿米尼、杨新宇、毛惠子、丹妮拉-鲁斯和韩松。Bevfusion：ArXiv preprint arXiv:2205.13542, 2022.
[55] Antonio Loquercio, Mattia Segu, and Davide Scaramuzza. A general framework for uncertainty estimation in deep learning. IEEE Robotics and Automation Letters, 5(2):3153–3160, 2020.
Antonio Loquercio、Mattia Segu 和 Davide Scaramuzza。深度学习中不确定性估计的通用框架。IEEE 机器人与自动化通讯》，5（2）：3153-3160，2020 年。
[56] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Ilya Loshchilov 和 Frank Hutter.解耦权重衰减正则化》，arXiv preprint arXiv:1711.05101, 2017.
[57] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Alex Nichol、Prafulla Dhariwal、Aditya Ramesh、Pranav Shyam、Pamela Mishkin、Bob McGrew、Ilya Sutskever 和 Mark Chen。Glide：用文本引导的扩散模型实现逼真图像的生成和编辑。arXiv 预印本 arXiv:2112.10741, 2021.
[58] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171, 2021.
Alexander Quinn Nichol 和 Prafulla Dhariwal.改进的去噪扩散概率模型。In ICML, pages 8162-8171, 2021.
[59] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
乔纳-菲利翁和萨尼娅-菲德勒。抬起、溅射、拍摄：通过隐式解投影到 3D 来编码任意摄像机装备的图像。在 ECCV，第 194-210 页，2020 年。
[60] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.使用剪辑潜变量的分层文本条件图像生成。arXiv 预印本 arXiv:2204.06125, 2022.
[61] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
René Ranftl、Alexey Bochkovskiy 和 Vladlen Koltun。用于密集预测的视觉变换器。2021 年，ICCV，第 12179-12188 页。
[62] Thomas Roddick, Alex Kendall, and Roberto Cipolla. Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
托马斯-罗迪克、亚历克斯-肯德尔和罗伯托-西波拉。用于单目 3d 物体检测的正交特征变换》，arXiv preprint arXiv:1811.08188, 2018.
[63] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer。利用潜在扩散模型合成高分辨率图像。2022 年，CVPR，第 10684-10695 页。
[64] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
Olaf Ronneberger、Philipp Fischer 和 Thomas Brox.U-net：用于生物医学图像分割的卷积网络。2015年，MICCAI，第234-241页。
[65] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Sara Mahdavi, Rapha Gontijo Lopes, et al.
[66] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. TPAMI, 2022.
Chitwan Saharia、Jonathan Ho、William Chan、Tim Salimans、David J Fleet 和 Mohammad Norouzi。通过迭代细化实现图像超分辨率。TPAMI，2022。
[67] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. NeurIPS, 29:2234–2242, 2016.
Tim Salimans、Ian Goodfellow、Wojciech Zaremba、Vicki Cheung、Alec Radford 和 Xi Chen。改进的甘斯训练技术。NeurIPS, 29:2234-2242, 2016.
[68] Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J. Fleet. Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816, 2023.
Saurabh Saxena、Abhishek Kar、Mohammad Norouzi 和 David J. Fleet.使用扩散模型的单目深度估计。arXiv 预印本 arXiv:2302.14816, 2023.
[69] Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, et al. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.13695, 2022.
Arne Schneuing、Yuanqi Du、Charles Harris、Arian Jamasb、Ilia Igashov、Weitao Du、Tom Blundell、Pietro Lió、Carla Gomes、Max Welling 等。基于等变扩散模型的结构药物设计。ArXiv 预印本 arXiv:2210.13695, 2022。
[70] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760, 2012.
Nathan Silberman、Derek Hoiem、Pushmeet Kohli 和 Rob Fergus。从 rgbd 图像进行室内分割和支持推理。ECCV, pages 746-760, 2012.
[71] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
Jascha Sohl-Dickstein、Eric Weiss、Niru Maheswaranathan 和 Surya Ganguli。使用非平衡热力学的深度无监督学习。在 ICML 上，第 2256-2265 页，2015 年。
[72] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
宋家明、孟晨霖、斯特凡诺-埃尔蒙。去噪扩散隐含模型。arXiv预印本arXiv:2010.02502，2020.
[73] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
宋家明、孟晨霖和斯特凡诺-埃尔蒙。去噪扩散隐含模型。2021 年，ICLR。
[74] Shuran Song, Jianxiong Xiao, Li Guo, and Xiaogang Yang. Sun rgb-d: A rgb-d scene understanding benchmark suite. CVPR, 2015.
Shuran Song、Jianxiong Xiao、Li Guo 和 Xiaogang Yang.Sun rgb-d：rgb-d 场景理解基准套件。CVPR，2015。
[75] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. NeurIPS, 33:12438–12448, 2020.
Yang Song 和 Stefano Ermon.基于分数的生成模型的改进训练技术。NeurIPS, 33:12438-12448, 2020.
[76] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
Robin Strudel、Ricardo Garcia、Ivan Laptev 和 Cordelia Schmid。分割器：语义分割变换器。2021 年，ICCV，第 7262-7272 页。
[77] Brian L Trippe, Jason Yim, Doug Tischer, Tamara Broderick, David Baker, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
Brian L Trippe, Jason Yim, Doug Tischer, Tamara Broderick, David Baker, Regina Barzilay, and Tommi Jaakkola.针对图案折叠问题的三维蛋白质骨架扩散概率建模.arXiv预印本arXiv:2206.04119，2022.
[78] Sourabh Vora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. In CVPR, pages 4604–4612, 2020.
Sourabh Vora、Alex H Lang、Bassam Helou 和 Oscar Beijbom。点绘制：三维物体检测的序列融合。2020年，CVPR，第4604-4612页。
[79] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. arXiv preprint arXiv:2211.05778, 2022.
Internimage：用可变形卷积探索大规模视觉基础模型。arXiv 预印本 arXiv:2211.05778, 2022.
[80] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
王文海、谢恩泽、李翔、范登平、宋开涛、梁鼎、卢彤、罗平、邵玲。金字塔视觉转换器：无需卷积即可实现密集预测的多功能骨干网。In ICCV, pages 568-578, 2021.
[81] Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin. Diffusion models for implicit image segmentation ensembles. In MIDL, pages 1336–1348, 2022.
Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin.隐式图像分割集合的扩散模型。在 MIDL 中，第 1336-1348 页，2022 年。
[82] Junde Wu, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611, 2022.
Junde Wu, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu.Medsegdiff：用扩散概率模型进行医学图像分割。ArXiv 预印本 arXiv:2211.00611, 2022.
[83] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun.场景理解的统一感知解析。在 ECCV 上，第 418-434 页，2018 年。
[84] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial examples for semantic segmentation and object detection. In ICCV, pages 1369–1378, 2017.
谢慈航、王建宇、张志帅、周玉银、谢玲茜和 Alan Yuille。用于语义分割和物体检测的对抗示例。In ICCV, pages 1369-1378, 2017.
[85] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 34, 2021.
谢恩泽、王文海、于志定、阿尼玛-阿南德库马尔、何塞-M-阿尔瓦雷斯、罗平。Segformer：利用变换器进行语义分割的简单高效设计。NeurIPS, 34, 2021.
[86] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and Jose M Alvarez. M2̂ bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
Enze Xie、Zhiding Yu、Daquan Zhou、Jonah Philion、Anima Anandkumar、Sanja Fidler、Ping Luo 和 Jose M Alvarez。M2 ̂ bev：统一鸟瞰图表示的多摄像头联合 3D 检测和分割。arXiv 预印本 arXiv:2204.05088, 2022.
[87] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. Transformer-based attention networks for continuous pixel-wise prediction. In ICCV, pages 16269–16279, 2021.
杨光磊、唐浩、丁明利、Nicu Sebe 和 Elisa Ricci。基于变压器的连续像素预测注意力网络。2021 年，ICCV，第 16269-16279 页。
[88] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In CVPR, pages 5485–5493, 2017.
Raymond A Yeh、Chen Chen、Teck Yian Lim、Alexander G Schwing、Mark Hasegawa-Johnson、Minh N Do。利用深度生成模型进行语义图像绘制。在 CVPR 上，第 5485-5493 页，2017 年。
[89] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multimodal virtual point 3d detection. NeurIPS, 34:16494–16507, 2021.
Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl.多模态虚拟点 3D 检测。NeurIPS, 34:16494-16507, 2021.
[90] Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In ICCV, pages 5684–5693, 2019.
Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan.为深度预测强制执行虚拟法线的几何约束。In ICCV, pages 5684-5693, 2019.
[91] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, pages 173–190, 2020.
Yuhui Yuan, Xilin Chen, and Jingdong Wang.用于语义分割的对象-上下文表示法。在 ECCV 上，第 173-190 页，2020 年。
[92] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution vision transformer for dense prediction. NeurIPS, 34, 2021.
袁宇辉、傅饶、黄朗、林伟红、张超、陈锡林和王敬东。Hrformer：用于密集预测的高分辨率视觉变换器。NeurIPS, 34, 2021.
[93] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum.Dino：用于端到端物体检测的带改进去噪锚框的 Detr。arXiv 预印本 arXiv:2203.03605, 2022.
[94] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
Lvmin Zhang and Maneesh Agrawala.为文本到图像扩散模型添加条件控制》，arXiv preprint arXiv:2302.05543, 2023.
[95] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-net: Towards unified image segmentation. NeurIPS, pages 10326–10338, 2021.
Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy.K-net：实现统一图像分割。NeurIPS，第 10326-10338 页，2021 年。
[96] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
赵恒双、史建平、齐晓娟、王晓刚、贾佳娅。金字塔场景解析网络。In CVPR, pages 2881-2890, 2017.
[97] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers.CVPR，第 6881-6890 页，2021 年。
[98] Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, pages 13760–13769, 2022.
Brady Zhou 和 Philipp Krähenbühl.用于实时地图视图语义分割的跨视图变换器。CVPR，第 13760-13769 页，2022 年。
[99] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
周博磊、赵航、泽维尔-普伊格、萨尼亚-菲德勒、阿德拉-巴里乌索和安东尼奥-托拉尔巴。通过 ade20k 数据集进行场景解析。In CVPR, pages 633-641, 2017.
[100] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. 2020.
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai.可变形检测器：用于端到端物体检测的可变形变换器。2020.

Appendix A Diffusion Model
附录 AD 扩散模型

A.1 Algorithm details A.1 算法细节

As a supplement to Algorithm 1 and Algorithm 2 described in the main paper, we provide the implementation details in Algorithm 3 for better clarity. Additionally, we introduce the implementation of the “self-aligned denoising” procedure in Algorithm 4, used in the last 5K iteration training to address the sampling drift problem (see Section 3.4). We provide an example in Figure 4 to illustrate the gap between the training and inference denoising targets.
作为对正文中描述的算法 1 和算法 2 的补充，我们在算法 3 中提供了实现细节，以使其更加清晰。此外，我们还介绍了算法 4 中 "自对齐去噪 "程序的实现，该程序用于最后 5K 次迭代训练，以解决采样漂移问题（见第 3.4 节）。我们在图 4 中提供了一个示例，以说明训练和推理去噪目标之间的差距。

⬇

def alpha_cumprod(t, ns=0.0002, ds=0.00025):
def alpha_cumprod(t, ns=0.0002, ds=0.00025)：

"""cosine noise schedule"""

n = torch.cos((t + ns) / (1 + ds)

* math.pi / 2) ** -2

return -torch.log(n - 1, eps=1e-5)

def ddim(map_t, map_pred, t_now, t_next):
def ddim(map_t, map_pred, t_now, t_next)：

"""

estimate x at t_next with DDIM update rule.
用 DDIM 更新规则估计 t_next 时的 x。

"""

\alpha_{\text{now}}

= alpha_cumprod(t_now)

\alpha_{\text{now}}

= alpha_cumprod（t_now）

\alpha_{\text{next}}

= alpha_cumprod(t_next)

\alpha_{\text{next}}

= alpha_cumprod（t_next）

map_enc = encoding(map_pred)

map_enc = (sigmoid(map_enc) * 2 - 1) * scale
map_enc = (sigmoid(map_enc) * 2 - 1) * 比例尺

eps =

\frac{1}{\sqrt{1-\alpha_{\text{now}}}}

* (map_t -

\sqrt{\alpha_{\text{now}}}

* map_enc)
EPS =

\frac{1}{\sqrt{1-\alpha_{\text{now}}}}

* (map_t -

\sqrt{\alpha_{\text{now}}}

* map_enc)

map_next =

\sqrt{\alpha_{\text{next}}}

* x_pred +

\sqrt{1-\alpha_{\text{now}}}

* eps
map_next =

\sqrt{\alpha_{\text{next}}}

* x_pred +

\sqrt{1-\alpha_{\text{now}}}

* eps

return map_next 返回 map_next

Algorithm 3 DDIM Update 算法 3 DDIM 更新

A.2 More Discussions A.2 更多讨论

As illustrated in Figure 3a, diffusion models for perceptual tasks tend to reach a saturation point within the first few steps, usually between 3-5 steps, making additional diffusion less advantageous. This is in contrast to the requirements of generative models for image generation, where multiple iterations over many steps (from 10 to 50) are often necessary. Intuitively, in generative tasks such as image generation, the goal is to produce complete and high-quality results by progressively incorporating more information at each time step, thus gradually accumulating and improving the overall result. Therefore, it may take more time steps to reach convergence in order to fully accumulate the necessary information. In perceptual tasks, such as semantic segmentation and object detection, the process from image to label is a gradual reduction of information, and critical information sufficient to make a decision needs to be obtained in only a few steps. Therefore, further diffusion has a limited role in improving the accuracy of predictions, leading to an early peak within three to five steps. In short, the diffusion process in a perception task can make decisions by accumulating the most important information. Therefore, DDP can achieve high accuracy in perception tasks with minimal computational cost.
如图 3a 所示，用于感知任务的扩散模型往往会在最初几步（通常在 3-5 步之间）达到饱和点，从而使额外的扩散变得不那么有利。这与图像生成模型的要求形成鲜明对比，在图像生成模型中，通常需要在许多步骤（10 到 50 步）中进行多次迭代。直观地说，在生成任务（如图像生成）中，目标是通过在每个时间步骤中逐步纳入更多信息来生成完整和高质量的结果，从而逐步积累和改进整体结果。因此，可能需要更多的时间步骤才能达到收敛，以充分积累必要的信息。在语义分割和物体检测等感知任务中，从图像到标签的过程是信息逐渐减少的过程，只需几步就能获得足以做出决定的关键信息。因此，进一步扩散对提高预测准确性的作用有限，导致在三至五步内达到早期峰值。简而言之，感知任务中的扩散过程可以通过积累最重要的信息来做出决策。因此，DDP 能以最小的计算成本在感知任务中实现高准确度。

⬇

def train(images, maps):
def train(images, maps)：

"""

images: [b, 3, h, w], maps: [b, 1, h, w]
图像：[b, 3, h, w], 地图：[b、1、h、w]

"""

img_enc = image_encoder(images)
img_enc = 图像编码器（图像）

map_t = normal(mean=0, std=1)

map_pred = map_decoder(map_t, img_enc, t=1)

# encode map_pred

map_enc = encoding(map_pred.detach())

map_enc = (sigmoid(map_enc) * 2 - 1) * scale
map_enc = (sigmoid(map_enc) * 2 - 1) * 比例尺

# corrupt the map_enc

t, eps = uniform(0, 1), normal(mean=0, std=1)
t，eps = uniform(0，1)，normal(mean=0，std=1)

map_crpt = sqrt(alpha_cumprod(t)) * map_enc +
map_crpt = sqrt(alpha_cumprod(t))* map_enc +

sqrt(1 - alpha_cumprod(t)) * eps
sqrt(1 - alpha_cumprod(t))* eps

# predict

map_pred = map_decoder(map_crpt, img_enc, t)
map_pred = map_decoder(map_crpt、img_enc、t)

loss = objective_func(map_pred, maps)

return loss 回波损耗

Algorithm 4 DDP Self-aligned Denoising
算法 4 DDP 自对齐去噪

Appendix B Implementation Details
附录 II 实施细节

B.1 Semantic Segmentation B.1语义分割

ADE20K.

We conduct the experiments of ADE20K [99] semantic segmentation based on MMSegmentation [20]. In the training phase, the backbone is initialized with the ImageNet [26] pre-trained weights. We optimize our DDP models using AdamW [56] optimizer with an initial learning rate of 6 $\times 10^{-5}$ , and a weight decay of 0.01. The learning rate is decayed following the polynomial decay schedule with a power of 1.0. Besides, we randomly resize and crop the image to 512 $\times$ 512 for training, and rescale to have a shorter side of 512 pixels during testing. All models are trained for 160k iterations with a batch size of 16 and compared fairly with previous discriminative-based and non-diffusion methods.
我们在 MMSegmentation [ 20] 的基础上进行了 ADE20K [ 99] 语义分割实验。在训练阶段，骨干网由 ImageNet [ 26] 预先训练的权重初始化。我们使用 AdamW [ 56] 优化器优化 DDP 模型，初始学习率为 6 $\times 10^{-5}$ ，权重衰减为 0.5%。权重衰减为 0.01。学习率按照幂次为 1.0 的多项式衰减时间表进行衰减。此外，我们随机调整图像大小并裁剪为 512 $\times$ 512 像素进行训练，并在测试时将其调整为 512 像素的较短边长。所有模型都经过了 160k 次迭代训练，批量大小为 16，并与之前的基于判别的方法和非扩散方法进行了比较。

Cityscapes. 城市景观

The Cityscape dataset includes 5000 high-resolution images, which contain 2,975 training images, 500 validation images, and 1525 testing samples. The images are captured from 50 different cities in Germany, covering various environments such as highways, city centers, and suburbs. Similar to ADE20K, during training, we load the ImageNet pre-trained weights and employ the AdamW optimizer. Following common practice, we randomly resize and crop the image to 512 $\times$ 1024 for training, and take the original images of 1024 $\times$ 2048 for testing. We Other hyper-parameters are kept the same as our ADE20K experiments.
城市景观数据集包含 5000 张高分辨率图像，其中有 2975 张训练图像、500 张验证图像和 1525 张测试样本。这些图像来自德国 50 个不同的城市，涵盖高速公路、市中心和郊区等各种环境。与 ADE20K 类似，在训练过程中，我们加载 ImageNet 预训练权重并使用 AdamW 优化器。按照惯例，我们随机调整图像大小并裁剪为 512 $\times$ 1024 的原始图像进行训练，并采用 1024 $\times$ 的原始图像进行测试。2048 进行测试。其他超参数与 ADE20K 实验相同。

B.2 BEV Map Segmentation B.2BEV 地图分割

nuScenes.

We conduct our experiments of BEV map segmentation on nuScenes [7], a large-scale multi-modal dataset for 3D detection and map segmentation. The dataset is split into 700/150/150 scenes for training/validation/testing. It contains data from multiple sensors, including six cameras, one LIDAR, and five radars. For camera inputs, each frame consists of six views of the surrounding environment at the same timestamps. We resize the input views to 256 $\times$ 704 and voxelize the point cloud to 0.1m. Our evaluation metrics align with [54] and report the IoU of 6 background classes, including drivable space, pedestrian crossing, walk-way, stop line, car-parking area, and lane divider, and use the mean IoU as the primary evaluation metric. We adopt the image and LiDAR data augmentation strategies from [8] for training. AdamW is utilized with a weight decay of 0.01 and a learning rate of 5e-5. We take overall 20 training epochs on 8 A100 GPUs with a batch size of 32. Other training settings are kept the same as [54] for fair comparisons.
我们在用于三维检测和地图分割的大型多模态数据集 nuScenes [ 7] 上进行了 BEV 地图分割实验。该数据集分为 700/150/150 个场景，用于训练/验证/测试。该数据集包含来自多个传感器的数据，其中包括六个摄像头、一个激光雷达和五个雷达。对于摄像头输入，每帧都由周围环境在相同时间戳下的六个视图组成。我们将输入视图的大小调整为 256 $\times$ 704 并将点云像素化为 0.1 米。我们的评估指标与 [ 54] 一致，报告了 6 个背景类别的 IoU，包括可驾驶空间、人行横道、人行道、停车线、停车场和车道分隔线，并将平均 IoU 作为主要评估指标。我们采用[8]中的图像和激光雷达数据增强策略进行训练。AdamW 的权重衰减为 0.01，学习率为 5e-5。我们在 8 个 A100 GPU 上总共进行了 20 次训练，批量大小为 32。为进行公平比较，其他训练设置与[54]相同。

B.3 Depth Estimation B.3D 深度估算

KITTI. 基蒂

The KITTI depth estimation dataset is a widely used benchmark dataset for monocular depth estimation with a depth range from 0-80m. The stereo images of the dataset have a resolution of 1242 $\times$ 375, while the corresponding GT depth map has a low density of 3.75% to 5.0%. Following the standard Eigen training/testing split [28], we use around 26K left view images for training and 697 frames for testing. We incorporate the DDP model into the codebase developed by [48] for KITTI depth estimation experiments. We excluded the discrete label encoding module as the task requires continuous value regression All experimental settings are the same as [48] for a fair comparison.
KITTI 深度估计数据集是广泛使用的单目深度估计基准数据集，深度范围为 0-80 米。该数据集的立体图像分辨率为 1242 $\times$ -375375，而相应的 GT 深度图密度较低，仅为 3.75% 至 5.0%。按照标准的 Eigen 训练/测试分法[ 28]，我们使用约 26K 幅左视图图像进行训练，并使用 697 帧图像进行测试。我们将 DDP 模型纳入了[ 48] 为 KITTI 深度估计实验开发的代码库中。为了进行公平比较，我们排除了离散标签编码模块，因为该任务需要连续值回归。

Method 方法	$\delta_{1}\uparrow$	$\delta_{2}\uparrow$	$\delta_{3}\uparrow$	REL $\downarrow$	RMS $\downarrow$	$\log_{10}\downarrow$
Chen et al.	0.757	0.943	0.984	0.166	0.494	0.071
Yin et al. [90] Yin 等人[ 90］	0.696	0.912	0.973	0.183	0.541	0.082
BTS [44] BTS [ 44］	0.740	0.933	0.980	0.172	0.515	0.075
AdaBins [3] AdaBins [ 3］	0.771	0.944	0.983	0.159	0.476	0.068
DepthFormer [48] 深度调节器 [ 48］	0.815	0.970	0.993	0.137	0.408	0.059
DDP (step 3) DDP （步骤 3）	0.825	0.973	0.994	0.128	0.397	0.056

Table 6: Depth estimation on the SUN RGB-D dataset. We report the result of the model trained on the NYU-DepthV2 dataset and tested on the SUN RGB-D dataset without fine-tuning.
表 6：SUN RGB-D 数据集上的深度估计。我们报告的是模型在 NYU-DepthV2 数据集上训练的结果，并在不进行微调的情况下在 SUN RGB-D 数据集上进行了测试。

NYU-DepthV2.

The NYU-DepthV2 is an indoor scene dataset that consists of RGB and depth images captured at a resolution of 640 $\times$ 480 pixels. The dataset contains over 1,449 pairs of aligned indoor scenes, captured from 464 different indoor areas. We train DDP using image pairs with a resolution of 320 $\times$ 240 and with varying depths up to approximately 10 meters. Following previous work, we evaluate the results on the predefined center cropping by [28]. To be fair, all experimental configurations were aligned with the previous method [48].
NYU-DepthV2 是一个室内场景数据集，由分辨率为 640 $\times$ 的 RGB 和深度图像组成。480 像素拍摄的 RGB 和深度图像组成。该数据集包含超过 1,449 对对齐的室内场景，采集自 464 个不同的室内区域。我们使用分辨率为 320 {{1} 240 的图像对训练 DDP。｝240 的图像对进行训练，图像深度最高可达约 10 米。根据之前的工作，我们对[ 28] 中预定义的中心裁剪结果进行了评估。为了公平起见，所有实验配置都与之前的方法[48]保持一致。

SUN RGB-D. SUN RGB-D。

We use this dataset [74] to evaluate generalization. To be specific, we assess the performance of our NYU pre-trained models on the official test set, which includes 5,050 images, without any additional fine-tuning. The maximum depth is restricted to 10 meters. Please note that this dataset is solely intended for evaluation purposes and is not utilized for training.
我们使用这个数据集 [ 74] 来评估泛化效果。具体来说，我们评估了纽约大学预训练模型在官方测试集上的性能，该测试集包括 5,050 张图像，没有任何额外的微调。最大深度限制为 10 米。请注意，该数据集仅用于评估目的，不用于训练。

Method 方法	$\delta_{1}\uparrow$	$\delta_{2}\uparrow$	$\delta_{3}\uparrow$	REL $\downarrow$	RMSE $\downarrow$	$\log_{10}\downarrow$
StructDepth [45] 结构深度 [ 45］	0.817	0.955	0.988	0.140	0.534	0.060
MonoIndoor [41] 单声道室内 [ 41］	0.823	0.958	0.989	0.134	0.526	-
DORN [30] 多恩[ 30］	0.828	0.965	0.992	0.115	0.509	0.051
BTS [44] BTS [ 44］	0.885	0.978	0.994	0.110	0.392	0.047
DAV [38] DAV [ 38］	0.882	0.980	0.996	0.108	0.412	-
TransDepth [87] 跨深度 [ 87］	0.900	0.983	0.996	0.106	0.365	0.045
DPT-Hybrid [61] DPT 杂交 [ 61］	0.904	0.988	0.998	0.110	0.357	0.045
AdaBins [3] AdaBins [ 3］	0.903	0.984	0.997	0.103	0.364	0.044
DepthFormer [48] 深度调节器 [ 48］	0.921	0.989	0.998	0.096	0.339	0.041
DDP (step 3) DDP （步骤 3）	0.921	0.990	0.998	0.094	0.329	0.040

Table 7: Depth estimation on the NYU-DepthV2 val set. We report the performance of DDP with 3 diffusion steps. The best and second-best results are bolded or underlined, respectively. ↓ means lower is better, and ↑ means higher is better.
表 7：NYU-DepthV2 val 集上的深度估计。我们报告了采用 3 个扩散步骤的 DDP 的性能。最佳和次佳结果分别以粗体或下划线表示。↓ 表示越低越好，↑ 表示越高越好。

Appendix C Experimental Results
附录 C 实验结果

In Table 7, we provide the depth estimation performance of DDP on the NYU-V2 dataset, in addition, in Table 6, we provide the generalization performance results of DDP on the SUN-RGBD dataset.
表 7 列出了 DDP 在 NYU-V2 数据集上的深度估计性能，表 6 则列出了 DDP 在 SUN-RGBD 数据集上的泛化性能结果。

Appendix D Visualization 附录 D

Figure 5 and Figure 6 visualize the “multiple inference” property of DDP on the validation sets of Cityscapes and ADE20K, respectively. These inference trajectories show that DDP can enhance its performance continuously and produce smoother segmentation maps by using more sampling steps. Figure 7 presents the BEV map segmentation results of DDP (step 3) with the ground truths and multi-view images. Figure 8 and Figure 9 compare the generated depth estimation results of DDP (step 3) with the ground truths on the validation sets of KITTI and NYU-DepthV2, respectively. These results indicate that our method can be easily generalized to most dense prediction tasks.
图 5 和图 6 分别展示了 DDP 在城市景观和 ADE20K 验证集上的 "多重推理 "特性。这些推理轨迹表明，DDP 可以通过使用更多的采样步骤不断提高性能，生成更平滑的分割图。图 7 展示了 DDP（步骤 3）使用地面实况和多视角图像进行 BEV 地图分割的结果。图 8 和图 9 分别比较了 DDP（步骤 3）生成的深度估计结果与 KITTI 和 NYU-DepthV2 验证集上的地面实况。这些结果表明，我们的方法很容易推广到大多数密集预测任务中。

Appendix E More Applications
附录 EM 更多应用

E.1 Combine DDP with ControlNet
E.1 将 DDP 与 ControlNet 结合使用

Setup. 设置。

It has been found that compared to the previous single-shot model, DDP can achieve more continuous and semantic consistency prediction results. To demonstrate the benefits of this pixel clustering property, we combined DDP with the recently popular segmentation mask condition generation model: ControlNet. We followed the official implementation of ControlNet for all hyperparameters, including input resolution and DDIM sampling steps.
研究发现，与之前的单镜头模型相比，DDP 可以获得更连续、语义更一致的预测结果。为了证明这种像素聚类特性的优势，我们将 DDP 与最近流行的分割掩膜条件生成模型相结合：ControlNet。我们在所有超参数（包括输入分辨率和 DDIM 采样步骤）上都遵循了 ControlNet 的官方实现方法。

Implementation 实施情况

ControlNet [94] improves upon the original Stable Diffusion (SD) model by adding extra conditions, which is done by incorporating a conditioning network. In the mask-conditional ControlNet, the map generated by the segmentation model is used as input for image synthesis. The original segmentation model was adopted from Uniformaer-S [47] with UperNetHead, which has 52M parameters and achieves 47.6 mIoU (ss) on the ADE20K dataset. To make a fair comparison, we replaced the original segmentation model in the mask-conditional ControlNet with DDP using the Swin-T backbone, which has 40M parameters and achieves 47.0 mIoU (ss) on the ADE20K dataset. Note that all results were obtained with the default prompt.
ControlNet [ 94] 在原有的稳定扩散（SD）模型基础上进行了改进，增加了额外的条件，具体做法是加入一个条件网络。在掩码条件 ControlNet 中，分割模型生成的地图被用作图像合成的输入。最初的分割模型采用了 Uniformaer-S [ 47] 的 UperNetHead，该模型有 5200 万个参数，在 ADE20K 数据集上实现了 47.6 mIoU (ss)。为了进行公平比较，我们将掩码条件 ControlNet 中的原始分割模型替换为使用 Swin-T 主干网的 DDP，该模型有 4000 万个参数，在 ADE20K 数据集上实现了 47.0 mIoU (ss)。请注意，所有结果都是在默认提示下获得的。

Results 成果

We select images from the PEXEL website https://www.pexels.com/ for testing in different scenarios. The results from the original ControlNet and the combination of DDP with ControlNet are shown in Figure 10. ControlNet is designed to achieve fine-grained, controllable image generation, our experiments show that DDP can produce more consistent results and has advantages in various scenarios. Moreover, when combined with DDP, ControlNet produces visually satisfying and well-composed results, surpassing those of the original ControlNet. Our experimental results suggest that DDP has great potential to improve cooperation with other types of foundation models.
我们从 PEXEL 网站 https://www.pexels.com/ 选取图像，在不同场景下进行测试。图 10 显示了原始 ControlNet 和 DDP 与 ControlNet 组合的结果。ControlNet 的设计目的是实现精细、可控的图像生成，而我们的实验表明，DDP 可以生成更一致的结果，并且在各种场景下都具有优势。此外，当 ControlNet 与 DDP 相结合时，ControlNet 能生成视觉上令人满意的完美合成结果，超过了原始的 ControlNet。我们的实验结果表明，DDP 在改善与其他类型基础模型的合作方面具有巨大潜力。

DDP: Diffusion Model for Dense Visual PredictionDDP：密集视觉预测扩散模型

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

Diffusion Model. 扩散模型。

Dense Prediction.

Diffusion Models for Dense Prediction.

3 Methodology

3.1 Preliminaries

Dense Prediction. 密集预测。

Conditional Diffusion Model.条件扩散模型

3.2 Architecture 3.2 结构

Image Encoder.

Map Decoder.

3.3 Training 3.3 培训

Label Encoding. 标签编码。

Map Corruption.

Objective Function.

3.4 Inference

Sampling Rule. 取样规则。

Sampling Drift. 采样漂移。

Multiple Inference.

4 Experiment

4.1 Main Properties

Dynamic Inference. 动态推理

Uncertainty Awareness. 不确定性意识。

4.2 Semantic Segmentation 4.2 语义分割

Datasets. 数据集。

Settings.

Results on ADE20K. ADE20K 的结果。

Results on Cityscapes.

Discussion. 讨论。

4.3 BEV Map Segmentation 4.3BEV 地图分割

Dataset. 数据集。

Settings. 设置

Results. 结果

4.4 Depth Estimation 4.4 深度估计

Datasets. 数据集。

Settings. 设置

Metrics. 衡量标准。

Results. 结果

4.5 Ablation Study 4.5 消融研究

Label Encoding. 标签编码。

Signal Scale. 信号刻度。

Noise Schedule. 噪音时间表。

Decoder Depth. 解码器深度

Accuracy vs. Efficiency. 准确性与效率

5 Conclusion 5 结束语

Acknowledgement. 致谢。

References 参考资料

Appendix A Diffusion Model附录 AD 扩散模型

A.1 Algorithm details A.1 算法细节

A.2 More Discussions A.2 更多讨论

Appendix B Implementation Details附录 II 实施细节

B.1 Semantic Segmentation B.1语义分割

ADE20K.

Cityscapes. 城市景观

B.2 BEV Map Segmentation B.2BEV 地图分割

nuScenes.

B.3 Depth Estimation B.3D 深度估算

KITTI. 基蒂

NYU-DepthV2.

SUN RGB-D. SUN RGB-D。

Appendix C Experimental Results附录 C 实验结果

Appendix D Visualization 附录 D

Appendix E More Applications附录 EM 更多应用

E.1 Combine DDP with ControlNetE.1 将 DDP 与 ControlNet 结合使用

Setup. 设置。

Implementation 实施情况

Results 成果

DDP: Diffusion Model for Dense Visual Prediction
DDP：密集视觉预测扩散模型

Conditional Diffusion Model.
条件扩散模型

Appendix A Diffusion Model
附录 AD 扩散模型

Appendix B Implementation Details
附录 II 实施细节

Appendix C Experimental Results
附录 C 实验结果

Appendix E More Applications
附录 EM 更多应用

E.1 Combine DDP with ControlNet
E.1 将 DDP 与 ControlNet 结合使用