这是用户在 2025-6-3 18:46 为 https://ar5iv.labs.arxiv.org/html/2202.02514?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Score-based Generative Modeling of Graphs via
the System of Stochastic Differential Equations
基于分数的图生成建模通过随机微分方程系统

Jaehyeong Jo   赵在亨    Seul Lee   李瑟    Sung Ju Hwang   黄成柱
Abstract  摘要

Generating graph-structured data requires learning the underlying distribution of graphs. Yet, this is a challenging problem, and the previous graph generative methods either fail to capture the permutation-invariance property of graphs or cannot sufficiently model the complex dependency between nodes and edges, which is crucial for generating real-world graphs such as molecules. To overcome such limitations, we propose a novel score-based generative model for graphs with a continuous-time framework. Specifically, we propose a new graph diffusion process that models the joint distribution of the nodes and edges through a system of stochastic differential equations (SDEs). Then, we derive novel score matching objectives tailored for the proposed diffusion process to estimate the gradient of the joint log-density with respect to each component, and introduce a new solver for the system of SDEs to efficiently sample from the reverse diffusion process. We validate our graph generation method on diverse datasets, on which it either achieves significantly superior or competitive performance to the baselines. Further analysis shows that our method is able to generate molecules that lie close to the training distribution yet do not violate the chemical valency rule, demonstrating the effectiveness of the system of SDEs in modeling the node-edge relationships. Our code is available at https://github.com/harryjo97/GDSS.
生成图结构数据需要学习图的潜在分布。然而,这是一项具有挑战性的问题,之前的图生成方法要么无法捕捉图的置换不变性特性,要么无法充分建模节点和边之间的复杂依赖关系,而这对于生成现实世界的图(如分子)至关重要。为克服这些局限性,我们提出了一种新颖的基于评分的图生成模型,采用连续时间框架。具体而言,我们提出了一种新的图扩散过程,通过一组随机微分方程(SDEs)建模节点和边的联合分布。然后,我们推导出针对所提出的扩散过程量身定制的新评分匹配目标,以估计联合对数密度相对于每个分量的梯度,并引入了一种新的 SDE 系统求解器,以高效地从反向扩散过程中进行采样。我们在多样化的数据集上验证了我们的图生成方法,在这些数据集上,它的性能显著优于或与基线相当。 进一步分析表明,我们的方法能够生成接近训练分布的分子,同时不违反化学价规则,展示了随机微分方程系统在建模节点-边关系方面的有效性。我们的代码可在 https://github.com/harryjo97/GDSS 获取。

Graph Generation, Score-based Generative Modeling
图生成,基于评分的生成建模

1 Introduction  1 引言

Learning the underlying distribution of graph-structured data is an important yet challenging problem that has wide applications, such as understanding the social networks (Grover et al., 2019; Wang et al., 2018), drug design (Simonovsky & Komodakis, 2018; Li et al., 2018c), neural architecture search (NAS) (Xie et al., 2019; Lee et al., 2021), and even program synthesis (Brockschmidt et al., 2019). Recently, deep generative models have shown success in graph generation by modeling complicated structural properties of graphs, exploiting the expressivity of neural networks. Among them, autoregressive models (You et al., 2018b; Liao et al., 2019) construct a graph via sequential decisions, while one-shot generative models (De Cao & Kipf, 2018; Liu et al., 2019) generate components of a graph at once. Although these models have achieved a certain degree of success, they also possess clear limitations. Autoregressive models are computationally costly and cannot capture the permutation-invariant nature of graphs, while one-shot generative models based on the likelihood fail to model structural information due to the restriction on the architectures to ensure tractable likelihood computation.
学习图结构数据的潜在分布是一个重要但具有挑战性的问题,具有广泛的应用,例如理解社交网络(Grover et al., 2019; Wang et al., 2018)、药物设计(Simonovsky & Komodakis, 2018; Li et al., 2018c)、神经架构搜索(NAS)(Xie et al., 2019; Lee et al., 2021),甚至程序合成(Brockschmidt et al., 2019)。最近,深度生成模型通过建模图的复杂结构特性,利用神经网络的表达能力,在图生成方面取得了成功。其中,自回归模型(You et al., 2018b; Liao et al., 2019)通过顺序决策构建图,而一次性生成模型(De Cao & Kipf, 2018; Liu et al., 2019)则一次性生成图的组成部分。尽管这些模型已取得了一定程度的成功,但它们也存在明显的局限性。 自回归模型计算成本高,并且无法捕捉图的置换不变特性,而基于似然的一次性生成模型由于对架构的限制以确保可处理的似然计算,无法建模结构信息。

Apart from the likelihood-based methods, Niu et al. (2020) introduced a score-based generative model for graphs, namely, edge-wise dense prediction graph neural network (EDP-GNN). However, since EDP-GNN utilizes the discrete-step perturbation of heuristically chosen noise scales to estimate the score function, both its flexibility and its efficiency are limited. Moreover, EDP-GNN only generates adjacency matrices of graphs, thus is unable to fully capture the node-edge dependency which is crucial for the generation of real-world graphs such as molecules.
除了基于似然的方法,Niu 等人(2020)提出了一种用于图的基于评分的生成模型,即边缘密集预测图神经网络(EDP-GNN)。然而,由于 EDP-GNN 利用启发式选择的噪声尺度的离散步扰动来估计评分函数,因此其灵活性和效率都受到限制。此外,EDP-GNN 仅生成图的邻接矩阵,因此无法充分捕捉对生成现实世界图(如分子)至关重要的节点-边依赖关系。

To overcome the limitations of previous graph generative models, we propose a novel score-based graph generation framework on a continuous-time domain that can generate both the node features and the adjacency matrix. Specifically, we propose a novel Graph Diffusion via the System of Stochastic differential equations (GDSS), which describes the perturbation of both node features and adjacency through a system of SDEs, and show that the previous work of Niu et al. (2020) is a special instance of GDSS. Diffusion via the system of SDEs can be interpreted as the decomposition of the full diffusion into simpler diffusion processes of respective components while modeling the dependency. Further, we derive novel training objectives for the proposed diffusion, which enable us to estimate the gradient of the joint log-density with respect to each component, and introduce a new integrator for solving the proposed system of SDEs.
为了克服先前图生成模型的局限性,我们提出了一种新颖的基于评分的图生成框架,该框架在连续时间域上能够生成节点特征和邻接矩阵。具体而言,我们提出了一种新颖的通过随机微分方程系统的图扩散(GDSS),它通过一组随机微分方程描述节点特征和邻接矩阵的扰动,并表明 Niu 等人(2020)的先前工作是 GDSS 的一个特例。通过随机微分方程系统的扩散可以被解释为将完整扩散分解为各个组件的更简单的扩散过程,同时建模依赖关系。此外,我们为所提出的扩散推导了新的训练目标,使我们能够估计每个组件的联合对数密度的梯度,并引入了一种新的积分器来求解所提出的随机微分方程系统。

We experimentally validate our method on generic graph generation tasks by evaluating the generation quality on synthetic and real-world graphs, on which ours outperforms existing one-shot generative models while achieving competitive performance to autoregressive models. We further validate our method on molecule generation tasks, where ours outperforms the state-of-the-art baselines including the autoregressive methods, demonstrating that the proposed diffusion process through the system of SDEs is able to capture the complex dependency between nodes and edges. We summarize our main contributions as follows:
我们通过在合成和真实世界图形上评估生成质量,实验验证了我们的方法在通用图生成任务上的有效性,在这些任务中,我们的方法优于现有的一次性生成模型,同时在性能上与自回归模型具有竞争力。我们进一步在分子生成任务上验证了我们的方法,在这些任务中,我们的方法超越了包括自回归方法在内的最先进基准,证明了通过随机微分方程系统提出的扩散过程能够捕捉节点和边之间的复杂依赖关系。我们总结了我们的主要贡献如下:

  • We propose a novel score-based generative model for graphs that overcomes the limitation of previous generative methods, by introducing a diffusion process for graphs that can generate node features and adjacency matrices simultaneously via the system of SDEs.
    我们提出了一种新颖的基于评分的图生成模型,克服了先前生成方法的局限性,通过引入一种图的扩散过程,可以通过随机微分方程系统同时生成节点特征和邻接矩阵。

  • We derive novel training objectives to estimate the gradient of the joint log-density for the proposed diffusion process and further introduce an efficient integrator to solve the proposed system of SDEs.
    我们推导了新的训练目标,以估计所提扩散过程的联合对数密度的梯度,并进一步引入了一种高效的积分器来求解所提的随机微分方程系统。

  • We validate our method on both synthetic and real-world graph generation tasks, on which ours outperforms existing graph generative models.
    我们在合成和真实世界的图生成任务上验证了我们的方法,在这些任务中,我们的方法优于现有的图生成模型。

2 Related Work  2 相关工作

Score-based Generative Models
基于分数的生成模型

Score-based generative models generate samples from noise by first perturbing the data with gradually increasing noise, then learning to reverse the perturbation via estimating the score function, which is the gradient of the log-density function with respect to the data. Two representative classes of score-based generative models have been proposed by Song & Ermon (2019) and Ho et al. (2020), respectively. Score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019) estimates the score function at multiple noise scales, then generates samples using annealed Langevin dynamics to slowly decrease the scales. On the other hand, denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020) backtracks each step of the noise perturbation by considering the diffusion process as a parameterized Markov chain and learning the transition of the chain. Recently, Song et al. (2021b) showed that these approaches can be unified into a single framework, describing the noise perturbation as the forward diffusion process modeled by the stochastic differential equation (SDE). Although score-based generative models have shown successful results for the generation of images (Ho et al., 2020; Song et al., 2021b, a; Dhariwal & Nichol, 2021), audio (Chen et al., 2021; Kong et al., 2021; Jeong et al., 2021; Mittal et al., 2021), and point clouds (Cai et al., 2020; Luo & Hu, 2021), the graph generation task remains to be underexplored due to the discreteness of the data structure and the complex dependency between nodes and edges. We are the first to propose a diffusion process for graphs and further model the dependency through a system of SDEs. It is notable that recent developments of score-based generative methods, such as latent score-based generative model (LSGM) (Vahdat et al., 2021) and critically-damped Langevin diffusion (CLD) (Dockhorn et al., 2021), are complementary to our method as we can apply these methods to improve each component-wise diffusion process.
基于评分的生成模型通过首先用逐渐增加的噪声扰动数据,然后学习通过估计评分函数来逆转扰动,从噪声中生成样本。评分函数是相对于数据的对数密度函数的梯度。Song 和 Ermon (2019) 以及 Ho 等人 (2020) 分别提出了两类代表性的基于评分的生成模型。带有朗之万动力学的评分匹配 (SMLD) (Song & Ermon, 2019) 在多个噪声尺度下估计评分函数,然后使用退火朗之万动力学生成样本,以缓慢降低尺度。另一方面,去噪扩散概率建模 (DDPM) (Ho et al., 2020) 通过将扩散过程视为参数化的马尔可夫链并学习链的转移,回溯噪声扰动的每一步。最近,Song 等人 (2021b) 表明这些方法可以统一为一个单一框架,将噪声扰动描述为由随机微分方程 (SDE) 建模的正向扩散过程。 尽管基于评分的生成模型在图像生成(Ho et al., 2020; Song et al., 2021b, a; Dhariwal & Nichol, 2021)、音频生成(Chen et al., 2021; Kong et al., 2021; Jeong et al., 2021; Mittal et al., 2021)和点云生成(Cai et al., 2020; Luo & Hu, 2021)方面取得了成功的结果,但由于数据结构的离散性以及节点和边之间的复杂依赖关系,图生成任务仍然未得到充分探索。我们首次提出了一种用于图的扩散过程,并通过随机微分方程(SDEs)系统进一步建模依赖关系。值得注意的是,最近基于评分的生成方法的发展,如潜在评分生成模型(LSGM)(Vahdat et al., 2021)和临界阻尼朗之万扩散(CLD)(Dockhorn et al., 2021),与我们的方法是互补的,因为我们可以应用这些方法来改进每个组件的扩散过程。

Graph Generative Models  图生成模型

The common goal of graph generative models is to learn the underlying distribution of graphs. Graph generative models can be classified into two categories based on the type of the generation process: autoregressive and one-shot. Autoregressive graph generative models include generative adversarial network (GAN) models (You et al., 2018a), recurrent neural network (RNN) models (You et al., 2018b; Popova et al., 2019), variational autoencoder (VAE) models (Jin et al., 2018, 2020), and normalizing flow models (Shi et al., 2020; Luo et al., 2021). Works that specifically focus on the scalability of the generation comprise another branch of autoregressive models (Liao et al., 2019; Dai et al., 2020). Although autoregressive models show state-of-the-art performance, they are computationally expensive and cannot model the permutation-invariant nature of the true data distribution. On the other hand, one-shot graph generative models aim to directly model the distribution of all the components of a graph as a whole, thereby considering the structural dependency as well as the permutation-invariance. One-shot graph generative models can be categorized into GAN models (De Cao & Kipf, 2018), VAE models (Ma et al., 2018), and normalizing flow models (Madhawa et al., 2019; Zang & Wang, 2020). There are also recent approaches that utilize energy-based models (EBMs) and score-based models, respectively (Liu et al., 2021b; Niu et al., 2020). Existing one-shot generative models perform poorly due to the restricted model architectures for building a normalized probability, which is insufficient to learn the complex dependency of nodes and edges. To overcome these limitations, we introduce a novel permutation-invariant one-shot generative method based on the score-based model that imposes fewer constraints on the model architecture, compared to previous one-shot methods.
图生成模型的共同目标是学习图的潜在分布。图生成模型可以根据生成过程的类型分为两类:自回归和一次性生成。自回归图生成模型包括生成对抗网络(GAN)模型(You et al., 2018a)、递归神经网络(RNN)模型(You et al., 2018b; Popova et al., 2019)、变分自编码器(VAE)模型(Jin et al., 2018, 2020)和归一化流模型(Shi et al., 2020; Luo et al., 2021)。专门关注生成可扩展性的研究构成了自回归模型的另一个分支(Liao et al., 2019; Dai et al., 2020)。尽管自回归模型表现出最先进的性能,但它们计算开销大,无法建模真实数据分布的置换不变性。另一方面,一次性图生成模型旨在直接建模图的所有组成部分的整体分布,从而考虑结构依赖性以及置换不变性。 一次性图生成模型可以分为 GAN 模型(De Cao & Kipf, 2018)、VAE 模型(Ma et al., 2018)和归一化流模型(Madhawa et al., 2019;Zang & Wang, 2020)。最近还有一些方法分别利用基于能量的模型(EBMs)和基于评分的模型(Liu et al., 2021b;Niu et al., 2020)。现有的一次性生成模型表现不佳,原因在于构建归一化概率的模型架构受到限制,这不足以学习节点和边的复杂依赖关系。为了克服这些局限性,我们提出了一种新颖的基于评分模型的置换不变一次性生成方法,与之前的一次性方法相比,对模型架构施加的约束更少。

Refer to caption
Figure 1: (Left) Visualization of graph generation through the reverse-time diffusion process. The colored trajectories denote different types of diffusion processes in the joint probability space of node features 𝑿\bm{X} and adjacency 𝑨\bm{A}. We compare three types of diffusion: GDSS (green) can successfully generate samples from the data distribution by modeling the dependency between the components, whereas GDSS-seq (red) or the independent diffusion of each component (orange) fails. (Right) Illustration of the proposed score-based graph generation framework. GDSS generates 𝑿\bm{X} and 𝑨\bm{A} simultaneously by modeling the dependency through time, whereas GDSS-seq generates them sequentially. EDP-GNN generates only 𝑨\bm{A} with 𝑿\bm{X} sampled from training data. Note that GDSS and GDSS-seq are based on a continuous-time diffusion process, while EDP-GNN is based on a discrete-step perturbation procedure.
图 1:(左)通过反向扩散过程可视化图生成。彩色轨迹表示节点特征 𝑿\bm{X} 和邻接 𝑨\bm{A} 的联合概率空间中的不同类型的扩散过程。我们比较了三种类型的扩散:GDSS(绿色)通过建模组件之间的依赖关系成功地从数据分布中生成样本,而 GDSS-seq(红色)或每个组件的独立扩散(橙色)则失败。(右)所提出的基于评分的图生成框架的示意图。GDSS 通过建模时间上的依赖关系同时生成 𝑿\bm{X}𝑨\bm{A} ,而 GDSS-seq 则是顺序生成。EDP-GNN 仅生成 𝑨\bm{A} ,并从训练数据中采样 𝑿\bm{X} 。请注意,GDSS 和 GDSS-seq 基于连续时间扩散过程,而 EDP-GNN 基于离散步骤扰动程序。

Score-based Graph Generation
基于评分的图生成

To the best of our knowledge, Niu et al. (2020) is the only work that approaches graph generation with the score-based generative model, which aims to generate graphs by estimating the score of the adjacency matrices at a finite number of noise scales, and using Langevin dynamics to sample from a series of decreasing noise scales. However, Langevin dynamics requires numerous sampling steps for each noise scale, thereby taking a long time for the generation and having difficulty in generating large-scale graphs. Furthermore, Niu et al. (2020) focuses on the generation of adjacency without the generation of node features, resulting in suboptimal learning of the distributions of node-attributed graphs such as molecular graphs. A naive extension of the work of Niu et al. (2020) that generates the node features and the adjacency either simultaneously or alternately will still be suboptimal, since it cannot capture the complex dependency between the nodes and edges. Therefore, we propose a novel score-based generative framework for graphs that can interdependently generate the nodes and edges. Specifically, we propose a novel diffusion process for graphs through a system of SDEs that smoothly transforms the data distribution to known prior and vice versa, which overcomes the limitation of the previous discrete-step perturbation procedure.
据我们所知,Niu 等人(2020)是唯一采用基于评分的生成模型进行图生成的工作,该模型旨在通过在有限数量的噪声尺度上估计邻接矩阵的评分来生成图,并使用朗之万动力学从一系列递减的噪声尺度中进行采样。然而,朗之万动力学需要在每个噪声尺度上进行大量的采样步骤,从而导致生成时间较长,并且在生成大规模图时存在困难。此外,Niu 等人(2020)专注于邻接的生成,而未生成节点特征,导致对节点属性图(如分子图)分布的学习效果不佳。Niu 等人(2020)工作的一个简单扩展,即同时或交替生成节点特征和邻接,仍然是次优的,因为它无法捕捉节点和边之间的复杂依赖关系。因此,我们提出了一种新颖的基于评分的图生成框架,可以相互依赖地生成节点和边。 具体而言,我们通过一组随机微分方程(SDEs)提出了一种新颖的图扩散过程,该过程平滑地将数据分布转换为已知的先验分布,反之亦然,从而克服了先前离散步骤扰动过程的局限性。

3 Graph Diffusion via the System of SDEs
通过随机微分方程系统的图扩散

In this section, we introduce our novel continuous-time score-based generative framework for modeling graphs using the system of SDEs. We first explain our proposed graph diffusion process via a system of SDEs in Section 3.1, then derive new objectives for estimating the gradients of the joint log-density with respect to each component in Section 3.2. Finally, we present an effective method for solving the system of reverse-time SDEs in Section 3.3.
在本节中,我们介绍了我们新颖的基于分数的连续时间生成框架,用于通过随机微分方程(SDEs)系统对图进行建模。我们首先在 3.1 节中解释我们提出的图扩散过程,然后在 3.2 节中推导出关于每个分量的联合对数密度梯度估计的新目标。最后,我们在 3.3 节中提出了一种有效的方法来求解反向时间 SDEs 系统。

3.1 Graph Diffusion Process
3.1 图扩散过程

The goal of graph generation is to synthesize graphs that closely follow the distribution of the observed set of graphs. To bypass the difficulty of directly representing the distribution, we introduce a continuous-time score-based generative framework for the graph generation. Specifically, we propose a novel graph diffusion process via the system of SDEs that transforms the graphs to noise and vice versa, while modeling the dependency between nodes and edges. We begin by explaining the proposed diffusion process for graphs.
图生成的目标是合成与观察到的图集分布密切相关的图。为了绕过直接表示分布的困难,我们引入了一种基于连续时间的评分生成框架用于图生成。具体而言,我们提出了一种通过随机微分方程(SDE)系统的新颖图扩散过程,该过程将图转化为噪声,反之亦然,同时建模节点和边之间的依赖关系。我们首先解释所提出的图扩散过程。

A graph 𝑮\bm{G} with NN nodes is defined by its node features 𝑿N×F\bm{X}\!\in\!\mathbb{R}^{\!N\!\times\!F} and the weighted adjacency matrix 𝑨N×N\bm{A}\!\in\!\mathbb{R}^{\!N\!\times\!N} as 𝑮=(𝑿,𝑨)N×F×N×N𝒢\bm{G}\!=\!(\bm{X},\!\bm{A})\!\in\!\mathbb{R}^{\!N\!\times\!F}\!\!\times\mathbb{R}^{\!N\!\times\!N}\!\!\coloneqq\mathcal{G}, where FF is the dimension of the node features. To model the dependency between 𝑿\bm{X} and 𝑨\bm{A}, we propose a forward diffusion process of graphs that transforms both the node features and the adjacency matrices to a simple noise distribution. Formally, the diffusion process can be represented as the trajectory of random variables {𝑮t=(𝑿t,𝑨t)}t[0,T]\left\{\bm{G}_{t}\!=\!(\bm{X}_{t},\!\bm{A}_{t})\right\}_{t\in[0,\!T]} in a fixed time horizon [0,T][0,\!T], where 𝑮0\bm{G}_{0} is a graph from the data distribution pdatap_{data}. The diffusion process can be modeled by the following Itô SDE:
一个具有 NN 个节点的图 𝑮\bm{G} 由其节点特征 𝑿N×Fsuperscript\bm{X}\!\in\!\mathbb{R}^{\!N\!\times\!F} 和加权邻接矩阵 𝑨N×Nsuperscript\bm{A}\!\in\!\mathbb{R}^{\!N\!\times\!N} 定义为 𝑮=(𝑿,𝑨)N×F×N×N𝒢superscriptsuperscript\bm{G}\!=\!(\bm{X},\!\bm{A})\!\in\!\mathbb{R}^{\!N\!\times\!F}\!\!\times\mathbb{R}^{\!N\!\times\!N}\!\!\coloneqq\mathcal{G} ,其中 FF 是节点特征的维度。为了建模 𝑿\bm{X}𝑨\bm{A} 之间的依赖关系,我们提出了一种图的正向扩散过程,该过程将节点特征和邻接矩阵转化为简单的噪声分布。形式上,扩散过程可以表示为在固定时间范围 [0,T]0[0,\!T] 内随机变量 {𝑮t=(𝑿t,𝑨t)}t[0,T]subscriptsubscriptsubscriptsubscript0\left\{\bm{G}_{t}\!=\!(\bm{X}_{t},\!\bm{A}_{t})\right\}_{t\in[0,\!T]} 的轨迹,其中 𝑮0subscript0\bm{G}_{0} 是来自数据分布 pdatasubscriptp_{data} 的图。扩散过程可以通过以下 Itô 随机微分方程建模:

d𝐆t=𝐟t(𝐆t)dt+𝐠t(𝐆t)d𝐰,𝐆0pdata,\displaystyle\mathrm{d}\bm{G}_{t}=\mathbf{f}_{t}(\bm{G}_{t})\mathrm{d}t+\mathbf{g}_{t}(\bm{G}_{t})\mathrm{d}\mathbf{w},\;\;\;\bm{G}_{0}\sim p_{data}, (1)

where 𝐟t():𝒢𝒢\mathbf{f}_{t}(\cdot)\colon\!\mathcal{G}\!\to\!\mathcal{G} 111tt-subscript represents functions of time: Ft()F(,t)F_{t}(\cdot)\!\coloneqq\!F(\cdot,t). is the linear drift coefficient, 𝐠t():𝒢𝒢×𝒢\mathbf{g}_{t}(\cdot)\!\colon\!\mathcal{G}\!\to\!\mathcal{G}\!\times\!\mathcal{G} is the diffusion coefficient, and 𝐰\mathbf{w} is the standard Wiener process. Intuitively, the forward diffusion process of Eq. (1) smoothly transforms both 𝑿0\bm{X}_{0} and 𝑨0\bm{A}_{0} by adding infinitesimal noise d𝐰\mathrm{d}\mathbf{w} at each infinitesimal time step dt\mathrm{d}t. The coefficients of the SDE, 𝐟t\mathbf{f}_{t} and 𝐠t\mathbf{g}_{t}, are chosen such that at the terminal time horizon TT, the diffused sample 𝑮T\bm{G}_{T} approximately follows a prior distribution that has a tractable form to efficiently generate the samples, for example Gaussian distribution. For ease of the presentation, we choose 𝐠t(𝑮t)\mathbf{g}_{t}(\bm{G}_{t}) to be a scalar function gtg_{t}. Note that ours is the first work that proposes a diffusion process for generating a whole graph consisting of nodes and edges with attributes, in that the work of Niu et al. (2020) (1) utilizes the finite-step perturbation of multiple noise scales, and (2) only focuses on the perturbation of the adjacency matrices while using the fixed node features sampled from the training data.
其中 𝐟t():𝒢𝒢subscript\mathbf{f}_{t}(\cdot)\colon\!\mathcal{G}\!\to\!\mathcal{G} 111tt-subscript represents functions of time: Ft()F(,t)subscriptF_{t}(\cdot)\!\coloneqq\!F(\cdot,t). 是线性漂移系数, 𝐠t():𝒢𝒢×𝒢subscript\mathbf{g}_{t}(\cdot)\!\colon\!\mathcal{G}\!\to\!\mathcal{G}\!\times\!\mathcal{G} 是扩散系数, 𝐰\mathbf{w} 是标准维纳过程。直观上,方程 (1) 的正向扩散过程通过在每个无穷小时间步 dt\mathrm{d}t 添加无穷小噪声 d𝐰\mathrm{d}\mathbf{w} ,平滑地转换 𝑿0subscript0\bm{X}_{0}𝑨0subscript0\bm{A}_{0} 。随机微分方程 (SDE) 的系数 𝐟tsubscript\mathbf{f}_{t}𝐠tsubscript\mathbf{g}_{t} 被选择为在终端时间范围 TT 时,扩散样本 𝑮Tsubscript\bm{G}_{T} 大致遵循具有可处理形式的先验分布,以有效生成样本,例如高斯分布。为了便于展示,我们选择 𝐠t(𝑮t)subscriptsubscript\mathbf{g}_{t}(\bm{G}_{t}) 为标量函数 gtsubscriptg_{t} 。请注意,我们的工作是首个提出生成包含节点和带属性边的完整图的扩散过程的研究,Niu 等人 (2020) 的工作 (1) 利用多种噪声尺度的有限步扰动,并且 (2) 仅关注邻接矩阵的扰动,同时使用从训练数据中采样的固定节点特征。

In order to generate graphs that follow the data distribution, we start from samples of the prior distribution and traverse the diffusion process of Eq. (1) backward in time. Notably, the reverse of the diffusion process in time is also a diffusion process described by the following reverse-time SDE (Anderson, 1982; Song et al., 2021b):
为了生成符合数据分布的图,我们从先验分布的样本出发,逆向遍历方程 (1) 中的扩散过程。值得注意的是,时间上的扩散过程的逆过程也是一个扩散过程,由以下逆时间随机微分方程 (SDE) 描述(Anderson, 1982; Song et al., 2021b):

d𝐆t=[𝐟t(𝐆t)gt2𝐆tlogpt(𝐆t)]dt¯+gtd𝐰¯,\displaystyle\mathrm{d}\bm{G}_{t}=\left[\mathbf{f}_{t}(\bm{G}_{t})-g_{t}^{2}\nabla_{\!\!\bm{G}_{t}}\!\log p_{t}(\bm{G}_{t})\right]\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{t}\mathrm{d}\bar{\mathbf{w}}, (2)

where ptp_{t} denotes the marginal distribution under the forward diffusion process at time tt, 𝐰¯\bar{\mathbf{w}} is a reverse-time standard Wiener process, and dt¯\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu is an infinitesimal negative time step. However, solving Eq. (2) directly requires the estimation of high-dimensional score 𝑮tlogpt(𝑮t)N×F×N×N\nabla_{\!\!\bm{G}_{t}}\!\log p_{t}(\bm{G}_{t})\!\in\!\mathbb{R}^{N\!\times\!F}\!\!\times\!\mathbb{R}^{N\!\times\!N}, which is expensive to compute. To bypass this computation, we propose a novel reverse-time diffusion process equivalent to Eq. (2), modeled by the following system of SDEs:
其中 ptsubscriptp_{t} 表示在时间 tt 下正向扩散过程的边际分布, 𝐰¯\bar{\mathbf{w}} 是一个逆时间标准维纳过程, dt¯\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu 是一个无穷小的负时间步长。然而,直接求解方程 (2) 需要估计高维得分 𝑮tlogpt(𝑮t)N×F×N×Nsubscriptsubscriptsubscriptsubscriptsuperscriptsuperscript\nabla_{\!\!\bm{G}_{t}}\!\log p_{t}(\bm{G}_{t})\!\in\!\mathbb{R}^{N\!\times\!F}\!\!\times\!\mathbb{R}^{N\!\times\!N} ,这在计算上是昂贵的。为了绕过这一计算,我们提出了一种新的逆时间扩散过程,该过程等价于方程 (2),由以下随机微分方程系统建模:

{d𝐗t=[𝐟1,t(𝐗t)g1,t2𝐗tlogpt(𝐗t,𝐀t)]dt¯+g1,td𝐰¯1d𝐀t=[𝐟2,t(𝐀t)g2,t2𝐀tlogpt(𝐗t,𝐀t)]dt¯+g2,td𝐰¯2\displaystyle\begin{cases}\!\mathrm{d}\bm{X}_{t}\!=\!\left[\mathbf{f}_{1,t}(\bm{X}_{t})-g_{1,t}^{2}\!\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\right]\!\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{1,t}\mathrm{d}\bar{\mathbf{w}}_{1}\\[5.0pt] \!\mathrm{d}\bm{A}_{t}=\!\left[\mathbf{f}_{2,t}(\bm{A}_{t})-g_{2,t}^{2}\!\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\right]\!\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{2,t}\mathrm{d}\bar{\mathbf{w}}_{2}\end{cases} (3)

where 𝐟1,t\mathbf{f}_{1,t} and 𝐟2,t\mathbf{f}_{2,t} are linear drift coefficients satisfying 𝐟t(𝑿,𝑨)=(𝐟1,t(𝑿),𝐟2,t(𝑨))\mathbf{f}_{t}(\bm{X},\bm{A})=\left(\mathbf{f}_{1,t}(\bm{X}),\mathbf{f}_{2,t}(\bm{A})\right), g1,tg_{1,t} and g2,tg_{2,t} are scalar diffusion coefficients, and 𝐰¯1\bar{\mathbf{w}}_{1}, 𝐰¯2\bar{\mathbf{w}}_{2} are reverse-time standard Wiener processes. We refer to these forward and reverse diffusion processes of graphs as Graph Diffusion via the System of SDEs (GDSS). Notably, each SDE in Eq. (3) describes the diffusion process of each component, 𝑿\bm{X} and 𝑨\bm{A}, respectively, which presents a new perspective of interpreting the diffusion of a graph as the diffusion of each component that are interrelated through time. In practice, we can choose different types of SDEs for each component-wise diffusion that best suit the generation process.
其中 𝐟1,tsubscript1\mathbf{f}_{1,t}𝐟2,tsubscript2\mathbf{f}_{2,t} 是满足 𝐟t(𝑿,𝑨)=(𝐟1,t(𝑿),𝐟2,t(𝑨))subscriptsubscript1subscript2\mathbf{f}_{t}(\bm{X},\bm{A})=\left(\mathbf{f}_{1,t}(\bm{X}),\mathbf{f}_{2,t}(\bm{A})\right) 的线性漂移系数, g1,tsubscript1g_{1,t}g2,tsubscript2g_{2,t} 是标量扩散系数,而 𝐰¯1subscript1\bar{\mathbf{w}}_{1}𝐰¯2subscript2\bar{\mathbf{w}}_{2} 是反向时间标准维纳过程。我们将这些图的正向和反向扩散过程称为通过随机微分方程系统的图扩散(GDSS)。值得注意的是,方程(3)中的每个随机微分方程描述了每个分量的扩散过程,分别为 𝑿\bm{X}𝑨\bm{A} ,这为将图的扩散解释为通过时间相互关联的每个分量的扩散提供了新的视角。在实践中,我们可以为每个分量的扩散选择不同类型的随机微分方程,以最适合生成过程。

The key property of GDSS is that the diffusion processes in the system are dependent on each other, related by the gradients of the joint log-density 𝑿tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) and 𝑨tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}), which we refer to as the partial score functions. By leveraging the partial scores to model the dependency between the components through time, GDSS is able to represent the diffusion process of a whole graph, consisting of nodes and edges. To demonstrate the importance of modeling the dependency, we present two variants of our proposed GDSS and compare their generative performance.
GDSS 的关键特性在于系统中的扩散过程相互依赖,通过联合对数密度 𝑿tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})𝑨tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) 的梯度相关联,我们称之为部分得分函数。通过利用部分得分来建模组件之间随时间的依赖关系,GDSS 能够表示整个图的扩散过程,该图由节点和边组成。为了展示建模依赖关系的重要性,我们提出了两种变体的 GDSS,并比较它们的生成性能。

The first variant is the continuous-time version of EDP-GNN (Niu et al., 2020). By ignoring the diffusion process of 𝑿\bm{X} in Eq. (3) with 𝒇1,t=g1,t=0\bm{f}_{1,t}\!\!=\!g_{1,t}\!=\!0 and choosing the prior distribution of 𝑿\bm{X} as the data distribution, we obtain a diffusion process of 𝑨\bm{A} that generalizes the discrete-step noise perturbation procedure of EDP-GNN. Therefore, EDP-GNN can be considered as a special example of GDSS without the diffusion process of the node features, which further replaces the diffusion by the discrete-step perturbation with a finite number of noise scales. We present another variant of GDSS that generates 𝑿\bm{X} and 𝑨\bm{A} sequentially instead of generating them simultaneously. By neglecting some part of the dependency through the assumptions 𝑨tlogpt(𝑿t,𝑨t)𝑿tlogpt(𝑿t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\!\approx\!\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t}) and 𝑨tlogpt(𝑿t,𝑨t)𝑨tlogpt(𝑿0,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\!\approx\!\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}), we can derive the following SDEs for the diffusion process of the variant:
第一个变体是 EDP-GNN 的连续时间版本(Niu 等,2020)。通过忽略方程(3)中 𝑿\bm{X}𝒇1,t=g1,t=0subscript1subscript10\bm{f}_{1,t}\!\!=\!g_{1,t}\!=\!0 的扩散过程,并选择 𝑿\bm{X} 的先验分布作为数据分布,我们获得了一个扩散过程 𝑨\bm{A} ,它推广了 EDP-GNN 的离散步噪声扰动过程。因此,EDP-GNN 可以被视为没有节点特征扩散过程的 GDSS 的一个特例,进一步用有限数量的噪声尺度的离散步扰动替代扩散。我们提出了 GDSS 的另一个变体,该变体顺序生成 𝑿\bm{X}𝑨\bm{A} ,而不是同时生成它们。通过通过假设 𝑨tlogpt(𝑿t,𝑨t)𝑿tlogpt(𝑿t)subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\!\approx\!\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t})𝑨tlogpt(𝑿t,𝑨t)𝑨tlogpt(𝑿0,𝑨t)subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript0subscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t})\!\approx\!\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}) 忽略部分依赖关系,我们可以推导出该变体的扩散过程的以下随机微分方程(SDE):

d𝐗t=\displaystyle\mathrm{d}\bm{X}_{t}\!= [𝐟1,t(𝐗t)g1,t2𝐗tlogpt(𝐗t)]dt¯+g1,td𝐰¯1,\displaystyle\left[\mathbf{f}_{1,t}(\bm{X}_{t})-g_{1,t}^{2}\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t})\right]\!\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{1,t}\mathrm{d}\bar{\mathbf{w}}_{1}, (4)
d𝐀t=\displaystyle\mathrm{d}\bm{A}_{t}\!= [𝐟2,t(𝐀t)g2,t2𝐀tlogpt(𝐗0,𝐀t)]dt¯+g2,td𝐰¯2,\displaystyle\left[\mathbf{f}_{2,t}(\bm{A}_{t})-g_{2,t}^{2}\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\bm{A}_{t})\right]\!\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{2,t}\mathrm{d}\bar{\mathbf{w}}_{2},

which are sequential in the sense that the reverse diffusion process of 𝑨\bm{A} is determined by 𝑿0\bm{X}_{0}, the result of the reverse diffusion of 𝑿\bm{X}. Thus simulating Eq. (4) can be interpreted as generating the node features 𝑿\bm{X} first, then generating the adjacency 𝑨\bm{A} sequentially, which we refer to as GDSS-seq.
这些方程是顺序的,因为 𝑨\bm{A} 的反向扩散过程由 𝑿0subscript0\bm{X}_{0} 决定,即 𝑿\bm{X} 的反向扩散结果。因此,模拟方程(4)可以解释为首先生成节点特征 𝑿\bm{X} ,然后顺序生成邻接 𝑨\bm{A} ,我们称之为 GDSS-seq。

The reverse diffusion process of these two variants, EDP-GNN and GDSS-seq, are visualized in Figure 1 as the red trajectory in the joint (𝑿,𝑨)(\bm{X},\!\bm{A})-space, where the trajectory is constrained to the hyperplane defined by 𝑿t=𝑿0\bm{X}_{t}\!=\!\bm{X}_{0}, therefore do not fully reflect the dependency. On the other hand, GDSS represented as the green trajectory is able to diffuse freely from the noise to the data distribution by modeling the joint distribution through the system of SDEs, thereby successfully generating samples from the data distribution.
这两种变体的反向扩散过程,EDP-GNN 和 GDSS-seq,在图 1 中以红色轨迹可视化于联合 (𝑿,𝑨)(\bm{X},\!\bm{A}) -空间,其中轨迹被限制在由 𝑿t=𝑿0subscriptsubscript0\bm{X}_{t}\!=\!\bm{X}_{0} 定义的超平面上,因此未能完全反映依赖关系。另一方面,GDSS 以绿色轨迹表示,能够通过通过随机微分方程系统建模联合分布,从噪声自由扩散到数据分布,从而成功地从数据分布中生成样本。

Refer to caption
Figure 2: A toy experiment on modeling the dependency. GDSS successfully models the correlation, whereas others fail. See Appendix C.1 for more details of the experiment.
图 2:关于建模依赖关系的玩具实验。GDSS 成功建模了相关性,而其他方法则失败。有关实验的更多细节,请参见附录 C.1。

To empirically verify this observation, we conduct a simple experiment where the data distribution is a bivariate Gaussian mixture. The generated samples of each process are shown in Figure 2. While GDSS successfully represents the correlation of the two variables, GDSS-seq fails to capture their covariance and generates samples that deviate from the data distribution. We observe that EDP-GNN shows a similar result. We further extensively validate the effectiveness of our GDSS in modeling the dependency in Section 4.
为了实证验证这一观察,我们进行了一项简单的实验,其中数据分布为双变量高斯混合。每个过程生成的样本如图 2 所示。虽然 GDSS 成功表示了两个变量的相关性,但 GDSS-seq 未能捕捉到它们的协方差,并生成了偏离数据分布的样本。我们观察到 EDP-GNN 显示出类似的结果。我们在第 4 节进一步广泛验证了 GDSS 在建模依赖关系方面的有效性。

Note that once the partial scores in Eq. (3) are known for all tt, the system of reverse-time SDEs can be used as a generative model by simulating the system backward in time, which we further explain in Section 3.3. In order to estimate the partial scores with neural networks, we introduce novel training objectives for GDSS in the following subsection.
请注意,一旦已知方程(3)中所有 tt 的部分得分,反向时间随机微分方程系统可以通过向后模拟系统作为生成模型,我们将在第 3.3 节中进一步解释。为了使用神经网络估计部分得分,我们在以下小节中引入了 GDSS 的新训练目标。

3.2 Estimating the Partial Score Functions
3.2 估计部分得分函数

Training Objectives  训练目标

The partial score functions can be estimated by training the time-dependent score-based models 𝒔θ,t\bm{s}_{\theta,t} and 𝒔ϕ,t\bm{s}_{\phi,t}, so that 𝒔θ,t(𝑮t)𝑿tlogpt(𝑮t)\bm{s}_{\theta,t}(\bm{G}_{t})\!\approx\!\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{G}_{t}) and 𝒔ϕ,t(𝑮t)𝑨tlogpt(𝑮t)\bm{s}_{\phi,t}(\bm{G}_{t})\!\approx\!\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{G}_{t}). However, the objectives introduced in the previous works for estimating the score function are not directly applicable, since the partial score functions are defined as the gradient of each component, not the gradient of the data as in the score function. Thus we derive new objectives for estimating the partial scores.
可以通过训练时间依赖的基于分数的模型 𝒔θ,tsubscript\bm{s}_{\theta,t}𝒔ϕ,tsubscript\bm{s}_{\phi,t} 来估计部分分数函数,从而得到 𝒔θ,t(𝑮t)𝑿tlogpt(𝑮t)subscriptsubscriptsubscriptsubscriptsubscriptsubscript\bm{s}_{\theta,t}(\bm{G}_{t})\!\approx\!\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{G}_{t})𝒔ϕ,t(𝑮t)𝑨tlogpt(𝑮t)subscriptsubscriptsubscriptsubscriptsubscriptsubscript\bm{s}_{\phi,t}(\bm{G}_{t})\!\approx\!\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{G}_{t}) 。然而,之前的研究中引入的用于估计分数函数的目标并不直接适用,因为部分分数函数被定义为每个分量的梯度,而不是像分数函数那样的数据梯度。因此,我们推导出用于估计部分分数的新目标。

Intuitively, the score-based models should be trained to minimize the distance to the corresponding ground-truth partial scores. In order to minimize the Euclidean distance, we introduce new objectives that generalize the score matching (Hyvärinen, 2005; Song et al., 2021b) to the estimation of the partial scores for the given graph dataset, as follows:
直观上,基于评分的模型应被训练以最小化与相应真实部分评分的距离。为了最小化欧几里得距离,我们引入了新的目标,这些目标将评分匹配(Hyvärinen, 2005; Song et al., 2021b)推广到对给定图数据集的部分评分的估计,如下所示:

minθ𝔼t{λ1(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬θ,t(𝐆t)𝐗tlogpt(𝐆t)22}\displaystyle\min_{\theta}\mathbb{E}_{t}\!\!\left\{\!\lambda_{1}(t)\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\!\left\|\bm{s}_{\theta,t}(\bm{G}_{t})-\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{G}_{t})\right\|^{2}_{2}\right\} (5)
minϕ𝔼t{λ2(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬ϕ,t(𝐆t)𝐀tlogpt(𝐆t)22},\displaystyle\min_{\phi}\mathbb{E}_{t}\!\!\left\{\!\lambda_{2}(t)\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\!\left\|\bm{s}_{\phi,t}(\bm{G}_{t})-\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{G}_{t})\right\|^{2}_{2}\right\},

where λ1(t)\lambda_{1}(t) and λ2(t)\lambda_{2}(t) are positive weighting functions and tt is uniformly sampled from [0,T][0,T]. The expectations are taken over the samples 𝑮0pdata\bm{G}_{0}\!\sim\!p_{data} and 𝑮tp0t(𝑮t|𝑮0)\bm{G}_{t}\!\sim\!p_{0t}(\bm{G}_{t}|\bm{G}_{0}), where p0t(𝑮t|𝑮0)p_{0t}(\bm{G}_{t}|\bm{G}_{0}) denotes the transition distribution from p0p_{0} to ptp_{t} induced by the forward diffusion process.
其中 λ1(t)subscript1\lambda_{1}(t)λ2(t)subscript2\lambda_{2}(t) 是正权重函数, tt 是从 [0,T]0[0,T] 中均匀采样的。期望值是针对样本 𝑮0pdatasimilar-tosubscript0subscript\bm{G}_{0}\!\sim\!p_{data}𝑮tp0t(𝑮t|𝑮0)similar-tosubscriptsubscript0conditionalsubscriptsubscript0\bm{G}_{t}\!\sim\!p_{0t}(\bm{G}_{t}|\bm{G}_{0}) 计算的,其中 p0t(𝑮t|𝑮0)subscript0conditionalsubscriptsubscript0p_{0t}(\bm{G}_{t}|\bm{G}_{0}) 表示由正向扩散过程引起的从 p0subscript0p_{0}ptsubscriptp_{t} 的转移分布。

Unfortunately, we cannot train directly with Eq. (5) since the ground-truth partial scores are not analytically accessible in general. Therefore we derive tractable objectives equivalent to Eq. (5), by leveraging the idea of denoising score matching (Vincent, 2011; Song et al., 2021b) to the partial scores, as follows (see Appendix A.1 for the derivation):
不幸的是,我们无法直接使用公式 (5) 进行训练,因为真实的部分分数在一般情况下是无法解析获取的。因此,我们通过利用去噪分数匹配的思想(Vincent, 2011; Song et al., 2021b)来推导出与公式 (5) 等价的可处理目标,如下所示(见附录 A.1 以获取推导过程):

minθ𝔼t{λ1(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬θ,t(𝐆t)𝐗tlogp0t(𝐆t|𝐆0)22}\displaystyle\min_{\theta}\mathbb{E}_{t}\left\{\lambda_{1}(t)\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\left\|\bm{s}_{\theta,t}(\bm{G}_{t})-\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{G}_{t}|\bm{G}_{0})\right\|^{2}_{2}\right\}
minϕ𝔼t{λ2(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬ϕ,t(𝐆t)𝐀tlogp0t(𝐆t|𝐆0)22}.\displaystyle\min_{\phi}\mathbb{E}_{t}\left\{\lambda_{2}(t)\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\left\|\bm{s}_{\phi,t}(\bm{G}_{t})-\nabla_{\!\!\bm{A}_{t}}\!\log p_{0t}(\bm{G}_{t}|\bm{G}_{0})\right\|^{2}_{2}\right\}.

Since the drift coefficient of the forward diffusion process in Eq. (1) is linear, the transition distribution p0t(𝑮t|𝑮0)p_{0t}(\bm{G}_{t}|\bm{G}_{0}) can be separated in terms of 𝑿t\bm{X}_{t} and 𝑨t\bm{A}_{t} as follows:
由于公式 (1) 中正向扩散过程的漂移系数是线性的,转移分布 p0t(𝑮t|𝑮0)subscript0conditionalsubscriptsubscript0p_{0t}(\bm{G}_{t}|\bm{G}_{0}) 可以根据 𝑿tsubscript\bm{X}_{t}𝑨tsubscript\bm{A}_{t} 进行如下分离:

p0t(𝐆t|𝐆0)=p0t(𝐗t|𝐗0)p0t(𝐀t|𝐀0).\displaystyle p_{0t}(\bm{G}_{t}|\bm{G}_{0})=p_{0t}(\bm{X}_{t}|\bm{X}_{0})\;p_{0t}(\bm{A}_{t}|\bm{A}_{0}). (6)

Notably, we can easily sample from the transition distributions of each components, p0t(𝑿t|𝑿0)p_{0t}(\bm{X}_{t}|\bm{X}_{0}) and p0t(𝑨t|𝑨0)p_{0t}(\bm{A}_{t}|\bm{A}_{0}), as they are Gaussian distributions where the mean and variance are tractably determined by the coefficients of the forward diffusion process (Särkkä & Solin, 2019). From Eq. (6), we propose new training objectives which are equivalent to Eq. (5) (see Appendix A.2 for the detailed derivation):
值得注意的是,我们可以轻松地从每个组件的转移分布 p0t(𝑿t|𝑿0)subscript0conditionalsubscriptsubscript0p_{0t}(\bm{X}_{t}|\bm{X}_{0})p0t(𝑨t|𝑨0)subscript0conditionalsubscriptsubscript0p_{0t}(\bm{A}_{t}|\bm{A}_{0}) 中进行采样,因为它们是高斯分布,其均值和方差可以通过前向扩散过程的系数可控地确定(Särkkä & Solin, 2019)。根据公式 (6),我们提出了新的训练目标,这些目标与公式 (5) 等价(详见附录 A.2 的推导):

minθ𝔼t{λ1(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬θ,t(𝐆t)𝐗tlogp0t(𝐗t|𝐗0)22}\displaystyle\min_{\theta}\mathbb{E}_{t}\!\!\left\{\!\lambda_{1}(t)\mathbb{E}_{\bm{G}_{0}}\!\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\!\!\left\|\bm{s}_{\theta,t}(\bm{G}_{t})-\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{X}_{t}|\bm{X}_{0})\right\|^{2}_{2}\right\} (7)
minϕ𝔼t{λ2(t)𝔼𝐆0𝔼𝐆t|𝐆0𝐬ϕ,t(𝐆t)𝐀tlogp0t(𝐀t|𝐀0)22}\displaystyle\min_{\phi}\mathbb{E}_{t}\!\!\left\{\!\lambda_{2}(t)\mathbb{E}_{\bm{G}_{0}}\!\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\!\!\left\|\bm{s}_{\phi,t}(\bm{G}_{t})-\nabla_{\!\!\bm{A}_{t}}\!\log p_{0t}(\bm{A}_{t}|\bm{A}_{0})\right\|^{2}_{2}\right\}

The expectations in Eq. (7) can be efficiently computed using the Monte Carlo estimate with the samples (t,𝑮0,𝑮t)(t,\bm{G}_{0},\bm{G}_{t}). Note that estimating the partial scores is not equivalent to estimating 𝑿tlogpt(𝑿t)\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t}) or 𝑨tlogpt(𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{A}_{t}), the main objective of previous score-based generative models, since estimating the partial scores requires capturing the dependency between 𝑿t\bm{X}_{t} and 𝑨t\bm{A}_{t} determined by the joint probability through time. As we can effectively estimate the partial scores by training the time-dependent score-based models with the objectives of Eq. (7), what remains is to find the models that can learn the partial scores of the underlying distribution of graphs. Thus we propose new architectures for the score-based models in the next paragraph.
在方程(7)中的期望可以使用样本 (t,𝑮0,𝑮t)subscript0subscript(t,\bm{G}_{0},\bm{G}_{t}) 通过蒙特卡洛估计有效计算。请注意,估计部分得分并不等同于估计 𝑿tlogpt(𝑿t)subscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t})𝑨tlogpt(𝑨t)subscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{A}_{t}) ,这是之前基于得分的生成模型的主要目标,因为估计部分得分需要捕捉由时间上的联合概率决定的 𝑿tsubscript\bm{X}_{t}𝑨tsubscript\bm{A}_{t} 之间的依赖关系。由于我们可以通过训练时间依赖的基于得分的模型来有效估计部分得分,目标是方程(7),剩下的就是找到能够学习图的基础分布的部分得分的模型。因此,我们在下一段中提出了基于得分模型的新架构。

Permuation-equivariant Score-based Model
置换等变评分模型

Now we propose new architectures for the time-dependent score-based models that can capture the dependencies of 𝑿t\bm{X}_{t} and 𝑨t\bm{A}_{t} through time, based on graph neural networks (GNNs). First, we present the score-based model 𝒔ϕ,t\bm{s}_{\phi,t} to estimate 𝑨tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) which has the same dimensionality as 𝑨t\bm{A}_{t}. We utilize the graph multi-head attention (Baek et al., 2021) to distinguish important relations between nodes, and further leverage the higher-order adjacency matrices to represent the long-range dependencies as follows:
现在我们提出了基于图神经网络(GNNs)的时间依赖评分模型的新架构,这些模型能够捕捉 𝑿tsubscript\bm{X}_{t}𝑨tsubscript\bm{A}_{t} 随时间的依赖关系。首先,我们提出评分模型 𝒔ϕ,tsubscript\bm{s}_{\phi,t} 来估计 𝑨tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) ,其维度与 𝑨tsubscript\bm{A}_{t} 相同。我们利用图多头注意力(Baek et al., 2021)来区分节点之间的重要关系,并进一步利用高阶邻接矩阵来表示长程依赖,如下所示:

𝐬ϕ,t(𝐆t)=MLP([{GMH(𝐇i,𝐀tp)}i=0,p=1K,P]),\displaystyle\bm{s}_{\phi,t}(\bm{G}_{t})=\text{MLP}\left(\left[\left\{\text{GMH}\left(\bm{H}_{i},\bm{A}^{p}_{t}\right)\right\}^{K,P}_{i=0,p=1}\right]\right), (8)

where 𝑨tp\bm{A}^{p}_{t} are the higher-order adjacency matrices, 𝑯i+1=GNN(𝑯i,𝑨t)\bm{H}_{i+1}\!=\!\text{GNN}(\bm{H}_{i},\!\bm{A}_{t}) with 𝑯0=𝑿t\bm{H}_{0}\!=\!\bm{X}_{t} given, [][\cdot] denotes the concatenation operation, GMH denotes the graph multi-head attention block, and KK denotes the number of GMH layers. We also present the score-based model 𝒔θ,t\bm{s}_{\theta,t} to estimate 𝑿tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\bm{A}_{t}) which has the same dimensionality as 𝑿t\bm{X}_{t}, where we use multiple layers of GNNs to learn the partial scores from the node representations as follows:
其中 𝑨tpsubscriptsuperscript\bm{A}^{p}_{t} 是高阶邻接矩阵, 𝑯i+1=GNN(𝑯i,𝑨t)subscript1subscriptsubscript\bm{H}_{i+1}\!=\!\text{GNN}(\bm{H}_{i},\!\bm{A}_{t})𝑯0=𝑿tsubscript0subscript\bm{H}_{0}\!=\!\bm{X}_{t} 给定, []delimited-[][\cdot] 表示连接操作,GMH 表示图多头注意力块, KK 表示 GMH 层的数量。我们还提出了基于评分的模型 𝒔θ,tsubscript\bm{s}_{\theta,t} 来估计 𝑿tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\bm{A}_{t}) ,其维度与 𝑿tsubscript\bm{X}_{t} 相同,其中我们使用多个 GNN 层从节点表示中学习部分评分,如下所示:

𝐬θ,t(𝐆t)=MLP([{𝐇i}i=0L]),\displaystyle\bm{s}_{\theta,t}(\bm{G}_{t})=\text{MLP}([\left\{\bm{H}_{i}\right\}^{L}_{i=0}]), (9)

where 𝑯i+1=GNN(𝑯i,𝑨t)\bm{H}_{i+1}\!=\!\text{GNN}(\bm{H}_{i},\!\bm{A}_{t}) with 𝑯0=𝑿t\bm{H}_{0}\!=\!\bm{X}_{t} given and LL denotes the number of GNN layers. Here GMH layers can be used instead of simple GNN layers with additional computation costs, which we analyze further in Section 4.3. The architectures of the score-based models are illustrated in Figure 5 of Appendix. Moreover, following Song & Ermon (2020), we incorporate the time information to the score-based models by scaling the output of the models with the standard deviation of the transition distribution at time tt.
其中 𝑯i+1=GNN(𝑯i,𝑨t)subscript1subscriptsubscript\bm{H}_{i+1}\!=\!\text{GNN}(\bm{H}_{i},\!\bm{A}_{t})𝑯0=𝑿tsubscript0subscript\bm{H}_{0}\!=\!\bm{X}_{t} 给定, LL 表示 GNN 层的数量。在这里,GMH 层可以替代简单的 GNN 层,但会增加额外的计算成本,我们将在第 4.3 节进一步分析。基于评分的模型架构在附录的图 5 中进行了说明。此外,遵循 Song & Ermon (2020),我们通过将模型输出与时间 tt 的转移分布的标准差进行缩放,将时间信息纳入基于评分的模型中。

Note that since the message-passing operations of GNNs and the attention function used in GMH are permutation-equivariant (Keriven & Peyré, 2019), the proposed score-based models are also equivariant, and theryby from the result of Niu et al. (2020), the log-likelihood implicitly defined by the models is guaranteed to be permutation-invariant.
请注意,由于 GNN 的消息传递操作和 GMH 中使用的注意力函数是置换等变的(Keriven & Peyré, 2019),因此所提出的基于分数的模型也是等变的,因此根据 Niu 等人(2020)的结果,模型隐式定义的对数似然性保证是置换不变的。

3.3 Solving the System of Reverse-time SDEs 
3.3 解决反向时间随机微分方程系统

In order to use the reverse-time diffusion process as a generative model, it requires simulating the system of reverse-time SDEs in Eq. (3), which can be approximated using the trained score-based models 𝒔θ,t\bm{s}_{\theta,t} and 𝒔ϕ,t\bm{s}_{\phi,t} as follows:
为了将反向时间扩散过程用作生成模型,需要模拟方程(3)中的反向时间随机微分方程系统,可以使用训练好的基于分数的模型 𝒔θ,tsubscript\bm{s}_{\theta,t}𝒔ϕ,tsubscript\bm{s}_{\phi,t} 进行如下近似:

{d𝐗t=\eqmakebox[F]f_1,t(X_t) d t¯+ g_1,tdw¯_1\eqmakebox[S]- g^2_1,ts_θ,t(X_t,​A_t)d t¯d𝐀t=\eqmakebox[LHS]f_2,t(A_t) d t¯+ g_2,tdw¯_2\eqmakebox[S]- g^2_2,ts_​ϕ,t(X_t,​A_t)d t¯\displaystyle\begin{cases}\mathrm{d}\bm{X}_{t}\!=\eqmakebox[F]{$\mathbf{f}_{1,t}(\bm{X}_t) \mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+ g_{1,t}\mathrm{d}\mathbf{\bar{w}}_1$}\;\eqmakebox[S]{$- g^2_{1,t}\bm{s}_{\theta,t}(\bm{X}_t,\!\bm{A}_t)\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu$}\\ \mathrm{d}\bm{A}_{t}\,\!=\eqmakebox[LHS]{$\mathbf{f}_{2,t}(\bm{A}_t) \mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+ g_{2,t}\mathrm{d}\mathbf{\bar{w}}_2$}\;\eqmakebox[S]{$- g^2_{2,t}\bm{s}_{\!\phi,t}(\bm{X}_t,\!\bm{A}_t)\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu$}\end{cases} (10)
\eqmakebox[F]F\eqmakebox[S]S\displaystyle\underbrace{\eqmakebox[F]{\mathstrut}}_{F}\phantom{}\underbrace{\eqmakebox[S]{\mathstrut}}_{S}

However, solving the system of two diffusion processes that are interdependently tied by the partial scores brings about another difficulty. Thus we propose a novel integrator, Symmetric Splitting for System of SDEs (S4) to simulate the system of reverse-time SDEs, that is efficient yet accurate, inspired by the Symmetric Splitting CLD Sampler (SSCS) (Dockhorn et al., 2021) and the Predictor-Corrector Sampler (PC sampler) (Song et al., 2021b).
然而,解决由部分分数相互依赖联系的两个扩散过程的系统带来了另一个困难。因此,我们提出了一种新颖的积分器,即对随机微分方程系统的对称分裂(S4),以模拟反向时间随机微分方程系统,该方法高效且准确,灵感来源于对称分裂 CLD 采样器(SSCS)(Dockhorn 等,2021)和预测-校正采样器(PC 采样器)(Song 等,2021b)。

Specifically, at each discretized time step tt, S4 solver consists of three steps: the score computation, the correction, and the prediction. First, S4 computes the estimation of the partial scores with respect to the predicted 𝑮t\bm{G}_{t}, using the score-based models 𝒔θ,t\bm{s}_{\theta,t} and 𝒔ϕ,t\bm{s}_{\phi,t}, where the computed partial scores are later used for both the correction and the prediction steps. After the score computation, we perform the correction step by leveraging a score-based MCMC method, for example Langevin MCMC (Parisi, 1981), in order to obtain calibrated sample 𝑮t\bm{G}^{\prime}_{t} from 𝑮t\bm{G}_{t}. Here we exploit the precomputed partial scores for the score-based MCMC approach. Then what remains is to predict the solution for the next time step tδtt-\delta t, going backward in time.
具体而言,在每个离散时间步 tt 中,S4 求解器由三个步骤组成:得分计算、校正和预测。首先,S4 计算相对于预测的 𝑮tsubscript\bm{G}_{t} 的部分得分估计,使用得分模型 𝒔θ,tsubscript\bm{s}_{\theta,t}𝒔ϕ,tsubscript\bm{s}_{\phi,t} ,计算出的部分得分随后用于校正和预测步骤。在得分计算之后,我们通过利用基于得分的 MCMC 方法(例如 Langevin MCMC(Parisi, 1981))执行校正步骤,以从 𝑮tsubscript\bm{G}_{t} 中获得校准样本 𝑮tsubscriptsuperscript\bm{G}^{\prime}_{t} 。在这里,我们利用预计算的部分得分来进行基于得分的 MCMC 方法。接下来需要做的是预测下一个时间步 tδtt-\delta t 的解,向时间的过去推演。

The prediction of the state at time tt^{\prime} follows the marginal distribution ptp_{t^{\prime}} described by the Fokker-Planck equation induced by Eq. (10), where the Fokker-Planck operators ^F\hat{\mathcal{L}}^{*}_{F} and ^S\hat{\mathcal{L}}^{*}_{S} correspond to the FF-term and SS-term, respectively 222Details of the Fokker-Planck operators are given in Appendix A.3.. Inspired by Dockhorn et al. (2021), we formalize an intractable solution to Eq. (10) with the classical propagator et(^F+^S)e^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})} which gives light to finding efficient prediction method. From the result of the symmetric Trotter theorem (Trotter, 1959; Strang, 1968), the propagation of the calibrated state 𝑮t\bm{G}^{\prime}_{t} from time tt to tδtt-\delta t following the dynamics of Eq. (10) can be approximated by applying eδt2^Feδt^Seδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}e^{\delta t\hat{\mathcal{L}}^{*}_{S}}e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} to 𝑮t\bm{G}^{\prime}_{t}. Observing the operators individually, the action of first eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} describes the dynamics of the FF-term in Eq. (10) from time tt to tδt/2t-\delta t/2, which is equal to sampling from the transition distribution of the forward diffusion in Eq. (1) as follows (see Appendix A.4 and  A.5 for the derivation of the action and the transition distribution.):
在时间 tsuperscriptt^{\prime} 的状态预测遵循由方程 (10) 引发的 Fokker-Planck 方程所描述的边际分布 ptsubscriptsuperscriptp_{t^{\prime}} ,其中 Fokker-Planck 算子 ^Fsubscriptsuperscript\hat{\mathcal{L}}^{*}_{F}^Ssubscriptsuperscript\hat{\mathcal{L}}^{*}_{S} 分别对应于 FF 项和 SS222Details of the Fokker-Planck operators are given in Appendix A.3. 。受到 Dockhorn 等人 (2021) 的启发,我们用经典传播子 et(^F+^S)superscriptsubscriptsuperscriptsubscriptsuperscripte^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})} 形式化了方程 (10) 的一个难以处理的解,这为寻找有效的预测方法提供了启示。根据对称 Trotter 定理的结果 (Trotter, 1959; Strang, 1968),从时间 tttδtt-\delta t 的校准状态 𝑮tsubscriptsuperscript\bm{G}^{\prime}_{t} 的传播遵循方程 (10) 的动态,可以通过将 eδt2^Feδt^Seδt2^Fsuperscript2subscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscript2subscriptsuperscripte^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}e^{\delta t\hat{\mathcal{L}}^{*}_{S}}e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} 应用到 𝑮tsubscriptsuperscript\bm{G}^{\prime}_{t} 来近似。单独观察算子,第一次 eδt2^Fsuperscript2subscriptsuperscripte^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} 的作用描述了方程 (10) 中 FF 项在时间 tttδt/22t-\delta t/2 的动态,这等同于从方程 (1) 中的前向扩散的转移分布进行采样,如下所示(有关作用和转移分布的推导,请参见附录 A.4 和 A.5)。

eδt2^F𝐆=𝐆~pt,tδt/2(𝐆~|𝐆).\displaystyle e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}\bm{G}=\tilde{\bm{G}}\sim p_{t,t-\delta t/2}(\tilde{\bm{G}}|\bm{G}). (11)

On the other hand, the action of eδt^Se^{\delta t\hat{\mathcal{L}}^{*}_{S}} is not analytically accessible, so we approximate the action with a simple Euler method (EM) that solves the ODE corresponding to the SS-term in Eq. (10). Here we use the precomputed partial scores again, which is justified due to the action of the first eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} on a sufficiently small half-step δt/2\delta t/2. Lastly, the action of remaining eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} corresponds to sampling from the transition distribution from time tδt/2t-\delta t/2 to tδtt-\delta t, which results in the approximated solution 𝑮tδt\bm{G}_{t-\delta t}. We provide the pseudo-code for the S4 solver in Algorithm 1 of Appendix. To obtain more accurate solution with additional cost of computation, one might consider using a higher-order integrator such as Runge-Kutta method to approximate the action of the operator eδt^Se^{\delta t\hat{\mathcal{L}}^{*}_{S}}, and further leverage HMC (Neal, 2012) for the correction step instead of Langevin MCMC.
另一方面, eδt^Ssuperscriptsubscriptsuperscripte^{\delta t\hat{\mathcal{L}}^{*}_{S}} 的作用在解析上不可获取,因此我们使用简单的欧拉方法(EM)来近似该作用,该方法解决了方程(10)中与 SS 项对应的常微分方程(ODE)。在这里,我们再次使用预计算的部分得分,这是合理的,因为第一个 eδt2^Fsuperscript2subscriptsuperscripte^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} 的作用在一个足够小的半步 δt/22\delta t/2 上。最后,剩余 eδt2^Fsuperscript2subscriptsuperscripte^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} 的作用对应于从时间 tδt/22t-\delta t/2tδtt-\delta t 的转移分布的采样,这导致了近似解 𝑮tδtsubscript\bm{G}_{t-\delta t} 。我们在附录的算法 1 中提供了 S4 求解器的伪代码。为了获得更准确的解,尽管增加了计算成本,可以考虑使用更高阶的积分器,如龙格-库塔方法,以近似算子 eδt^Ssuperscriptsubscriptsuperscripte^{\delta t\hat{\mathcal{L}}^{*}_{S}} 的作用,并进一步利用 HMC(Neal, 2012)作为修正步骤,而不是使用朗之万 MCMC。

Note that although S4 and PC sampler both carry out the prediction and correction steps, S4 solver is far more efficient in terms of computation since compared to the PC sampler, S4 requires half the number of forward passes to the score-based models which dominates the computational cost of solving the SDEs. Moreover, the proposed S4 solver can be used to solve a general system of SDEs, including the system with mixed types of SDEs such as those of Variance Exploding (VE) SDE and Variance Preserving (VP) SDE (Song et al., 2021b), whereas SSCS is limited to solving a specific type of SDE, namely CLD.
请注意,尽管 S4 和 PC 采样器都执行预测和修正步骤,但 S4 求解器在计算效率方面远远优于 PC 采样器,因为与 PC 采样器相比,S4 需要对基于分数的模型进行一半的前向传递,而这在解决随机微分方程(SDEs)时占据了计算成本的主导地位。此外,所提出的 S4 求解器可以用于解决一般的随机微分方程系统,包括混合类型的 SDE 系统,例如方差爆炸(VE)SDE 和方差保持(VP)SDE(Song et al., 2021b),而 SSCS 限于解决特定类型的 SDE,即 CLD。

Table 1: Generation results on the generic graph datasets. We report the MMD distances between the test datasets and generated graphs. Best results are highlighted in bold (smaller the better). The results of the baselines for Ego-small and Community-small dataset are taken from Niu et al. (2020) and Luo et al. (2021). Hyphen (-) denotes out-of-resources that take more than 10 days or not applicable due to the memory issue. denotes our own implementation and indicates unreproducible results. Due to the space limitation, we provide the standard deviations in Appendix D.1.
表 1:通用图数据集上的生成结果。我们报告了测试数据集与生成图之间的 MMD 距离。最佳结果以粗体突出显示(越小越好)。Ego-small 和 Community-small 数据集的基线结果取自 Niu 等人(2020)和 Luo 等人(2021)。连字符(-)表示超出资源限制,耗时超过 10 天或因内存问题而不适用。 表示我们自己的实现, 表示不可重现的结果。由于空间限制,我们在附录 D.1 中提供了标准偏差。
Ego-small  自我小型 Community-small  社区-小型 Enzymes   Grid  网格
Real, 4|V|184\leq|V|\leq 18  实数, 4|V|184184\leq|V|\leq 18 Synthetic, 12|V|2012\leq|V|\leq 20  合成, 12|V|20122012\leq|V|\leq 20 Real, 10|V|12510\leq|V|\leq 125  真实, 10|V|1251012510\leq|V|\leq 125 Synthetic, 100|V|400100\leq|V|\leq 400  合成, 100|V|400100400100\leq|V|\leq 400
Deg.  度. Clus.  聚类。 Orbit  轨道 Avg.  平均。 Deg.  度。 Clus.  聚类。 Orbit  轨道 Avg.  平均值 Deg.   Clus.  聚类 Orbit  轨道 Avg.  平均值 Deg.  度。 Clus.  聚类。 Orbit  轨道 Avg.  平均值
Autoreg.  自回归 DeepGMG  深度生成图模型 0.040 0.100 0.020 0.053 0.220 0.950 0.400 0.523 - - - - - - - -
GraphRNN 0.090 0.220 0.003 0.104 0.080 0.120 0.040 0.080 0.017 0.062 0.046 0.042 0.064 0.043 0.021 0.043
GraphAF 0.03 0.11 0.001 0.047 0.18 0.20 0.02 0.133 1.669 1.283 0.266 1.073 - - - -
GraphDF 0.04 0.13 0.01 0.060 0.06 0.12 0.03 0.070 1.503 1.061 0.202 0.922 - - - -
One-shot  一次性 GraphVAE 0.130 0.170 0.050 0.117 0.350 0.980 0.540 0.623 1.369 0.629 0.191 0.730 1.619 0.0 0.919 0.846
GNF 0.030 0.100 0.001 0.044 0.200 0.200 0.110 0.170 - - - - - - - -
EDP-GNN 0.052 0.093 0.007 0.051 0.053 0.144 0.026 0.074 0.023 0.268 0.082 0.124 0.455 0.238 0.328 0.340
GDSS-seq (Ours)  GDSS-seq(我们的) 0.032 0.027 0.011 0.023 0.090 0.123 0.007 0.073 0.099 0.225 0.010 0.111 0.171 0.011 0.223 0.135
GDSS (Ours)  GDSS(我们的) 0.021 0.024 0.007 0.017 0.045 0.086 0.007 0.046 0.026 0.061 0.009 0.032 0.111 0.005 0.070 0.062
Refer to caption
Community-small Enzymes
Solver Deg. Clus. Orbit Avg. Time (s) Deg. Clus. Orbit Avg. Time (s)
EM 0.055 0.133 0.017 0.068 29.64 0.060 0.581 0.120 0.254 153.58
Reverse 0.058 0.125 0.016 0.066 29.75 0.057 0.550 0.112 0.240 155.06
EM + Langevin 0.045 0.086 0.007 0.046 59.93 0.028 0.062 0.010 0.033 308.42
Rev. + Langevin 0.045 0.086 0.007 0.046 59.40 0.028 0.064 0.009 0.034 310.35
S4 (Ours) 0.042 0.101 0.007 0.050 30.93 0.026 0.061 0.009 0.032 157.57
Figure 3: (Left) Complexity of the score-based models measured by the Frobenius norm of the Jacobian of the model. We compare GDSS (green) against GDSS-seq (red), where the solid and dotted lines denote the models estimating the partial scores with respect to 𝑿\bm{X} and 𝑨\bm{A}, respectively. (Right) Comparison between fixed step size SDE solvers. We measure the time for the generation of 128 graphs.
图 3:(左)通过模型雅可比矩阵的弗罗贝尼乌斯范数测量的基于评分的模型的复杂性。我们将 GDSS(绿色)与 GDSS-seq(红色)进行比较,其中实线和虚线分别表示估计相对于 𝑿\bm{X}𝑨\bm{A} 的部分评分的模型。(右)固定步长 SDE 求解器之间的比较。我们测量生成 128 个图的时间。

4 Experiments   4 实验

We experimentally validate the performance of our method in generation of generic graphs as well as molecular graphs.
我们通过实验验证了我们的方法在生成通用图和分子图方面的性能。

4.1 Generic Graph Generation 
4.1 通用图生成

To verify that GDSS is able to generate graphs that follow the underlying data distribution, we evaluate our method on generic graph generation tasks with various datasets.
为了验证 GDSS 能够生成符合基础数据分布的图,我们在各种数据集上评估了我们的方法在通用图生成任务中的表现。

Experimental Setup  实验设置

We first validate GDSS by evaluating the quality of the generated samples on four generic graph datasets, including synthetic and real-world graphs with varying sizes: (1) Ego-small, 200 small ego graphs drawn from larger Citeseer network dataset (Sen et al., 2008), (2) Community-small, 100 randomly generated community graphs, (3) Enzymes, 587 protein graphs which represent the protein tertiary structures of the enzymes from the BRENDA database (Schomburg et al., 2004), and (4) Grid, 100 standard 2D grid graphs. For a fair comparison, we follow the experimental and evaluation setting of You et al. (2018b) with the same train/test split. We use the maximum mean discrepancy (MMD) to compare the distributions of graph statistics between the same number of generated and test graphs. Following You et al. (2018b), we measure the distributions of degree, clustering coefficient, and the number of occurrences of orbits with 4 nodes. Note that we use the Gaussian Earth Mover’s Distance (EMD) kernel to compute the MMDs instead of the total variation (TV) distance used in Liao et al. (2019), since the TV distance leads to an indefinite kernel and an undefined behavior (O’Bray et al., 2021). Please see Appendix C.2 for more details.
我们首先通过评估在四个通用图数据集上生成样本的质量来验证 GDSS,这些数据集包括具有不同规模的合成图和真实世界图:(1)Ego-small,来自更大 Citeseer 网络数据集(Sen et al., 2008)的 200 个小型自我图;(2)Community-small,100 个随机生成的社区图;(3)Enzymes,587 个蛋白质图,代表来自 BRENDA 数据库的酶的蛋白质三级结构(Schomburg et al., 2004);(4)Grid,100 个标准的二维网格图。为了公平比较,我们遵循 You et al.(2018b)的实验和评估设置,使用相同的训练/测试划分。我们使用最大均值差异(MMD)来比较相同数量的生成图和测试图之间的图统计分布。遵循 You et al.(2018b),我们测量度数、聚类系数和具有 4 个节点的轨道出现次数的分布。请注意,我们使用高斯地球搬运工距离(EMD)核来计算 MMD,而不是 Liao et al.(2019)中使用的总变差(TV)距离,因为 TV 距离会导致不确定的核和未定义的行为(O’Bray et al., 2021)。 请参见附录 C.2 以获取更多详细信息。

Table 2: Generation results on the QM9 and ZINC250k dataset. Results are the means of 3 different runs, and the best results are highlighted in bold. Values denoted by * are taken from the respective original papers. Other results are obtained by running open-source codes. Val. w/o corr. denotes the Validity w/o correction metric, and values that do not exceed 50% are underlined. Due to the space limitation, we provide the results of validity, uniqueness, and novelty as well as the standard deviations in Appendix D.2.
表 2:在 QM9 和 ZINC250k 数据集上的生成结果。结果为 3 次不同运行的平均值,最佳结果以粗体突出显示。标记为*的值取自各自的原始论文。其他结果通过运行开源代码获得。Val. w/o corr.表示无修正的有效性指标,未超过 50%的值用下划线标出。由于空间限制,我们在附录 D.2 中提供有效性、独特性和新颖性以及标准偏差的结果。
QM9 ZINC250k
Method  方法 Val. w/o corr. (%)\uparrow
无校正验证 (%) \uparrow
NSPDK\downarrow FCD\downarrow Time (s)\downarrow  时间 (秒) \downarrow Val. w/o corr. (%)\uparrow
无校正值 (%) \uparrow
NSPDK\downarrow FCD\downarrow Time (s)\downarrow  时间 (秒) \downarrow
Autoreg.  自回归 GraphAF (Shi et al., 2020)
GraphAF(Shi 等,2020)
67* 0.020 5.268 2.52e3e^{3} 68* 0.044 16.289 5.80e3e^{3}
GraphAF+FC 74.43 0.021 5.625 2.55e3e^{3} 68.47 0.044 16.023 6.02e3e^{3}
GraphDF (Luo et al., 2021) 82.67* 0.063 10.816 5.35e4e^{4} 89.03* 0.176 34.202 6.03e4e^{4}
GraphDF+FC 93.88 0.064 10.928 4.91e4e^{4} 90.61 0.177 33.546 5.54e4e^{4}
One-shot  一次性 MoFlow (Zang & Wang, 2020)
MoFlow(Zang & Wang, 2020)
91.36 0.017 4.467 4.60 63.11 0.046 20.931 2.45𝐞𝟏\mathbf{e^{1}}
EDP-GNN (Niu et al., 2020)
EDP-GNN(Niu et al., 2020)
47.52 0.005 2.680 4.40e3e^{3} 82.97 0.049 16.737 9.09e3e^{3}
GraphEBM (Liu et al., 2021b)
GraphEBM(Liu et al., 2021b)
8.22 0.030 6.143 3.71e1e^{1} 5.29 0.212 35.471 5.46e1e^{1}
GDSS-seq (Ours)  GDSS-seq(我们) 94.47 0.010 4.004 1.13e2e^{2} 92.39 0.030 16.847 2.02e3e^{3}
GDSS (Ours)  GDSS(我们的) 95.72 0.003 2.900 1.14e2e^{2} 97.01 0.019 14.656 2.02e3e^{3}
Refer to caption
Figure 4: Visualization of the generated molecules with maximum Tanimoto similarity to the molecule from the dataset. The top row shows QM9 molecules while the bottom row shows ZINC250k molecules. For each generated molecule, we display the similarity value at the bottom. The pairwise Tanimoto similarity is calculated based on the standard Morgan fingerprints with radius 2 and 1024 bits.
图 4:与数据集中分子具有最大 Tanimoto 相似度的生成分子的可视化。顶部行显示 QM9 分子,而底部行显示 ZINC250k 分子。对于每个生成的分子,我们在底部显示相似度值。成对的 Tanimoto 相似度是基于半径为 2 和 1024 位的标准 Morgan 指纹计算的。

Implementation Details and Baselines
实现细节和基线

We compare our proposed method against the following deep generative models. GraphVAE (Simonovsky & Komodakis, 2018) is a one-shot VAE-based model. DeepGMG (Li et al., 2018a) and GraphRNN (You et al., 2018b) are autoregressive RNN-based models. GNF (Liu et al., 2019) is a one-shot flow-based model. GraphAF (Shi et al., 2020) is an autoregressive flow-based model. EDP-GNN (Niu et al., 2020) is a one-shot score-based model. GraphDF (Luo et al., 2021) is an autoregressive flow-based model that utilizes discrete latent variables. For GDSS, we consider three types of SDEs introduced by Song et al. (2021b), namely VESDE, VPSDE, and sub-VP SDE, for the diffusion processes of each component, and either use the PC sampler or the S4 solver to solve the system of SDEs. We provide further implementation details of the baselines and our GDSS in Appendix C.2.
我们将所提出的方法与以下深度生成模型进行比较。GraphVAE(Simonovsky & Komodakis, 2018)是一个一次性基于变分自编码器(VAE)模型。DeepGMG(Li et al., 2018a)和 GraphRNN(You et al., 2018b)是自回归基于递归神经网络(RNN)模型。GNF(Liu et al., 2019)是一个一次性基于流的模型。GraphAF(Shi et al., 2020)是一个自回归基于流的模型。EDP-GNN(Niu et al., 2020)是一个一次性基于评分的模型。GraphDF(Luo et al., 2021)是一个利用离散潜变量的自回归基于流的模型。对于 GDSS,我们考虑 Song 等人(2021b)提出的三种类型的随机微分方程(SDE),即 VESDE、VPSDE 和子 VP SDE,用于每个组件的扩散过程,并使用 PC 采样器或 S4 求解器来求解 SDE 系统。我们在附录 C.2 中提供基线和我们的 GDSS 的进一步实现细节。

Results  结果

Table 1 shows that the proposed GDSS significantly outperforms all the baseline one-shot generative models including EDP-GNN, and also outperforms the autoregressive baselines on most of the datasets, except Grid. Moreover, GDSS shows competitive performance to the state-of-the-art autoregressive model GraphRNN on generating large graphs, i.e. Grid dataset, for which GDSS is the only method that achieves similar performance. We observe that EDP-GNN, a previous score-based generation model for graphs, completely fails in generating large graphs. The MMD results demonstrate that GDSS can effectively capture the local characteristics of the graphs, which is possible due to modeling the dependency between nodes and edges. We visualize the generated graphs of GDSS in Appendix E.1.
表 1 显示,所提出的 GDSS 在所有基线一次性生成模型中显著优于 EDP-GNN,并且在大多数数据集上也优于自回归基线,除了 Grid。此外,GDSS 在生成大图方面与最先进的自回归模型 GraphRNN 表现出竞争力,即在 Grid 数据集上,GDSS 是唯一实现类似性能的方法。我们观察到,EDP-GNN 作为之前的基于分数的图生成模型,在生成大图方面完全失败。MMD 结果表明,GDSS 能够有效捕捉图的局部特征,这得益于对节点和边之间依赖关系的建模。我们在附录 E.1 中可视化了 GDSS 生成的图。

Complexity of Learning Partial Scores
学习部分分数的复杂性

Learning the partial score 𝑨tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) regarding both 𝑿t\bm{X}_{t} and 𝑨t\bm{A}_{t} may seem difficult, compared to learning 𝑨tlogpt(𝑿0,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}) that concerns only 𝑨t\bm{A}_{t}. We empirically demonstrate this is not the case, by measuring the complexity of the score-based models that estimate the partial scores. The complexity of the models can be measured via the squared Frobenius norm of their Jacobians 𝒥F(t)\mathcal{J}_{F}(t), and we compare the models of GDSS and GDSS-seq trained with the Ego-small dataset. As shown in Figure 3, the complexity of GDSS estimating 𝑨tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) is significantly smaller compared to the complexity of GDSS-seq estimating 𝑨tlogpt(𝑿0,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}), and similarly for 𝑿t\bm{X}_{t}. The result stresses the importance of modeling the node-edge dependency, since whether to model the dependency is the only difference between GDSS and GDSS-seq. Moreover, the generation result of GDSS compared to GDSS-seq in Table 1 verifies that the reduced complexity enables the effective generation of larger graphs.
学习关于 𝑿tsubscript\bm{X}_{t}𝑨tsubscript\bm{A}_{t} 的部分得分 𝑨tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) 可能看起来很困难,相较于仅涉及 𝑨tsubscript\bm{A}_{t}𝑨tlogpt(𝑿0,𝑨t)subscriptsubscriptsubscriptsubscript0subscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}) 。我们通过测量估计部分得分的基于得分的模型的复杂性,实证证明情况并非如此。模型的复杂性可以通过其雅可比矩阵 𝒥F(t)subscript\mathcal{J}_{F}(t) 的平方弗罗贝尼乌斯范数来衡量,我们比较了使用 Ego-small 数据集训练的 GDSS 和 GDSS-seq 模型。如图 3 所示,GDSS 估计 𝑨tlogpt(𝑿t,𝑨t)subscriptsubscriptsubscriptsubscriptsubscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\!\bm{A}_{t}) 的复杂性显著小于 GDSS-seq 估计 𝑨tlogpt(𝑿0,𝑨t)subscriptsubscriptsubscriptsubscript0subscript\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{0},\!\bm{A}_{t}) 的复杂性, 𝑿tsubscript\bm{X}_{t} 也是如此。结果强调了建模节点-边依赖性的重要性,因为建模依赖性是 GDSS 和 GDSS-seq 之间唯一的区别。此外,表 1 中 GDSS 与 GDSS-seq 的生成结果验证了降低的复杂性使得有效生成更大图形成为可能。

4.2 Molecule Generation   4.2 分子生成

To show that GDSS is able to capture the complex dependency between nodes and edges, we further evaluate our method for molecule generation tasks.
为了证明 GDSS 能够捕捉节点和边之间的复杂依赖关系,我们进一步评估了我们的方法在分子生成任务中的表现。

Experimental Setup  实验设置

We use two molecular datasets, QM9 (Ramakrishnan et al., 2014) and ZINC250k (Irwin et al., 2012), where we provide the statistics in Table 6 in Appendix. Following previous works (Shi et al., 2020; Luo et al., 2021), the molecules are kekulized by the RDKit library (Landrum et al., 2016) with hydrogen atoms removed. We evaluate the quality of the 10,000 generated molecules with the following metrics. Fréchet ChemNet Distance (FCD) (Preuer et al., 2018) evaluates the distance between the training and generated sets using the activations of the penultimate layer of the ChemNet. Neighborhood subgraph pairwise distance kernel (NSPDK) MMD (Costa & De Grave, 2010) is the MMD between the generated molecules and test molecules which takes into account both the node and edge features for evaluation. Note that FCD and NSPDK MMD are salient metrics that assess the ability to learn the distribution of the training molecules, measuring how close the generated molecules lie to the distribution. Specifically, FCD measures the ability in the view of molecules in chemical space, while NSPDK MMD measures the ability in the view of the graph structure. Validity w/o correction is the fraction of valid molecules without valency correction or edge resampling, which is different from the metric used in Shi et al. (2020) and Luo et al. (2021), since we allow atoms to have formal charge when checking their valency following Zang & Wang (2020), which is more reasonable due to the existence of formal charge in the training molecules. Time measures the time for generating 10,000 molecules in the form of RDKit molecules. We provide further details about the experimental settings, including the hyperparameter search in Appendix C.3.
我们使用两个分子数据集,QM9(Ramakrishnan et al., 2014)和 ZINC250k(Irwin et al., 2012),在附录的表 6 中提供统计数据。遵循之前的研究(Shi et al., 2020; Luo et al., 2021),分子通过 RDKit 库(Landrum et al., 2016)进行凯库化,并去除氢原子。我们使用以下指标评估生成的 10,000 个分子的质量。Fréchet ChemNet 距离(FCD)(Preuer et al., 2018)评估训练集和生成集之间的距离,使用 ChemNet 倒数第二层的激活值。邻域子图成对距离核(NSPDK)MMD(Costa & De Grave, 2010)是生成分子与测试分子之间的 MMD,考虑了节点和边特征进行评估。需要注意的是,FCD 和 NSPDK MMD 是评估学习训练分子分布能力的重要指标,测量生成分子与分布的接近程度。具体而言,FCD 从化学空间的角度测量能力,而 NSPDK MMD 从图结构的角度测量能力。 无修正有效性是指在没有价态修正或边缘重采样的情况下有效分子的比例,这与 Shi 等人(2020)和 Luo 等人(2021)使用的度量不同,因为我们在检查原子的价态时允许原子具有形式电荷,遵循 Zang & Wang(2020)的做法,这在训练分子中形式电荷的存在使得这种做法更为合理。时间测量生成 10,000 个 RDKit 分子的时间。我们在附录 C.3 中提供了关于实验设置的更多细节,包括超参数搜索。

Baselines  基线

We compare our GDSS against the following baselines. GraphAF (Shi et al., 2020) is an autoregressive flow-based model. MoFlow (Zang & Wang, 2020) is a one-shot flow-based model. GraphDF (Luo et al., 2021) is an autoregressive flow-based model using discrete latent variables. GraphEBM (Liu et al., 2021b) is a one-shot energy-based model that generates molecules by minimizing energies with Langevin dynamics. We also construct modified versions of GraphAF and GraphDF that consider formal charge (GraphAF+FC and GraphDF+FC) for a fair comparison with the baselines and ours. For GDSS, the choice of the diffusion process and the solver is identical to that of the generic graph generation tasks. We provide further details of the baselines and GDSS in Appendix C.3.
我们将我们的 GDSS 与以下基准进行比较。GraphAF(Shi 等,2020)是一个自回归流模型。MoFlow(Zang & Wang,2020)是一个一次性流模型。GraphDF(Luo 等,2021)是一个使用离散潜变量的自回归流模型。GraphEBM(Liu 等,2021b)是一个一次性基于能量的模型,通过最小化朗之万动力学中的能量来生成分子。我们还构建了考虑正式电荷的 GraphAF 和 GraphDF 的修改版本(GraphAF+FC 和 GraphDF+FC),以便与基准和我们的模型进行公平比较。对于 GDSS,扩散过程和求解器的选择与通用图生成任务相同。我们在附录 C.3 中提供了基准和 GDSS 的进一步细节。

Table 3: Generation results of EDP-GNN using GMH.
Community-small Enzymes
Method Deg. Clus. Orbit Deg. Clus. Orbit
EDP-GNN 0.053 0.144 0.026 0.023 0.268 0.082
EDP-GNN w/ GMH 0.033 0.130 0.035 0.047 0.328 0.051
GDSS (Ours) 0.045 0.086 0.007 0.026 0.061 0.009
Table 4: Generation results of the variants of GDSS.
ZINC250k
Method Val. w/o corr. (%) NSPDK FCD Time (s)
GDSS-discrete 53.21 0.045 22.925 6.07e3e^{3}
GDSS w/ GMH in 𝒔θ,t\bm{s}_{\theta,t} 94.39 0.015 12.388 2.44e3e^{3}
GDSS 97.01 0.019 14.656 2.02e3e^{3}

Results  结果

As shown in Table 2, GDSS achieves the highest validity when the post-hoc valency correction is disabled, demonstrating that GDSS is able to proficiently learn the chemical valency rule that requires capturing the node-edge relationship. Moreover, GDSS significantly outperforms all the baselines in NSPDK MMD and most of the baselines in FCD, showing that the generated molecules of GDSS lie close to the data distribution both in the space of graphs and the chemical space. The superior performance of GDSS on molecule generation tasks verifies the effectiveness of our method for learning the underlying distribution of graphs with multiple node and edge types. We further visualize the generated molecules in Figure 4 and Figure 10 of Appendix E.2, which demonstrate that GDSS is capable of generating molecules that share a large substructure with the molecules in the training set, whereas the generated molecules of the baselines share a smaller structural portion or even completely differ from the training molecules.
如表 2 所示,当后验效价校正被禁用时,GDSS 实现了最高的有效性,表明 GDSS 能够熟练地学习需要捕捉节点-边关系的化学效价规则。此外,GDSS 在 NSPDK MMD 上显著优于所有基线,在 FCD 上也优于大多数基线,显示出 GDSS 生成的分子在图空间和化学空间中都接近数据分布。GDSS 在分子生成任务上的优越表现验证了我们的方法在学习具有多种节点和边类型的图的潜在分布方面的有效性。我们进一步在附录 E.2 的图 4 和图 10 中可视化生成的分子,展示了 GDSS 能够生成与训练集中的分子共享大量子结构的分子,而基线生成的分子则共享较小的结构部分,甚至与训练分子完全不同。

Time Efficiency  时间效率

To validate the practicality of GDSS, we compare the inference time for generating molecules with the baselines. As shown in Table 2, GDSS not only outperforms the autoregressive models in terms of the generation quality, but also in terms of time efficiency showing 450×450\times speed up on QM9 datasets compared to GraphDF. Moreover, GDSS and GDSS-seq require significantly smaller generation time compared to EDP-GNN, showing that modeling the transformation of graphs to noise and vice-versa as a continuous-time diffusion process is far more efficient than the discrete-step noise perturbation used in EDP-GNN.
为了验证 GDSS 的实用性,我们比较了生成分子的推理时间与基线模型。如表 2 所示,GDSS 不仅在生成质量上优于自回归模型,而且在时间效率上也表现出相较于 GraphDF 的 450×450\times 倍加速,特别是在 QM9 数据集上。此外,GDSS 和 GDSS-seq 所需的生成时间显著小于 EDP-GNN,这表明将图的转化为噪声及其反向过程建模为连续时间扩散过程比 EDP-GNN 中使用的离散步骤噪声扰动要高效得多。

4.3 Ablation Studies   4.3 消融研究

We provide an extensive analysis of the proposed GDSS framework from three different perspectives: (1) The necessity of modeling the dependency between 𝑿\bm{X} and 𝑨\bm{A}, (2) effectiveness of S4 compared to other solvers, and (3) further comparison with the variants of EDP-GNN and GDSS.
我们从三个不同的角度对提出的 GDSS 框架进行了广泛的分析:(1)建模 𝑿\bm{X}𝑨\bm{A} 之间依赖关系的必要性,(2)S4 与其他求解器的有效性,以及(3)与 EDP-GNN 和 GDSS 变体的进一步比较。

Necessity of Dependency Modeling
依赖建模的必要性

To validate that modeling the node-edge dependency is crucial for graph generation, we compare our proposed methods GDSS and GDSS-seq, since the only difference is that the latter only models the dependency of 𝑨\bm{A} on 𝑿\bm{X} (Eq. (3) and Eq. (4)). From the results in Table 1 and Table 2, we observe that GDSS constantly outperforms GDSS-seq in all metrics, which proves that accurately learning the distributions of graphs requires modeling the dependency. Moreover, for molecule generation, the node-edge dependency can be directly measured in terms of validity, and the results verify the effectiveness of GDSS modeling the dependency via the system of SDEs.
为了验证节点-边依赖建模对图生成的重要性,我们比较了我们提出的方法 GDSS 和 GDSS-seq,因为它们之间的唯一区别在于后者仅建模 𝑨\bm{A}𝑿\bm{X} 的依赖(公式(3)和公式(4))。从表 1 和表 2 的结果中,我们观察到 GDSS 在所有指标上始终优于 GDSS-seq,这证明了准确学习图的分布需要建模依赖关系。此外,对于分子生成,节点-边依赖可以直接通过有效性进行测量,结果验证了 GDSS 通过随机微分方程系统建模依赖关系的有效性。

Significance of S4 Solver
S4 求解器的重要性

To validate the effectiveness of the proposed S4 solver, we compare its performance against the non-adaptive stepsize solvers, namely EM and Reverse sampler which are predictor-only methods, and the PC samplers using Langevin MCMC. As shown in the table of Figure 3, S4 significantly outperforms the predictor-only methods, and further outperforms the PC samplers with half the computation time, due to fewer evaluations of the score-based models. We provide more results in Appendix D.3.
为了验证所提出的 S4 求解器的有效性,我们将其性能与非自适应步长求解器进行比较,即仅预测方法的 EM 和反向采样器,以及使用 Langevin MCMC 的 PC 采样器。如图 3 的表格所示,S4 显著优于仅预测方法,并且在计算时间上进一步优于 PC 采样器,原因是对基于评分的模型的评估次数更少。我们在附录 D.3 中提供了更多结果。

Variants of EDP-GNN and GDSS
EDP-GNN 和 GDSS 的变体

First, to comprehensively compare EDP-GNN with GDSS, we evaluate the performance of EDP-GNN with GMH layers instead of simple GNN layers. Table 4.2 shows that using GMH does not necessarily increase the generation quality, and is still significantly outperformed by our GDSS. Moreover, to verify that the continuous-time diffusion process of GDSS is essential, we compare the performance of the GDSS variants. Table 4.2 shows that GDSS-discrete, which is our GDSS using discrete-step perturbation as in EDP-GNN instead of the diffusion process, performs poorly on molecule generation tasks with an increased generation time, which reaffirms the significance of the proposed diffusion process for graphs. Furthermore, using GMH instead of GNN for the score-based model 𝒔θ,t\bm{s}_{\theta,t} shows comparable results with GDSS.
首先,为了全面比较 EDP-GNN 与 GDSS,我们评估了使用 GMH 层而非简单 GNN 层的 EDP-GNN 的性能。表 4.2 显示,使用 GMH 并不一定提高生成质量,并且仍然显著不如我们的 GDSS。此外,为了验证 GDSS 的连续时间扩散过程是必不可少的,我们比较了 GDSS 变体的性能。表 4.2 显示,GDSS-discrete,即我们使用离散步扰动的 GDSS,表现不佳,生成时间增加,这再次确认了所提出的图的扩散过程的重要性。此外,使用 GMH 而非 GNN 的基于评分的模型 𝒔θ,tsubscript\bm{s}_{\theta,t} 与 GDSS 的结果相当。

5 Conclusion  5 结论

We presented a novel score-based generative framework for learning the underlying distribution of the graphs, which overcomes the limitations of previous graph generative methods. Specifically, we proposed a novel graph diffusion process via the system of SDEs (GDSS) that transforms both the node features and adjacency to noise and vice-versa, modeling the dependency between them. Further, we derived new training objectives to estimate the gradients of the joint log-density with respect to each component, and presented a novel integrator to efficiently solve the system of SDEs describing the reverse diffusion process. We validated GDSS on the generation of diverse synthetic and real-world graphs including molecules, on which ours outperforms existing generative methods. We pointed out that modeling the dependency between nodes and edges is crucial for learning the distribution of graphs and shed new light on the effectiveness of score-based generative methods for graphs.
我们提出了一种新颖的基于评分的生成框架,用于学习图的潜在分布,克服了先前图生成方法的局限性。具体而言,我们提出了一种通过随机微分方程系统(SDEs)实现的新型图扩散过程(GDSS),该过程将节点特征和邻接矩阵转化为噪声,反之亦然,从而建模它们之间的依赖关系。此外,我们推导了新的训练目标,以估计每个组件的联合对数密度的梯度,并提出了一种新型积分器,以高效地求解描述反向扩散过程的 SDE 系统。我们在生成多样的合成和真实世界图(包括分子)上验证了 GDSS,结果表明我们的模型优于现有的生成方法。我们指出,建模节点和边之间的依赖关系对于学习图的分布至关重要,并为基于评分的图生成方法的有效性提供了新的见解。

Acknowledgements  致谢

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub and No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), and the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921). We thank Geon Park for providing the visualization of diffusion in Figure 1, and Jinheon Baek for the suggestions on the experiments.
本研究得到了韩国政府(MSIT)资助的通信技术规划与评估研究所(IITP)资助(编号:2021-0-02068,人工智能创新中心和编号:2019-0-00075,人工智能研究生院项目(KAIST))的支持,以及通过韩国国家研究基金会(NRF)资助的工程研究中心项目(NRF-2018R1A5A1059921)。我们感谢 Geon Park 提供的图 1 中的扩散可视化,以及 Jinheon Baek 对实验的建议。

References

  • Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Baek et al. (2021) Baek, J., Kang, M., and Hwang, S. J. Accurate learning of graph representations with graph multiset pooling. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Brockschmidt et al. (2019) Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
  • Cai et al. (2020) Cai, R., Yang, G., Averbuch-Elor, H., Hao, Z., Belongie, S. J., Snavely, N., and Hariharan, B. Learning gradient fields for shape generation. In ECCV, 2020.
  • Chen et al. (2021) Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In ICLR, 2021.
  • Costa & De Grave (2010) Costa, F. and De Grave, K. Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 26th International Conference on Machine Learning, pp.  255–262. Omnipress; Madison, WI, USA, 2010.
  • Dai et al. (2020) Dai, H., Nazi, A., Li, Y., Dai, B., and Schuurmans, D. Scalable deep generative modeling for sparse graphs. In International Conference on Machine Learning, pp. 2302–2312. PMLR, 2020.
  • De Cao & Kipf (2018) De Cao, N. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. arXiv:2105.05233, 2021.
  • Dockhorn et al. (2021) Dockhorn, T., Vahdat, A., and Kreis, K. Score-based generative modeling with critically-damped langevin diffusion. arXiv:2112.07068, 2021.
  • Grover et al. (2019) Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2434–2444. PMLR, 2019.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • Hyvärinen (2005) Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, 2005.
  • Irwin et al. (2012) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
  • Jeong et al. (2021) Jeong, M., Kim, H., Cheon, S. J., Choi, B. J., and Kim, N. S. Diff-tts: A denoising diffusion model for text-to-speech. arXiv:2104.01409, 2021.
  • Jin et al. (2018) Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, pp. 2323–2332. PMLR, 2018.
  • Jin et al. (2020) Jin, W., Barzilay, R., and Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In International Conference on Machine Learning, pp. 4839–4848. PMLR, 2020.
  • Keriven & Peyré (2019) Keriven, N. and Peyré, G. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  7090–7099, 2019.
  • Kong et al. (2021) Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
  • Landrum et al. (2016) Landrum, G. et al. Rdkit: Open-source cheminformatics software, 2016. URL http://www. rdkit. org/, https://github. com/rdkit/rdkit, 2016.
  • Lee et al. (2021) Lee, H., Hyung, E., and Hwang, S. J. Rapid neural architecture search by learning to generate graphs from datasets. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 2021.
  • Li et al. (2018a) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. W. Learning deep generative models of graphs. arXiv:1803.03324, 2018a.
  • Li et al. (2018b) Li, Y., Zhang, L., and Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminformatics, 10(1):33:1–33:24, 2018b.
  • Li et al. (2018c) Li, Y., Zhang, L., and Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminformatics, 10(1):33:1–33:24, 2018c.
  • Liao et al. (2019) Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duvenaud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. Advances in Neural Information Processing Systems, 32:4255–4265, 2019.
  • Liu et al. (2019) Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing flows. In NeurIPS, 2019.
  • Liu et al. (2021a) Liu, M., Luo, Y., Wang, L., Xie, Y., Yuan, H., Gui, S., Yu, H., Xu, Z., Zhang, J., Liu, Y., Yan, K., Liu, H., Fu, C., Oztekin, B. M., Zhang, X., and Ji, S. DIG: A turnkey library for diving into graph deep learning research. Journal of Machine Learning Research, 22(240):1–9, 2021a.
  • Liu et al. (2021b) Liu, M., Yan, K., Oztekin, B., and Ji, S. Graphebm: Molecular graph generation with energy-based models. In Energy Based Models Workshop-ICLR 2021, 2021b.
  • Luo & Hu (2021) Luo, S. and Hu, W. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
  • Luo et al. (2021) Luo, Y., Yan, K., and Ji, S. Graphdf: A discrete flow model for molecular graph generation. International Conference on Machine Learning, 2021.
  • Ma et al. (2018) Ma, T., Chen, J., and Xiao, C. Constrained generation of semantically valid graphs via regularizing variational autoencoders. Advances in Neural Information Processing Systems, 31:7113–7124, 2018.
  • Madhawa et al. (2019) Madhawa, K., Ishiguro, K., Nakago, K., and Abe, M. Graphnvp: An invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600, 2019.
  • Mittal et al. (2021) Mittal, G., Engel, J. H., Hawthorne, C., and Simon, I. Symbolic music generation with diffusion models. In ISMIR, 2021.
  • Neal (2012) Neal, R. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 06 2012.
  • Niu et al. (2020) Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. Permutation invariant graph generation via score-based generative modeling. In AISTATS, 2020.
  • O’Bray et al. (2021) O’Bray, L., Horn, M., Rieck, B., and Borgwardt, K. M. Evaluation metrics for graph generative models: Problems, pitfalls, and practical solutions. arXiv:2106.01098, 2021.
  • Parisi (1981) Parisi, G. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384, 1981.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  • Popova et al. (2019) Popova, M., Shvets, M., Oliva, J., and Isayev, O. Molecularrnn: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372, 2019.
  • Preuer et al. (2018) Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. Journal of chemical information and modeling, 58(9):1736–1741, 2018.
  • Ramakrishnan et al. (2014) Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1–7, 2014.
  • Schomburg et al. (2004) Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Res., 32(Database-Issue):431–433, 2004.
  • Sen et al. (2008) Sen, P., Namata, G. M., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
  • Shi et al. (2020) Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flow-based autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2020.
  • Simonovsky & Komodakis (2018) Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN, 2018.
  • Song et al. (2021a) Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In ICLR, 2021a.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  • Song & Ermon (2020) Song, Y. and Ermon, S. Improved techniques for training score-based generative models. In NeurIPS, 2020.
  • Song et al. (2021b) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
  • Strang (1968) Strang, G. On the construction and comparison of difference schemes. SIAM Journal on Numerical Analysis, 5:506–517, 1968.
  • Särkkä & Solin (2019) Särkkä, S. and Solin, A. Applied Stochastic Differential Equations. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2019.
  • Trotter (1959) Trotter, H. F. On the product of semi-groups of operators. In Proceedings of the American Mathematical Society, 1959.
  • Vahdat et al. (2021) Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. arXiv:2106.05931, 2021.
  • Vincent (2011) Vincent, P. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 07 2011.
  • Wang et al. (2018) Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., and Guo, M. Graphgan: Graph representation learning with generative adversarial nets. In AAAI, 2018.
  • Xie et al. (2019) Xie, S., Kirillov, A., Girshick, R. B., and He, K. Exploring randomly wired neural networks for image recognition. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 1284–1293. IEEE, 2019.
  • You et al. (2018a) You, J., Liu, B., Ying, R., Pande, V., and Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.  6412–6422, 2018a.
  • You et al. (2018b) You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pp. 5708–5717. PMLR, 2018b.
  • Zang & Wang (2020) Zang, C. and Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  617–626, 2020.

Appendix

Organization

The appendix is organized as follows: We first present the derivations excluded from the main paper due to space limitation in Section A, and explain the details of the proposed score-based graph generation framework in Section B. Then we provide the experimental details including the hyperparameters of the toy experiment, the generic graph generation, and the molecule generation in Section C. Finally, we present additional experimental results and the visualizations of the generated graphs in Section D.

Appendix A Derivations

In this section, we present the detailed derivations of the proposed training objectives described in Section 3.2 and the derivations of our novel S4 solver explained in Section 3.3.

A.1 Deriving the Denoising Score Matching Objectives

The original score matching objective can be written as follows:

𝔼𝑮t𝒔θ(𝑮t,t)𝑿tlogpt(𝑮t)22=𝔼𝑮t𝒔θ(𝑮t,t)222𝔼𝑮t𝒔θ,𝑿tlogpt(𝑮t)+C1,\displaystyle\mathbb{E}_{\bm{G}_{t}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)-\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t})\Big{\|}^{2}_{2}=\mathbb{E}_{\bm{G}_{t}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)\Big{\|}^{2}_{2}-2\mathbb{E}_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t})\Big{\rangle}+C_{1}, (12)

where C1C_{1} is a constant that does not depend on θ\theta. Further, the denoising score matching objective can be written as follows:

𝔼𝑮0𝔼𝑮t|𝑮0𝒔θ(𝑮t,t)𝑿tlogpt(𝑮t|𝑮0)22\displaystyle\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)-\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\|}^{2}_{2} =𝔼𝑮0𝔼𝑮t|𝑮0𝒔θ(𝑮t,t)22\displaystyle=\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)\Big{\|}^{2}_{2} (13)
2𝔼𝑮0𝔼𝑮t|𝑮0𝒔θ,𝑿tlogpt(𝑮t|𝑮0)+C2,\displaystyle-2\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\rangle}+C_{2},

where C2C_{2} is also a constant that does not depend on θ\theta. From the following equivalence,

𝔼𝑮t𝒔θ,𝑿tlogpt(𝑮t)\displaystyle\mathbb{E}_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t})\Big{\rangle} =𝑮tp(𝑮t)𝒔θ,𝑿tlogpt(𝑮t)d𝑮t\displaystyle=\int_{\bm{G}_{t}}p(\bm{G}_{t})\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t})\Big{\rangle}\mathrm{d}\bm{G}_{t}
=𝑮t𝒔θ,𝑿tpt(𝑮t)d𝑮t\displaystyle=\int_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}p_{t}(\bm{G}_{t})\Big{\rangle}\mathrm{d}\bm{G}_{t}
=𝑮t𝒔θ,𝑿t𝑮0p(𝑮0)pt(𝑮t|𝑮0)d𝑮0d𝑮t\displaystyle=\int_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\int_{\bm{G}_{0}}p(\bm{G}_{0})p_{t}(\bm{G}_{t}|\bm{G}_{0})\mathrm{d}\bm{G}_{0}\Big{\rangle}\mathrm{d}\bm{G}_{t}
=𝑮t𝒔θ,𝑿t𝑮0p(𝑮0)pt(𝑮t|𝑮0)d𝑮0d𝑮t\displaystyle=\int_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\int_{\bm{G}_{0}}p(\bm{G}_{0})p_{t}(\bm{G}_{t}|\bm{G}_{0})\mathrm{d}\bm{G}_{0}\Big{\rangle}\mathrm{d}\bm{G}_{t}
=\displaystyle= 𝑮t𝒔θ,𝑮0p(𝑮0)𝑿tpt(𝑮t|𝑮0)d𝑮0d𝑮t\displaystyle\int_{\bm{G}_{t}}\Big{\langle}\bm{s}_{\theta},\int_{\bm{G}_{0}}p(\bm{G}_{0})\nabla_{\!\bm{X}_{t}}p_{t}(\bm{G}_{t}|\bm{G}_{0})\mathrm{d}\bm{G}_{0}\Big{\rangle}\mathrm{d}\bm{G}_{t}
=\displaystyle= 𝑮t𝑮0p(𝑮0)p(𝑮t|𝑮0)𝒔θ,𝑿tlogpt(𝑮t|𝑮0)d𝑮0d𝑮t\displaystyle\int_{\bm{G}_{t}}\int_{\bm{G}_{0}}p(\bm{G}_{0})p(\bm{G}_{t}|\bm{G}_{0})\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\rangle}\mathrm{d}\bm{G}_{0}\mathrm{d}\bm{G}_{t}
=\displaystyle= 𝔼𝑮0𝔼𝑮t|𝑮0𝒔θ,𝑿tlogpt(𝑮t|𝑮0),\displaystyle\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\langle}\bm{s}_{\theta},\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\rangle},

we can conclude that the two objectives are equivalent with respect to θ\theta:

𝔼𝑮0𝔼𝑮t|𝑮0𝒔θ(𝑮t,t)𝑿tlogpt(𝑮t|𝑮0)22=𝔼𝑮t𝒔θ(𝑮t,t)𝑿tlogpt(𝑮t)22+C2C1.\displaystyle\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)-\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\|}^{2}_{2}=\mathbb{E}_{\bm{G}_{t}}\Big{\|}\bm{s}_{\theta}(\bm{G}_{t},t)-\nabla_{\!\bm{X}_{t}}\log p_{t}(\bm{G}_{t})\Big{\|}^{2}_{2}+C_{2}-C_{1}. (14)

Similarly, computing the gradient with respect to 𝑨t\bm{A}_{t}, we can show that the following two objectives are also equivalent with respect to ϕ\phi:

𝔼𝑮0𝔼𝑮t|𝑮0𝒔ϕ(𝑮t,t)𝑨tlogpt(𝑮t|𝑮0)22=𝔼𝑮t𝒔ϕ(𝑮t,t)𝑨tlogpt(𝑮t)22+C4C3,\displaystyle\mathbb{E}_{\bm{G}_{0}}\mathbb{E}_{\bm{G}_{t}|\bm{G}_{0}}\Big{\|}\bm{s}_{\phi}(\bm{G}_{t},t)-\nabla_{\!\bm{A}_{t}}\log p_{t}(\bm{G}_{t}|\bm{G}_{0})\Big{\|}^{2}_{2}=\mathbb{E}_{\bm{G}_{t}}\Big{\|}\bm{s}_{\phi}(\bm{G}_{t},t)-\nabla_{\!\bm{A}_{t}}\log p_{t}(\bm{G}_{t})\Big{\|}^{2}_{2}+C_{4}-C_{3}, (15)

where C3C_{3} and C4C_{4} are constants that does not depend on ϕ\phi.

A.2 Deriving New Objectives for GDSS

It is enough to show that 𝑿tlogp0t(𝑮t|𝑮0)\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{G}_{t}|\bm{G}_{0}) is equal to 𝑿tlogp0t(𝑿t|𝑿0)\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{X}_{t}|\bm{X}_{0}). Using the chain rule, we can derive that 𝑿tlogp0t(𝑨t|𝑨0)=0\nabla_{\!\bm{X}_{t}}\log p_{0t}(\bm{A}_{t}|\bm{A}_{0})=0:

logp0t(𝑨t|𝑨0)(𝑿t)ij\displaystyle\frac{\partial\log p_{0t}(\bm{A}_{t}|\bm{A}_{0})}{\partial(\bm{X}_{t})_{ij}} =Tr[𝑨tlogp0t(𝑨t|𝑨0)𝑨t(𝑿t)ij=0]=0,\displaystyle=\text{Tr}\bigg{[}\nabla_{\!\bm{A}_{t}}\log p_{0t}(\bm{A}_{t}|\bm{A}_{0})\underbrace{\frac{\partial\bm{A}_{t}}{\partial(\bm{X}_{t})_{ij}}}_{=0}\bigg{]}=0, (16)

Therefore, we can conclude that 𝑿tlogp0t(𝑮t|𝑮0)\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{G}_{t}|\bm{G}_{0}) is equal to 𝑿tlogp0t(𝑿t|𝑿0)\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{X}_{t}|\bm{X}_{0}):

𝑿tlogp0t(𝑮t|𝑮0)=𝑿tlogp0t(𝑿t|𝑿0)+𝑿tlogp0t(𝑨t|𝑨0)=0=𝑿tlogp0t(𝑿t|𝑿0).\displaystyle\nabla_{\!\bm{X}_{t}}\log p_{0t}(\bm{G}_{t}|\bm{G}_{0})=\nabla_{\!\bm{X}_{t}}\log p_{0t}(\bm{X}_{t}|\bm{X}_{0})+\underbrace{\nabla_{\!\bm{X}_{t}}\log p_{0t}(\bm{A}_{t}|\bm{A}_{0})}_{=0}=\nabla_{\!\!\bm{X}_{t}}\!\log p_{0t}(\bm{X}_{t}|\bm{X}_{0}). (17)

Similarly, computing the gradient with respect to 𝑨t\bm{A}_{t}, we can also show that 𝑨tlogp0t(𝑮t|𝑮0)\nabla_{\!\bm{A}_{t}}\log p_{0t}(\bm{G}_{t}|\bm{G}_{0}) is equal to 𝑨tlogp0t(𝑨t|𝑨0)\nabla_{\!\bm{A}_{t}}\log p_{0t}(\bm{A}_{t}|\bm{A}_{0}).

A.3 The Action of the Fokker-Planck Operators and the Classical Propagator

Fokker-Planck Operators

Recall the system of reverse-time SDEs of Eq. (10):

{d𝐗t=\eqmakebox[F]f_1,t(X_t) d t¯+ g_1,tdw¯_1\eqmakebox[S]- g^2_1,ts_θ,t(X_t,​A_t)d t¯d𝐀t=\eqmakebox[LHS]f_2,t(A_t) d t¯+ g_2,tdw¯_2\eqmakebox[S]- g^2_2,ts_​ϕ,t(X_t,​A_t)d t¯\eqmakebox[F]F\eqmakebox[S]S,\begin{aligned} \begin{cases}\mathrm{d}\bm{X}_{t}\!=\eqmakebox[F]{$\mathbf{f}_{1,t}(\bm{X}_t) \mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+ g_{1,t}\mathrm{d}\mathbf{\bar{w}}_1$}\;\eqmakebox[S]{$- g^2_{1,t}\bm{s}_{\theta,t}(\bm{X}_t,\!\bm{A}_t)\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu$}\\ \mathrm{d}\bm{A}_{t}\,\!=\eqmakebox[LHS]{$\mathbf{f}_{2,t}(\bm{A}_t) \mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+ g_{2,t}\mathrm{d}\mathbf{\bar{w}}_2$}\;\eqmakebox[S]{$- g^2_{2,t}\bm{s}_{\!\phi,t}(\bm{X}_t,\!\bm{A}_t)\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu$}\end{cases}\\[-14.39996pt] \underbrace{\eqmakebox[F]{\mathstrut}}_{F}\phantom{}\underbrace{\eqmakebox[S]{\mathstrut}}_{S}\phantom{a\;}\end{aligned}, (18)

Denoting the marginal joint distribution of Eq. (18) at time tt as p~t(𝑮t)\tilde{p}_{t}(\bm{G}_{t}), the evolution of p~t\tilde{p}_{t} through time tt can be described by a partial differential equation, namely Fokker-Planck equation, as follows:

p~t(𝑮t)t=𝑮t(𝐟t(𝑮t)p~t(𝑮t)12gt2p~t(𝑮t)𝑮tlogp~t(𝑮t)gt2𝒔t(𝑮t)p~t(𝑮t)),\displaystyle\frac{\partial\tilde{p}_{t}(\bm{G}_{t})}{\partial t}=-\nabla_{\!\bm{G}_{t}}\cdot\left(\mathbf{f}_{t}(\bm{G}_{t})\tilde{p}_{t}(\bm{G}_{t})-\frac{1}{2}g_{t}^{2}\tilde{p}_{t}(\bm{G}_{t})\nabla_{\!\bm{G}_{t}}\log\tilde{p}_{t}(\bm{G}_{t})-g_{t}^{2}\bm{s}_{t}(\bm{G}_{t})\tilde{p}_{t}(\bm{G}_{t})\right), (19)

where 𝒔t(𝑮t)=(𝒔θ,t(𝑮t),𝒔ϕ,t(𝑮t))\bm{s}_{t}(\bm{G}_{t})=(\bm{s}_{\theta,\!t}(\bm{G}_{t}),\!\bm{s}_{\phi,t}(\bm{G}_{t})). Then, the Fokker-Planck equation can be represented using the Fokker-Planck operators as follows:

p~t(𝑮t)t=(^F+^S)p~t(𝑮t),\displaystyle\frac{\partial\tilde{p}_{t}(\bm{G}_{t})}{\partial t}=(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})\tilde{p}_{t}(\bm{G}_{t}), (20)

where the action of the Fokker-Planck operators on the function 𝑨(𝑮t)\bm{A}(\bm{G}_{t}) is defined as:

^F𝑨(𝑮t)\displaystyle\hat{\mathcal{L}}^{*}_{F}\bm{A}(\bm{G}_{t}) 𝑮t(𝐟t(𝑮t)𝑨(𝑮t)12gt2𝑨(𝑮t)𝑮tlog𝑨(𝑮t))\displaystyle\coloneqq-\nabla_{\!\bm{G}_{t}}\cdot\left(\mathbf{f}_{t}(\bm{G}_{t})\bm{A}(\bm{G}_{t})-\frac{1}{2}g_{t}^{2}\bm{A}(\bm{G}_{t})\nabla_{\!\bm{G}_{t}}\log\bm{A}(\bm{G}_{t})\right) (21)
^S𝑨(𝑮t)\displaystyle\hat{\mathcal{L}}^{*}_{S}\bm{A}(\bm{G}_{t}) 𝑮t(gt2𝒔t(𝑮t)𝑨(𝑮t)).\displaystyle\coloneqq-\nabla_{\!\bm{G}_{t}}\cdot\left(-g_{t}^{2}\bm{s}_{t}(\bm{G}_{t})\bm{A}(\bm{G}_{t})\right). (22)

Classical Propagator

From the Fokker-Planck equation of Eq. (20), we can derive an intractable solution 𝑮¯t𝑮Tt\bar{\bm{G}}_{t}\coloneqq\bm{G}_{T-t} to the system of reverse-time SDEs in Eq. (18) as follows:

𝑮¯t=et(^F+^S)𝑮¯0,\displaystyle\bar{\bm{G}}_{t}=e^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})}\bar{\bm{G}}_{0}, (23)

which is called the classical propagator. The action of the operator et(^F+^S)e^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})} propagates the initial states 𝑮¯0\bar{\bm{G}}_{0} to time tt following the dynamics determined by the action of Fokker-Planck operators ^F\hat{\mathcal{L}}^{*}_{F} and ^S\hat{\mathcal{L}}^{*}_{S}, described in Eq. (22).

A.4 Symmetric Splitting for the System of SDEs

Let us take a look on each operator one by one. First, the action of the operator ^F\hat{\mathcal{L}}^{*}_{F} on the marginal distribution p~t(𝑮t)\tilde{p}_{t}(\bm{G}_{t}) corresponds to the diffusion process described by the FF-term in Eq. (18) as follows:

{d𝑿t=𝐟1,t(𝑿t)dt¯+g1,td𝐰¯𝟏d𝑨t=𝐟2,t(𝑨t)dt¯+g2,td𝐰¯𝟐,\displaystyle\begin{cases}\mathrm{d}\bm{X}_{t}=\mathbf{f}_{1,t}(\bm{X}_{t})\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{1,t}\mathrm{d}\mathbf{\bar{w}_{1}}\\ \mathrm{d}\bm{A}_{t}=\mathbf{f}_{2,t}(\bm{A}_{t})\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu+g_{2,t}\mathrm{d}\mathbf{\bar{w}_{2}}\\ \end{cases}, (24)

Notice that the diffusion process described by Eq. (24) is similar to the forward diffusion process of Eq. (1), with a slight difference that Eq. (24) is a system of reverse-time SDEs. Therefore, the action of the operator eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}} can be represented by the transition distribution of the forward diffusion process pst(|)p_{st}(\cdot|\cdot) as follows:

eδt2^F𝑮=𝑮~pt,tδt/2(𝑮~|𝑮).\displaystyle e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}\bm{G}=\tilde{\bm{G}}\sim p_{t,t-\delta t/2}(\tilde{\bm{G}}|\bm{G}). (25)

We provide the explicit form of the transition distribution in Section A.5 of the Appendix. Furthermore, the operator ^S\hat{\mathcal{L}}^{*}_{S} corresponds to the evolution of 𝑿t\bm{X}_{t} and 𝑨t\bm{A}_{t} described by the SS-term in Eq. (18), which is a system of reverse-time ODEs:

{𝑿t=g1,t2𝒔θ,t(𝑿t,𝑨t)dt¯𝑨t=g2,t2𝒔ϕ,t(𝑿t,𝑨t)dt¯,\displaystyle\begin{cases}\bm{X}_{t}=-g^{2}_{1,t}\bm{s}_{\theta,t}(\bm{X}_{t},\bm{A}_{t})\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu\\ \bm{A}_{t}=-g^{2}_{2,t}\bm{s}_{\phi,t}(\bm{X}_{t},\bm{A}_{t})\mathrm{d}\mkern 1.5mu\overline{\mkern-1.5mut\mkern-1.5mu}\mkern 1.5mu\\ \end{cases}, (26)

Hence, the action of the operator eδt^Se^{\delta t\hat{\mathcal{L}}^{*}_{S}} can be approximated with the simple Euler method for a positive time step δt\delta t:

eδt^S𝑿t𝑿t+g1,t2𝒔θ,t(𝑿t,𝑨t)δt\displaystyle e^{\delta t\hat{\mathcal{L}}^{*}_{S}}\bm{X}_{t}\approx\bm{X}_{t}+g^{2}_{1,t}\bm{s}_{\theta,t}(\bm{X}_{t},\bm{A}_{t})\delta t (27)
eδt^S𝑨t𝑨t+g2,t2𝒔ϕ,t(𝑿t,𝑨t)δt,\displaystyle e^{\delta t\hat{\mathcal{L}}^{*}_{S}}\bm{A}_{t}\approx\bm{A}_{t}+g^{2}_{2,t}\bm{s}_{\phi,t}(\bm{X}_{t},\bm{A}_{t})\delta t,

which we refer to this approximated action as eδtSEulere^{\delta t\mathcal{L}^{Euler}_{S}}. Using the symmtric Trotter theorem (Trotter, 1959), we can approximate the intractable solution et(^F+^S)e^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})} as follows (Dockhorn et al., 2021):

et(^F+^S)\displaystyle e^{t(\hat{\mathcal{L}}^{*}_{F}+\hat{\mathcal{L}}^{*}_{S})} [eδt2^Feδt^Seδt2^F]M+O(Mδt3)\displaystyle\approx\left[e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}e^{\delta t\hat{\mathcal{L}}^{*}_{S}}e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}\right]^{M}+O(M\delta t^{3})
=[eδt2^FeδtSEulereδt2^F]M+MO(δt2)\displaystyle=\left[e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}e^{\delta t\mathcal{L}^{Euler}_{S}}e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}\right]^{M}+MO(\delta t^{2}) (28)
=[eδt2^FeδtSEulereδt2^F]M+O(δt),\displaystyle=\left[e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}e^{\delta t\mathcal{L}^{Euler}_{S}}e^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}\right]^{M}+O(\delta t), (29)

for a sufficiently large number of steps MM and a time step δt=t/M\delta t=t/M. Note that from Eq. (28) and Eq. (29), we can see that the prediction step of S4 has local error 𝒪(δt2)\mathcal{O}(\delta t^{2}) and global error 𝒪(δt)\mathcal{O}(\delta t). From the action of the Fokker-Planck operators and the result of Eq. (29), we can derive the prediction step of the S4 solver described in Section 3.3. We further provide the pseudo-code for the proposed S4 solver in Algorithm 1.

Algorithm 1 Symmetric Splitting for the System of SDEs (S4)

Input: Score-based models 𝒔θ,t\bm{s}_{\theta,t} and 𝒔ϕ,t\bm{s}_{\phi,t}, number of sampling steps MM, step size δt\delta t, transition distributions pst(|)p_{st}(\cdot|\cdot) of the forward diffusion in Eq. (1), Lagevin MCMC step size α\alpha, scaling coefficient ϵs\epsilon_{s}
Output: 𝑿0\bm{X}_{0}, 𝑨0\bm{A}_{0}: the solution to Eq. (10)

1:  t=Tt=T
2:  Sample from the prior distribution 𝑿M,𝑨MpT\bm{X}_{M},\!\bm{A}_{M}\sim p_{T}
3:  for m=M1m=M-1 to 0 do
4:    𝑺X𝒔θ,t(𝑿m+1,𝑨m+1)\bm{S}_{X}\leftarrow\bm{s}_{\theta,t}(\bm{X}_{m+1},\!\bm{A}_{m+1});    𝑺A𝒔ϕ,t(𝑿m+1,𝑨m+1)\bm{S}_{A}\leftarrow\bm{s}_{\phi,t}(\bm{X}_{m+1},\!\bm{A}_{m+1}) \rhd score computation step
5:    𝑿m+1𝑿m+1+α2𝑺X+ϵsα𝒛X\bm{X}_{m+1}\leftarrow\bm{X}_{m+1}+\frac{\alpha}{2}\bm{S}_{X}+\epsilon_{s}\sqrt{\alpha}\bm{z}_{X};    𝒛X𝒩(𝟎,𝑰X)\bm{z}_{X}\sim\mathcal{N}\left(\bm{0},\bm{I}_{X}\right) \rhd correction step: 𝑿\bm{X}
6:    𝑨m+1𝑨m+1+α2𝑺A+ϵsα𝒛A\bm{A}_{m+1}\,\leftarrow\bm{A}_{m+1}\,+\frac{\alpha}{2}\bm{S}_{A}+\epsilon_{s}\sqrt{\alpha}\bm{z}_{A};    𝒛A𝒩(𝟎,𝑰A)\bm{z}_{A}\sim\mathcal{N}\left(\bm{0},\bm{I}_{A}\right) \rhd correction step: 𝑨\bm{A}
7:    ttδt/2t^{\prime}\leftarrow t-\delta t/2
8:    𝑿~mpt,t(𝑿~m|𝑿m+1)\tilde{\bm{X}}_{m}\sim p_{t,t^{\prime}}(\tilde{\bm{X}}_{m}|\bm{X}_{m+1});    𝑨~mpt,t(𝑨~m|𝑨m+1)\tilde{\bm{A}}_{m}\sim p_{t,t^{\prime}}(\tilde{\bm{A}}_{m}|\bm{A}_{m+1}) \rhd prediction step: action of eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}
9:    𝑿~m𝑿~m+g1,t2𝑺Xδt\tilde{\bm{X}}_{m}\leftarrow\tilde{\bm{X}}_{m}+g^{2}_{1,t}\bm{S}_{X}\delta t;    𝑨~m𝑨~m+g2,t2𝑺Aδt\tilde{\bm{A}}_{m}\leftarrow\tilde{\bm{A}}_{m}+g^{2}_{2,t}\bm{S}_{A}\delta t \rhd prediction step: action of eδt^Se^{\delta t\hat{\mathcal{L}}^{*}_{S}}\,\,
10:    ttδtt\leftarrow t-\delta t
11:    𝑿mpt,t(𝑿m|𝑿~m)\bm{X}_{m}\sim p_{t^{\prime},t}(\bm{X}_{m}|\tilde{\bm{X}}_{m});   𝑨mpt,t(𝑨m|𝑨~m)\bm{A}_{m}\sim p_{t^{\prime},t}(\bm{A}_{m}|\tilde{\bm{A}}_{m}) \rhd prediction step: action of eδt2^Fe^{\frac{\delta t}{2}\hat{\mathcal{L}}^{*}_{F}}
12:  end for
13:  Return: 𝑿0\bm{X}_{0}, 𝑨0\bm{A}_{0}

A.5 Derivation of the transition distribution

We provide an explicit form of the transition distribution for two types of SDE, namely VPSDE and VESDE (Song et al., 2021b). We consider the transition distribution from time tt to tδtt-\delta t for sufficiently small time step δt\delta t with 𝒙t\bm{x}_{t} given, and considering the input as discrete state corresponding to normal distribution with 0 variance.

VPSDE

The process of the VPSDE is given by the following SDE:

d𝒙=12βt𝒙dt+βtd𝐰,\displaystyle\mathrm{d}\bm{x}=-\frac{1}{2}\beta_{t}\bm{x}\mathrm{d}t+\sqrt{\beta_{t}}\mathrm{d}\mathbf{w}, (30)

where βt=βmin+t(βmaxβmin)\beta_{t}=\beta_{min}+t(\beta_{max}-\beta_{min}) for the hyperparameters βmin\beta_{min} and βmax\beta_{max}, and t[0,1]t\in[0,1]. Since Eq. (30) has a linear drift coefficient, the transition distribution of the process is Gaussian, and the mean and covariance can be derived using the result of Eq.(5.50) and (5.51) of Särkkä & Solin (2019) as follows:

pt,tδt(𝒙tδt|𝒙t)=𝒩(𝒙tδt|μt𝒙t,𝚺𝒕),\displaystyle p_{t,t-\delta t}(\bm{x}_{t-\delta t}|\bm{x}_{t})=\mathcal{N}\left(\bm{x}_{t-\delta t}\;|\;\mu_{t}\bm{x}_{t},\bm{\Sigma_{t}}\right), (31)

where μt=eCt\mu_{t}=e^{C_{t}} and 𝚺t=𝑰𝑰e2Ct\bm{\Sigma}_{t}=\bm{I}-\bm{I}e^{-2C_{t}} for

Ct=12tδttβsds=δt4(2βmin+(2t+δt)(βmaxβmin)).\displaystyle C_{t}=\frac{1}{2}\int^{t}_{t-\delta t}\beta_{s}\mathrm{d}s=\frac{\delta t}{4}\Big{(}2\beta_{min}+(2t+\delta t)(\beta_{max}-\beta_{min})\Big{)}. (32)

VESDE

The process of the VESDE is given by the following SDE:

d𝒙=σmin(σmaxσmin)t2logσmaxσmind𝐰,\displaystyle\mathrm{d}\bm{x}=\sigma_{min}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^{t}\sqrt{2\log\frac{\sigma_{max}}{\sigma_{min}}}\mathrm{d}\mathbf{w}, (33)

for the hyperparameters σmin\sigma_{min} and σmax\sigma_{max}, and t(0,1]t\in(0,1]. Since Eq. (33) has a linear drift coefficient, the transition distribution of the process is Gaussian, and the mean and covariance can be derived using the result of Eq.(5.50) and (5.51) of Särkkä & Solin (2019) as follows:

pt,tδt(𝒙tδt|𝒙t)=𝒩(𝒙tδt|𝒙t,𝚺𝒕),\displaystyle p_{t,t-\delta t}(\bm{x}_{t-\delta t}|\bm{x}_{t})=\mathcal{N}\left(\bm{x}_{t-\delta t}\;|\;\bm{x}_{t},\bm{\Sigma_{t}}\right), (34)

where 𝚺𝒕=Σt𝑰\bm{\Sigma_{t}}=\Sigma_{t}\bm{I} is given as:

Σt=σmin2(σmaxσmin)2tσmin2(σmaxσmin)2t2δt.\displaystyle\Sigma_{t}=\sigma_{min}^{2}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^{2t}-\sigma_{min}^{2}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^{2t-2\delta t}. (35)

Appendix B Details for Score-based Graph Generation

In this section, we describe the architectures of our proposed score-based models, and further provide the details of the graph generation procedure through the reverse-time diffusion process.

Refer to caption
Figure 5: The architecture of the score-based models of GDSS. (Left) The score-based model sϕs_{\phi} estimating 𝑨tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{A}_{t}}\!\log p_{t}(\bm{X}_{t},\bm{A}_{t}) is composed of GMH blocks and MLP layers. (Right) The score-based model sθs_{\theta} estimating 𝑿tlogpt(𝑿t,𝑨t)\nabla_{\!\!\bm{X}_{t}}\!\log p_{t}(\bm{X}_{t},\bm{A}_{t}) is composed of GCN layers and MLP layers. Both models take 𝑿𝒕\bm{X_{t}} and 𝑨𝒕\bm{A_{t}} as input and estimate the partial scores with respect to 𝑨𝒕\bm{A_{t}} and 𝑿𝒕\bm{X_{t}}, respectively.

B.1 Score-based Model Architecture

We illustrate the architecture of the proposed score models sθ,ts_{\theta,t} and sϕ,ts_{\phi,t} in Figure 5, which are described in Section 3.2.

B.2 Generating Samples from the Reverse Diffusion Process

We first sample NN, the number of nodes to be generated from the empirical distribution of the number of nodes in the training dataset as done in Li et al. (2018b) and Niu et al. (2020). Then we sample the noise of batch size BB from the prior distribution, where 𝑿T\bm{X}_{T} is of dimension N×F×BN\!\times\!F\!\times\!B and 𝑨T\bm{A}_{T} is of dimension N×N×BN\!\times\!N\!\times\!B, and simulate the reverse-time system of SDEs in Eq. (10) to obtain the solution 𝑿0\bm{X}_{0} and 𝑨0\bm{A}_{0}. Lastly, we quantize 𝑿0\bm{X}_{0} and 𝑨0\bm{A}_{0} with the operation depending on the generation tasks. We provide further details of the generation procedure in Section C, including the hyperparameters.

Appendix C Experimental Details

In this section, we explain the details of the experiments including the toy experiments shown in Figure 2, the generic graph generation tasks, and the molecule generation tasks. We describe the implementation details of GDSS and the baselines, and further provide the hyperparameters used in the experiments in Table 5.

C.1 Toy Experiment

Here, we provide the details for the toy experiment presented in Section 3.1. We construct the distribution of the data with bivariate Gaussian mixture with the mean and the covariance as follows:

pdata(𝐱)\displaystyle p_{data}(\mathbf{x}) =𝒩(𝐱|μ1,Σ1)+𝒩(𝐱|μ2,Σ2),\displaystyle=\mathcal{N}(\mathbf{x}\;|\;\mu_{1},\Sigma_{1})+\mathcal{N}(\mathbf{x}\;|\;\mu_{2},\Sigma_{2}), (36)
μ1\displaystyle\mu_{1} =(0.50.5),μ2=(0.50.5),Σ1=Σ2=0.12(1.00.90.91.0).\displaystyle=\begin{pmatrix}0.5\\ 0.5\\ \end{pmatrix}\;,\;\mu_{2}=\begin{pmatrix}-0.5\\ -0.5\\ \end{pmatrix}\;,\;\Sigma_{1}=\Sigma_{2}=0.1^{2}\begin{pmatrix}1.0&0.9\\ 0.9&1.0\\ \end{pmatrix}.

For each diffusion method, namely GDSS, GDSS-seq, and independent diffusion, we train two models, where each model estimates the partial score or score with respect to the variable. We fix the number of linear layers in the model to 20 with the residual paths, and set the hidden dimension as 512. We use VPSDE for the diffusion process of each variable with βmin=0.01\beta_{min}=0.01 and βmax=0.05\beta_{max}=0.05. We train the models for 5000 epochs with batch size 2048 sampled from the data distribution. We generate 2132^{13} samples for each diffusion method, shown in Figure 2.

Table 5: Hyperparameters of GDSS used in the generic graph generation tasks and the molecule generation tasks. We provide the hyperparameters of the score-based models (𝒔θ\bm{s}_{\theta} and 𝒔ϕ\bm{s}_{\phi}), the diffusion processes (SDE for 𝑿\bm{X} and 𝑨\bm{A}), the SDE solver, and the training.
Hyperparameter Ego-small Community-small Enzymes Grid QM9 ZINC250k
𝒔θ\bm{s}_{\theta} Number of GCN layers 2 3 5 5 2 2
Hidden dimension 32 32 32 32 16 16
𝒔ϕ\bm{s}_{\phi} Number of attention heads 4 4 4 4 4 4
Number of initial channels 2 2 2 2 2 2
Number of hidden channels 8 8 8 8 8 8
Number of final channels 4 4 4 4 4 4
Number of GCN layers 5 5 7 7 3 6
Hidden dimension 32 32 32 32 16 16
SDE for 𝑿\bm{X} Type VP VP VP VP VE VP
Number of sampling steps 1000 1000 1000 1000 1000 1000
βmin\beta_{min} 0.1 0.1 0.1 0.1 0.1 0.1
βmax\beta_{max} 1.0 1.0 1.0 1.0 1.0 1.0
SDE for 𝑨\bm{A} Type VP VP VE VP VE VE
Number of sampling steps 1000 1000 1000 1000 1000 1000
βmin\beta_{min} 0.1 0.1 0.2 0.2 0.1 0.2
βmax\beta_{max} 1.0 1.0 1.0 0.8 1.0 1.0
Solver Type EM EM + Langevin S4 Rev. + Langevin Rev. + Langevin Rev. + Langevin
SNR - 0.05 0.15 0.1 0.2 0.2
Scale coefficient - 0.7 0.7 0.7 0.7 0.9
Train Optimizer Adam Adam Adam Adam Adam Adam
Learning rate 1×1021\times 10^{-2} 1×1021\times 10^{-2} 1×1021\times 10^{-2} 1×1021\times 10^{-2} 5×1035\times 10^{-3} 5×1035\times 10^{-3}
Weight decay 1×1041\times 10^{-4} 1×1041\times 10^{-4} 1×1041\times 10^{-4} 1×1041\times 10^{-4} 1×1041\times 10^{-4} 1×1041\times 10^{-4}
Batch size 128 128 64 8 1024 1024
Number of epochs 5000 5000 5000 5000 300 500
EMA - - 0.999 0.999 - -

C.2 Generic Graph Generation

The information and the statistics of the graph datasets, namely Ego-small, Community-small, Enzymes and Grid, are shown in Section 4.1 and Table 1. We carefully selected the datasets to have varying sizes and characteristics, for example synthetic graphs, real-world graphs, social graphs or biochemical graphs.

Implementation Details

For a fair evaluation of the generic graph generation task, we follow the standard setting of existing works (You et al., 2018b; Liu et al., 2019; Niu et al., 2020) from the node features to the data splitting. Especially, for Ego-small and Community-small datasets, we report the means of 15 runs, 3 different runs for 5 independently trained models. For Enzymes and Grid dataset, since the baselines including GraphVAE and EDP-GNN take more than 3 days for a single training, we report the means of 3 different runs. For the baselines, we use the hyperparameters given by the original work, and further search for the best performance if none exists. For GDSS, we initialize the node features as the one-hot encoding of the degrees. We perform the grid search to choose the best signal-to-noise ratio (SNR) in {0.05,0.1,0.15,0.2}\{0.05,0.1,0.15,0.2\} and the scale coefficient in the {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0\}. We select the best MMD with the lowest average of three graph statistics, degree, clustering coefficient, and orbit. Further, we observed that applying the exponential moving average (EMA) Song & Ermon (2020) for larger graph datasets, namely Enzymes and Grid, improves the performance and lowers the variance. After generating the samples by simulating the reverse diffusion process, we quantize the entries of the adjacency matrices with the operator 1x>0.51_{x>0.5} to obtain the 0-1 adjacency matrix. We empirically found that the entries of the resulting samples after the simulation of the diffusion process do not deviate much from the integer values 0 and 1. We report the hyperparameters used in the experiment in Table 5.

Table 6: Statistics of QM9 and ZINC25ok datasets used in the molecule generation tasks.
Dataset Number of graphs Number of nodes Number of node types Number of edge types
QM9 133,885 1 |V|\leq|V|\leq 9 4 3
ZINC250k 249,455 6 |V|\leq|V|\leq 38 9 3

C.3 Molecule Generation

The statistics of the molecular datasets, namely, QM9 and ZINC250k datasets, are summarized in Table 6.

Implementation Details of GDSS and GDSS-seq

Each molecule is preprocessed into a graph with the node features X{0,1}N×FX\!\in\!\{0,1\}^{N\!\times\!F} and the adjacency matrix A{0,1,2,3}N×NA\!\in\!\{0,1,2,3\}^{N\!\times\!N}, where NN is the maximum number of atoms in a molecule of the dataset, and FF is the number of possible atom types. The entries of AA indicate the bond types, i.e. single, double, or triple bonds. Following the standard procedure (Shi et al., 2020; Luo et al., 2021), the molecules are kekulized by the RDKit library (Landrum et al., 2016) and hydrogen atoms are removed. As explained in Section 4.2, we make use of the valency correction proposed by Zang & Wang (2020). We perform the grid search to choose the best signal-to-noise ratio (SNR) in {0.1,0.2}\{0.1,0.2\} and the scale coefficient in {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0\}. Since the low novelty value leads to low FCD and NSPDK MMD values even though it is undesirable and meaningless, we choose the hyperparameters that exhibit the best FCD value among those which show the novelty that exceeds 85%. After generating the samples by simulating the reverse diffusion process, we quantize the entries of the adjacency matrices to {0,1,2,3}\{0,1,2,3\} by clipping the values as: (,0.5)(-\infty,0.5) to 0, the values of [0.5,1.5)[0.5,1.5) to 1, the values of [1.5,2.5)[1.5,2.5) to 2, and the values of [2.5,+)[2.5,+\infty) to 3. We empirically observed that the entries of the resulting samples after the simulation of the diffusion process do not deviate much from the integer values 0, 1, 2, and 3. We report the hyperparameters used in the experiment in Table 5.

Implementation Details of Baselines

We utilize the DIG (Liu et al., 2021a) library to generate molecules with GraphDF and GraphEBM. To conduct the experiments with GraphAF, we use the DIG library for the QM9 dataset, and use the official code333https://github.com/DeepGraphLearning/GraphAF for the ZINC250k dataset. We use the official code444https://github.com/calvin-zcx/moflow for MoFlow. We follow the experimental settings reported in the respective original papers and the codes for these models. For EDP-GNN, we use the same preprocessing and postprocessing procedures as in GDSS and GDSS-seq, except that we divide the adjacency matrices by 3 to ensure the entries are in the range of [0,1][0,1] before feeding them into the model. We conduct the grid search to choose the best size of the Langevin step in {0.01,0.005,0.001}\{0.01,0.005,0.001\} and noise scale in {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0\}, and apply the same tuning criterion as in GDSS and GDSS-seq.

C.4 Computing Resources

For all the experiments, we utilize PyTorch (Paszke et al., 2019) to implement GDSS and train the score models on TITAN XP, TITAN RTX, GeForce RTX 2080 Ti, and GeForce RTX 3090 GPU. For the generic graph generation tasks, the time comparison between the SDE solvers in Figure 3 was measured on 1 GeForce RTX 2080 Ti GPU and 40 CPU cores. For the molecule generation tasks, the inference time of each model is measured on 1 TITAN RTX GPU and 20 CPU cores.

Appendix D Additional Experimental Results

In this section, we provide additional experimental results.

Table 7: Generation results of GDSS on the Ego-small and the Community-small datasets. We report the MMD distance between the test datasets and generated graphs with the standard deviation.
Ego-small Community-small
Real, 4|V|184\leq|V|\leq 18 Synthetic, 12|V|2012\leq|V|\leq 20
Deg. Clus. Orbit Deg. Clus. Orbit
GDSS-seq (Ours) 0.032 ±\pm 0.006 0.027 ±\pm 0.005 0.011 ±\pm 0.007 0.090 ±\pm 0.021 0.123 ±\pm 0.200 0.007 ±\pm 0.003
GDSS (Ours) 0.021 ±\pm 0.008 0.024 ±\pm 0.007 0.007 ±\pm 0.005 0.045 ±\pm 0.028 0.086 ±\pm 0.022 0.007 ±\pm 0.004
Table 8: Generation results on the Enzymes and the Grid datasets. We report the MMD distance between the test datasets and generated graphs with the standard deviation. Best results are highlighted in bold (smaller the better). Hyphen (-) denotes out-of-resources that take more than 10 days or not applicable due to memory issue. denotes our own implementation.
Enzymes Grid
Real, 10|V|12510\leq|V|\leq 125 Synthetic, 100|V|400100\leq|V|\leq 400
Deg. Clus. Orbit Deg. Clus. Orbit
GraphRNN 0.017 ±\pm 0.007 0.062 ±\pm 0.020 0.046 ±\pm 0.031 0.064 ±\pm 0.017 0.043 ±\pm 0.022 0.021 ±\pm 0.007
GraphAF 1.669 ±\pm 0.024 1.283 ±\pm 0.019 0.266 ±\pm 0.007 - - -
GraphDF 1.503 ±\pm 0.011 1.061 ±\pm 0.011 0.202 ±\pm 0.002 - - -
GraphVAE 1.369 ±\pm 0.020 0.629 ±\pm 0.005 0.191 ±\pm 0.020 1.619 ±\pm 0.007 0.0 ±\pm 0.000 0.919 ±\pm 0.002
EDP-GNN 0.023 ±\pm 0.012 0.268 ±\pm 0.164 0.082 ±\pm 0.078 0.455 ±\pm 0.319 0.238 ±\pm 0.380 0.328 ±\pm 0.278
GDSS-seq (Ours) 0.099 ±\pm 0.083 0.225 ±\pm 0.051 0.010 ±\pm 0.007 0.171 ±\pm 0.134 0.011 ±\pm 0.001 0.223 ±\pm 0.070
GDSS (Ours) 0.026 ±\pm 0.008 0.061 ±\pm 0.010 0.009 ±\pm 0.005 0.111 ±\pm 0.012 0.005 ±\pm 0.000 0.070 ±\pm 0.044

D.1 Generic Graph Generation

We report the standard deviation of the generation results of Table 1 in Table 7 and Table 8.

Table 9: Generation results of MMD using a larger number (1024) of samples.
Ego-small Community-small Avg.
Deg. Clus. Orbit Deg. Clus. Orbit
GraphRNN 0.040 0.050 0.060 0.030 0.010 0.010 0.033
GNF 0.010 0.030 0.001 0.120 0.150 0.020 0.055
EDP-GNN 0.010 0.025 0.003 0.006 0.127 0.018 0.031
GDSS (Ours) 0.023 0.020 0.005 0.029 0.068 0.004 0.030

For a fair evaluation of the generative methods, following You et al. (2018b), we have measured MMD between the test datasets and the set of generated graphs that have the same number of graphs as the test datasets. To further compare extensively with the baselines, following Liu et al. (2019); Niu et al. (2020), we provide the results of MMD measured between the test datasets and the set of 1024 generated graphs in Table 9. We can observe that GDSS still outperforms the baselines using a larger number of samples (1024) to measure the MMD, and significantly outperforms EDP-GNN.

D.2 Molecule Generation

We additionally report the validity, uniqueness, and novelty of the generated molecules as well as the standard deviation of the results in Table 10 and Table 11. Validity is the fraction of the generated molecules that do not violate the chemical valency rule. Uniqueness is the fraction of the valid molecules that are unique. Novelty is the fraction of the valid molecules that are not included in the training set.

Table 10: Generation results on the QM9 dataset. Results are the means and the standard deviations of 3 runs. Values denoted by * are taken from the respective original papers. Other results are obtained by running open-source codes. Best results are highlighted in bold.
Method Validity w/o \uparrow NSPDK \downarrow FCD \downarrow Validity (%) \uparrow Uniqueness (%) \uparrow Novelty (%) \uparrow
correction (%) MMD
Autoreg. GraphAF (Shi et al., 2020) 67* 0.020±\pm0.003 5.268±\pm0.403 100.00* 94.51* 88.83*
GraphAF+FC 74.43±\pm2.55 0.021±\pm0.003 5.625±\pm0.259 100.00±\pm0.00 88.64±\pm2.37 86.59±\pm1.95
GraphDF (Luo et al., 2021) 82.67* 0.063±\pm0.001 10.816±\pm0.020 100.00* 97.62* 98.10*
GraphDF+FC 93.88±\pm4.76 0.064±\pm0.000 10.928±\pm0.038 100.00±\pm0.00 98.58±\pm0.25 98.54±\pm0.48
One-shot MoFlow (Zang & Wang, 2020) 91.36±\pm1.23 0.017±\pm0.003 4.467±\pm0.595 100.00±\pm0.00 98.65±\pm0.57 94.72±\pm0.77
EDP-GNN (Niu et al., 2020) 47.52±\pm3.60 0.005±\pm0.001 2.680±\pm0.221 100.00±\pm0.00 99.25±\pm0.05 86.58±\pm1.85
GraphEBM (Liu et al., 2021b) 8.22±\pm2.24 0.030±\pm0.004 6.143±\pm0.411 100.00±\pm0.00* 97.90±\pm0.14* 97.01±\pm0.17*
GDSS-seq (Ours) 94.47±\pm1.03 0.010±\pm0.001 4.004±\pm0.166 100.00±\pm0.00 94.62±\pm1.40 85.48±\pm1.01
GDSS (Ours) 95.72±\pm1.94 0.003±\pm0.000 2.900±\pm0.282 100.00±\pm0.00 98.46±\pm0.61 86.27±\pm2.29
Table 11: Generation results on the ZINC250k dataset. Results are the means and the standard deviations of 3 runs. Values denoted by * are taken from the respective original papers. Other results are obtained by running open-source codes. Best results are marked as bold.
Method Validity w/o \uparrow NSPDK \downarrow FCD \downarrow Validity (%) \uparrow Uniqueness (%) \uparrow Novelty (%) \uparrow
correction (%) MMD
Autoreg. GraphAF (Shi et al., 2020) 68* 0.044±\pm0.006 16.289±\pm0.482 100.00* 99.10* 100.00*
GraphAF+FC 68.47±\pm0.99 0.044±\pm0.005 16.023±\pm0.451 100.00±\pm0.00 98.64±\pm0.69 99.99±\pm0.01
GraphDF (Luo et al., 2021) 89.03* 0.176±\pm0.001 34.202±\pm0.160 100.00* 99.16* 100.00*
GraphDF+FC 90.61±\pm4.30 0.177±\pm0.001 33.546±\pm0.150 100.00±\pm0.00 99.63±\pm0.01 100.00±\pm0.00
One-shot MoFlow (Zang & Wang, 2020) 63.11±\pm5.17 0.046±\pm0.002 20.931±\pm0.184 100.00±\pm0.00 99.99±\pm0.01 100.00±\pm0.00
EDP-GNN (Niu et al., 2020) 82.97±\pm2.73 0.049±\pm0.006 16.737±\pm1.300 100.00±\pm0.00 99.79±\pm0.08 100.00±\pm0.00
GraphEBM (Liu et al., 2021b) 5.29±\pm3.83 0.212±\pm0.075 35.471±\pm5.331 99.96±\pm0.02* 98.79±\pm0.15* 100.00±\pm0.00*
GDSS-seq (Ours) 92.39±\pm2.72 0.030±\pm0.003 16.847±\pm0.097 100.00±\pm0.00 99.94±\pm0.02 100.00±\pm0.00
GDSS (Ours) 97.01±\pm0.77 0.019±\pm0.001 14.656±\pm0.680 100.00±\pm0.00 99.64±\pm0.13 100.00±\pm0.00

D.3 Ablation Studies

Table 12: Comparison between fixed step size SDE solvers. We additionally provide the results of the proposed S4 solver on other datasets not included in the table of Figure 3. Best results are highlighted in bold (smaller the better).
Ego-small Grid QM9 ZINC250k
Solver Deg. Clus. Orbit Time (s) Deg. Clus. Orbit Time (s) Val. w/o corr. (%) NSPDK FCD Time (s) Val. w/o corr. (%) NSPDK FCD Time (s)
EM 0.021 0.024 0.007 40.31 0.278 0.008 0.089 235.36 67.44 0.016 4.809 63.87 15.92 0.086 26.049 1019.89
Reverse 0.032 0.046 0.010 41.55 0.278 0.008 0.089 249.86 69.32 0.016 4.823 65.23 46.02 0.052 21.486 1021.09
EM + Langevin 0.032 0.040 0.009 78.17 0.111 0.005 0.070 483.47 93.98 0.008 3.889 111.49 94.47 0.025 15.292 2020.87
Rev. + Langevin 0.032 0.046 0.021 77.82 0.111 0.005 0.070 500.01 95.72 0.003 2.900 114.57 97.01 0.019 14.656 2020.06
S4 (Ours) 0.032 0.044 0.009 41.25 0.125 0.008 0.076 256.24 95.13 0.003 2.777 63.78 95.52 0.021 14.537 1021.21

In Table 12, we provide the full results of the table in Figure 3, which shows the comparison between the fixed step size SDE solvers on other datasets. As shown in Table 12, S4 significantly outperforms the predictor-only methods, and further shows competitive results compared to the PC samplers with half the computation time.

Appendix E Visualization

In this section, we additionally provide the visualizations of the generated graphs for the generic graph generation tasks and molecule generation tasks.

E.1 Generic Graph Generation

We visualize the graphs from the training datasets and the generated graphs of GDSS for each datasets in Figure 6-9. The visualized graphs are the randomly selected samples from the training datasets and the generated graph set. We additionally provide the information of the number of edges ee and the number of nodes nn of each graph.

Refer to caption
Figure 6: Visualization of the graphs from the Ego small dataset and the generated graphs of GDSS.
Refer to caption
Figure 7: Visualization of the graphs from the Community small dataset and the generated graphs of GDSS.
Refer to caption
Figure 8: Visualization of the graphs from the ENZYMES dataset and the generated graphs of GDSS.
Refer to caption
Figure 9: Visualization of the graphs from the Grid dataset and the generated graphs of GDSS.

E.2 Molecule Generation

We visualize the generated molecules that are maximally similar to certain training molecules in Figure 10. The similarity measure is the Tanimoto similarity based on the Morgan fingerprints, which are obtained by the RDKit (Landrum et al., 2016) library with radius 2 and 1024 bits. As shown in the figure, GDSS is able to generate molecules that are structurally close to the training molecules while other baselines generate molecules that deviate from the training distribution.

Refer to caption
Figure 10: Visualization of generated molecules with maximum Tanimoto similarity with the molecule from the dataset. For each generated molecule, we display the similarity value at the bottom.