这是用户在 2024-5-3 23:03 为 https://ar5iv.labs.arxiv.org/html/2109.05641v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
\newfloatcommand

capbtabboxtable[][\FBwidth]

Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?
异质性是否对图神经网络进行节点分类构成真正的噩梦?

Sitao Luan1,2, Chenqing Hua1, Qincheng Lu1, Jiaqi Zhu1, Mingde Zhao1,2, Shuyuan Zhang1,2,
栾思涛 1,2 ,华晨晴 1 ,陆钦程 1 ,朱佳琦 1 ,赵明德 1,2 ,张书远 1,2

Xiao-Wen Chang1, Doina Precup1,2,3
Xiao-Wen Chang 1 ,Doina Precup 1,2,3

{sitao.luan@mail, chenqing.hua@mail, qincheng.lu@mail, jiaqi.zhu@mail, mingde.zhao@mail,
shuyuan.zhang@mail, chang@cs, dprecup@cs}.mcgill.ca
1McGill University; 2Mila; 3DeepMind
shuyuan.zhang@mail, chang@cs, dprecup@cs}.mcgill.ca 1 麦吉尔大学; 2 Mila; 3 DeepMind

Abstract 摘要

Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by using the graph structures based on the relational inductive bias (homophily assumption). Though GNNs are believed to outperform NNs in real-world tasks, performance advantages of GNNs over graph-agnostic NNs seem not generally satisfactory. Heterophily has been considered as a main cause and numerous works have been put forward to address it. In this paper, we first show that not all cases of heterophily are harmful 111In general, harmful heterophily means the heterophilous structure that will make a graph-aware model underperform its corresponding graph-agnostic model.
一般来说,有害的异质性意味着异质结构会导致一个图感知模型表现不如其对应的不考虑图的模型。

图神经网络(GNNs)通过使用基于关系归纳偏差(同质性假设)的图结构来扩展基本神经网络(NNs)。虽然人们认为 GNNs 在现实任务中胜过 NNs,但是 GNNs 相对于不考虑图的 NNs 的性能优势似乎并不总是令人满意的。异质性被认为是一个主要原因,已经提出了许多作品来解决这个问题。在本文中,我们首先展示并不是所有的异质性情况都是有害的 1
for GNNs with aggregation operation. Then, we propose new metrics based on a similarity matrix which considers the influence of both graph structure and input features on GNNs. The metrics demonstrate advantages over the commonly used homophily metrics by tests on synthetic graphs. From the metrics and the observations, we find some cases of harmful heterophily can be addressed by diversification operation. With this fact and knowledge of filterbanks, we propose the Adaptive Channel Mixing (ACM) framework to adaptively exploit aggregation, diversification and identity channels in each GNN layer to address harmful heterophily. We validate the ACM-augmented baselines with 101010 real-world node classification tasks. They consistently achieve significant performance gain and exceed the state-of-the-art GNNs on most of the tasks without incurring significant computational burden.
对于具有聚合操作的 GNN。然后,我们提出了基于相似矩阵的新度量标准,该矩阵考虑了图结构和输入特征对 GNN 的影响。这些度量标准通过对合成图进行测试,展示了优于常用同质性度量标准的优势。通过度量标准和观察,我们发现一些有害异质性的情况可以通过多样化操作来解决。基于这一事实和滤波器组的知识,我们提出了自适应通道混合(ACM)框架,以自适应地利用每个 GNN 层中的聚合、多样化和身份通道,以解决有害异质性。我们通过 101010 真实世界节点分类任务验证了 ACM 增强基线。它们在大多数任务上始终实现显著的性能提升,并超过了现有技术中的大多数任务的 GNN,而不会带来显著的计算负担。

1 Introduction 1 介绍

Deep Neural Networks (NNs) [21] have revolutionized many machine learning areas, including image recognition [20], speech recognition [13] and natural language processing [2], etc.One major strength is their capacity and effectiveness of learning latent representation from Euclidean data. Recently, the focus has been put on its applications on non-Euclidean data [5], e.g., relational data or graphs. Combining graph signal processing and convolutional neural networks [22], numerous Graph Neural Networks (GNNs) [34, 9, 15, 35, 18, 28] have been proposed which empirically outperform traditional neural networks on graph-based machine learning tasks, e.g., node classification, graph classification, link prediction and graph generation, etc.GNNs are built on the homophily assumption[31], i.e., connected nodes tend to share similar attributes with each other [14], which offers additional information besides node features. Such relational inductive bias [3] is believed to be a key factor leading to GNNs’ superior performance over NNs’ in many tasks.
深度神经网络(NNs)[ 21] 已经彻底改变了许多机器学习领域,包括图像识别[ 20]、语音识别[ 13]和自然语言处理[ 2]等。其主要优势之一是从欧几里得数据中学习潜在表示的能力和有效性。最近,人们开始关注其在非欧几里得数据[ 5]上的应用,例如关系数据或图。结合图信号处理和卷积神经网络[ 22],已经提出了许多图神经网络(GNNs)[ 34, 9, 15, 35, 18, 28],在图基机器学习任务上经验性地优于传统神经网络,例如节点分类、图分类、链接预测和图生成等。GNNs 建立在同质性假设[ 31] 的基础上,即连接的节点倾向于彼此具有相似的属性[ 14],这除了节点特征外还提供了额外信息。这种关系归纳偏差[ 3] 被认为是导致 GNNs 在许多任务中表现优越于 NNs 的关键因素。

Nevertheless, growing evidence shows that GNNs do not always gain advantages over traditional NNs when dealing with relational data. In some cases, even simple Multi-Layer Perceptrons (MLPs) can outperform GNNs by a large margin [40, 26, 7]. An important reason for the performance degradation is believed to be the heterophily problem, i.e., connected nodes tend to have different labels which makes the homophily assumption fail. Heterophily challenge has received attention recently and there are increasing number of models being put forward to address this problem [40, 26, 7, 39, 38].
然而,越来越多的证据显示,在处理关系数据时,GNN 并不总是比传统的 NN 获得优势。在某些情况下,甚至简单的多层感知器(MLPs)也能大幅胜过 GNN [40, 26, 7]。性能下降的一个重要原因被认为是异质性问题,即连接的节点往往具有不同的标签,这使得同质性假设失败。异质性挑战最近受到关注,出现了越来越多的模型来解决这个问题 [40, 26, 7, 39, 38]。

Contributions 贡献

In this paper, we first demonstrate that not all heterophilous graphs are harmful for aggregation-based GNNs and the existing metrics of homophily are insufficient to decide whether the aggregation operation will make nodes less distinguishable or not. By constructing a similarity matrix from backpropagation analysis, we derive new metrics to depict how much GNNs are influenced by the graph structure and node features. We show the advantage of our metrics over the existing metrics by comparing the ability of characterizing the performance of two baseline GNNs on synthetic graphs of different levels of homophily. From the similarity matrix, we find that diversification operation is able to address some harmful heterophily cases, and based on which we propose Adaptive Channel Mixing (ACM) GNN framework. The experiments on the synthetic datasets, ablation studies and real-world datasets consistently show that the baseline GNNs augmented by ACM framework are able to obtain significant performance boost on node classification tasks on heterophilous graphs.
在本文中,我们首先证明了并非所有的异质图对基于聚合的 GNNs 都是有害的,并且现有的同质性度量不足以决定聚合操作是否会使节点变得不太可区分。通过从反向传播分析中构建相似性矩阵,我们推导出新的度量标准,以描述 GNNs 受图结构和节点特征影响的程度。我们通过比较两个基线 GNN 在不同同质性水平的合成图上表现的能力,展示了我们的度量标准相对于现有度量标准的优势。从相似性矩阵中,我们发现多样化操作能够解决一些有害的异质性情况,基于此,我们提出了自适应通道混合(ACM)GNN 框架。在合成数据集、消融研究和真实世界数据集上的实验一致表明,通过 ACM 框架增强的基线 GNN 能够在异质图上的节点分类任务中获得显著的性能提升。

The rest of this paper is mainly organized as follows: In section 2, we introduce the notation and the background knowledge. In section 3, we conduct node-wise analysis on heterophily, derive new homophily metrics based on a similarity matrix and conduct experiments to show their advantages over the existing homophily metrics. In section 4, we demonstrate the capability of diversification operation on addressing some cases of harmful heterophily and propose the ACM-GNN framework to adaptively utilize the information from different filterbank channels to address heterophily problem. In section 5, we discuss the related works and clarify the differences to our method. In section 6, we provide empirical evaluations on ACM framework, including ablation study and tests on 101010 real-world node classification tasks.
本文的其余部分主要组织如下:在第 2 节中,我们介绍符号和背景知识。在第 3 节中,我们对异质性进行节点分析,基于相似矩阵推导新的同质性度量,并进行实验展示它们相对于现有同质性度量的优势。在第 4 节中,我们展示了多样化操作在解决一些有害异质性情况方面的能力,并提出了 ACM-GNN 框架,以自适应地利用不同滤波器通道的信息来解决异质性问题。在第 5 节中,我们讨论相关工作并澄清与我们方法的差异。在第 6 节中,我们对 ACM 框架进行实证评估,包括消融研究和对 101010 个真实世界节点分类任务的测试。

2 Preliminaries 2 初步

We will introduce the related notation and background knowledge in this section. We use bold fonts for vectors (e.g., 𝒗𝒗\bm{v}). Suppose we have an undirected connected graph 𝒢=(𝒱,,A)𝒢𝒱𝐴\mathcal{G}=(\mathcal{V},\mathcal{E},A), where 𝒱𝒱\mathcal{V} is the node set with |𝒱|=N𝒱𝑁\left|\mathcal{V}\right|=N; \mathcal{E} is the edge set without self-loop; AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N} is the symmetric adjacency matrix with Ai,j=1subscript𝐴𝑖𝑗1A_{i,j}=1 iff eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E}, otherwise Ai,j=0subscript𝐴𝑖𝑗0A_{i,j}=0. We use D𝐷D to denote the diagonal degree matrix of 𝒢𝒢{\cal G}, i.e., Di,i=di=jAi,jsubscript𝐷𝑖𝑖subscript𝑑𝑖subscript𝑗subscript𝐴𝑖𝑗D_{i,i}=d_{i}=\sum_{j}A_{i,j} and use 𝒩isubscript𝒩𝑖\mathcal{N}_{i} to denote the neighborhood set of node i𝑖i, i.e., 𝒩i={j:eij}subscript𝒩𝑖conditional-set𝑗subscript𝑒𝑖𝑗\mathcal{N}_{i}=\{j:e_{ij}\in\mathcal{E}\}. A graph signal is a vector 𝒙N𝒙superscript𝑁\bm{x}\in\mathbb{R}^{N} defined on 𝒱𝒱\mathcal{V}, where 𝒙isubscript𝒙𝑖\bm{x}_{i} is defined on the node i𝑖i. We also have a feature matrix XN×F𝑋superscript𝑁𝐹{X}\in\mathbb{R}^{N\times F}, whose columns are graph signals and whose i𝑖i-th row Xi,:subscript𝑋𝑖:{X_{i,:}} is a feature vector of node i𝑖i. We use ZN×C𝑍superscript𝑁𝐶Z\in\mathbb{R}^{N\times C} to denote the label encoding matrix, whose i𝑖i-th row Zi,:subscript𝑍𝑖:Z_{i,:} is the one-hot encoding of the label of node i𝑖i.
在本节中,我们将介绍相关的符号和背景知识。我们使用粗体字表示向量(例如, 𝒗𝒗\bm{v} )。假设我们有一个无向连通图 𝒢=(𝒱,,A)𝒢𝒱𝐴\mathcal{G}=(\mathcal{V},\mathcal{E},A) ,其中 𝒱𝒱\mathcal{V} 是节点集合,具有 |𝒱|=N𝒱𝑁\left|\mathcal{V}\right|=N\mathcal{E} 是不带自环的边集; AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N} 是对称邻接矩阵,当 Ai,j=1subscript𝐴𝑖𝑗1A_{i,j}=1 时为 eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E} ,否则为 Ai,j=0subscript𝐴𝑖𝑗0A_{i,j}=0 。我们使用 D𝐷D 表示 𝒢𝒢{\cal G} 的对角度量矩阵,即 Di,i=di=jAi,jsubscript𝐷𝑖𝑖subscript𝑑𝑖subscript𝑗subscript𝐴𝑖𝑗D_{i,i}=d_{i}=\sum_{j}A_{i,j} ,并使用 𝒩isubscript𝒩𝑖\mathcal{N}_{i} 表示节点 i𝑖i 的邻域集合,即 𝒩i={j:eij}subscript𝒩𝑖conditional-set𝑗subscript𝑒𝑖𝑗\mathcal{N}_{i}=\{j:e_{ij}\in\mathcal{E}\} 。图信号是定义在 𝒱𝒱\mathcal{V} 上的向量 𝒙N𝒙superscript𝑁\bm{x}\in\mathbb{R}^{N} ,其中 𝒙isubscript𝒙𝑖\bm{x}_{i} 定义在节点 i𝑖i 上。我们还有一个特征矩阵 XN×F𝑋superscript𝑁𝐹{X}\in\mathbb{R}^{N\times F} ,其列是图信号,第 i𝑖iXi,:subscript𝑋𝑖:{X_{i,:}} 是节点 i𝑖i 的特征向量。我们使用 ZN×C𝑍superscript𝑁𝐶Z\in\mathbb{R}^{N\times C} 表示标签编码矩阵,其第 i𝑖iZi,:subscript𝑍𝑖:Z_{i,:} 是节点 i𝑖i 的标签的独热编码。

2.1 Graph Laplacian, Affinity Matrix and Their Variants
2.1 图拉普拉斯,相似矩阵及其变体

The (combinatorial) graph Laplacian is defined as L=DA𝐿𝐷𝐴L=D-A, which is Symmetric Positive Semi-Definite (SPSD) [8]. Its eigendecomposition gives L=UΛUT𝐿𝑈Λsuperscript𝑈𝑇L=U\Lambda U^{T}, where the columns 𝒖isubscript𝒖𝑖\boldsymbol{u}_{i} of UN×N𝑈superscript𝑁𝑁U\in{\mathbb{R}}^{N\times N} are orthonormal eigenvectors, namely the graph Fourier basis, Λ=diag(λ1,,λN)Λdiagsubscript𝜆1subscript𝜆𝑁\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}) with λ1λNsubscript𝜆1subscript𝜆𝑁\lambda_{1}\leq\cdots\leq\lambda_{N}, and these eigenvalues are also called frequencies. The graph Fourier transform of the graph signal 𝒙𝒙{\boldsymbol{x}} is defined as 𝒙=U1𝒙=UT𝒙=[𝒖1T𝒙,,𝒖NT𝒙]Tsubscript𝒙superscript𝑈1𝒙superscript𝑈𝑇𝒙superscriptsuperscriptsubscript𝒖1𝑇𝒙superscriptsubscript𝒖𝑁𝑇𝒙𝑇\bm{x}_{\mathcal{F}}=U^{-1}\bm{x}=U^{T}\bm{x}=[\boldsymbol{u}_{1}^{T}{\boldsymbol{x}},\ldots,\boldsymbol{u}_{N}^{T}{\boldsymbol{x}}]^{T}, where 𝒖iT𝒙superscriptsubscript𝒖𝑖𝑇𝒙\bm{u}_{i}^{T}\bm{x} is the component of 𝒙𝒙\bm{x} in the direction of 𝒖𝒊subscript𝒖𝒊\bm{u_{i}}.
(组合)图拉普拉斯定义为 L=DA𝐿𝐷𝐴L=D-A ,它是对称正定半定(SPSD) [8]。它的特征分解给出 L=UΛUT𝐿𝑈Λsuperscript𝑈𝑇L=U\Lambda U^{T} ,其中列 𝒖isubscript𝒖𝑖\boldsymbol{u}_{i}UN×N𝑈superscript𝑁𝑁U\in{\mathbb{R}}^{N\times N} 的标准正交特征向量,即图傅立叶基 Λ=diag(λ1,,λN)Λdiagsubscript𝜆1subscript𝜆𝑁\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}) ,满足 λ1λNsubscript𝜆1subscript𝜆𝑁\lambda_{1}\leq\cdots\leq\lambda_{N} ,这些特征值也被称为频率。图信号 𝒙𝒙{\boldsymbol{x}} 的图傅立叶变换定义为 𝒙=U1𝒙=UT𝒙=[𝒖1T𝒙,,𝒖NT𝒙]Tsubscript𝒙superscript𝑈1𝒙superscript𝑈𝑇𝒙superscriptsuperscriptsubscript𝒖1𝑇𝒙superscriptsubscript𝒖𝑁𝑇𝒙𝑇\bm{x}_{\mathcal{F}}=U^{-1}\bm{x}=U^{T}\bm{x}=[\boldsymbol{u}_{1}^{T}{\boldsymbol{x}},\ldots,\boldsymbol{u}_{N}^{T}{\boldsymbol{x}}]^{T} ,其中 𝒖iT𝒙superscriptsubscript𝒖𝑖𝑇𝒙\bm{u}_{i}^{T}\bm{x}𝒙𝒙\bm{x}𝒖𝒊subscript𝒖𝒊\bm{u_{i}} 方向上的分量。

In additional to L𝐿L, some variants are also commonly used, e.g., the symmetric normalized Laplacian Lsym=D1/2LD1/2=ID1/2AD1/2subscript𝐿symsuperscript𝐷12𝐿superscript𝐷12𝐼superscript𝐷12𝐴superscript𝐷12L_{\text{sym}}=D^{-1/2}LD^{-1/2}=I-D^{-1/2}AD^{-1/2} and the random walk normalized Laplacian Lrw=D1L=ID1Asubscript𝐿rwsuperscript𝐷1𝐿𝐼superscript𝐷1𝐴L_{\text{rw}}=D^{-1}L=I-D^{-1}A. The affinity (transition) matrices can be derived from the Laplacians, e.g., Arw=ILrw=D1Asubscript𝐴rw𝐼subscript𝐿rwsuperscript𝐷1𝐴A_{\text{rw}}=I-L_{\text{rw}}=D^{-1}A, Asym=ILsym=D1/2AD1/2subscript𝐴sym𝐼subscript𝐿symsuperscript𝐷12𝐴superscript𝐷12A_{\text{sym}}=I-L_{\text{sym}}=D^{-1/2}AD^{-1/2} and are considered to be low-pass filters [30]. Their eigenvalues satisfy λi(Arw)=λi(Asym)=1λi(Lsym)=1λi(Lrw)(1,1]subscript𝜆𝑖subscript𝐴rwsubscript𝜆𝑖subscript𝐴sym1subscript𝜆𝑖subscript𝐿sym1subscript𝜆𝑖subscript𝐿rw11\lambda_{i}(A_{\text{rw}})=\lambda_{i}(A_{\text{sym}})=1-\lambda_{i}(L_{\text{sym}})=1-\lambda_{i}(L_{\text{rw}})\in(-1,1]. Applying the renormalization trick [18] to affinity and Laplacian matrices respectively leads to A^sym=D~1/2A~D~1/2subscript^𝐴symsuperscript~𝐷12~𝐴superscript~𝐷12\hat{A}_{\text{sym}}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} and L^sym=IA^symsubscript^𝐿sym𝐼subscript^𝐴sym\hat{L}_{\text{sym}}=I-\hat{A}_{\text{sym}}, where A~A+I~𝐴𝐴𝐼\tilde{A}\equiv A+I and D~D+I~𝐷𝐷𝐼\tilde{D}\equiv D+I. The renormalized affinity matrix essentially adds a self-loop to each node in the graph, and is widely used in Graph Convolutional Network (GCN) [18] as follows,
除了 L𝐿L 之外,还有一些常用的变体,例如对称归一化拉普拉斯 Lsym=D1/2LD1/2=ID1/2AD1/2subscript𝐿symsuperscript𝐷12𝐿superscript𝐷12𝐼superscript𝐷12𝐴superscript𝐷12L_{\text{sym}}=D^{-1/2}LD^{-1/2}=I-D^{-1/2}AD^{-1/2} 和随机游走归一化拉普拉斯 Lrw=D1L=ID1Asubscript𝐿rwsuperscript𝐷1𝐿𝐼superscript𝐷1𝐴L_{\text{rw}}=D^{-1}L=I-D^{-1}A 。亲和(转移)矩阵可以从拉普拉斯矩阵中导出,例如 Arw=ILrw=D1Asubscript𝐴rw𝐼subscript𝐿rwsuperscript𝐷1𝐴A_{\text{rw}}=I-L_{\text{rw}}=D^{-1}AAsym=ILsym=D1/2AD1/2subscript𝐴sym𝐼subscript𝐿symsuperscript𝐷12𝐴superscript𝐷12A_{\text{sym}}=I-L_{\text{sym}}=D^{-1/2}AD^{-1/2} ,被认为是低通滤波器[30]。它们的特征值满足 λi(Arw)=λi(Asym)=1λi(Lsym)=1λi(Lrw)(1,1]subscript𝜆𝑖subscript𝐴rwsubscript𝜆𝑖subscript𝐴sym1subscript𝜆𝑖subscript𝐿sym1subscript𝜆𝑖subscript𝐿rw11\lambda_{i}(A_{\text{rw}})=\lambda_{i}(A_{\text{sym}})=1-\lambda_{i}(L_{\text{sym}})=1-\lambda_{i}(L_{\text{rw}})\in(-1,1] 。将亲和矩阵和拉普拉斯矩阵分别应用重新归一化技巧[18],得到 A^sym=D~1/2A~D~1/2subscript^𝐴symsuperscript~𝐷12~𝐴superscript~𝐷12\hat{A}_{\text{sym}}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}L^sym=IA^symsubscript^𝐿sym𝐼subscript^𝐴sym\hat{L}_{\text{sym}}=I-\hat{A}_{\text{sym}} ,其中 A~A+I~𝐴𝐴𝐼\tilde{A}\equiv A+ID~D+I~𝐷𝐷𝐼\tilde{D}\equiv D+I 。重新归一化的亲和矩阵实质上为图中的每个节点添加了自环,并且在图卷积网络(GCN)[18]中被广泛使用,如下所示,

Y=softmax(A^symReLU(A^symXW0)W1)𝑌softmaxsubscript^𝐴symReLUsubscript^𝐴sym𝑋subscript𝑊0subscript𝑊1Y=\text{softmax}(\hat{A}_{\text{sym}}\;\text{ReLU}(\hat{A}_{\text{sym}}{X}W_{0})\;W_{1}) (1)

where W0F×F1subscript𝑊0superscript𝐹subscript𝐹1W_{0}\in{\mathbb{R}}^{F\times F_{1}} and W1F1×Osubscript𝑊1superscriptsubscript𝐹1𝑂W_{1}\in{\mathbb{R}}^{F_{1}\times O} are learnable parameter matrices. GCN can be trained by minimizing the following cross entropy loss
其中 W0F×F1subscript𝑊0superscript𝐹subscript𝐹1W_{0}\in{\mathbb{R}}^{F\times F_{1}}W1F1×Osubscript𝑊1superscriptsubscript𝐹1𝑂W_{1}\in{\mathbb{R}}^{F_{1}\times O} 是可学习的参数矩阵。GCN 可以通过最小化以下交叉熵损失来训练

=trace(ZTlogY)tracesuperscript𝑍𝑇𝑌\mathcal{L}=-\mathrm{trace}(Z^{T}\log Y) (2)

where log()\log(\cdot) is a component-wise logarithm operation. The random walk renormalized matrix A^rw=D~1A~subscript^𝐴rwsuperscript~𝐷1~𝐴\hat{A}_{\text{rw}}=\tilde{D}^{-1}\tilde{A}, which shares the same eigenvalues as A^symsubscript^𝐴sym\hat{A}_{\text{sym}}, can also be applied in GCN. The corresponding Laplacian is defined as L^rw=IA^rwsubscript^𝐿rw𝐼subscript^𝐴rw\hat{L}_{\text{rw}}=I-\hat{A}_{\text{rw}}. A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} is essentially a random walk matrix and behaves as a mean aggregator that is applied in spatial-based GNNs [15, 14]. To bridge the spectral and spatial methods, we use A^rwsubscript^𝐴𝑟𝑤\hat{A}_{rw} in the paper.
其中 log()\log(\cdot) 是一个逐分量对数运算。随机游走归一化矩阵 A^rw=D~1A~subscript^𝐴rwsuperscript~𝐷1~𝐴\hat{A}_{\text{rw}}=\tilde{D}^{-1}\tilde{A} ,其特征值与 A^symsubscript^𝐴sym\hat{A}_{\text{sym}} 相同,也可以应用于 GCN。相应的拉普拉斯定义为 L^rw=IA^rwsubscript^𝐿rw𝐼subscript^𝐴rw\hat{L}_{\text{rw}}=I-\hat{A}_{\text{rw}}A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} 本质上是一个随机游走矩阵,并且作为应用在基于空间的 GNNs [15, 14]中的均值聚合器。为了连接频谱和空间方法,我们在论文中使用 A^rwsubscript^𝐴𝑟𝑤\hat{A}_{rw}

2.2 Metrics of Homophily 2.2 同质性度量

The metrics of homophily are defined by considering different relations between node labels and graph structures defined by adjacency matrix. There are three commonly used homophily metrics: edge homophily [1, 40], node homophily [33], and class homophily [24] 222The authors in [24] did not name this homophily metric. We name it class homophily based on its definition.
[ 24]中的作者没有给这个同质性度量取名。我们根据其定义将其命名为类同质性。

同质性度量是通过考虑节点标签之间的不同关系和邻接矩阵定义的图结构来定义的。有三种常用的同质性度量:边同质性[ 1, 40],节点同质性[ 33]和类同质性[ 24] 2
defined as follows: 如下所定义:

Hedge(𝒢)=|{euveuv,Zu,:=Zv,:}|||,Hnode(𝒢)=1|𝒱|v𝒱|{uu𝒩v,Zu,:=Zv,:}|dv,formulae-sequencesubscript𝐻edge𝒢conditional-setsubscript𝑒𝑢𝑣formulae-sequencesubscript𝑒𝑢𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝐻node𝒢1𝒱subscript𝑣𝒱conditional-set𝑢formulae-sequence𝑢subscript𝒩𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝑑𝑣\displaystyle H_{\text{edge}}(\mathcal{G})=\frac{\big{|}\{e_{uv}\mid e_{uv}\in\mathcal{E},Z_{u,:}=Z_{v,:}\}\big{|}}{|\mathcal{E}|},\ \ H_{\text{node}}(\mathcal{G})=\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}\frac{\big{|}\{u\mid u\in\mathcal{N}_{v},Z_{u,:}=Z_{v,:}\}\big{|}}{d_{v}}, (3)
Hclass(𝒢)=1C1k=1C[hk|{vZv,k=1}|N]+,hk=v𝒱|{uZv,k=1,u𝒩v,Zu,:=Zv,:}|v{v|Zv,k=1}dvformulae-sequencesubscript𝐻class𝒢1𝐶1superscriptsubscript𝑘1𝐶subscriptdelimited-[]subscript𝑘conditional-set𝑣subscript𝑍𝑣𝑘1𝑁subscript𝑘subscript𝑣𝒱conditional-set𝑢formulae-sequencesubscript𝑍𝑣𝑘1formulae-sequence𝑢subscript𝒩𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝑣conditional-set𝑣subscript𝑍𝑣𝑘1subscript𝑑𝑣\displaystyle H_{\text{class}}(\mathcal{G})=\frac{1}{C-1}\sum_{k=1}^{C}\left[h_{k}-\frac{\big{|}\{v\mid Z_{v,k}=1\}\big{|}}{N}\right]_{+},\ \ h_{k}=\frac{\sum_{v\in\mathcal{V}}\big{|}\{u\mid Z_{v,k}=1,u\in\mathcal{N}_{v},Z_{u,:}=Z_{v,:}\}\big{|}}{\sum_{v\in\{v|Z_{v,k}=1\}}d_{v}}

where [a]+=max(a,0)subscriptdelimited-[]𝑎𝑎0[a]_{+}=\max(a,0); hksubscript𝑘h_{k} is the class-wise homophily metric [24]. They are all in the range of [0,1]01[0,1] and a value close to 111 corresponds to strong homophily while a value close to 00 indicates strong heterophily. Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) measures the proportion of edges that connect two nodes in the same class; Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G}) evaluates the average proportion of edge-label consistency of all nodes; Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) tries to avoid the sensitivity to imbalanced class, which can cause Hedgesubscript𝐻edgeH_{\text{edge}} misleadingly large. The above definitions are all based on the graph-label consistency and imply that the inconsistency will cause harmful effect to the performance of GNNs. With this in mind, we will show a counter example to illustrate the insufficiency of the above metrics and propose new metrics in the following section.
其中 [a]+=max(a,0)subscriptdelimited-[]𝑎𝑎0[a]_{+}=\max(a,0) ; hksubscript𝑘h_{k} 是基于类别的同质性度量[24]。它们都在 [0,1]01[0,1] 的范围内,接近 111 的值表示强同质性,而接近 00 的值表示强异质性。 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 衡量连接两个同类节点的边的比例; Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G}) 评估所有节点的边标签一致性的平均比例; Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) 试图避免对不平衡类别的敏感性,这可能导致 Hedgesubscript𝐻edgeH_{\text{edge}} 误导性地增大。上述定义都基于图标签的一致性,并暗示不一致性会对 GNN 的性能产生有害影响。考虑到这一点,我们将展示一个反例来说明上述度量的不足,并在下一节提出新的度量方法。

3 Analysis of Heterophily 3 异质性分析

3.1 Motivation and Aggregation Homophily
3.1 动机和聚合同好

Refer to caption
Figure 1: Example of harmless heterophily
图 1:无害异质性的示例

Heterophily is believed to be harmful for message-passing based GNNs [40, 33, 7] because intuitively features of nodes in different classes will be falsely mixed and this will lead nodes indistinguishable [40]. Nevertheless, it is not always the case, e.g., the bipartite graph shown in Figure 1 is highly heterophilous according to the homophily metrics in (LABEL:eq:definition_homophily_metrics), but after mean aggregation, the nodes in classes 1 and 2 only exchange colors and are still distinguishable. Authors in [7] also point out the insufficiency of Hnodesubscript𝐻nodeH_{\text{node}} by examples to show that different graph typologies with the same Hnodesubscript𝐻nodeH_{\text{node}} can carry different label information.
异质性被认为对基于消息传递的 GNNs 有害[ 40, 33, 7],因为直观地,不同类别节点的特征将被错误混合,这将导致节点无法区分[ 40]。 然而,并非总是如此,例如,图 1 中显示的二部图根据同质性度量在(LABEL:eq:definition_homophily_metrics)中高度异质,但在均值聚合后,类别 1 和 2 中的节点仅交换颜色仍然可区分。 [ 7]中的作者还通过示例指出 Hnodesubscript𝐻nodeH_{\text{node}} 的不足,以表明具有相同 Hnodesubscript𝐻nodeH_{\text{node}} 的不同图拓扑结构可以携带不同的标签信息。

To analyze to what extent the graph structure can affect the output of a GNN, we first simplify the GCN by removing its nonlinearity as [36]. Let A^N×N^𝐴superscript𝑁𝑁\hat{A}\in\mathbb{R}^{N\times N} denote a general aggregation operator. Then, equation (1) can be simplified as,
为了分析图结构对 GNN 输出的影响程度,我们首先通过去除其非线性性来简化 GCN[ 36]。 让 A^N×N^𝐴superscript𝑁𝑁\hat{A}\in\mathbb{R}^{N\times N} 表示一般的聚合算子。 然后,方程(1)可以简化为,

Y=softmax(A^XW)=softmax(Y)𝑌softmax^𝐴𝑋𝑊softmaxsuperscript𝑌\displaystyle Y=\text{softmax}(\hat{A}XW)=\text{softmax}(Y^{\prime}) (4)

After each gradient decent step ΔW=γddWΔ𝑊𝛾𝑑𝑑𝑊\Delta W=\gamma\frac{d\mathcal{L}}{dW}, where γ𝛾\gamma is the learning rate, the update of Ysuperscript𝑌Y^{\prime} will be (see Appendix B for derivation),
每次梯度下降步骤之后 ΔW=γddWΔ𝑊𝛾𝑑𝑑𝑊\Delta W=\gamma\frac{d\mathcal{L}}{dW} ,其中 γ𝛾\gamma 是学习率, Ysuperscript𝑌Y^{\prime} 的更新将会是(见附录 B 进行推导),

ΔY=A^XΔW=γA^XddWA^XddW=A^XXTA^T(ZY)=S(A^,X)(ZY)Δsuperscript𝑌^𝐴𝑋Δ𝑊𝛾^𝐴𝑋𝑑𝑑𝑊proportional-to^𝐴𝑋𝑑𝑑𝑊^𝐴𝑋superscript𝑋𝑇superscript^𝐴𝑇𝑍𝑌𝑆^𝐴𝑋𝑍𝑌\displaystyle\Delta Y^{\prime}=\hat{A}X\Delta W=\gamma\hat{A}X\frac{d\mathcal{L}}{dW}\propto\hat{A}X\frac{d\mathcal{L}}{dW}=\hat{A}XX^{T}\hat{A}^{T}(Z-Y)=S(\hat{A},X)(Z-Y) (5)

where S(A^,X)A^X(A^X)T𝑆^𝐴𝑋^𝐴𝑋superscript^𝐴𝑋𝑇S(\hat{A},X)\equiv\hat{A}X(\hat{A}X)^{T} is a post-aggregation node similarity matrix, ZY𝑍𝑌Z-Y is the prediction error matrix. The update direction of node i𝑖i is essentially a weighted sum of the prediction error, i.e., Δ(Y)i,:=j𝒱[S(A^,X)]i,j(ZY)j,:Δsubscriptsuperscript𝑌𝑖:subscript𝑗𝒱subscriptdelimited-[]𝑆^𝐴𝑋𝑖𝑗subscript𝑍𝑌𝑗:\Delta(Y^{\prime})_{i,:}=\sum_{j\in\mathcal{V}}\left[S(\hat{A},X)\right]_{i,j}(Z-Y)_{j,:}.
其中 S(A^,X)A^X(A^X)T𝑆^𝐴𝑋^𝐴𝑋superscript^𝐴𝑋𝑇S(\hat{A},X)\equiv\hat{A}X(\hat{A}X)^{T} 是后聚合节点相似性矩阵, ZY𝑍𝑌Z-Y 是预测误差矩阵。节点 i𝑖i 的更新方向本质上是预测误差的加权和,即 Δ(Y)i,:=j𝒱[S(A^,X)]i,j(ZY)j,:Δsubscriptsuperscript𝑌𝑖:subscript𝑗𝒱subscriptdelimited-[]𝑆^𝐴𝑋𝑖𝑗subscript𝑍𝑌𝑗:\Delta(Y^{\prime})_{i,:}=\sum_{j\in\mathcal{V}}\left[S(\hat{A},X)\right]_{i,j}(Z-Y)_{j,:}

To study the effect of heterophily, we first define the aggregation similarity score as follows.
研究异质性效应,我们首先定义聚集相似度分数如下。

Definition 1.

Aggregation similarity score

Sagg(S(A^,X))=|{v|Meanu({S(A^,X)v,u|Zu,:=Zv,:})Meanu({S(A^,X)v,u|Zu,:Zv,:})}||𝒱|subscript𝑆agg𝑆^𝐴𝑋conditional-set𝑣subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑋𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑋𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱S_{\text{agg}}\left(S(\hat{A},X)\right)=\frac{\left|\left\{v\,\big{|}\,\mathrm{Mean}_{u}\big{(}\{S(\hat{A},X)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},X)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right\}\right|}{\left|\mathcal{V}\right|} (6)

where Meanu({})subscriptMean𝑢\mathrm{Mean}_{u}\left(\{\cdot\}\right) takes the average over u𝑢u of a given multiset of values or variables.


定义 1. 聚合相似度分数

Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}(S(\hat{A},X)) measures the proportion of nodes v𝒱𝑣𝒱v\in\mathcal{V} that will put relatively larger similarity weights on nodes in the same class than in other classes after aggregation. It is easy to see that Sagg(S(A^,X))[0,1]subscript𝑆agg𝑆^𝐴𝑋01S_{\text{agg}}(S(\hat{A},X))\in[0,1]. But in practice, we observe that in most datasets, we will have Sagg(S(A^,X))0.5subscript𝑆agg𝑆^𝐴𝑋0.5S_{\text{agg}}(S(\hat{A},X))\geq 0.5. Based on this observation, we rescale (6) to the following modified aggregation similarity for practical usage,

SaggM(S(A^,X))=[2Sagg(S(A^,X))1]+subscriptsuperscript𝑆𝑀agg𝑆^𝐴𝑋subscriptdelimited-[]2subscript𝑆agg𝑆^𝐴𝑋1S^{M}_{\text{agg}}\left(S(\hat{A},X)\right)=\left[2S_{\text{agg}}\left(S(\hat{A},X)\right)-1\right]_{+} (7)

In order to measure the consistency between labels and graph structures without considering node features and make a fair comparison with the existing homophily metrics in (LABEL:eq:definition_homophily_metrics), we define the graph (𝒢𝒢\mathcal{G}) aggregation (A^^𝐴\hat{A}) homophily and its modified version as

Hagg(𝒢)=Sagg(S(A^,Z)),HaggM(𝒢)=SaggM(S(A^,Z))formulae-sequencesubscript𝐻agg𝒢subscript𝑆agg𝑆^𝐴𝑍superscriptsubscript𝐻agg𝑀𝒢superscriptsubscript𝑆agg𝑀𝑆^𝐴𝑍H_{\text{agg}}(\mathcal{G})=S_{\text{agg}}\left(S(\hat{A},Z)\right),\;H_{\text{agg}}^{M}(\mathcal{G})=S_{\text{agg}}^{M}\left(S(\hat{A},Z)\right) (8)

In practice, we will only check Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) when HaggM(𝒢)=0superscriptsubscript𝐻agg𝑀𝒢0H_{\text{agg}}^{M}(\mathcal{G})=0. As Figure 1 shows, when A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}}, Hagg(𝒢)=HaggM(𝒢)=1subscript𝐻agg𝒢superscriptsubscript𝐻agg𝑀𝒢1H_{\text{agg}}(\mathcal{G})=H_{\text{agg}}^{M}(\mathcal{G})=1. Thus, this new metric reflects the fact that nodes in classes 1 and 2 are still highly distinguishable after aggregation, while other metrics mentioned before fail to capture the information and misleadingly give value 0. This shows the advantage of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) and HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) by additionally considering information from aggregation operator A^^𝐴\hat{A} and the similarity matrix.
在实践中,我们只会在 HaggM(𝒢)=0superscriptsubscript𝐻agg𝑀𝒢0H_{\text{agg}}^{M}(\mathcal{G})=0 时检查 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 。如图 1 所示,当 A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}} 时, Hagg(𝒢)=HaggM(𝒢)=1subscript𝐻agg𝒢superscriptsubscript𝐻agg𝑀𝒢1H_{\text{agg}}(\mathcal{G})=H_{\text{agg}}^{M}(\mathcal{G})=1 。因此,这个新度量反映了类别 1 和 2 中的节点在聚合后仍然高度可区分,而之前提到的其他度量未能捕捉到这一信息,误导性地给出值 0。这显示了 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G})HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 的优势,因为它们另外考虑了来自聚合算子 A^^𝐴\hat{A} 和相似度矩阵的信息。

To comprehensively compare HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) with the metrics in (LABEL:eq:definition_homophily_metrics) in terms of how they reveal the influence of graph structure on the GNN performance, we generate synthetic graphs and evaluate SGC [36] and GCN [18] on them in the next subsection.
为了全面比较 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 与(LABEL:eq:definition_homophily_metrics)中的指标,看它们如何揭示图结构对 GNN 性能的影响,我们生成了合成图,并在下一小节评估了 SGC [ 36]和 GCN [ 18]。

3.2 Evaluation and Comparison on Synthetic Graphs
在合成图上的评估和比较 3.2

Data Generation & Experimental Setup
数据生成与实验设置

We first generate 280 graphs in total with 28 edge homophily levels varied from 0.005 to 0.95, each corresponding to 101010 graphs. For every generated graph, we have 5 classes with 400 nodes in each class. For each node, we randomly generate 2 intra-class edges and [2h222\frac{2}{h}-2] inter-class edges (see the details about the data generation process in appendix C). The features of nodes in each class are sampled from node features in the corresponding class of the base dataset. Nodes are randomly split into 60%/20%/20% for train/validation/test. We train 1-hop SGC (sgc-1) [36] and GCN [18] on synthetic data (see appendix A.1 for hyperparameter searching range). For each value of Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}), we take the average test accuracy and standard deviation of runs over 10 generated graphs. For each generated graph, we also calculate its Hnode(𝒢),Hclass(𝒢)subscript𝐻node𝒢subscript𝐻class𝒢H_{\text{node}}(\mathcal{G}),H_{\text{class}}(\mathcal{G}) and HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}). Model performance with respect to different homophily values are shown in Figure 2.
我们总共生成了 280 个图,其中包含 28 个边同质性水平,从 0.005 到 0.95 变化,每个水平对应 101010 个图。对于每个生成的图,我们有 5 个类别,每个类别有 400 个节点。对于每个节点,我们随机生成 2 条类内边和[ 2h222\frac{2}{h}-2 ]条类间边(有关数据生成过程的详细信息请参见附录 C)。每个类别中节点的特征是从基础数据集相应类别的节点特征中抽样得到的。节点被随机分为 60%/20%/20%用于训练/验证/测试。我们在合成数据上训练 1-hop SGC(sgc-1)[36]和 GCN[18](有关超参数搜索范围,请参见附录 A.1)。对于每个 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 的值,我们计算 10 个生成图的平均测试准确率和运行的标准偏差。对于每个生成的图,我们还计算其 Hnode(𝒢),Hclass(𝒢)subscript𝐻node𝒢subscript𝐻class𝒢H_{\text{node}}(\mathcal{G}),H_{\text{class}}(\mathcal{G})HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 。模型性能与不同同质性值的关系如图 2 所示。

Refer to caption
(a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) (a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) (a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G})
Refer to caption
(b) Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G})
Refer to caption
(c) Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G})
Refer to caption
(d) HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G})
Figure 2: Comparison of baseline performance under different homophily metrics.
图 2:不同同质性度量下基线性能的比较。

Comparison of Homophily Metrics
同音词度量的比较

The performance of SGC-1 and GCN are expected to be monotonically increasing with a proper and informative homophily metric. However, Figure 2(a)(b)(c) show that the performance curves under Hedge(𝒢),Hnode(𝒢)subscript𝐻edge𝒢subscript𝐻node𝒢H_{\text{edge}}(\mathcal{G}),H_{\text{node}}(\mathcal{G}) and Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) are U𝑈U-shaped 333A similar J-shaped curve is found in [40], though using different data generation processes. It does not mention the insufficiency of edge homophily.
在[40]中发现了类似的 J 形曲线,尽管使用了不同的数据生成过程。它没有提到边缘同质性的不足。

SGC-1 和 GCN 的性能预计会随着适当和信息丰富的同质性度量单调增加。然而,图 2(a)(b)(c)显示,在 Hedge(𝒢),Hnode(𝒢)subscript𝐻edge𝒢subscript𝐻node𝒢H_{\text{edge}}(\mathcal{G}),H_{\text{node}}(\mathcal{G})Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) 下,性能曲线呈 U𝑈U 形状。
, while Figure 2(d) reveals a nearly monotonic curve with a little perturbation around 1. This indicates that HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) can describe how the graph structure affects the performance of SGC-1 and GCN more appropriately and adequately than the existing metrics.
,而图 2(d)显示了一个几乎单调的曲线,围绕 1 有一点扰动。这表明 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 可以更恰当和充分地描述图结构如何影响 SGC-1 和 GCN 的性能,而不是现有的度量标准。

In addition, we notice that in Figure 2(a), both SGC-1 and GCN get the worst performance on all datasets when Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) is around somewhere between 0.1 and 0.2. This interesting phenomenon can be explained by the following theorem based on the similarity matrix which can verify the usefulness of HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}).
另外,我们注意到在图 2(a)中,当 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 大约在 0.1 和 0.2 之间时,SGC-1 和 GCN 在所有数据集上表现最差。这一有趣的现象可以通过基于相似矩阵的以下定理来解释,该定理可以验证 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 的有用性。

  • Theorem1.

    (See Appendix D for proof). Suppose there are C𝐶C classes in the graph 𝒢𝒢\cal G and 𝒢𝒢\cal G is a d𝑑d-regular graph (each node has d𝑑d neighbors). Given d𝑑d, edges for each node are i.i.d.generated, such that each edge of any node has probability hh to connect with nodes in the same class and probability 1h11-h to connect with nodes in different classes. Let the aggregation operator A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}}. Then, for nodes v𝑣v, u1subscript𝑢1u_{1} and u2subscript𝑢2u_{2}, where Zu1,:=Zv,:subscript𝑍subscript𝑢1:subscript𝑍𝑣:Z_{u_{1},:}=Z_{v,:} and Zu2,:Zv,:subscript𝑍subscript𝑢2:subscript𝑍𝑣:Z_{u_{2},:}\neq Z_{v,:}, we have

    g(h)𝔼(S(A^,Z)v,u1)𝔼(S(A^,Z)v,u2)=((C1)(hd+1)(1h)d(C1)(d+1))2𝑔𝔼𝑆subscript^𝐴𝑍𝑣subscript𝑢1𝔼𝑆subscript^𝐴𝑍𝑣subscript𝑢2superscript𝐶1𝑑11𝑑𝐶1𝑑12g(h)\equiv\mathbb{E}\left(S(\hat{A},Z)_{v,u_{1}}\right)-\mathbb{E}\left(S(\hat{A},Z)_{v,u_{2}}\right)=\left(\frac{(C-1)(hd+1)-(1-h)d}{(C-1)(d+1)}\right)^{2} (9)

    and the minimum of g(h)𝑔g(h) is reached at

    h=d+1CCd=dintra/h+1CC(dintra/h)h=dintraCdintra+C1𝑑1𝐶𝐶𝑑subscript𝑑intra1𝐶𝐶subscript𝑑intrasubscript𝑑intra𝐶subscript𝑑intra𝐶1h=\frac{d+1-C}{Cd}=\frac{d_{\text{intra}}/h+1-C}{C(d_{\text{intra}}/h)}\Rightarrow h=\frac{d_{\text{intra}}}{Cd_{\text{intra}}+C-1}

    where dintra=dhsubscript𝑑intra𝑑d_{\text{intra}}=dh, which is the expected number of neighbors of a node that have the same label as the node.
    其中 dintra=dhsubscript𝑑intra𝑑d_{\text{intra}}=dh ,即节点具有与节点相同标签的邻居节点的预期数量。


    定理 1.(见附录 D 证明)。假设图 𝒢𝒢\cal G 中有 C𝐶C 个类,并且 𝒢𝒢\cal G 是一个 d𝑑d -正则图(每个节点有 d𝑑d 个邻居)。给定 d𝑑d ,每个节点的边是独立同分布生成的,使得任何节点的每条边以概率 hh 连接到同一类别的节点,并以概率 1h11-h 连接到不同类别的节点。让聚合算子 A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}} 。那么,对于节点 v𝑣vu1subscript𝑢1u_{1}u2subscript𝑢2u_{2} ,其中 Zu1,:=Zv,:subscript𝑍subscript𝑢1:subscript𝑍𝑣:Z_{u_{1},:}=Z_{v,:}Zu2,:Zv,:subscript𝑍subscript𝑢2:subscript𝑍𝑣:Z_{u_{2},:}\neq Z_{v,:} ,我们有

The value of g(h)𝑔g(h) in (9) is the expected differences of the similarity values between nodes in the same class as v𝑣v and nodes in other classes. g(h)𝑔g(h) is strongly related to the definition of aggregation homophily and its minimum potentially implies the worst value of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}). In the synthetic experiments, we have dintra=2,C=5formulae-sequencesubscript𝑑intra2𝐶5d_{\text{intra}}=2,C=5 and the minimum of g(h)𝑔g(h) is reached at h=1/70.14170.14h=1/7\approx 0.14, which corresponds to the lowest point in the performance curve in Figure 2(a). In other words, the hh where SGC-1 and GCN perform worst is where g(h)𝑔g(h) gets the smallest value, instead of the point with the smallest edge homophily value h=00h=0. This again reveals the advantage of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) over Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) by taking use of the similarity matrix.
(9)中 g(h)𝑔g(h) 的值是同一类别中的节点与其他类别中的节点之间相似性值的预期差异。 g(h)𝑔g(h) 与聚合同质性的定义密切相关,其最小值可能意味着 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 的最差值。在合成实验中,我们有 dintra=2,C=5formulae-sequencesubscript𝑑intra2𝐶5d_{\text{intra}}=2,C=5 ,并且 g(h)𝑔g(h) 的最小值在 h=1/70.14170.14h=1/7\approx 0.14 时达到,这对应于图 2(a)中性能曲线的最低点。换句话说,SGC-1 和 GCN 表现最差的 hhg(h)𝑔g(h) 获得最小值的地方,而不是具有最小边同质性值 h=00h=0 的点。这再次揭示了 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 相对于 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 的优势,通过利用相似性矩阵。

Refer to caption
Figure 3: Example of how HP filter addresses harmful heterophily
图 3:HP 滤波器如何处理有害的异质性的示例

4 Adaptive Channel Mixing (ACM) Framework
4 自适应通道混合(ACM)框架

Besides the new homophily metrics, in this section, we will also figure out how diversification operation (high-pass filter) is potentially capable to address some cases of harmful heterophily based on the similarity matrix proposed in equation (5). From the analysis, we argue that low-pass filter and high-pass filter should be combined together for feature extraction, which lead us to the filterbank method in subsection 4.2. We generalize filterbank method and propose ACM framework in subsection 4.3.
除了新的同质性度量之外,在本节中,我们还将弄清多样化操作(高通滤波器)如何潜在地能够解决一些基于方程(5)中提出的相似性矩阵的有害异质性情况。通过分析,我们认为低通滤波器和高通滤波器应该结合在一起进行特征提取,这使我们进入了第 4.2 小节的滤波器组方法。我们概括滤波器组方法,并在第 4.3 小节提出 ACM 框架。

4.1 How Diversification Operation Helps with Harmful Heterophily
4.1 多样化操作如何帮助解决有害异质性

We first consider the example shown in Figure 3. From S(A^,X)𝑆^𝐴𝑋S(\hat{A},X), nodes 1,3 assign relatively large positive weights to nodes in class 2 after aggregation, which will make node 1,3 hard to be distinguished from nodes in class 2. Despite the fact, we can still distinguish between nodes 1,3 and 4,5,6,7 by considering their neighborhood difference: nodes 1,3 are different from most of their neighbors while nodes 4,5,6,7 are similar to most of their neighbors. This indicates, in some cases, although some nodes become similar after aggregation, they are still distinguishable via their surrounding dissimilarities. This leads us to use diversification operation, i.e., high-pass (HP) filter IA^𝐼^𝐴I-\hat{A} [10] (will be introduced in the next subsection) to extract the information of neighborhood differences and address harmful heterophily. As S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) in Figure 3 shows, nodes 1,3 will assign negative weights to nodes 4,5,6,7 after diversification operation, i.e., nodes 1,3 treat nodes 4,5,6,7 as negative samples and will move away from them during backpropagation. Base on this example, we first propose diversification distinguishability as follows to measures the proportion of nodes that diversification operation is potentially helpful for,
我们首先考虑图 3 中显示的示例。从 S(A^,X)𝑆^𝐴𝑋S(\hat{A},X) ,节点 1,3 在聚合后对类 2 中的节点分配相对较大的正权重,这将使节点 1,3 难以与类 2 中的节点区分开。尽管如此,我们仍然可以通过考虑它们的邻域差异来区分节点 1,3 和 4,5,6,7:节点 1,3 与大多数邻居不同,而节点 4,5,6,7 与大多数邻居相似。这表明,在某些情况下,尽管一些节点在聚合后变得相似,但仍可以通过它们周围的不同之处来区分它们。这促使我们使用多样化操作,即高通(HP)滤波器 IA^𝐼^𝐴I-\hat{A} [10](将在下一小节中介绍)来提取邻域差异的信息并解决有害的异质性。如图 3 中所示 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) ,节点 1,3 在多样化操作后将对节点 4,5,6,7 分配负权重,即节点 1,3 将节点 4,5,6,7 视为负样本,并在反向传播过程中远离它们。根据这个例子,我们首先提出多样化可区分性如下来衡量多样化操作对潜在有帮助的节点比例

Definition 2.

Diversification Distinguishability (DD) based on S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X).


定义 2. 基于 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) 的分散区分度(DD)。

Given S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X), a node v𝑣v is diversification distinguishable if the following two conditions are satisfied at the same time,
给定 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) ,如果同时满足以下两个条件,则节点 v𝑣v 的差异化可区分性

1.Meanu({S(IA^,X)v,u|u𝒱Zu,:=Zv,:})0;2.Meanu({S(IA^,X)v,u|u𝒱Zu,:Zv,:})0formulae-sequence1.subscriptMean𝑢conditional-set𝑆subscript𝐼^𝐴𝑋𝑣𝑢𝑢𝒱subscript𝑍𝑢:subscript𝑍𝑣:02.subscriptMean𝑢conditional-set𝑆subscript𝐼^𝐴𝑋𝑣𝑢𝑢𝒱subscript𝑍𝑢:subscript𝑍𝑣:0\begin{split}\textbf{1.}\ \mathrm{Mean}_{u}\left(\{S(I-\hat{A},X)_{v,u}|u\in\mathcal{V}\land Z_{u,:}=Z_{v,:}\}\right)\geq 0;\\ \textbf{2.}\ \mathrm{Mean}_{u}\left(\{S(I-\hat{A},X)_{v,u}|u\in\mathcal{V}\land Z_{u,:}\neq Z_{v,:}\}\right)\leq 0\end{split} (10)

Then, graph diversification distinguishability value is defined as
然后,图差异化可区分性值被定义为

DDA^,X(𝒢)=1|𝒱||{v|v is diversification distinguishable}|subscriptDD^𝐴𝑋𝒢1𝒱conditional-set𝑣𝑣 is diversification distinguishable\mathrm{DD}_{\hat{A},X}(\mathcal{G})=\frac{1}{\left|\mathcal{V}\right|}\Big{|}\{v|v\mbox{ is diversification distinguishable}\}\Big{|} (11)

We can see that DDA^,X(𝒢)[0,1]subscriptDD^𝐴𝑋𝒢01\mathrm{DD}_{\hat{A},X}(\mathcal{G})\in[0,1] . The effectiveness of diversification operation can be proved for binary classification problems under certain conditions based on definition 222, leading us to:

  • Theorem2.

    (See Appendix E for proof). Suppose X=Z,A^=A^rwformulae-sequence𝑋𝑍^𝐴subscript^𝐴rwX=Z,\hat{A}=\hat{A}_{\text{rw}}. Then, for a binary classification problem, i.e., C=2𝐶2C=2, all nodes are diversification distinguishable, i.e., DDA^,Z(𝒢)=1subscriptDD^𝐴𝑍𝒢1\mathrm{DD}_{\hat{A},Z}(\mathcal{G})=1.

Theorem 2 theoretically demonstrates the importance of diversification operation to extract high-frequency information of graph signal [10]. Combined with aggregation operation, which is a low-pass filter [10, 30], we can get a filterbank which uses both aggregation and diversification operations to distinctively extract the low- and high-frequency information from graph signals. We will introduce filterbank in the next subsection.
定理 2 在理论上证明了差异化操作对提取图信号的高频信息的重要性[10]。结合聚合操作,即低通滤波器[10, 30],我们可以得到一个滤波器组,它同时使用聚合和差异化操作来明显提取图信号的低频和高频信息。我们将在下一小节介绍滤波器组。

4.2 Filterbank in Spectral and Spatial Forms
4.2 频谱和空间形式中的滤波器组

Filterbank 滤波器组

For the graph signal 𝒙𝒙\bm{x} defined on 𝒢𝒢\mathcal{G}, a 2-channel linear (analysis) filterbank [10] 444In graph signal processing, an additional synthesis filter [10] is required to form the 2-channel filterbank. But synthesis filter is not needed in our framework, so we do not introduce it in our paper.
在图信号处理中,需要额外的合成滤波器[10]来形成 2 通道滤波器组。但在我们的框架中不需要合成滤波器,因此我们在论文中没有介绍它。

对于在 𝒢𝒢\mathcal{G} 上定义的图信号 𝒙𝒙\bm{x} ,一个 2 通道线性(分析)滤波器组[10] 4
includes a pair of low-pass(LP) and high-pass(HP) filters HLP,HHPsubscript𝐻LPsubscript𝐻HPH_{\text{LP}},H_{\text{HP}}, where HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} retain the low-frequency and high-frequency content of 𝒙𝒙\bm{x}, respectively.
包括一对低通(LP)和高通(HP)滤波器 HLP,HHPsubscript𝐻LPsubscript𝐻HPH_{\text{LP}},H_{\text{HP}} ,其中 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}} 分别保留 𝒙𝒙\bm{x} 的低频和高频内容。

Most existing GNNs are under uni-channel filtering architecture [18, 35, 15] with either HLPsubscript𝐻LPH_{\text{LP}} or HHPsubscript𝐻HPH_{\text{HP}} channel that only partially preserves the input information. Unlike the uni-channel architecture, filterbanks with HLP+HHP=Isubscript𝐻LPsubscript𝐻HP𝐼H_{\text{LP}}+H_{\text{HP}}=I will not lose any information of the input signal, i.e., perfect reconstruction property [10].
大多数现有的 GNN 都采用单通道滤波结构[18, 35, 15],只有 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}} 通道,这些通道只部分保留输入信息。与单通道结构不同,具有 HLP+HHP=Isubscript𝐻LPsubscript𝐻HP𝐼H_{\text{LP}}+H_{\text{HP}}=I 的滤波器组不会丢失输入信号的任何信息,即完美重构性质[10]。

Generally, the Laplacian matrices (Lsymsubscript𝐿symL_{\text{sym}}, Lrwsubscript𝐿rwL_{\text{rw}}, L^symsubscript^𝐿sym\hat{L}_{\text{sym}}, L^rwsubscript^𝐿rw\hat{L}_{\text{rw}}) can be regarded as HP filters [10] and affinity matrices (Asymsubscript𝐴symA_{\text{sym}}, Arwsubscript𝐴rwA_{\text{rw}}, A^symsubscript^𝐴sym\hat{A}_{\text{sym}}, A^rwsubscript^𝐴rw\hat{A}_{\text{rw}}) can be treated as LP filters [30, 14]. Moreover, MLPs can be considered as owing a special identity filterbank with matrix I𝐼I that satisfies HLP+HHP=I+0=Isubscript𝐻LPsubscript𝐻HP𝐼0𝐼H_{\text{LP}}+H_{\text{HP}}=I+0=I.
通常,拉普拉斯矩阵( Lsymsubscript𝐿symL_{\text{sym}}Lrwsubscript𝐿rwL_{\text{rw}}L^symsubscript^𝐿sym\hat{L}_{\text{sym}}L^rwsubscript^𝐿rw\hat{L}_{\text{rw}} )可以被视为 HP 滤波器[10],而相似性矩阵( Asymsubscript𝐴symA_{\text{sym}}Arwsubscript𝐴rwA_{\text{rw}}A^symsubscript^𝐴sym\hat{A}_{\text{sym}}A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} )可以被视为 LP 滤波器[30, 14]。此外,MLP 可以被视为具有满足 HLP+HHP=I+0=Isubscript𝐻LPsubscript𝐻HP𝐼0𝐼H_{\text{LP}}+H_{\text{HP}}=I+0=I 的特殊身份滤波器组的矩阵 I𝐼I

Filterbank in Spatial Form
空间形式的滤波器组

Filterbank methods can also be extended to spatial GNNs. Formally, on the node level, left multiplying HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} on 𝒙𝒙\bm{x} performs as aggregation and diversification operations, respectively. For example, suppose HLP=A^subscript𝐻LP^𝐴H_{\text{LP}}=\hat{A} and HHP=IA^subscript𝐻HP𝐼^𝐴H_{\text{HP}}=I-\hat{A}, then for node i𝑖i we have
滤波器组方法也可以扩展到空间 GNN。形式上,在节点级别,左乘 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}}𝒙𝒙\bm{x} 上分别执行聚合和多样化操作。例如,假设 HLP=A^subscript𝐻LP^𝐴H_{\text{LP}}=\hat{A}HHP=IA^subscript𝐻HP𝐼^𝐴H_{\text{HP}}=I-\hat{A} ,那么对于节点 i𝑖i ,我们有

(HLP𝒙)i=j{𝒩ii}A^i,j𝒙𝒋,(HHP𝒙)i=𝒙𝒊j{𝒩ii}A^i,j𝒙𝒋formulae-sequencesubscriptsubscript𝐻LP𝒙𝑖subscript𝑗subscript𝒩𝑖𝑖subscript^𝐴𝑖𝑗subscript𝒙𝒋subscriptsubscript𝐻HP𝒙𝑖subscript𝒙𝒊subscript𝑗subscript𝒩𝑖𝑖subscript^𝐴𝑖𝑗subscript𝒙𝒋(H_{\text{LP}}\bm{x})_{i}=\sum_{j\in\{\mathcal{N}_{i}\cup i\}}\hat{A}_{i,j}\bm{x_{j}},\ (H_{\text{HP}}\bm{x})_{i}=\bm{x_{i}}-\sum_{j\in\{\mathcal{N}_{i}\cup i\}}\hat{A}_{i,j}\bm{x_{j}} (12)

where A^i,jsubscript^𝐴𝑖𝑗\hat{A}_{i,j} is the connection weight between two nodes. To leverage HP and identity channels in GNNs, we propose the Adaptive Channel Mixing (ACM) architecture in the following subsection.
其中 A^i,jsubscript^𝐴𝑖𝑗\hat{A}_{i,j} 是两个节点之间的连接权重。为了利用 GNN 中的 HP 和身份通道,我们在以下小节中提出了自适应通道混合(ACM)架构。

4.3 Adaptive Channel Mixing(ACM) GNN Framework

ACM framework can be applied to lots of baseline GNNs and in this subsection, we use GCN as an example and introduce ACM framework in matrix form. We use HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} to represent general LP and HP filters. The ACM framework includes 333 steps as follows,

Step 1. Feature Extraction for Each Channel: (13)
Option 1: HLl=ReLU(HLPHl1WLl1),HHl=ReLU(HHPHl1WHl1),HIl=ReLU(IHl1WIl1);formulae-sequenceOption 1: subscriptsuperscript𝐻𝑙𝐿ReLUsubscript𝐻LPsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐿formulae-sequencesubscriptsuperscript𝐻𝑙𝐻ReLUsubscript𝐻HPsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐻subscriptsuperscript𝐻𝑙𝐼ReLU𝐼superscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐼\displaystyle\text{Option 1: }{H}^{l}_{L}=\text{ReLU}\left(H_{\text{LP}}{H^{l-1}}W^{l-1}_{L}\right),\ {{H}^{l}_{H}}=\text{ReLU}\left(H_{\text{HP}}{H^{l-1}}W^{l-1}_{H}\right),{H}^{l}_{I}=\ \text{ReLU}\left(I{H^{l-1}}W^{l-1}_{I}\right);
Option 2: HLl=HLPReLU(Hl1WLl1),HHl=HHPReLU(Hl1WHl1),HIl=IReLU(Hl1WIl1);formulae-sequenceOption 2: subscriptsuperscript𝐻𝑙𝐿subscript𝐻LPReLUsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐿formulae-sequencesubscriptsuperscript𝐻𝑙𝐻subscript𝐻HPReLUsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐻subscriptsuperscript𝐻𝑙𝐼𝐼ReLUsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐼\displaystyle\text{Option 2: }{H}^{l}_{L}=H_{\text{LP}}\text{ReLU}\left({H^{l-1}}W^{l-1}_{L}\right),\ {{H}^{l}_{H}}=H_{\text{HP}}\text{ReLU}\left({H^{l-1}}W^{l-1}_{H}\right),{H}^{l}_{I}=I\ \text{ReLU}\left({H^{l-1}}W^{l-1}_{I}\right);
WLl1,WHl1,WIl1Fl1×Fl;superscriptsubscript𝑊𝐿𝑙1superscriptsubscript𝑊𝐻𝑙1superscriptsubscript𝑊𝐼𝑙1superscriptsubscript𝐹𝑙1subscript𝐹𝑙\displaystyle W_{L}^{l-1},\ W_{H}^{l-1},\ W_{I}^{l-1}\in\mathbb{R}^{F_{l-1}\times F_{l}};
Step 2. Feature-based Weight Learning
α~Ll=σ(HLlW~Ll),α~Hl=σ(HHlW~Hl),α~Il=σ(HIlW~Il),W~Ll1,W~Hl1,W~Il1Fl×1formulae-sequencesuperscriptsubscript~𝛼𝐿𝑙𝜎subscriptsuperscript𝐻𝑙𝐿subscriptsuperscript~𝑊𝑙𝐿formulae-sequencesuperscriptsubscript~𝛼𝐻𝑙𝜎subscriptsuperscript𝐻𝑙𝐻subscriptsuperscript~𝑊𝑙𝐻formulae-sequencesuperscriptsubscript~𝛼𝐼𝑙𝜎subscriptsuperscript𝐻𝑙𝐼subscriptsuperscript~𝑊𝑙𝐼superscriptsubscript~𝑊𝐿𝑙1superscriptsubscript~𝑊𝐻𝑙1superscriptsubscript~𝑊𝐼𝑙1superscriptsubscript𝐹𝑙1\displaystyle\tilde{\alpha}_{L}^{l}=\sigma\left({H}^{l}_{L}\tilde{W}^{l}_{L}\right),\ \tilde{\alpha}_{H}^{l}=\sigma\left({H}^{l}_{H}\tilde{W}^{l}_{H}\right),\ \tilde{\alpha}_{I}^{l}=\sigma\left({H}^{l}_{I}\tilde{W}^{l}_{I}\right),\ \tilde{W}_{L}^{l-1},\ \tilde{W}_{H}^{l-1},\ \tilde{W}_{I}^{l-1}\in\mathbb{R}^{F_{l}\times 1}
[αLl,αHl,αIl]=Softmax([α~Ll,α~Hl,α~Il]WMixl/T,),WMixl3×3,T is the temperature;\displaystyle\left[{\alpha}_{L}^{l},{\alpha}_{H}^{l},{\alpha}_{I}^{l}\right]=\text{Softmax}\left(\left[\tilde{\alpha}_{L}^{l},\tilde{\alpha}_{H}^{l},\tilde{\alpha}_{I}^{l}\right]W_{\text{Mix}}^{l}/T,\right),\ W_{\text{Mix}}^{l}\in\mathbb{R}^{3\times 3},T\in\mathbb{R}\text{ is the temperature};
Step 3. Node-wise Channel Mixing:
Hl=(diag(αLl)HLl+diag(αHl)HHl+diag(αIl)HIl).superscript𝐻𝑙diagsuperscriptsubscript𝛼𝐿𝑙subscriptsuperscript𝐻𝑙𝐿diagsuperscriptsubscript𝛼𝐻𝑙subscriptsuperscript𝐻𝑙𝐻diagsuperscriptsubscript𝛼𝐼𝑙subscriptsuperscript𝐻𝑙𝐼\displaystyle{H^{l}}=\left(\text{diag}(\alpha_{L}^{l}){H}^{l}_{L}+\text{diag}(\alpha_{H}^{l}){H}^{l}_{H}+\text{diag}(\alpha_{I}^{l}){H}^{l}_{I}\right).

The framework with option 1 in step 1 is ACM framework and with option 2 is ACMII framework. ACM(II)-GCN first implement distinct feature extractions for 333 channels, respectively. After processed by a set of filterbanks, 333 filtered components HLl,HHl,HIlsuperscriptsubscript𝐻𝐿𝑙superscriptsubscript𝐻𝐻𝑙superscriptsubscript𝐻𝐼𝑙H_{L}^{l},H_{H}^{l},H_{I}^{l} are obtained. Different nodes may have different needs for the information in the 3 channels, e.g., in Figure 3, nodes 1,3 demand high-frequency information while node 2 only needs low-frequency information. To adaptively exploit information from different channels, ACM(II)-GCN learns row-wise (node-wise) feature-conditioned weights to combine the 333 channels. ACM(II) can be easily plugged into spatial GNNs by replacing HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} by aggregation and diversification operations as shown in (12). See Appendix F for a detailed discussion of model comparison on synthetic datasets.

Complexity

Number of learnable parameters in layer l𝑙l of ACM(II)-GCN is 3Fl1(Fl+1)+93subscript𝐹𝑙1subscript𝐹𝑙193F_{l-1}(F_{l}+1)+9, while it is Fl1Flsubscript𝐹𝑙1subscript𝐹𝑙F_{l-1}F_{l} in GCN. The computation of step 1-3 takes NFl(8+6Fl1)+2Fl(nnz(HLP)+nnz(HHP))+18N𝑁subscript𝐹𝑙86subscript𝐹𝑙12subscript𝐹𝑙nnzsubscript𝐻LPnnzsubscript𝐻HP18𝑁NF_{l}(8+6F_{l-1})+2F_{l}(\text{nnz}(H_{\text{LP}})+\text{nnz}(H_{\text{HP}}))+18N flops, while GCN layer takes 2NFl1Fl+2Fl(nnz(HLP))2𝑁subscript𝐹𝑙1subscript𝐹𝑙2subscript𝐹𝑙nnzsubscript𝐻LP2NF_{l-1}F_{l}+2F_{l}(\text{nnz}(H_{\text{LP}})) flops, where nnz()nnz\text{nnz}(\cdot) is the number of non-zero elements. A detailed comparison on running time is conducted in section 6.1.

Limitations

Diversification operation does not work well in all harmful heterophily cases. For example, consider an imbalanced dataset where several small clusters with distinctive labels are densely connected to a large cluster. In this case, the surrounding differences of nodes in small clusters are similar, i.e., the neighborhood differences are mainly from their connection to the same large cluster, and this possibly makes diversification operation fail to discriminate them. See a more detailed demonstration and discussion in Appendix G.

5 Prior Work 5 先前工作

GNNs on Addressing Heterophily
GNNs 对异质性的处理

We discuss relevant work of GNNs on addressing heterophily challenge in this part. [1] acknowledges the difficulty of learning on graphs with weak homophily and propose MixHop to extract features from multi-hop neighborhood to get more information. Geom-GCN [33] precomputes unsupervised node embeddings and uses graph structure defined by geometric relationships in the embedding space to define the bi-level aggregation process. [16] proposes measurements based on feature smoothness and label smoothness that are potentially helpful to guide GNNs on dealing with heterophilous graphs. H2GCN [40] combines 3 key designs to address heterophily: (1) ego- and neighbor-embedding separation; (2) higher-order neighborhoods; (3) combination of intermediate representations. CPGNN [39] models label correlations by the compatibility matrix, which is beneficial for heterophily settings, and propagates a prior belief estimation into GNNs by the compatibility matrix. FBGNN [29] first proposes to use filterbank to address heterophily problem, but it does not fully explain the insights behind HP filters and does not contain identity channel and node-wise channel mixing mechanism. FAGCN [4] learns edge-level aggregation weights as GAT [35] but allows the weights to be negative which enables the network to capture the high-frequency components in graph signals. GPRGNN [7] uses learnable weights that can be both positive and negative for feature propagation, it allows GRPGNN to adapt heterophily structure of graph and is able to handle both high- and low-frequency parts of the graph signals.
我们在这部分讨论了 GNN 在解决异质性挑战方面的相关工作。[1]承认了在具有弱同质性的图上学习的困难,并提出了 MixHop 来从多跳邻域中提取特征以获取更多信息。Geom-GCN [33] 预先计算无监督节点嵌入,并利用嵌入空间中定义的几何关系来定义双层聚合过程。[16] 提出了基于特征平滑性和标签平滑性的度量,这些度量有可能有助于指导 GNN 处理异质图。H 2 GCN [40] 结合了三个关键设计来解决异质性问题:(1)自我和邻居嵌入分离;(2)高阶邻域;(3)中间表示的组合。CPGNN [39] 通过兼容矩阵对标签相关性进行建模,这有利于异质性设置,并通过兼容矩阵将先验信念估计传播到 GNN 中。FBGNN [29] 首次提出使用滤波器组来解决异质性问题,但它并没有完全解释 HP 滤波器背后的见解,并且不包含身份通道和节点通道混合机制。 FAGCN [ 4] 学习边级聚合权重,如 GAT [ 35],但允许权重为负,从而使网络能够捕获图信号中的高频分量。GPRGNN [ 7] 使用可学习的权重,既可以是正的也可以是负的,用于特征传播,它允许 GRPGNN 适应图的异质结构,并能够处理图信号的高频和低频部分。

GNNs with Filterbanks 具有滤波器组的 GNNs

Previously, there are geometric scattering networks [12, 32] that apply filterbanks to address over-smoothing [23] problem. The scattering construction captures different channels of variation from node features or labels. In geometric learning and graph signal processing, the band-pass filtering operations extract geometric information beyond smooth signals, thus it is believed that filterbanks can alleviate over-smoothing in GNNs. In ACM framework, we aim to design a framework with the help of filterbanks to adaptively utilize different channels to address the challenge of learning on heterophilous graph. We deal with different problem from [12, 32].
以前,有几何散射网络 [ 12, 32] 应用滤波器组来解决过度平滑 [ 23] 问题。散射构造从节点特征或标签中捕获不同的变化通道。在几何学习和图信号处理中,带通滤波操作提取超出平滑信号的几何信息,因此人们认为滤波器组可以缓解 GNNs 中的过度平滑。在 ACM 框架中,我们旨在设计一个框架,借助滤波器组来自适应地利用不同通道来解决在异质图上学习的挑战。我们处理不同于 [ 12, 32] 的问题。

6 Experiments on Real-World Datasets
6 在真实世界数据集上的实验

In this section, we evaluate ACM(II) framework on real-world datasets. We first conduct ablation studies in subsection 6.1 to validate different components. Then, we compare with the state-of-the-arts (SOTA) models in subsection 6.2.
在本节中,我们评估 ACM(II)框架在真实世界数据集上的表现。我们首先在第 6.1 小节进行消融研究,以验证不同组件。然后,在第 6.2 小节中与最先进的模型进行比较。

6.1 Ablation Study & Efficiency
6.1 消融研究与效率

Table 1: Ablation study on 9 real-world datasets [33]. Cell with ✓means the component is applied to the baseline model. The best test results are highlighted.
表 1:对 9 个真实世界数据集进行的消融研究[33]。带有 ✓ 的单元格表示该组件应用于基准模型。最佳测试结果已突出显示。
Ablation Study on Different Components in ACM-SGC and ACM-GCN (%)
ACM-SGC 和 ACM-GCN 中不同组件的消融研究 (%)
Baseline Model Components 模型组件 Cornell Wisconsin Texas Film Chameleon Squirrel Cora CiteSeer PubMed Rank
Models LP HP Identity Mixing Acc ±plus-or-minus\pm Std 准确率 ±plus-or-minus\pm 标准差 Acc ±plus-or-minus\pm Std 准确率 ±plus-or-minus\pm 标准差 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std 准确 ±plus-or-minus\pm 标准 Acc ±plus-or-minus\pm Std Acc ±plus-or-minus\pm 标准
SGC-1 w/ \checkmark 70.98 ±plus-or-minus\pm 8.39 70.38 ±plus-or-minus\pm 2.85 83.28 ±plus-or-minus\pm 5.43 25.26 ±plus-or-minus\pm 1.18 64.86 ±plus-or-minus\pm 1.81 47.62 ±plus-or-minus\pm 1.27 85.12 ±plus-or-minus\pm 1.64 79.66 ±plus-or-minus\pm 0.75 85.5 ±plus-or-minus\pm 0.76 12.89
\checkmark \checkmark \checkmark 83.28 ±plus-or-minus\pm 5.81 91.88 ±plus-or-minus\pm 1.61 90.98 ±plus-or-minus\pm 2.46 36.76 ±plus-or-minus\pm 1.01 65.27 ±plus-or-minus\pm 1.9 47.27 ±plus-or-minus\pm 1.37 86.8 ±plus-or-minus\pm 1.08 80.98 ±plus-or-minus\pm 1.68 87.21 ±plus-or-minus\pm 0.42 10.44
\checkmark \checkmark \checkmark 93.93 ±plus-or-minus\pm 3.6 95.25 ±plus-or-minus\pm 1.84 93.93 ±plus-or-minus\pm 2.54 38.38 ±plus-or-minus\pm 1.13 63.83 ±plus-or-minus\pm 2.07 46.79 ±plus-or-minus\pm 0.75 86.73 ±plus-or-minus\pm 1.28 80.57 ±plus-or-minus\pm 0.99 87.8 ±plus-or-minus\pm 0.58 9.44
\checkmark \checkmark \checkmark 88.2 ±plus-or-minus\pm 4.39 93.5 ±plus-or-minus\pm 2.95 92.95 ±plus-or-minus\pm 2.94 37.19 ±plus-or-minus\pm 0.87 62.82 ±plus-or-minus\pm 1.84 44.94 ±plus-or-minus\pm 0.93 85.22 ±plus-or-minus\pm 1.35 80.75 ±plus-or-minus\pm 1.68 88.11 ±plus-or-minus\pm 0.21 11.00
\checkmark \checkmark \checkmark \checkmark 93.77 ±plus-or-minus\pm 1.91 93.25 ±plus-or-minus\pm 2.92 93.61 ±plus-or-minus\pm 1.55 39.33 ±plus-or-minus\pm 1.25 63.68 ±plus-or-minus\pm 1.62 46.4 ±plus-or-minus\pm 1.13 86.63 ±plus-or-minus\pm 1.13 80.96 ±plus-or-minus\pm 0.93 87.75 ±plus-or-minus\pm 0.88 10.00
ACM-GCN w/ \checkmark 82.46 ±plus-or-minus\pm 3.11 75.5 ±plus-or-minus\pm 2.92 83.11 ±plus-or-minus\pm 3.2 35.51 ±plus-or-minus\pm 0.99 64.18 ±plus-or-minus\pm 2.62 44.76 ±plus-or-minus\pm 1.39 87.78 ±plus-or-minus\pm 0.96 81.39 ±plus-or-minus\pm 1.23 88.9 ±plus-or-minus\pm 0.32 11.44
\checkmark \checkmark \checkmark 82.13 ±plus-or-minus\pm 2.59 86.62 ±plus-or-minus\pm 4.61 89.19 ±plus-or-minus\pm 3.04 38.06 ±plus-or-minus\pm 1.35 69.21 ±plus-or-minus\pm 1.68 57.2 ±plus-or-minus\pm 1.01 88.93 ±plus-or-minus\pm 1.55 81.96 ±plus-or-minus\pm 0.91 90.01 ±plus-or-minus\pm 0.8 7.22
\checkmark \checkmark \checkmark 94.26 ±plus-or-minus\pm 2.23 96.13 ±plus-or-minus\pm 2.2 94.1 ±plus-or-minus\pm 2.95 41.51 ±plus-or-minus\pm 0.99 67.44 ±plus-or-minus\pm 2.14 53.97 ±plus-or-minus\pm 1.39 88.95 ±plus-or-minus\pm 0.9 81.72 ±plus-or-minus\pm 1.22 90.88 ±plus-or-minus\pm 0.55 4.44
\checkmark \checkmark \checkmark 91.64 ±plus-or-minus\pm 2 95.37 ±plus-or-minus\pm 3.31 95.25 ±plus-or-minus\pm 2.37 40.47 ±plus-or-minus\pm 1.49 68.93 ±plus-or-minus\pm 2.04 54.78 ±plus-or-minus\pm 1.27 89.13 ±plus-or-minus\pm 1.77 81.96 ±plus-or-minus\pm 2.03 91.01 ±plus-or-minus\pm 0.7 3.11
\checkmark \checkmark \checkmark \checkmark 94.75 ±plus-or-minus\pm 2.62 96.75 ±plus-or-minus\pm 1.6 95.08 ±plus-or-minus\pm 3.2 41.62 ±plus-or-minus\pm 1.15 69.04 ±plus-or-minus\pm 1.74 58.02 ±plus-or-minus\pm 1.86 88.95 ±plus-or-minus\pm 1.3 81.80 ±plus-or-minus\pm 1.26 90.69 ±plus-or-minus\pm 0.53 2.78
ACMII-GCN w/ \checkmark \checkmark \checkmark 82.46 ±plus-or-minus\pm 3.03 91.00 ±plus-or-minus\pm 1.75 90.33 ±plus-or-minus\pm 2.69 38.39 ±plus-or-minus\pm 0.75 67.59 ±plus-or-minus\pm 2.14 53.67 ±plus-or-minus\pm 1.71 89.13 ±plus-or-minus\pm 1.14 81.75 ±plus-or-minus\pm 0.85 89.87 ±plus-or-minus\pm 0.39 7.44
\checkmark \checkmark \checkmark 94.26 ±plus-or-minus\pm 2.57 96.00 ±plus-or-minus\pm 2.15 94.26 ±plus-or-minus\pm 2.96 40.96 ±plus-or-minus\pm 1.2 66.35 ±plus-or-minus\pm 1.76 50.78 ±plus-or-minus\pm 2.07 89.06 ±plus-or-minus\pm 1.07 81.86 ±plus-or-minus\pm 1.22 90.71 ±plus-or-minus\pm 0.67 4.67
\checkmark \checkmark \checkmark 91.48 ±plus-or-minus\pm 1.43 96.25 ±plus-or-minus\pm 2.09 93.77 ±plus-or-minus\pm 2.91 40.27 ±plus-or-minus\pm 1.07 66.52 ±plus-or-minus\pm 2.65 52.9 ±plus-or-minus\pm 1.64 88.83 ±plus-or-minus\pm 1.16 81.54 ±plus-or-minus\pm 0.95 90.6 ±plus-or-minus\pm 0.47 6.67
\checkmark \checkmark \checkmark \checkmark 95.9 ±plus-or-minus\pm 1.83 96.62 ±plus-or-minus\pm 2.44 95.25 ±plus-or-minus\pm 3.15 41.84 ±plus-or-minus\pm 1.15 68.38 ±plus-or-minus\pm 1.36 54.53 ±plus-or-minus\pm 2.09 89.00 ±plus-or-minus\pm 0.72 81.79 ±plus-or-minus\pm 0.95 90.74 ±plus-or-minus\pm 0.5 2.78
Comparison of Average Running Time Per Epoch(ms)
SGC-1 w/ \checkmark 2.53 2.83 2.5 3.18 3.48 4.65 3.47 3.43 4.04
\checkmark \checkmark \checkmark 4.01 4.57 4.24 4.55 4.76 5.09 5.39 4.69 4.75
\checkmark \checkmark \checkmark 3.88 4.01 4.04 4.43 4.06 4.5 4.38 3.82 4.16
\checkmark \checkmark \checkmark 3.31 3.49 3.18 3.7 3.53 4.83 3.92 3.87 4.24
\checkmark \checkmark \checkmark \checkmark 5.53 5.96 5.43 5.21 5.41 6.96 6 5.9 6.04
ACM-GCN w/ \checkmark 3.67 3.74 3.59 4.86 4.96 6.41 4.24 4.18 5.08
\checkmark \checkmark \checkmark 6.63 8.06 7.89 8.11 7.8 9.39 7.82 7.38 8.74
\checkmark \checkmark \checkmark 5.73 5.91 5.93 6.86 6.35 7.15 7.34 6.65 6.8
\checkmark \checkmark \checkmark 5.16 5.25 5.2 5.93 5.64 8.02 5.73 5.65 6.16
\checkmark \checkmark \checkmark \checkmark 8.25 8.11 7.89 7.97 8.41 11.9 8.84 8.38 8.63
ACMII-GCN w/ \checkmark \checkmark \checkmark 6.62 7.35 7.39 7.62 7.33 9.69 7.49 7.58 7.97
\checkmark \checkmark \checkmark 6.3 6.05 6.26 6.87 6.44 6.5 6.14 7.21 6.6
\checkmark \checkmark \checkmark 5.24 5.27 5.46 5.72 5.65 7.87 5.48 5.65 6.33
\checkmark \checkmark \checkmark \checkmark 7.59 8.28 8.06 8.85 8 10 8.27 8.5 8.68

We investigate the effectiveness and efficiency of adding HP, identity channels and the adaptive mixing mechanism in ACM(II) framework by ablation study. Specifically, we apply the above components to SGC-1 and GCN separately, run 10 times on each dataset with 60%/20%/20% random splits for train/validation/test used in [7] and report the average test accuracy as well as the standard deviation. We also record the average running time per epoch (in milliseconds) to compare the efficiency. We set the temperature T𝑇T in (LABEL:eq:acm_gnn_spectral) to be 333. (See Appendix A for hyperparameter searching range.)

From the results we can see that on most datasets, the additional HP and identity channels are helpful, even on strong homophily datasets, such as Cora, CiteSeer and PubMed. The adaptive mixing mechanism also shows its advantage over the method that directly adds the three channels together. This illustrates the necessity of learning to customize the channel usage adaptively for different nodes. As for efficiency, we can see that the running time is approximately doubled under ACM(II) framework than the original model.
从结果我们可以看到,在大多数数据集上,额外的 HP 和身份通道都是有帮助的,即使在强同质性数据集上,比如 Cora、CiteSeer 和 PubMed。自适应混合机制也显示出其优势,超过了直接将三个通道相加的方法。这说明了需要学会根据不同节点自适应地定制通道使用的必要性。至于效率,我们可以看到在 ACM(II)框架下,运行时间大约是原始模型的两倍。

6.2 Comparison with State-of-the-art Models
6.2 与最先进模型的比较

Table 2: Experimental results: average test accuracy ±plus-or-minus\pm standard deviation on 101010 real-world benchmark datasets. The best results are highlighted. Results "*" are reported from [7, 24] and results "" are from [33]. NA means the reported results are not available and OOM means out of memory.
表 2:实验结果: 101010 个真实世界基准数据集上平均测试准确率 ±plus-or-minus\pm 标准偏差。最佳结果已突出显示。结果“*”来自[7, 24],结果“ ”来自[33]。NA 表示报告的结果不可用,OOM 表示内存不足。
Cornell Wisconsin Texas Film Chameleon Squirrel Deezer-Europe Cora CiteSeer PubMed
#nodes 183 251 183 7,600 2,277 5,201 28,281 2,708 3,327 19,717
#edges 295 499 309 33,544 36,101 217,073 92,752 5,429 4,732 44,338
#features 1,703 1,703 1,703 931 2,325 2,089 31,241 1,433 3,703 500
#classes 5 5 5 5 5 5 2 7 6 3
H_edge 0.5669 0.4480 0.4106 0.3750 0.2795 0.2416 0.5251 0.8100 0.7362 0.8024
H_node 0.3855 0.1498 0.0968 0.2210 0.2470 0.2156 0.5299 0.8252 0.7175 0.7924
H_class 0.0468 0.0941 0.0013 0.0110 0.0620 0.0254 0.0304 0.7657 0.6270 0.6641
Data Splits 数据拆分 60%/20%/20% 60%/20%/20% 60%/20%/20% 60%/20%/20% 60%/20%/20% 60%/20%/20% 50%/25%/25% 60%/20%/20% 60%/20%/20% 60%/20%/20%
H_aggM̂(G) 0.8032 0.7768 0.694 0.6822 0.61 0.3566 0.5790 0.9904 0.9826 0.9432
Test Accuracy (%) of State-of-the-art Models, Baseline GNN Models and ACM-GNN models
最先进模型、基准 GNN 模型和 ACM-GNN 模型的测试准确率(%)
Rank
MLP-2* 91.30 ±plus-or-minus\pm 0.70 93.87 ±plus-or-minus\pm 3.33 92.26 ±plus-or-minus\pm 0.71 38.58 ±plus-or-minus\pm 0.25 46.72 ±plus-or-minus\pm 0.46 31.28 ±plus-or-minus\pm 0.27 66.55 ±plus-or-minus\pm 0.72 76.44 ±plus-or-minus\pm 0.30 76.25 ±plus-or-minus\pm 0.28 86.43 ±plus-or-minus\pm 0.13 18.60
GAT* 76.00 ±plus-or-minus\pm 1.01 71.01 ±plus-or-minus\pm 4.66 78.87 ±plus-or-minus\pm 0.86 35.98 ±plus-or-minus\pm 0.23 63.9 ±plus-or-minus\pm 0.46 42.72 ±plus-or-minus\pm 0.33 61.09 ±plus-or-minus\pm 0.77 76.70 ±plus-or-minus\pm 0.42 67.20 ±plus-or-minus\pm 0.46 83.28 ±plus-or-minus\pm 0.12 21.40
APPNP* 91.80 ±plus-or-minus\pm 0.63 92.00 ±plus-or-minus\pm 3.59 91.18 ±plus-or-minus\pm 0.70 38.86 ±plus-or-minus\pm 0.24 51.91 ±plus-or-minus\pm 0.56 34.77 ±plus-or-minus\pm 0.34 67.21 ±plus-or-minus\pm 0.56 79.41 ±plus-or-minus\pm 0.38 68.59 ±plus-or-minus\pm 0.30 85.02 ±plus-or-minus\pm 0.09 18.00
GPRGNN* 91.36 ±plus-or-minus\pm 0.70 93.75 ±plus-or-minus\pm 2.37 92.92 ±plus-or-minus\pm 0.61 39.30 ±plus-or-minus\pm 0.27 67.48 ±plus-or-minus\pm 0.40 49.93 ±plus-or-minus\pm 0.53 66.90 ±plus-or-minus\pm 0.50 79.51pm 0.36 67.63 ±plus-or-minus\pm 0.38 85.07 ±plus-or-minus\pm 0.09 14.40
H2GCN 86.23 ±plus-or-minus\pm 4.71 87.5 ±plus-or-minus\pm 1.77 85.90 ±plus-or-minus\pm 3.53 38.85 ±plus-or-minus\pm 1.17 52.30 ±plus-or-minus\pm 0.48 30.39 ±plus-or-minus\pm 1.22 67.22 ±plus-or-minus\pm 0.90 87.52 ±plus-or-minus\pm 0.61 79.97 ±plus-or-minus\pm 0.69 87.78 ±plus-or-minus\pm 0.28 17.00
MixHop 60.33 ±plus-or-minus\pm 28.53 77.25 ±plus-or-minus\pm 7.80 76.39 ±plus-or-minus\pm 7.66 33.13 ±plus-or-minus\pm 2.40 36.28 ±plus-or-minus\pm 10.22 24.55 ±plus-or-minus\pm 2.60 66.80 ±plus-or-minus\pm 0.58 65.65 ±plus-or-minus\pm 11.31 49.52 ±plus-or-minus\pm 13.35 87.04 ±plus-or-minus\pm 4.10 23.50
GCN+JK 66.56 ±plus-or-minus\pm 13.82 62.50 ±plus-or-minus\pm 15.75 80.66 ±plus-or-minus\pm 1.91 32.72 ±plus-or-minus\pm 2.62 64.68 ±plus-or-minus\pm 2.85 53.40 ±plus-or-minus\pm 1.90 60.99 ±plus-or-minus\pm 0.14 86.90 ±plus-or-minus\pm 1.51 73.77 ±plus-or-minus\pm 1.85 90.09 ±plus-or-minus\pm 0.68 18.80
GAT+JK 74.43 ±plus-or-minus\pm 10.24 69.50 ±plus-or-minus\pm 3.12 75.41 ±plus-or-minus\pm 7.18 35.41 ±plus-or-minus\pm 0.97 68.14 ±plus-or-minus\pm 1.18 52.28 ±plus-or-minus\pm 3.61 59.66 ±plus-or-minus\pm 0.92 89.52 ±plus-or-minus\pm 0.43 74.49 ±plus-or-minus\pm 2.76 89.15 ±plus-or-minus\pm 0.87 16.70
FAGCN 88.03 ±plus-or-minus\pm 5.6 89.75 ±plus-or-minus\pm 6.37 88.85 ±plus-or-minus\pm 4.39 31.59 ±plus-or-minus\pm 1.37 49.47 ±plus-or-minus\pm 2.84 42.24 ±plus-or-minus\pm 1.2 66.86 p, 0.53 88.85 ±plus-or-minus\pm 1.36 82.37 ±plus-or-minus\pm 1.46 89.98 ±plus-or-minus\pm 0.54 14.10
GraphSAGE 71.41 ±plus-or-minus\pm 1.24 64.85 ±plus-or-minus\pm 5.14 79.03 ±plus-or-minus\pm 1.20 36.37 ±plus-or-minus\pm 0.21 62.15 ±plus-or-minus\pm 0.42 41.26 ±plus-or-minus\pm 0.26 OOM 86.58 ±plus-or-minus\pm 0.26 78.24 ±plus-or-minus\pm 0.30 86.85 ±plus-or-minus\pm 0.11 20.89
Geom-GCN 几何-GCN 60.81 64.12 67.57 31.63 60.9 38.14 NA 85.27 77.99 90.05 22.67
SGC-1 70.98 ±plus-or-minus\pm 8.39 70.38 ±plus-or-minus\pm 2.85 83.28 ±plus-or-minus\pm 5.43 25.26 ±plus-or-minus\pm 1.18 64.86 ±plus-or-minus\pm 1.81 47.62 ±plus-or-minus\pm 1.27 59.73 ±plus-or-minus\pm 0.12 85.12 ±plus-or-minus\pm 1.64 79.66 ±plus-or-minus\pm 0.75 85.5 ±plus-or-minus\pm 0.76 20.10
SGC-2 72.62 ±plus-or-minus\pm 9.92 74.75 ±plus-or-minus\pm 2.89 81.31 ±plus-or-minus\pm 3.3 28.81 ±plus-or-minus\pm 1.11 62.67 ±plus-or-minus\pm 2.41 41.25 ±plus-or-minus\pm 1.4 61.56 ±plus-or-minus\pm 0.51 85.48 ±plus-or-minus\pm 1.48 80.75 ±plus-or-minus\pm 1.15 85.36 ±plus-or-minus\pm 0.52 20.70
GCNII 89.18 ±plus-or-minus\pm 3.96 83.25 ±plus-or-minus\pm 2.69 82.46 ±plus-or-minus\pm 4.58 40.82 ±plus-or-minus\pm 1.79 60.35 ±plus-or-minus\pm 2.7 38.81 ±plus-or-minus\pm 1.97 66.38 ±plus-or-minus\pm 0.45 88.98 ±plus-or-minus\pm 1.33 81.58 ±plus-or-minus\pm 1.3 89.8 ±plus-or-minus\pm 0.3 14.80
GCNII* 90.49 ±plus-or-minus\pm 4.45 89.12 ±plus-or-minus\pm 3.06 88.52 ±plus-or-minus\pm 3.02 41.54 ±plus-or-minus\pm 0.99 62.8 ±plus-or-minus\pm 2.87 38.31 ±plus-or-minus\pm 1.3 66.42 ±plus-or-minus\pm 0.56 88.93 ±plus-or-minus\pm 1.37 81.83 ±plus-or-minus\pm 1.78 89.98 ±plus-or-minus\pm 0.52 12.30
GCN 82.46 ±plus-or-minus\pm 3.11 75.5 ±plus-or-minus\pm 2.92 83.11 ±plus-or-minus\pm 3.2 35.51 ±plus-or-minus\pm 0.99 64.18 ±plus-or-minus\pm 2.62 44.76 ±plus-or-minus\pm 1.39 62.23 ±plus-or-minus\pm 0.53 87.78 ±plus-or-minus\pm 0.96 81.39 ±plus-or-minus\pm 1.23 88.9 ±plus-or-minus\pm 0.32 16.30
Snowball-2 82.62 ±plus-or-minus\pm 2.34 74.88 ±plus-or-minus\pm 3.42 83.11 ±plus-or-minus\pm 3.2 35.97 ±plus-or-minus\pm 0.66 64.99 ±plus-or-minus\pm 2.39 47.88 ±plus-or-minus\pm 1.23 OOM 88.64 ±plus-or-minus\pm 1.15 81.53 ±plus-or-minus\pm 1.71 89.04 ±plus-or-minus\pm 0.49 15.22
Snowball-3 82.95 ±plus-or-minus\pm 2.1 69.5 ±plus-or-minus\pm 5.01 83.11 ±plus-or-minus\pm 3.2 36.00 ±plus-or-minus\pm 1.36 65.49 ±plus-or-minus\pm 1.64 48.25 ±plus-or-minus\pm 0.94 OOM 89.33 ±plus-or-minus\pm 1.3 80.93 ±plus-or-minus\pm 1.32 88.8 ±plus-or-minus\pm 0.82 14.78
ACM-SGC-1 93.77 ±plus-or-minus\pm 1.91 93.25 ±plus-or-minus\pm 2.92 93.61 ±plus-or-minus\pm 1.55 39.33 ±plus-or-minus\pm 1.25 63.68 ±plus-or-minus\pm 1.62 46.4 ±plus-or-minus\pm 1.13 66.67 ±plus-or-minus\pm 0.56 86.63 ±plus-or-minus\pm 1.13 80.96 ±plus-or-minus\pm 0.93 87.75 ±plus-or-minus\pm 0.88 12.60
ACM-SGC-2 93.77 ±plus-or-minus\pm 2.17 94.00 ±plus-or-minus\pm 2.61 93.44 ±plus-or-minus\pm 2.54 40.13 ±plus-or-minus\pm 1.21 60.48 ±plus-or-minus\pm 1.55 40.91 ±plus-or-minus\pm 1.39 66.53 ±plus-or-minus\pm 0.57 87.64 ±plus-or-minus\pm 0.99 80.93 ±plus-or-minus\pm 1.16 88.79 ±plus-or-minus\pm 0.5 13.40
ACM-GCNII 92.62 ±plus-or-minus\pm 3.13 94.63 ±plus-or-minus\pm 2.96 92.46 ±plus-or-minus\pm 1.97 41.37 ±plus-or-minus\pm 1.37 58.73 ±plus-or-minus\pm 2.52 40.9 ±plus-or-minus\pm 1.58 66.39 ±plus-or-minus\pm 0.56 89.1 ±plus-or-minus\pm 1.61 82.28 ±plus-or-minus\pm 1.12 90.12 ±plus-or-minus\pm 0.4 10.40
ACM-GCNII* 93.44 ±plus-or-minus\pm 2.74 94.37 ±plus-or-minus\pm 2.81 93.28 ±plus-or-minus\pm 2.79 41.27 ±plus-or-minus\pm 1.24 61.66 ±plus-or-minus\pm 2.29 38.32 ±plus-or-minus\pm 1.5 66.6 ±plus-or-minus\pm 0.57 89.00 ±plus-or-minus\pm 1.35 81.69 ±plus-or-minus\pm 1.25 90.18 0.51 10.10
ACM-GCN 94.75 ±plus-or-minus\pm 3.8 95.75 ±plus-or-minus\pm 2.03 94.92 ±plus-or-minus\pm 2.88 41.62 ±plus-or-minus\pm 1.15 69.04 ±plus-or-minus\pm 1.74 58.02 ±plus-or-minus\pm 1.86 67.01 ±plus-or-minus\pm 0.38 88.62 ±plus-or-minus\pm 1.22 81.68 ±plus-or-minus\pm 0.97 90.66 ±plus-or-minus\pm 0.47 4.80
ACM-Snowball-2 95.08 ±plus-or-minus\pm 3.11 96.38 ±plus-or-minus\pm 2.59 95.74 ±plus-or-minus\pm 2.22 41.4 ±plus-or-minus\pm 1.23 68.51 ±plus-or-minus\pm 1.7 55.97 ±plus-or-minus\pm 2.03 OOM 88.83 ±plus-or-minus\pm 1.49 81.58 ±plus-or-minus\pm 1.23 90.81 ±plus-or-minus\pm 0.52 4.44
ACM-Snowball-3 94.26 ±plus-or-minus\pm 2.57 96.62 ±plus-or-minus\pm 1.86 94.75 ±plus-or-minus\pm 2.41 41.27 ±plus-or-minus\pm 0.8 68.4 ±plus-or-minus\pm 2.05 55.73 ±plus-or-minus\pm 2.39 OOM 89.59 ±plus-or-minus\pm 1.58 81.32 ±plus-or-minus\pm 0.97 91.44 ±plus-or-minus\pm 0.59 4.44
ACMII-GCN 95.9 ±plus-or-minus\pm 1.83 96.62 ±plus-or-minus\pm 2.44 95.08 ±plus-or-minus\pm 2.07 41.84 ±plus-or-minus\pm 1.15 68.38 ±plus-or-minus\pm 1.36 54.53 ±plus-or-minus\pm 2.09 67.15 ±plus-or-minus\pm 0.41 89.00 ±plus-or-minus\pm 0.72 81.79 ±plus-or-minus\pm 0.95 90.74 ±plus-or-minus\pm 0.5 3.40
ACMII-Snowball-2 95.25 ±plus-or-minus\pm 1.55 96.63 ±plus-or-minus\pm 2.24 95.25 ±plus-or-minus\pm 1.55 41.1 ±plus-or-minus\pm 0.75 67.83 ±plus-or-minus\pm 2.63 53.48 ±plus-or-minus\pm 0.6 OOM 88.95 ±plus-or-minus\pm 1.04 82.07 ±plus-or-minus\pm 1.04 90.56 ±plus-or-minus\pm 0.39 4.78
ACMII-Snowball-3 93.61 ±plus-or-minus\pm 2.79 97.00 ±plus-or-minus\pm 2.63 94.75 ±plus-or-minus\pm 3.09 40.31 ±plus-or-minus\pm 1.6 67.53 ±plus-or-minus\pm 2.83 52.31 ±plus-or-minus\pm 1.57 OOM 89.36 ±plus-or-minus\pm 1.26 81.56 ±plus-or-minus\pm 1.15 91.31 ±plus-or-minus\pm 0.6 5.89

Datasets & Experimental Setup
数据集和实验设置

In this section, we implement SGC [36] with 1 hop and 2 hops (SGC-1, SGC-2), GCNII [6], GCNII* [6], GCN [18] and snowball networks with 2 and 3 layers (snowball-2, snowball-3) and apply them in ACM or ACMII framework (see details in appendix A.5): we use A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} as LP filter and the corresponding HP filter can be derived from (12). We compare them with several baselines and SOTA models: MLP with 2 layers (MLP-2), GAT [35], APPNP [19], GPRGNN [7], H2GCN [40], MixHop [1], GCN+JK [18, 37, 24], GAT+JK [35, 37, 24], FAGCN [4] GraphSAGE [15] and Geom-GCN [33]. Besides the 9 benchmark datasets used in [33], we further test the above models on a new benchmark dataset, Deezer-Europe, that is proposed in [24]. We test these models 10 times on Cornell, Wisconsin, Texas, Film, Chameleon, Squirrel, Cora, Citeseer and Pubmed following the same early stopping strategy, the same random data splitting method 555See the open source code for splits in https://github.com/jianhao2016/GPRGNN/blob/f4aaad6ca28c83d3121338a4c4fe5d162edfa9a2/src/utils.py#L16. See table 9 in appendix I for the performance comparison with several SOTA models on the fixed 48%/32%/20% splits provided by [33].
请查看 https://github.com/jianhao2016/GPRGNN/blob/f4aaad6ca28c83d3121338a4c4fe5d162edfa9a2/src/utils.py#L16 中的拆分的开源代码。请参阅附录 I 中的表 9,以了解与[33]提供的固定 48%/32%/20%分割上的几个 SOTA 模型的性能比较。

在本节中,我们使用 1 跳和 2 跳的 SGC [ 36](SGC-1,SGC-2),GCNII [ 6],GCNII* [ 6],GCN [ 18]以及 2 层和 3 层的雪球网络(snowball-2,snowball-3),并将它们应用于 ACM 或 ACMII 框架(详见附录 A.5):我们使用 A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} 作为 LP 滤波器,相应的 HP 滤波器可以从(12)中导出。我们将它们与几个基线模型和 SOTA 模型进行比较:具有 2 层的 MLP(MLP-2),GAT [ 35],APPNP [ 19],GPRGNN [ 7],H 2 GCN [ 40],MixHop [ 1],GCN+JK [ 18, 37, 24],GAT+JK [ 35, 37, 24],FAGCN [ 4],GraphSAGE [ 15]和 Geom-GCN [ 33]。除了[ 33]中使用的 9 个基准数据集外,我们还在一个新的基准数据集 Deezer-Europe 上测试上述模型,该数据集在[ 24]中提出。我们在康奈尔大学,威斯康星大学,德克萨斯大学,电影,变色龙,松鼠,科拉,Citeseer 和 Pubmed 上使用相同的早停策略,相同的随机数据拆分方法 5 对这些模型进行 10 次测试。
and Adam [17] optimizer as used in GPRGNN [7]. For Deezer-Europe, we test the above models 5 times with the same early stopping strategy, the same fixed splits and AdamW [27] used in [24]. The details of hyperparameter searching range and the optimal hyperparameters are reported in appendix A.3 and A.4.
以及 Adam [17]优化器,如在 GPRGNN [7]中使用。对于 Deezer-Europe,我们使用相同的早停策略、相同的固定分割和 AdamW [27](在[24]中使用)对上述模型进行 5 次测试。超参数搜索范围的详细信息和最佳超参数在附录 A.3 和 A.4 中报告。

The main results of this set of experiments with statistics of datasets are summarized in Table 2, where we report the mean accuracy and standard deviation. We can see that after applied in ACM(II) framework, the performance of baseline models are boosted on almost all tasks. Especially, ACMII-GCN performs the best in terms of average rank (3.40) across all datasets and ACM(II)-GNNs achieve SOTA performance on 888 out of 101010 datasets. Overall, It suggests that ACM(II) framework can help GNNs to generalize better on node classification tasks on heterophilous graphs.
这组实验的主要结果以数据集统计数据总结在表 2 中,我们报告平均准确度和标准差。我们可以看到,在应用于 ACM(II)框架后,基线模型的性能在几乎所有任务上都有所提升。特别是,ACMII-GCN 在所有数据集上的平均排名(3.40)表现最佳,ACM(II)-GNN 在 888 个数据集中实现了 SOTA 性能。总的来说,这表明 ACM(II)框架可以帮助 GNN 在异质图上的节点分类任务中更好地泛化。

7 Future Work 未来工作

The similarity matrix and the new metrics defined in this paper mainly capture the linear relations of the aggregated node features. But this might be insufficient sometimes when nonlinearity information in feature vectors are important for classification. In the future, similarity matrix that is able to capture nonlinear relations between node features can be proposed to define new homophily metrics.
本文中定义的相似性矩阵和新度量主要捕捉了聚合节点特征的线性关系。但是,当特征向量中的非线性信息对于分类很重要时,有时这可能是不够的。未来可以提出能够捕捉节点特征之间非线性关系的相似性矩阵,以定义新的同质性度量。

As the limitation raised in section 4.3, filterbank method cannot properly handle all cases of harmful heterophily. In the future, we need to explore and address more challenging heterophilous graph with GNNs.
正如第 4.3 节中提出的限制,滤波器组方法无法正确处理所有有害异质性的情况。未来,我们需要探索并解决更具挑战性的异质图与 GNNs。

References 参考文献

  • [1] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pages 21–29. PMLR, 2019.
    S. Abu-El-Haija,B. Perozzi,A. Kapoor,N. Alipourfard,K. Lerman,H. Harutyunyan,G. Ver Steeg 和 A. Galstyan。Mixhop:通过稀疏化邻域混合的高阶图卷积架构。在机器学习国际会议上,页码 21-29。PMLR,2019 年。
  • [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    D. Bahdanau,K. Cho 和 Y. Bengio。通过联合学习对齐和翻译的神经机器翻译。arXiv 预印本 arXiv:1409.0473,2014 年。
  • [3] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
    P. W. Battaglia,J. B. Hamrick,V. Bapst,A. Sanchez-Gonzalez,V. Zambaldi,M. Malinowski,A. Tacchetti,D. Raposo,A. Santoro,R. Faulkner,等。关系归纳偏差,深度学习和图网络。arXiv 预印本 arXiv:1806.01261,2018 年。
  • [4] D. Bo, X. Wang, C. Shi, and H. Shen. Beyond low-frequency information in graph convolutional networks. arXiv preprint arXiv:2101.00797, 2021.
    D. Bo,X. Wang,C. Shi 和 H. Shen。图卷积网络中的低频信息之外。arXiv 预印本 arXiv:2101.00797,2021 年。
  • [5] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. arXiv, abs/1611.08097, 2016.
    M. M. Bronstein,J. Bruna,Y. LeCun,A. Szlam 和 P. Vandergheynst。几何深度学习:超越欧几里德数据。arXiv,abs/1611.08097,2016 年。
  • [6] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, pages 1725–1735. PMLR, 2020.
    M. Chen,Z. Wei,Z. Huang,B. Ding 和 Y. Li。简单和深度图卷积网络。在机器学习国际会议上,第 1725-1735 页。PMLR,2020 年。
  • [7] E. Chien, J. Peng, P. Li, and O. Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations. https://openreview. net/forum, 2021.
    E. Chien,J. Peng,P. Li 和 O. Milenkovic。自适应通用广义 PageRank 图神经网络。在学习表示国际会议上。https://openreview.net/forum,2021 年。
  • [8] F. R. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
    F. R. Chung 和 F. C. Graham。谱图论。编号 92。美国数学学会,1997 年。
  • [9] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. arXiv, abs/1606.09375, 2016.
    M. Defferrard,X. Bresson 和 P. Vandergheynst。在具有快速局部谱滤波的图上的卷积神经网络。arXiv,abs/1606.09375,2016。
  • [10] V. N. Ekambaram. Graph structured data viewed through a fourier lens. University of California, Berkeley, 2014.
    V. N. Ekambaram。通过傅立叶透镜查看的图结构化数据。加州大学伯克利分校,2014。
  • [11] M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
    M. Fey 和 J. E. Lenssen。使用 pytorch geometric 进行快速图表示学习。arXiv 预印本 arXiv:1903.02428,2019 年。
  • [12] F. Gao, G. Wolf, and M. Hirn. Geometric scattering for graph data analysis. In International Conference on Machine Learning, pages 2122–2131. PMLR, 2019.
    F. Gao,G. Wolf 和 M. Hirn。几何散射用于图数据分析。在机器学习国际会议上,页码 2122-2131。PMLR,2019 年。
  • [13] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013.
    A. Graves,A.-r. Mohamed 和 G. Hinton。使用深度递归神经网络进行语音识别。在 2013 年 IEEE 国际会议上,声学、语音和信号处理,页码 6645-6649。IEEE,2013 年。
  • [14] W. L. Hamilton. Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning, 14(3):1–159, 2020.
    W. L. Hamilton. 图表示学习. 人工智能与机器学习综合讲座, 14(3):1-159, 2020.
  • [15] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. arXiv, abs/1706.02216, 2017.
    W. L. Hamilton,R. Ying 和 J. Leskovec。大规模图上的归纳表示学习。arXiv,abs/1706.02216,2017。
  • [16] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. Ma, H. Chen, and M.-C. Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations, 2019.
    Y. Hou,J. Zhang,J. Cheng,K. Ma,R. T. Ma,H. Chen 和 M.-C. Yang。在国际学习表示会议上,测量和改进图神经网络中图信息的使用,2019。
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    D. P. Kingma 和 J. Ba. Adam:一种随机优化方法。arXiv 预印本 arXiv:1412.6980,2014 年。
  • [18] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv, abs/1609.02907, 2016.
    T. N. Kipf 和 M. Welling. 使用图卷积网络进行半监督分类. arXiv, abs/1609.02907, 2016.
  • [19] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
    J. Klicpera, A. Bojchevski 和 S. Günnemann. 预测然后传播: 图神经网络遇见个性化 PageRank. arXiv 预印本 arXiv:1810.05997, 2018.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    A. Krizhevsky,I. Sutskever 和 G. E. Hinton。使用深度卷积神经网络进行 Imagenet 分类。在神经信息处理系统的进展中,2012 年,第 1097-1105 页。
  • [21] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
    Y. LeCun,Y. Bengio 和 G. Hinton。深度学习。自然,521(7553):436,2015 年。
  • [22] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Y. LeCun,L. Bottou,Y. Bengio,P. Haffner,等。基于梯度的学习应用于文档识别。IEEE 会议录,86(11):2278-2324,1998 年。
  • [23] Q. Li, Z. Han, and X. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. arXiv, abs/1801.07606, 2018.
    Q. Li,Z. Han 和 X. Wu。更深入了解用于半监督学习的图卷积网络。arXiv,abs/1801.07606,2018 年。
  • [24] D. Lim, X. Li, F. Hohne, and S.-N. Lim. New benchmarks for learning on non-homophilous graphs. arXiv preprint arXiv:2104.01404, 2021.
    D. Lim,X. Li,F. Hohne 和 S.-N. Lim. 非同源图上学习的新基准。arXiv 预印本 arXiv:2104.01404,2021 年。
  • [25] V. Lingam, R. Ragesh, A. Iyer, and S. Sellamanickam. Simple truncated svd based model for node classification on heterophilic graphs. arXiv preprint arXiv:2106.12807, 2021.
    V. Lingam,R. Ragesh,A. Iyer 和 S. Sellamanickam. 基于简单截断 SVD 的异源图节点分类模型。arXiv 预印本 arXiv:2106.12807,2021 年。
  • [26] M. Liu, Z. Wang, and S. Ji. Non-local graph neural networks. arXiv preprint arXiv:2005.14612, 2020.
    刘明,王哲,季胜. 非局部图神经网络. arXiv 预印本 arXiv:2005.14612, 2020.
  • [27] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    I. Loshchilov 和 F. Hutter. 解耦权重衰减正则化. arXiv 预印本 arXiv:1711.05101, 2017.
  • [28] S. Luan, M. Zhao, X.-W. Chang, and D. Precup. Break the ceiling: Stronger multi-scale deep graph convolutional networks. arXiv preprint arXiv:1906.02174, 2019.
    S. Luan, M. Zhao, X.-W. Chang 和 D. Precup. 打破天花板:更强大的多尺度深度图卷积网络. arXiv 预印本 arXiv:1906.02174, 2019.
  • [29] S. Luan, M. Zhao, C. Hua, X.-W. Chang, and D. Precup. Complete the missing half: Augmenting aggregation filtering with diversification for graph convolutional networks. arXiv preprint arXiv:2008.08844, 2020.
    S. Luan,M. Zhao,C. Hua,X.-W. Chang 和 D. Precup。完成缺失的一半:通过多样化增强聚合过滤以用于图卷积网络。arXiv 预印本 arXiv:2008.08844,2020 年。
  • [30] T. Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550, 2019.
    T. Maehara. 重新审视图神经网络:我们所拥有的只是低通滤波器。arXiv 预印本 arXiv:1905.09550,2019 年。
  • [31] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001.
    M. McPherson, L. Smith-Lovin, 和 J. M. Cook. 物以类聚:社交网络中的同质性。社会学年度评论,27(1):415-444,2001 年。
  • [32] Y. Min, F. Wenkel, and G. Wolf. Scattering gcn: Overcoming oversmoothness in graph convolutional networks. arXiv preprint arXiv:2003.08414, 2020.
    Y. Min,F. Wenkel 和 G. Wolf。散射 gcn:克服图卷积网络中的过度平滑。arXiv 预印本 arXiv:2003.08414,2020 年。
  • [33] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang. Geom-gcn: Geometric graph convolutional networks. arXiv preprint arXiv:2002.05287, 2020.
    H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, 和 B. Yang. Geom-GCN:几何图卷积网络. arXiv 预印本 arXiv:2002.05287, 2020.
  • [34] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
    F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, 和 G. Monfardini. 图神经网络模型. IEEE 神经网络交易, 20(1):61–80, 2008.
  • [35] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv, abs/1710.10903, 2017.
    P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, 和 Y. Bengio. 图注意力网络. arXiv, abs/1710.10903, 2017.
  • [36] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
    F. Wu,T. Zhang,A. H. d. Souza Jr,C. Fifty,T. Yu 和 K. Q. Weinberger。简化图卷积网络。arXiv 预印本 arXiv:1902.07153,2019 年。
  • [37] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5453–5462. PMLR, 10–15 Jul 2018.
    K. Xu,C. Li,Y. Tian,T. Sonobe,K.-i. Kawarabayashi 和 S. Jegelka。使用跳跃知识网络在图上进行表示学习。在 J. Dy 和 A. Krause,编辑,第 35 届国际机器学习会议论文集,机器学习研究论文集第 80 卷,页面 5453-5462。PMLR,2018 年 7 月 10-15 日。
  • [38] Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. arXiv preprint arXiv:2102.06462, 2021.
    Y. Yan,M. Hashemi,K. Swersky,Y. Yang 和 D. Koutra。同一枚硬币的两面:图卷积神经网络中的异质性和过度平滑。arXiv 预印本 arXiv:2102.06462,2021 年。
  • [39] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, and D. Koutra. Graph neural networks with heterophily. arXiv preprint arXiv:2009.13566, 2020.
    J. Zhu,R. A. Rossi,A. Rao,T. Mai,N. Lipka,N. K. Ahmed 和 D. Koutra。具有异质性的图神经网络。arXiv 预印本 arXiv:2009.13566,2020 年。
  • [40] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems, 33, 2020.
    J. Zhu,Y. Yan,L. Zhao,M. Heimann,L. Akoglu 和 D. Koutra。图神经网络中的同质性之外:当前的局限性和有效设计。神经信息处理系统的进展,33,2020 年。

Checklist 检查清单

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]


      (a) 摘要和引言中提出的主要观点是否准确反映了论文的贡献和范围?[是]
    2. (b)

      Did you describe the limitations of your work? [Yes] In the Appendix G, we discussion cases that high-pass filter cannot tackle.


      (b) 您是否描述了工作的局限性?[是] 在附录 G 中,我们讨论了高通滤波器无法解决的情况。
    3. (c)

      Did you discuss any potential negative societal impacts of your work? [No] It is in Section 8, we have not come up with significant social negative impact.


      (c) 您是否讨论了工作可能带来的负面社会影响?[否] 在第 8 节中,我们没有提出重大的社会负面影响。
    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]


      (d) 您是否阅读了伦理审查指南,并确保您的论文符合这些指南?[是]

    1. 对于所有作者...
  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes] In Section 3&4, we mainly define a new homophily metric and it is followed by two theorems.


      (a) 您是否陈述了所有理论结果的全部假设?[是] 在第 3 和第 4 节中,我们主要定义了一个新的同质性度量,并跟随两个定理。
    2. (b)

      Did you include complete proofs of all theoretical results? [Yes] In Appendix B&D& E, we justify the new metric and two theorems.


      (b) 您是否包含了所有理论结果的完整证明?[是] 在附录 B&D&E 中,我们证明了新的度量和两个定理。

    2. 如果您包含理论结果…
  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The settings are provided in details and the source code is submitted in the supplemental material.


      (a) 您是否包含了重现主要实验结果所需的代码、数据和说明(无论是在补充材料中还是作为 URL)?[是] 详细提供了设置,并在补充材料中提交了源代码。
    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In Section 6, we specify model details.


      (b) 您是否指定了所有培训细节(例如,数据拆分、超参数、选择方式)?[是] 在第 6 节中,我们指定模型细节。
    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We include average test accuracy of times of running with standard deviation.


      (c) 您是否报告了误差条(例如,关于多次运行实验后的随机种子)?[是] 我们包括了运行平均测试准确性的次数与标准偏差。
    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We include hardware details in Appendix, which is not computationally expensive.


    3. 如果您进行了实验…
  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] In Section 6, we specify the datasets with their data split sources in footnotes.

    2. (b)

      Did you mention the license of the assets? [No]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [No]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [No]


      (d) 您是否讨论过如何获得使用/策划数据的人的同意以及同意是否获得?[否]
    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] None included.


      (e) 您是否讨论过您正在使用/策划的数据是否包含个人可识别信息或冒犯性内容?[否] 没有包含任何内容。

    4. 如果您正在使用现有资产(例如,代码、数据、模型)或策划/发布新资产...
  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [No] None included.


      (a) 您是否包含了提供给参与者的完整指导说明文本和屏幕截图(如果适用)?[否] 未包含。
    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [No] None included.


      (b) 您是否描述了任何潜在的参与者风险,并提供了机构审查委员会(IRB)批准的链接(如果适用)?[否] 未包含。
    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [No] None included.


      (c) 您是否包含了支付给参与者的预估小时工资和用于参与者补偿的总金额?[否] 未包含。

    5. 如果您使用众包或进行了涉及人类主体的研究...

Appendix A Hyperparameters & Details of The Experiments
附录 A 实验的超参数和细节

A.1 Hyperparameters Searching Range for GNNs on Synthetic Graphs
A.1 在合成图上的 GNNs 的超参数搜索范围

Table 3: Hyperparameter Searching Range for Synthetic Experiments
表 3:合成实验的超参数搜索范围
Hyperparameter Searching Range for Synthetic Experiments
用于合成实验的超参数搜索范围
Models\Hyperparameters 模型\超参数 lr weight_decay dropout hidden
MLP-1 0.05 {5e-5, 1e-4, 5e-4} - -
SGC-1 0.05 {5e-5, 1e-4, 5e-4} - -
ACM-SGC-1 0.05 {5e-5, 1e-4, 5e-4} { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} -
MLP-2 0.05 {5e-5, 1e-4, 5e-4} { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64
GCN 0.05 {5e-5, 1e-4, 5e-4} { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64
ACM-GCN 0.05 {5e-5, 1e-4, 5e-4} { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64

A.2 Hyperparameters Searching Range for GNNs on Ablation Study

Hyperparameter Searching Range for Ablation Study
切除研究的超参数搜索范围
Models\Hyperparameters 模型\超参数 lr weight_decay dropout hidden
SGC-LP+HP {0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} - -
SGC-LP+Identity SGC-LP+身份 {0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} - -
ACM-SGC-no adaptive mixing
ACM-SGC-无自适应混合
{0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} -
GCN-LP+HP {0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64
GCN-LP+Identity {0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64
ACM-GCN-no adaptive mixing
ACM-GCN-无自适应混合
{0.01, 0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8,0.9} 64
Table 4: Hyperparameter Searching Range for Ablation Study
表 4:消融研究的超参数搜索范围

A.3 Hyperparameters Searching Range for GNNs on Real-world Datasets
GNN 在真实世界数据集上的超参数搜索范围

See table 5. 请参阅表 5。

Table 5: Hyperparameter Searching Range for Real-world Datasets
表 5:真实世界数据集的超参数搜索范围
Models\Hyperparameters 模型\超参数 lr weight_decay dropout hidden lambda alpha_l head layers JK type JK 类型
H2GCN 0.01 0.001 {0, 0.5} {8, 16, 32, 64} - - - {1, 2} -
MixHop 0.01 0.001 0.5 {8, 16, 32} - - - {2, 3} -
GCN+JK {0.1, 0.01, 0.001} 0.001 0.5 {4, 8, 16, 32, 64} - - - 2 {max, cat} {最大,类别}
GAT+JK {0.1, 0.01, 0.001} 0.001 0.5 {4, 8, 12, 32} - - {2,4,8} 2 {max, cat} {最大,猫}
GCNII, GCNII* GCNII,GCNII* 0.01 {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} for Deezer-Europe and {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} for others
{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} 适用于 Deezer-Europe 和 {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} 适用于其他地区
0.5 64 {0.5, 1, 1.5} {0.1,0.2,0.3,0,4,0.5} - {4, 8, 16, 32} for Deezer-Europe and {4, 8, 16, 32, 64} for others
{4, 8, 16, 32} 适用于 Deezer-Europe 和 {4, 8, 16, 32, 64} 适用于其他地区
-
Baselines: {SGC-1, SGC-2, GCN, Snowball-2, Snowball-3, FAGCN}; ACM-{SGC-1, SGC-2, GCN, Snowball-2, Snowball-3}; ACMII-{SGC-1, SGC-2, GCN, Snowball-2, Snowball-3}
基线: {SGC-1, SGC-2, GCN, Snowball-2, Snowball-3, FAGCN}; ACM-{SGC-1, SGC-2, GCN, Snowball-2, Snowball-3}; ACMII-{SGC-1, SGC-2, GCN, Snowball-2, Snowball-3}
{0.002, 0.01, 0.05} for Deezer-Europe and {0.01, 0.05, 0.1} for others
{0.002, 0.01, 0.05} 适用于 Deezer-Europe,{0.01, 0.05, 0.1} 适用于其他地区
{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} for Deezer-Europe and {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} for others
{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} 适用于 Deezer-Europe,{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} 适用于其他地区
{0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 64 - - - - -
GraphSAGE {0.01,0.05, 0.1} {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} for Deezer-Europe and {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} for others
{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} 适用于 Deezer-Europe,{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} 适用于其他地区
{ 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 8 for Deezer-Europe and 64 for others
Deezer-Europe 为 8,其他为 64
- - - - -
ACM-{GCNII, GCNII*} 0.01 {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3} for Deezer-Europe and {0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} for others
Deezer-Europe 为{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3},其他为{0, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2}
{ 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 64 - - - {1,2,3,4} -

A.4 Optimal Hyperparameters for Baselines and ACM(II)-GNNs on Real-world Tasks
基线和 ACM(II)-GNN 在真实任务上的最佳超参数

See table 6 and 7.
请参阅表 6 和 7。

Hyperparameters for Baseline GNNs
基线 GNN 的超参数
Datasets Models\Hyperparameters 模型\超参数 lr weight_decay dropout hidden # layers # 层 Gat heads Gat 头 JK Type JK 类型 lambda alpha_l results std average epoch time/average total time
平均时代时间/平均总时间
Cornell SGC-1 0.05 1.00E-02 0 64 - - - - - 70.98 8.39 2.53ms/0.51s 2.53 毫秒/0.51 秒
SGC-2 0.05 1.00E-03 0 64 - - - - - 72.62 9.92 2.46ms/0.53s 2.46 毫秒/0.53 秒
GCN 0.1 5.00E-03 0.5 64 2 - - - - 82.46 3.11 3.67ms/0.74s 3.67 毫秒/0.74 秒
Snowball-2 0.01 5.00E-03 0.4 64 2 - - - - 82.62 2.34 4.24ms/0.87s 4.24 毫秒/0.87 秒
Snowball-3 0.01 5.00E-03 0.4 64 3 - - - - 82.95 2.1 6.66ms/1.36s 6.66 毫秒/1.36 秒
GCNII 0.01 1.00E-03 0.5 64 16 - - 0.5 0.5 89.18 3.96 25.41ms/8.11s 25.41 毫秒/8.11 秒
GCNII* 0.01 1.00E-03 0.5 64 8 - - 0.5 0.5 90.49 4.45 15.35ms/4.05s 15.35 毫秒/4.05 秒
FAGCN 0.01 1.00E-04 0.7 32 2 - - - - 88.03 5.6 8.1ms/3.8858s
Mixhop 0.01 0.001 0.5 16 2 - - - - 60.33 28.53 10.379ms/2.105s
H2GCN 0.01 0.001 0.5 64 1 - - - - 86.23 4.71 4.381ms/1.123s
GCN+JK 0.1 0.001 0.5 64 2 - cat - - 66.56 13.82 5.589ms/1.227s
GAT+JK 0.1 0.001 0.5 32 2 8 max - - 74.43 10.24 10.725ms/2.478s
Wisconsin SGC-1 0.05 5.00E-03 0 64 - - - - - 70.38 2.85 2.83ms/0.57s
SGC-2 0.1 1.00E-03 0 64 - - - - - 74.75 2.89 2.14ms/0.43s
GCN 0.1 1.00E-03 0.7 64 2 - - - - 75.5 2.92 3.74ms/0.76s
Snowball-2 0.1 1.00E-03 0.5 64 2 - - - - 74.88 3.42 3.73ms/0.76s
Snowball-3 0.05 5.00E-04 0.8 64 3 - - - - 69.5 5.01 5.46ms/1.12s
GCNII 0.01 1.00E-03 0.5 64 8 - - 0.5 0.5 83.25 2.69
GCNII* 0.01 1.00E-03 0.5 64 4 - - 1.5 0.3 89.12 3.06 9.26ms/1.96s
FAGCN 0.05 1.00E-04 0 32 2 - - - - 89.75 6.37 12.9ms/4.6359s
Mixhop 0.01 0.001 0.5 16 2 - - - - 77.25 7.80 10.281ms/2.095s
H2GCN 0.01 0.001 0.5 32 1 - - - - 87.5 1.77 4.324ms/1.134s
GCN+JK 0.1 0.001 0.5 32 2 - cat - - 62.5 15.75 5.117ms/1.049s
GAT+JK 0.1 0.001 0.5 4 2 8 max - - 69.5 3.12 10.762ms/2.25s
APPNP 0.05 0.001 0.5 64 2 - - - - 92 3.59 10.303ms/2.104s
GPRGNN 0.05 0.001 0.5 256 2 - - - - 93.75 2.37 11.856ms/2.415s
Texas SGC-1 0.05 1.00E-03 0 64 - - - - - 83.28 5.43 2.55ms/0.54s
SGC-2 0.01 1.00E-03 0 64 - - - - - 81.31 3.3 2.61ms/2.53s
GCN 0.05 1.00E-02 0.9 64 2 - - - - 83.11 3.2 3.59ms/0.73s
Snowball-2 0.05 1.00E-02 0.9 64 2 - - - - 83.11 3.2 3.98ms/0.82s
Snowball-3 0.05 1.00E-02 0.9 64 3 - - - - 83.11 3.2 5.56ms/1.12s
GCNII 0.01 1.00E-04 0.5 64 4 - - 1.5 0.5 82.46 4.58
GCNII* 0.01 1.00E-04 0.5 64 8 - - 0.5 0.5 88.52 3.02 15.64ms/3.47s
FAGCN 0.01 5.00E-04 0 32 2 - - - - 88.85 4.39 8.8ms/6.5252s
Mixhop 0.01 0.001 0.5 32 2 - - - - 76.39 7.66 11.099ms/2.329s
H2GCN 0.01 0.001 0.5 64 1 - - - - 85.90 3.53 4.197ms/0.95s
GCN+JK 0.1 0.001 0.5 32 2 - cat - - 80.66 1.91 5.28ms/1.085s
GAT+JK 0.1 0.001 0.5 8 2 2 cat - - 75.41 7.18 10.937ms/2.402s
Film SGC-1 0.01 5.00E-06 0 64 - - - - - 25.26 1.18 3.18ms/0.70s
SGC-2 0.01 5.00E-06 0 64 - - - - - 28.81 1.11 2.13ms/0.43s
GCN 0.1 5.00E-04 0 64 2 - - - - 35.51 0.99 4.86ms/0.99s
Snowball-2 0.1 5.00E-04 0 64 2 - - - - 35.97 0.66 5.59ms/1.14s
Snowball-3 0.1 5.00E-04 0.2 64 3 - - - - 36 1.36 7.89ms/1.60s
GCNII 0.01 1.00E-04 0.5 64 8 - - 1.5 0.3 40.82 1.79 15.85ms/3.22s
GCNII* 0.01 1.00E-06 0.5 64 4 - - 1 0.1 41.54 0.99
FAGCN 0.01 5.00E-05 0.6 32 2 - - - - 31.59 1.37 45.4ms/11.107s
Mixhop 0.01 0.001 0.5 8 3 8 max - - 33.13 2.40 17.651ms/3.566s
H2GCN 0.01 0.001 0 64 1 8 max - - 38.85 1.17 8.101ms/1.695s
GCN+JK 0.1 0.001 0.5 64 2 8 cat - - 32.72 2.62 8.946ms/1.807s
GAT+JK 0.001 0.001 0.5 32 2 4 cat - - 35.41 0.97 20.726ms/4.187s
Chameleon SGC-1 0.1 5.00E-06 0 64 - - - - - 64.86 1.81 3.48ms/2.96s
SGC-2 0.1 0.00E+00 0 64 - - - - - 62.67 2.41 4.43ms/1.12s
GCN 0.01 1.00E-05 0.9 64 2 - - - - 64.18 2.62 4.96ms/1.18s
Snowball-2 1.00E-01 1.00E-05 0.9 64 2 - - - - 64.99 2.39 4.96ms/1.00s
Snowball-3 0.1 5.00E-06 0.9 64 3 - - - - 65.49 1.64 7.44ms/1.50s
GCNII 0.01 5.00E-06 0.5 64 4 - - 0.5 0.1 60.35 2.7 9.76ms/2.26s
GCNII* 0.01 5.00E-04 0.5 64 4 - - 1.5 0.5 62.8 2.87 10.40ms/2.17s
FAGCN 0.002 1.00E-04 0 32 2 - - - - 49.47 2.84 8.4ms/13.8696s
Mixhop 0.01 0.001 0.5 16 2 8 max - - 36.28 10.2 11.372ms/2.297s
H2GCN 0.01 0.001 0 32 1 8 max - - 52.3 0.48 4.059ms/0.82s
GCN+JK 0.001 0.001 0.5 32 2 8 cat - - 64.68 2.85 5.211ms/1.053s
GAT+JK 0.001 0.001 0.5 4 2 8 max - - 68.14 1.18 13.772ms/2.788s
Squirrel SGC-1 0.05 0.00E+00 0 64 - - - - - 47.62 1.27 4.65ms/1.44s
SGC-2 0.1 0.00E+00 0.9 64 - - - - - 41.25 1.4 35.06ms/7.81s
GCN 0.01 5.00E-05 0.7 64 2 - - - - 44.76 1.39 8.41ms/2.50s
Snowball-2 0.1 0.00E+00 0.9 64 2 - - - - 47.88 1.23 8.96ms/1.92s
Snowball-3 0.1 0.00E+00 0.8 64 3 - - - - 48.25 0.94 14.00ms/2.90s
GCNII 0.01 1.00E-04 0.5 64 4 - - 1.5 0.2 38.81 1.97 13.35ms/2.70s
GCNII* 0.01 5.00E-04 0.5 64 4 - - 1.5 0.3 38.31 1.3 13.81ms/2.78s
FAGCN 0.05 1.00E-04 0 32 2 - - - - 42.24 1.2 16ms/6.7961s
Mixhop 0.01 0.001 0.5 32 2 - - - - 24.55 2.6 17.634ms/3.562s
H2GCN 0.01 0.001 0 16 1 - - - - 30.39 1.22 9.315ms/1.882s
GCN+JK 0.001 0.001 0.5 32 2 - max - - 53.4 1.9 14.321ms/2.905s
GAT+JK 0.001 0.001 0.5 8 2 4 max - - 52.28 3.61 29.097ms/5.878s
Cora SGC-1 0.1 5.00E-06 0 64 - - - - - 85.12 1.64 3.47ms/11.55s
SGC-2 0.1 1.00E-05 0 64 - - - - - 85.48 1.48 2.91ms/6.85s
GCN 0.1 5.00E-04 0.2 64 2 - - - - 87.78 0.96 4.24ms/0.86s
Snowball-2 0.1 5.00E-04 0.1 64 2 - - - - 88.64 1.15 4.65ms/0.94s
Snowball-3 0.05 1.00E-03 0.6 64 3 - - - - 89.33 1.3 6.41ms/1.32s
GCNII 0.01 1.00E-04 0.5 64 16 - - 0.5 0.2 88.98 1.33
GCNII* 0.01 5.00E-04 0.5 64 4 - - 0.5 0.5 88.93 1.37 10.16ms/2.24s
FAGCN 0.05 5.00E-04 0 32 2 - - - - 88.85 1.36 8.4ms/3.3183s
Mixhop 0.01 0.001 0.5 16 2 - - - - 65.65 11.31 11.177ms/2.278s
H2GCN 0.01 0.001 0 32 1 - - - - 87.52 0.61 4.335ms/1.209s
GCN+JK 0.001 0.001 0.5 64 2 - cat - - 86.90 1.51 6.656ms/1.346s
GAT+JK 0.001 0.001 0.5 32 2 2 cat - - 89.52 0.43 12.91ms/2.608s
CiteSeer SGC-1 0.1 5.00E-04 0 64 - - - - - 79.66 0.75 3.43ms/7.30s
SGC-2 0.01 5.00E-04 0.9 64 - - - - - 80.75 1.15 5.33ms/4.40s
GCN 0.1 1.00E-03 0.9 64 2 - - - - 81.39 1.23 4.18ms/0.86s
Snowball-2 0.1 1.00E-03 0.8 64 2 - - - - 81.53 1.71 5.19ms/1.11s
Snowball-3 0.1 1.00E-03 0.9 64 3 - - - - 80.93 1.32 7.64ms/1.69s
GCNII 0.01 1.00E-03 0.5 64 16 - - 0.5 0.2 81.58 1.3
GCNII* 0.01 1.00E-03 0.5 64 16 - - 0.5 0.2 81.83 1.78 32.50ms/10.29s
FAGCN 0.05 5.00E-04 0 32 2 - - - - 82.37 1.46 9.4ms/4.7648s
Mixhop 0.01 0.001 0.5 16 2 - - - - 49.52 13.35 13.793ms/2.786s
H2GCN 0.01 0.001 0 8 1 - - - - 79.97 0.69 5.794ms/3.049s
GCN+JK 0.001 0.001 0.5 32 2 - max - - 73.77 1.85 5.264ms/1.063s
GAT+JK 0.001 0.001 0.5 8 2 4 max - - 74.49 2.76 12.326ms/2.49s
PubMed SGC-1 0.05 5.00E-06 0.3 64 - - - - - 87.75 0.88 6.04ms/2.61s
SGC-2 0.05 5.00E-05 0.1 64 - - - - - 88.79 0.5 8.62ms/3.18s
GCN 0.1 5.00E-05 0.6 64 2 - - - - 88.9 0.32 5.08ms/1.03s
Snowball-2 0.1 5.00E-04 0 64 2 - - - - 89.04 0.49 5.68ms/1.19s
Snowball-3 0.1 5.00E-06 0 64 3 - - - - 88.8 0.82 8.54ms/1.75s
GCNII 0.01 1.00E-06 0.5 64 4 - - 0.5 0.5 89.8 0.3 10.98ms/3.21s
GCNII* 0.01 1.00E-06 0.5 64 4 - - 0.5 0.1 89.98 0.52 11.47ms/3.24s
FAGCN 0.05 5.00E-04 0 32 2 - - - - 89.98 0.54 14.5ms/6.411s
Mixhop 0.01 0.001 0.5 16 2 - - - - 87.04 4.10 17.459ms/3.527s
H2GCN 0.01 0.001 0 64 1 - - - - 87.78 0.28 8.039ms/2.28s
GCN+JK 0.01 0.001 0.5 32 2 - cat - - 90.09 0.68 12.001ms/2.424s 12.001 毫秒/2.424 秒
GAT+JK 0.1 0.001 0.5 8 2 4 max - - 89.15 0.87 20.403ms/4.125s 20.403 毫秒/4.125 秒
Deezer-Europe FAGCN 0.01 0.0001 0 32 2 - - - - 66.86 0.53 41.7ms/20.8362s 41.7 毫秒/20.8362 秒
GCNII 0.01 5e-6,1e-5 0.5 64 32 - - 0.5 0.5 66.38 0.45 126.58ms/63.16s 126.58 毫秒/63.16 秒
GCNII* 0.01 1e-4,1e-3 0.5 64 32 - - 0.5 0.5 66.42 0.56 134.05ms/66.89s 134.05 毫秒/66.89 秒
Table 6: Hyperparameters for baseline models
表 6:基线模型的超参数
Hyperparameters for ACM-GNNs and ACMII-GNNs
ACM-GNNs 和 ACMII-GNNs 的超参数
Datasets Models\Hyperparameters 模型\超参数 lr weight_decay dropout hidden # layers # 层数 Gat heads Gat 头 JK Type JK 类型 lambda alpha_l results std average epoch time/average total time
平均时代时间/平均总时间
Cornell ACM-SGC-1 0.01 5.00E-03 0.6 64 - - - - - 93.77 1.91 5.53ms/2.31s 5.53 毫秒/2.31 秒
ACM-SGC-2 0.01 5.00E-03 0.6 64 - - - - - 93.77 2.17 4.73ms/1.87s 4.73 毫秒/1.87 秒
ACM-GCN 0.05 1.00E-02 0.2 64 2 - - - - 94.75 3.8 8.25ms/1.69s 8.25 毫秒/1.69 秒
ACMII-GCN 0.1 1.00E-02 0.5 64 2 - - - - 95.25 2.79 8.43ms/1.71s 8.43 毫秒/1.71 秒
ACM-GCNII 0.01 1.00E-03 0.5 64 1 - - 0.5 0.4 92.62 3.13 6.81ms/1.43s 6.81 毫秒/1.43 秒
ACM-GCNII* 0.01 5.00E-04 0.5 64 1 - - 0.5 0.1 93.44 2.74 6.76ms/1.39s 6.76 毫秒/1.39 秒
ACM-Snowball-2 0.05 1.00E-02 0.2 64 2 - - - - 95.08 3.11 9.15ms/1.86s 9.15 毫秒/1.86 秒
ACM-Snowball-3 0.1 1.00E-02 0.4 64 3 - - - - 94.26 2.57 13.20ms/2.68s 13.20 毫秒/2.68 秒
ACMII-Snowball-2 0.05 1.00E-02 0.6 64 2 - - - - 95.25 1.55 8.23ms/1.72s 8.23 毫秒/1.72 秒
ACMII-Snowball-3 0.05 1.00E-02 0.7 64 3 - - - - 93.61 2.79 11.70ms/2.37s 11.70 毫秒/2.37 秒
Wisconsin ACM-SGC-1 0.05 5.00E-03 0.7 64 - - - - - 93.25 2.92 5.96ms/1.34s 5.96 毫秒/1.34 秒
ACM-SGC-2 0.1 5.00E-03 0.2 64 - - - - - 94 2.61 4.60ms/0.95s 4.60 毫秒/0.95 秒
ACM-GCN 0.1 5.00E-03 0 64 2 - - - - 95.75 2.03 8.11ms/1.64s 8.11 毫秒/1.64 秒
ACMII-GCN 0.1 1.00E-02 0.2 64 2 - - - - 96.62 2.44 8.28ms/1.68s 8.28 毫秒/1.68 秒
ACM-GCNII 0.01 5.00E-03 0.5 64 1 - - 1 0.1 94.63 2.96 9.31ms/2.19s 9.31 毫秒/2.19 秒
ACM-GCNII* 0.01 1.00E-03 0.5 64 1 - - 1.5 0.4 94.37 2.81 7.11ms/1.45s 7.11 毫秒/1.45 秒
ACM-Snowball-2 0.1 5.00E-03 0.1 64 2 - - - - 96.38 2.59 8.63ms/1.74s 8.63 毫秒/1.74 秒
ACM-Snowball-3 0.05 1.00E-02 0.3 64 3 - - - - 96.62 1.86 12.79ms/2.58s 12.79 毫秒/2.58 秒
ACMII-Snowball-2 0.1 1.00E-02 0.1 64 2 - - - - 96.63 2.24 8.11ms/1.65s 8.11 毫秒/1.65 秒
ACMII-Snowball-3 0.1 5.00E-03 0.1 64 3 - - - - 97 2.63 12.38ms/2.51s 12.38 毫秒/2.51 秒
Texas ACM-SGC-1 0.01 5.00E-03 0.6 64 - - - - - 93.61 1.55 5.43ms/2.18s 5.43 毫秒/2.18 秒
ACM-SGC-2 0.05 5.00E-03 0.4 64 - - - - - 93.44 2.54 4.59ms/1.01s 4.59 毫秒/1.01 秒
ACM-GCN 0.05 1.00E-02 0.6 64 2 - - - - 94.92 2.88 8.33ms/1.70s 8.33 毫秒/1.70 秒
ACMII-GCN 0.1 5.00E-03 0.4 64 2 - - - - 95.08 2.54 8.49ms/1.72s 8.49 毫秒/1.72 秒
ACM-GCNII 0.01 1.00E-03 0.5 64 1 - - 0.5 0.4 92.46 1.97 6.47ms/1.36s 6.47 毫秒/1.36 秒
ACM-GCNII* 0.01 1.00E-03 0.5 64 1 - - 0.5 0.4 93.28 2.79 7.03ms/1.45s 7.03 毫秒/1.45 秒
ACM-Snowball-2 0.05 1.00E-02 0.1 64 2 - - - - 95.74 2.22 8.35ms/1.71s 8.35 毫秒/1.71 秒
ACM-Snowball-3 0.01 5.00E-03 0.6 64 3 - - - - 94.75 2.41 12.56ms/2.63s 12.56 毫秒/2.63 秒
ACMII-Snowball-2 0.1 1.00E-02 0.4 64 2 - - - - 95.25 1.55 9.74ms/1.97s 9.74 毫秒/1.97 秒
ACMII-Snowball-3 0.05 1.00E-02 0.6 64 3 - - - - 94.75 3.09 11.91ms/2.42s 11.91 毫秒/2.42 秒
Film ACM-SGC-1 0.05 5.00E-05 0.7 64 - - - - - 39.33 1.25 5.21ms/2.33s 5.21 毫秒/2.33 秒
ACM-SGC-2 0.1 5.00E-05 0.7 64 - - - - - 40.13 1.21 12.41ms/4.87s 12.41 毫秒/4.87 秒
ACM-GCN 0.1 5.00E-04 0.5 64 2 - - - - 41.62 1.15 10.72ms/2.66s 10.72 毫秒/2.66 秒
ACMII-GCN 0.1 5.00E-04 0.5 64 2 - - - - 41.24 1.16 10.51ms/2.44s 10.51 毫秒/2.44 秒
ACM-GCNII 0.01 0.00E+00 0.5 64 3 - - 1.5 0.2 41.37 1.37 13.65ms/2.74s 13.65 毫秒/2.74 秒
ACM-GCNII* 0.01 1.00E-05 0.5 64 3 - - 1.5 0.1 41.27 1.24 14.98ms/3.01s 14.98 毫秒/3.01 秒
ACM-Snowball-2 0.1 5.00E-03 0 64 2 - - - - 41.4 1.23 10.30ms/2.08s 10.30 毫秒/2.08 秒
ACM-Snowball-3 0.05 1.00E-02 0 64 3 - - - - 41.27 0.8 16.43ms/3.52s 16.43 毫秒/3.52 秒
ACMII-Snowball-2 0.1 5.00E-03 0 64 2 - - - - 41.1 0.75 10.74ms/2.19s 10.74 毫秒/2.19 秒
ACMII-Snowball-3 0.05 5.00E-03 0.2 64 3 - - - - 40.31 1.6 16.31ms/3.29s 16.31 毫秒/3.29 秒
Chameleon ACM-SGC-1 0.1 5.00E-06 0.9 64 - - - - - 63.68 1.62 5.41ms/1.21s 5.41 毫秒/1.21 秒
ACM-SGC-2 0.1 5.00E-06 0.9 64 - - - - - 60.48 1.55 7.86ms/1.81s 7.86 毫秒/1.81 秒
ACM-GCN 0.01 5.00E-05 0.8 64 2 - - - - 68.18 1.67 10.55ms/3.12s 10.55 毫秒/3.12 秒
ACMII-GCN 0.05 5.00E-05 0.7 64 2 - - - - 68.38 1.36 10.90ms/2.39s 10.90 毫秒/2.39 秒
ACM-GCNII 0.01 5.00E-06 0.5 64 4 - - 0.5 0.1 58.73 2.52 18.31ms/3.68s 18.31 毫秒/3.68 秒
ACM-GCNII* 0.01 1.00E-03 0.5 64 1 - - 1 0.1 61.66 2.29 6.68ms/1.40s 6.68 毫秒/1.40 秒
ACM-Snowball-2 0.05 5.00E-05 0.7 64 2 - - - - 68.51 1.7 9.92ms/2.06s 9.92 毫秒/2.06 秒
ACM-Snowball-3 0.01 1.00E-04 0.7 64 3 - - - - 68.4 2.05 14.49ms/3.15s 14.49 毫秒/3.15 秒
ACMII-Snowball-2 0.1 5.00E-05 0.6 64 2 - - - - 67.83 2.63 9.99ms/2.10s 9.99 毫秒/2.10 秒
ACMII-Snowball-3 0.05 1.00E-04 0.7 64 3 - - - - 67.53 2.83 15.03ms/3.29s 15.03 毫秒/3.29 秒
Squirrel ACM-SGC-1 0.05 0.00E+00 0.9 64 - - - - - 46.4 1.13 6.96ms/2.16s 6.96 毫秒/2.16 秒
ACM-SGC-2 0.05 0.00E+00 0.9 64 - - - - - 40.91 1.39 35.20ms/10.66s 35.20 毫秒/10.66 秒
ACM-GCN 0.05 5.00E-06 0.6 64 2 - - - - 58.02 1.86 14.35ms/2.98s 14.35 毫秒/2.98 秒
ACMII-GCN 0.05 0.00E+00 0.7 64 2 - - - - 53.76 1.63 14.08ms/3.39s 14.08 毫秒/3.39 秒
ACM-GCNII 0.01 1.00E-05 0.5 64 4 - - 0.5 0.1 40.9 1.58 20.72ms/4.17s 20.72 毫秒/4.17 秒
ACM-GCNII* 0.01 1.00E-03 0.5 64 4 - - 0.5 0.3 38.32 1.5 21.78ms/4.38s 21.78 毫秒/4.38 秒
ACM-Snowball-2 0.05 5.00E-06 0.6 64 2 - - - - 55.97 2.03 15.38ms/3.15s 15.38 毫秒/3.15 秒
ACM-Snowball-3 0.01 1.00E-04 0.6 64 3 - - - - 55.73 2.39 26.15ms/5.94s 26.15 毫秒/5.94 秒
ACMII-Snowball-2 0.1 5.00E-06 0.6 64 2 - - - - 53.48 0.6 15.54ms/3.19s 15.54 毫秒/3.19 秒
ACMII-Snowball-3 0.05 5.00E-05 0.5 64 3 - - - - 52.31 1.57 26.24ms/5.30s 26.24 毫秒/5.30 秒
Cora ACM-SGC-1 0.01 5.00E-06 0.9 64 - - - - - 86.63 1.13 6.00ms/7.40s 6.00 毫秒/7.40 秒
ACM-SGC-2 0.1 5.00E-05 0.6 64 - - - - - 87.64 0.99 4.85ms/1.17s 4.85 毫秒/1.17 秒
ACM-GCN 0.1 5.00E-03 0.5 64 2 - - - - 88.62 1.22 8.84ms/1.81s 8.84 毫秒/1.81 秒
ACMII-GCN 0.1 5.00E-03 0.4 64 2 - - - - 89 0.72 8.93ms/1.83s 8.93 毫秒/1.83 秒
ACM-GCNII 0.01 1.00E-03 0.5 64 3 - - 1 0.2 89.1 1.61 14.07ms/3.04s 14.07 毫秒/3.04 秒
ACM-GCNII* 0.01 1.00E-02 0.5 64 4 - - 1 0.2 89 1.35 11.36ms/2.48s 11.36 毫秒/2.48 秒
ACM-Snowball-2 0.05 1.00E-03 0.6 64 2 - - - - 88.83 1.49 9.34ms/1.92s 9.34 毫秒/1.92 秒
ACM-Snowball-3 0.1 1.00E-02 0.3 64 3 - - - - 89.59 1.58 13.33ms/2.75s 13.33 毫秒/2.75 秒
ACMII-Snowball-2 0.1 5.00E-03 0.5 64 2 - - - - 88.95 1.04 9.29ms/1.90s 9.29 毫秒/1.90 秒
ACMII-Snowball-3 0.1 5.00E-03 0.5 64 3 - - - - 89.36 1.26 14.18ms/2.89s 14.18 毫秒/2.89 秒
CiteSeer ACM-SGC-1 0.01 5.00E-04 0.9 64 - - - - - 80.96 0.93 5.90ms/4.31s 5.90 毫秒/4.31 秒
ACM-SGC-2 0.05 5.00E-04 0.9 64 - - - - - 80.93 1.16 5.01ms/1.42s 5.01 毫秒/1.42 秒
ACM-GCN 0.05 5.00E-03 0.7 64 2 - - - - 81.68 0.97 11.35ms/2.57s 11.35 毫秒/2.57 秒
ACMII-GCN 0.05 5.00E-05 0.7 64 2 - - - - 81.58 1.77 9.55ms/1.94s 9.55 毫秒/1.94 秒
ACM-GCNII 0.01 1.00E-02 0.5 64 3 - - 0.5 0.3 82.28 1.12 15.61ms/3.56s 15.61 毫秒/3.56 秒
ACM-GCNII* 0.01 1.00E-02 0.5 64 3 - - 0.5 0.5 81.69 1.25 15.56ms/3.61s 15.56 毫秒/3.61 秒
ACM-Snowball-2 0.05 5.00E-03 0.7 64 2 - - - - 81.58 1.23 11.14ms/2.50s 11.14 毫秒/2.50 秒
ACM-Snowball-3 0.01 5.00E-03 0.9 64 3 - - - - 81.32 0.97 15.91ms/5.36s 15.91 毫秒/5.36 秒
ACMII-Snowball-2 0.05 5.00E-03 0.7 64 2 - - - - 82.07 1.04 10.97ms/2.55s 10.97 毫秒/2.55 秒
ACMII-Snowball-3 0.05 1.00E-04 0.6 64 3 - - - - 81.56 1.15 14.95ms/3.03s 14.95 毫秒/3.03 秒
PubMed ACM-SGC-1 0.05 5.00E-06 0.3 64 - - - - - 87.75 0.88 6.04ms/2.61s 6.04 毫秒/2.61 秒
ACM-SGC-2 0.05 5.00E-05 0.1 64 - - - - - 88.79 0.5 8.62ms/3.18s 8.62 毫秒/3.18 秒
ACM-GCN 0.1 5.00E-04 0.2 64 2 - - - - 90.54 0.63 10.20ms/2.08s 10.20 毫秒/2.08 秒
ACMII-GCN 0.1 5.00E-04 0.2 64 2 - - - - 90.74 0.5 10.20ms/2.07s 10.20 毫秒/2.07 秒
ACM-GCNII 0.01 1.00E-04 0.5 64 3 - - 1.5 0.5 90.12 0.4 15.07ms/3.35s 15.07 毫秒/3.35 秒
ACM-GCNII* 0.01 1.00E-04 0.5 64 3 - - 1.5 0.5 90.18 0.51 16.62ms/3.72s 16.62 毫秒/3.72 秒
ACM-Snowball-2 0.1 1.00E-04 0.3 64 2 - - - - 90.81 0.52 11.52ms/2.36s 11.52 毫秒/2.36 秒
ACM-Snowball-3 0.05 1.00E-03 0.2 64 3 - - - - 91.44 0.59 18.06ms/3.69s 18.06 毫秒/3.69 秒
ACMII-Snowball-2 0.1 1.00E-04 0.3 64 2 - - - - 90.56 0.39 11.74ms/2.39s 11.74 毫秒/2.39 秒
ACMII-Snowball-3 0.1 5.00E-04 0.2 64 3 - - - - 91.31 0.6 18.61ms/3.88s 18.61 毫秒/3.88 秒
Deezer-Europe ACM-SGC-1 0.05 0,5e-6,1e-5,5e-5 0.3 64 - - - - - 66.67 0.56 146.41ms/73.06s 146.41 毫秒/73.06 秒
ACM-SGC-2 0.002 5e-5,1e-4 0.3 64 - - - - - 66.53 0.57 195.21ms/97.41s 195.21 毫秒/97.41 秒
ACM-GCN 0.002 5.00E-04 0.5 64 2 - - - - 67.01 0.38 136.45ms/68.09s 136.45 毫秒/68.09 秒
ACMII-GCN 0.01 5.00E-05 0.8 64 2 - - - - 67.15 0.41 135.24ms/67.48s 135.24 毫秒/67.48 秒
ACM-GCNII 0.01 0,5e-6 5e-7 0.5 64 1 - - 0.5 0.4 66.39 0.56 80.82ms/40.33s 80.82 毫秒/40.33 秒
ACM-GCNII* 0.01 0.0001,1e-3 0.5 64 1 - - 1.5 0.2 66.6 0.57 80.95ms/40.40s 80.95 毫秒/40.40 秒
Table 7: Hyperparameters for ACM-GNNs and ACMII-GNNs
表 7:ACM-GNNs 和 ACMII-GNNs 的超参数

A.5 Details of the Implementation of ACM and ACMII Frameworks
ACM 和 ACMII 框架实施细节

In ACM(II) framework, we first use dropout operation over the input data. The implementation of ACM(II)-GCN and ACM(II)-snowball is straightforward, but SGC-1, SGC-2, GCNII and GCNII* are not able to be applied under ACM(II) framework and we will make an explanation as follows.
在 ACM(II)框架中,我们首先对输入数据使用了 dropout 操作。ACM(II)-GCN 和 ACM(II)-snowball 的实现很直接,但 SGC-1、SGC-2、GCNII 和 GCNII*无法在 ACM(II)框架下应用,接下来我们将做出解释。

  • SGC-1 and SGC-2: SGC does not contain nonlinearity, so the option 1 and option 2 in step 1 is the same for ACM-SGC and ACMII-SGC. Thus, we only implement ACM-SGC.


    • SGC-1 和 SGC-2:SGC 不包含非线性,因此在步骤 1 中的选项 1 和选项 2 对于 ACM-SGC 和 ACMII-SGC 是相同的。因此,我们只实现 ACM-SGC。
  • GCNII and GCNII*:

    GCCII: 𝐇(+1)=σ(((1α)𝐀^𝐇()+α𝐇(0))((1β)𝐈n+β𝐖()))GCCII: superscript𝐇1𝜎1subscript𝛼^𝐀superscript𝐇subscript𝛼superscript𝐇01subscript𝛽subscript𝐈𝑛subscript𝛽superscript𝐖\displaystyle\text{GCCII: }\mathbf{H}^{(\ell+1)}=\sigma\left(\left(\left(1-\alpha_{\ell}\right)\hat{\mathbf{A}}\mathbf{H}^{(\ell)}+\alpha_{\ell}\mathbf{H}^{(0)}\right)\left(\left(1-\beta_{\ell}\right)\mathbf{I}_{n}+\beta_{\ell}\mathbf{W}^{(\ell)}\right)\right)
    GCCII*: 𝐇(+1)=σ((1α)𝐀^𝐇()((1β)𝐈n+β𝐖1())++α𝐇(0)((1β)𝐈n+β𝐖2()))\displaystyle\text{GCCII*: }\mathbf{H}^{(\ell+1)}=\sigma\left(\left(1-\alpha_{\ell}\right)\hat{\mathbf{A}}\mathbf{H}^{(\ell)}\left(\left(1-\beta_{\ell}\right)\mathbf{I}_{n}+\beta_{\ell}\mathbf{W}_{1}^{(\ell)}\right)+\right.\left.+\alpha_{\ell}\mathbf{H}^{(0)}\left(\left(1-\beta_{\ell}\right)\mathbf{I}_{n}+\beta_{\ell}\mathbf{W}_{2}^{(\ell)}\right)\right)

    • GCNII 和 GCNII*:

Without major modification, GCNII and GCNII* are hard to be put into ACMII framework. In ACMII frameworks, before apply the operator A^^𝐴\hat{A}, we first implement a nonlinear feature operation over Hsuperscript𝐻H^{\ell}. But in GCNII and GCNII*, before multiplying W(or W1,W2)superscript𝑊or superscriptsubscript𝑊1superscriptsubscript𝑊2W^{\ell}(\text{or }W_{1}^{\ell},W_{2}^{\ell}) to extract features, we need to add another term including H(0)superscript𝐻0H^{(0)}, which are not filtered by A^^𝐴\hat{A}. This is incompatible with ACMII framework, thus, we did not implement GCNII and GCNII* in ACMII framework.
在没有进行重大修改的情况下,GCNII 和 GCNII*很难被放入 ACMII 框架中。在 ACMII 框架中,在应用运算符 A^^𝐴\hat{A} 之前,我们首先对 Hsuperscript𝐻H^{\ell} 执行非线性特征操作。但是在 GCNII 和 GCNII*中,在乘以 W(or W1,W2)superscript𝑊or superscriptsubscript𝑊1superscriptsubscript𝑊2W^{\ell}(\text{or }W_{1}^{\ell},W_{2}^{\ell}) 以提取特征之前,我们需要添加另一个包括 H(0)superscript𝐻0H^{(0)} 的项,这些项不受 A^^𝐴\hat{A} 过滤。这与 ACMII 框架不兼容,因此,我们没有在 ACMII 框架中实现 GCNII 和 GCNII*。

The open source code will be released soon.
开源代码将很快发布。

A.6 Computing Resources A.6 计算资源

For all experiments on synthetic datasets and real-world datasets, we use NVidia V100 GPUs with 16/32GB GPU memory, 8-core CPU, 16G Memory. The software implementation is based on PyTorch and PyTorch Geometric [11].
对于所有在合成数据集和真实数据集上的实验,我们使用 NVidia V100 GPU,具有 16/32GB GPU 内存,8 核 CPU,16G 内存。软件实现基于 PyTorch 和 PyTorch Geometric [11]。

Appendix B Details of Gradient Calculation in (5)
附录 B(5)中梯度计算的详细信息

B.1 Derivation in Matrix Form
B.1 矩阵形式的推导

In output layer, we have
在输出层,我们有

Y𝑌\displaystyle Y =softmax(A^XW)softmax(Y)=(exp(Y)𝟏C𝟏CT)1exp(Y)>0absentsoftmax^𝐴𝑋𝑊softmaxsuperscript𝑌direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1superscript𝑌0\displaystyle=\text{softmax}(\hat{A}XW)\equiv\text{softmax}(Y^{\prime})=\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\odot\exp(Y^{\prime})>0
\displaystyle\mathcal{L} =trace(ZTlogY)absenttracesuperscript𝑍𝑇𝑌\displaystyle=-\mathrm{trace}(Z^{T}\log Y)

where 𝟏CC×1subscript1𝐶superscript𝐶1\bm{1}_{C}\in\mathcal{R}^{C\times 1}, ()1superscript1(\cdot)^{-1} is point-wise inverse function and each element of Y𝑌Y is positive. Then
其中 𝟏CC×1subscript1𝐶superscript𝐶1\bm{1}_{C}\in\mathcal{R}^{C\times 1}()1superscript1(\cdot)^{-1} 是逐点逆函数, Y𝑌Y 的每个元素都是正数。然后

d=trace(ZT((Y)1dY))=trace(ZT((softmax(Y))1dsoftmax(Y)))𝑑tracesuperscript𝑍𝑇direct-productsuperscript𝑌1𝑑𝑌tracesuperscript𝑍𝑇direct-productsuperscriptsoftmaxsuperscript𝑌1𝑑softmaxsuperscript𝑌\displaystyle d\mathcal{L}=-\mathrm{trace}\left(Z^{T}((Y)^{-1}\odot dY)\right)=-\mathrm{trace}\left(Z^{T}\left(\left(\text{softmax}(Y^{\prime})\right)^{-1}\odot d\ \text{softmax}(Y^{\prime})\right)\right)

Note that

dsoftmax(Y)=𝑑softmaxsuperscript𝑌absent\displaystyle d\ \text{softmax}(Y^{\prime})= (exp(Y)𝟏C𝟏CT)2[(exp(Y)dY)𝟏C𝟏CT]exp(Y)direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇2delimited-[]direct-productsuperscript𝑌𝑑superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇superscript𝑌\displaystyle-\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-2}\odot[(\exp(Y^{\prime})\odot dY^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}]\odot\exp(Y^{\prime})
+(exp(Y)𝟏C𝟏CT)1(exp(Y)dY)direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1direct-productsuperscript𝑌𝑑superscript𝑌\displaystyle+\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\odot(\exp(Y^{\prime})\odot dY^{\prime})
=\displaystyle= softmax(Y)(exp(Y)𝟏C𝟏CT)1[(exp(Y)dY)𝟏C𝟏CT]direct-productsoftmaxsuperscript𝑌superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1delimited-[]direct-productsuperscript𝑌𝑑superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇\displaystyle-\text{softmax}(Y^{\prime})\odot\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\odot[(\exp(Y^{\prime})\odot dY^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}]
+softmax(Y)dYdirect-productsoftmaxsuperscript𝑌𝑑superscript𝑌\displaystyle+\text{softmax}(Y^{\prime})\odot dY^{\prime}
=\displaystyle= softmax(Y)((exp(Y)𝟏C𝟏CT)1[(exp(Y)dY)𝟏C𝟏CT]+dY)direct-productsoftmaxsuperscript𝑌direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1delimited-[]direct-productsuperscript𝑌𝑑superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇𝑑superscript𝑌\displaystyle\ \text{softmax}(Y^{\prime})\odot\left(-\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\odot\left[(\exp(Y^{\prime})\odot dY^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right]+dY^{\prime}\right)

Then,

d=𝑑absent\displaystyle d\mathcal{L}= trace(ZT((softmax(Y))1[softmax(Y)((exp(Y)𝟏C𝟏CT)1\displaystyle-\mathrm{trace}\Bigg{(}Z^{T}\bigg{(}(\text{softmax}(Y^{\prime}))^{-1}\odot\bigg{[}\text{softmax}(Y^{\prime})\odot\bigg{(}-\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}
[(exp(Y)dY)𝟏C𝟏CT]+dY)]))\displaystyle\odot\left[(\exp(Y^{\prime})\odot dY^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right]+dY^{\prime}\bigg{)}\bigg{]}\bigg{)}\Bigg{)}
=\displaystyle= trace(ZT((exp(Y)𝟏C𝟏CT)1[(exp(Y)dY)𝟏C𝟏CT]+dY))tracesuperscript𝑍𝑇direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1delimited-[]direct-productsuperscript𝑌𝑑superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇𝑑superscript𝑌\displaystyle-\mathrm{trace}\left(Z^{T}\left(-\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\odot\left[(\exp(Y^{\prime})\odot dY^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right]+dY^{\prime}\right)\right)
=\displaystyle= trace(((Z(exp(Y)𝟏C𝟏CT)1)𝟏C𝟏CT)T[exp(Y)dY]ZTdY)tracesuperscriptdirect-product𝑍superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1subscript1𝐶superscriptsubscript1𝐶𝑇𝑇delimited-[]direct-productsuperscript𝑌𝑑superscript𝑌superscript𝑍𝑇𝑑superscript𝑌\displaystyle\mathrm{trace}\left(\left(\left(Z\odot\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\right)\bm{1}_{C}\bm{1}_{C}^{T}\right)^{T}\left[\exp(Y^{\prime})\odot dY^{\prime}\right]-Z^{T}dY^{\prime}\right)
=\displaystyle= trace((exp(Y)((Z(exp(Y)𝟏C𝟏CT)1)𝟏C𝟏CT))TdYZTdY)tracesuperscriptdirect-productsuperscript𝑌direct-product𝑍superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1subscript1𝐶superscriptsubscript1𝐶𝑇𝑇𝑑superscript𝑌superscript𝑍𝑇𝑑superscript𝑌\displaystyle\mathrm{trace}\left(\left(\exp(Y^{\prime})\odot\left(\left(Z\odot\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\right)\bm{1}_{C}\bm{1}_{C}^{T}\right)\right)^{T}dY^{\prime}-Z^{T}dY^{\prime}\right)
=\displaystyle= trace((exp(Y)(exp(Y)𝟏C𝟏CT)1)TdYZTdY)tracesuperscriptdirect-productsuperscript𝑌superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1𝑇𝑑superscript𝑌superscript𝑍𝑇𝑑superscript𝑌\displaystyle\mathrm{trace}\left(\left(\exp(Y^{\prime})\odot\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\right)^{T}dY^{\prime}-Z^{T}dY^{\prime}\right)
=\displaystyle= trace((softmax(Y)Z)TdY)tracesuperscriptsoftmaxsuperscript𝑌𝑍𝑇𝑑superscript𝑌\displaystyle\mathrm{trace}\left((\text{softmax}(Y^{\prime})-Z)^{T}dY^{\prime}\right)

where the 4-th equation holds due to (Z(exp(Y)𝟏C𝟏CT)1)𝟏C𝟏CT=(exp(Y)𝟏C𝟏CT)1direct-product𝑍superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1subscript1𝐶superscriptsubscript1𝐶𝑇superscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1\left(Z\odot\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}\right)\bm{1}_{C}\bm{1}_{C}^{T}=\left(\exp(Y^{\prime})\bm{1}_{C}\bm{1}_{C}^{T}\right)^{-1}. Thus, we have

ddY=softmax(Y)Z=YZ𝑑𝑑superscript𝑌softmaxsuperscript𝑌𝑍𝑌𝑍\frac{d\mathcal{L}}{dY^{\prime}}=\text{softmax}(Y^{\prime})-Z=Y-Z

For Ysuperscript𝑌Y^{\prime} and W𝑊W, we have

dY𝑑superscript𝑌\displaystyle dY^{\prime} =A^XdW and d=trace(ddYTdY)=trace(ddYTA^XdW)=trace(ddWTdW)absent^𝐴𝑋𝑑𝑊 and 𝑑tracesuperscript𝑑𝑑superscript𝑌𝑇𝑑superscript𝑌tracesuperscript𝑑𝑑superscript𝑌𝑇^𝐴𝑋𝑑𝑊tracesuperscript𝑑𝑑𝑊𝑇𝑑𝑊\displaystyle=\hat{A}XdW\text{ and }d\mathcal{L}=\text{trace}\left(\frac{d\mathcal{L}}{dY^{\prime}}^{T}dY^{\prime}\right)=\text{trace}\left(\frac{d\mathcal{L}}{dY^{\prime}}^{T}\hat{A}X\ dW\right)=\text{trace}\left(\frac{d\mathcal{L}}{dW}^{T}\ dW\right)

To get ddW𝑑𝑑𝑊\frac{d\mathcal{L}}{dW} we have,

ddW=XTA^TddY=XTA^T(YZ)𝑑𝑑𝑊superscript𝑋𝑇superscript^𝐴𝑇𝑑𝑑superscript𝑌superscript𝑋𝑇superscript^𝐴𝑇𝑌𝑍\displaystyle\frac{d\mathcal{L}}{dW}=X^{T}\hat{A}^{T}\frac{d\mathcal{L}}{dY^{\prime}}=X^{T}\hat{A}^{T}(Y-Z) (14)

B.2 Component-wise Derivation

Denote X~=XW~𝑋𝑋𝑊\tilde{X}=XW. We rewrite \cal L as follows:

\displaystyle\mathcal{L} =trace(ZTlog((exp(Y)𝟏C𝟏CT)1exp(Y)))absenttracesuperscript𝑍𝑇direct-productsuperscriptsuperscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇1superscript𝑌\displaystyle=-\mathrm{trace}\left(Z^{T}\log\left((\exp({Y^{\prime}})\bm{1}_{C}\bm{1}_{C}^{T})^{-1}\odot\exp({Y^{\prime}})\right)\right)
=trace(ZT(log(exp(Y)𝟏C𝟏CT)+Y))absenttracesuperscript𝑍𝑇superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇superscript𝑌\displaystyle=-\mathrm{trace}\left(Z^{T}\left(-\log(\exp({Y^{\prime}})\bm{1}_{C}\bm{1}_{C}^{T})+{Y^{\prime}}\right)\right)
=trace(ZTY)+trace(ZTlog(exp(Y)𝟏C𝟏CT))absenttracesuperscript𝑍𝑇superscript𝑌tracesuperscript𝑍𝑇superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇\displaystyle=-\mathrm{trace}\left(Z^{T}{Y^{\prime}}\right)+\mathrm{trace}\left(Z^{T}\log\left(\exp({Y^{\prime}})\bm{1}_{C}\bm{1}_{C}^{T}\right)\right)
=trace(ZTA^XW)+trace(ZTlog(exp(Y)𝟏C𝟏CT))absenttracesuperscript𝑍𝑇^𝐴𝑋𝑊tracesuperscript𝑍𝑇superscript𝑌subscript1𝐶superscriptsubscript1𝐶𝑇\displaystyle=-\mathrm{trace}\left(Z^{T}\hat{A}XW\right)+\mathrm{trace}\left(Z^{T}\log\left(\exp({Y^{\prime}})\bm{1}_{C}\bm{1}_{C}^{T}\right)\right)
=trace(ZTA^XW)+trace(𝟏CTlog(exp(Y)𝟏C))absenttracesuperscript𝑍𝑇^𝐴𝑋𝑊tracesuperscriptsubscript1𝐶𝑇superscript𝑌subscript1𝐶\displaystyle=-\mathrm{trace}\left(Z^{T}\hat{A}XW\right)+\mathrm{trace}\left(\bm{1}_{C}^{T}\log\left(\exp({Y^{\prime}})\bm{1}_{C}\right)\right)
=i=1Nj𝒩iA^i,jZi,:X~j:T+i=1Nlog(c=1Cexp(j𝒩iA^i,jX~j,c))absentsuperscriptsubscript𝑖1𝑁subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript𝑍𝑖:superscriptsubscript~𝑋:𝑗absent𝑇superscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐\displaystyle=-\sum\limits_{i=1}^{N}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,:}\tilde{X}_{j:}^{T}+\sum\limits_{i=1}^{N}\log\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)
=i=1Nlog(exp(c=1Cj𝒩iA^i,jZi,cX~j,c))+i=1Nlog(c=1Cexp(j𝒩iA^i,jX~j,c))absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript𝑍𝑖𝑐subscript~𝑋𝑗𝑐superscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐\displaystyle=-\sum\limits_{i=1}^{N}\log\left(\exp\left(\sum\limits_{c=1}^{C}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,c}\tilde{X}_{j,c}\right)\right)+\sum\limits_{i=1}^{N}\log\left(\sum\limits_{c=1}^{C}\exp\left(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c}\right)\right)
=i=1Nlogexp(c=1Cj𝒩iA^i,jZi,cX~j,c)(c=1Cexp(j𝒩iA^i,jX~j,c))absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript𝑍𝑖𝑐subscript~𝑋𝑗𝑐superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐\displaystyle=-\sum\limits_{i=1}^{N}\log\frac{\exp\left(\sum\limits_{c=1}^{C}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,c}\tilde{X}_{j,c}\right)}{\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)}

Note that c=1CZj,c=1superscriptsubscript𝑐1𝐶subscript𝑍𝑗𝑐1\sum\limits_{c=1}^{C}Z_{j,c}=1 for any j𝑗j. Consider the derivation of \mathcal{L} over X~j,csubscript~𝑋superscript𝑗superscript𝑐\tilde{X}_{j^{\prime},c^{\prime}}:

ddX~j,c𝑑𝑑subscript~𝑋superscript𝑗superscript𝑐\displaystyle\frac{d\mathcal{L}}{d\tilde{X}_{j^{\prime},c^{\prime}}}
=\displaystyle= i=1Nc=1Cexp(j𝒩iA^i,jX~j,c)exp(c=1Cj𝒩iA^i,jZi,cX~j,c)superscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript𝑍𝑖𝑐subscript~𝑋𝑗𝑐\displaystyle-\sum\limits_{i=1}^{N}\frac{\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})}{\exp\left(\sum\limits_{c=1}^{C}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,c}\tilde{X}_{j,c}\right)}
×((A^i,jZi,c)exp(c=1Cj𝒩iA^i,jZi,cX~j,c)(c=1Cexp(j𝒩iA^i,jX~j,c))(c=1Cexp(j𝒩iA^i,jX~j,c))2\displaystyle\times\left(\frac{\left(\hat{A}_{i,j^{\prime}}Z_{i,c^{\prime}}\right)\exp\left(\sum\limits_{c=1}^{C}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,c}\tilde{X}_{j,c}\right)\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)}{\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)^{2}}\right.
(A^i,j)exp(c=1Cj𝒩iA^i,jZi,cX~j,c)(exp(j𝒩iA^i,jX~j,c))(c=1Cexp(j𝒩iA^i,jX~j,c))2)\displaystyle\left.-\frac{\left(\hat{A}_{i,j^{\prime}}\right)\exp\left(\sum\limits_{c=1}^{C}\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}Z_{i,c}\tilde{X}_{j,c}\right)\left(\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c^{\prime}})\right)}{\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)^{2}}\right)
=\displaystyle= i=1N((A^i,jZi,c)(c=1Cexp(j𝒩iA^i,jX~j,c))(A^i,j)(exp(j𝒩iA^i,jX~j,c))(c=1Cexp(j𝒩iA^i,jX~j,c)))superscriptsubscript𝑖1𝑁subscript^𝐴𝑖superscript𝑗subscript𝑍𝑖superscript𝑐superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐subscript^𝐴𝑖superscript𝑗subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗superscript𝑐superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐\displaystyle-\sum\limits_{i=1}^{N}\left(\frac{\left(\hat{A}_{i,j^{\prime}}Z_{i,c^{\prime}}\right)\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)-\left(\hat{A}_{i,j^{\prime}}\right)\left(\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c^{\prime}})\right)}{\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)}\right)
=\displaystyle= i=1N(A^i,j(c=1,ccC(Zi,c)exp(j𝒩iA^i,jX~j,c))+(Zi,c1)(exp(j𝒩iA^i,jX~j,c))(c=1Cexp(j𝒩iA^i,jX~j,c)))superscriptsubscript𝑖1𝑁subscript^𝐴𝑖superscript𝑗superscriptsubscriptformulae-sequence𝑐1𝑐superscript𝑐𝐶subscript𝑍𝑖superscript𝑐subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐subscript𝑍𝑖superscript𝑐1subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗superscript𝑐superscriptsubscript𝑐1𝐶subscript𝑗subscript𝒩𝑖subscript^𝐴𝑖𝑗subscript~𝑋𝑗𝑐\displaystyle-\sum\limits_{i=1}^{N}\left(\hat{A}_{i,j^{\prime}}\frac{\left(\sum\limits_{c=1,c\neq c^{\prime}}^{C}(Z_{i,c^{\prime}})\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)+\left(Z_{i,c^{\prime}}-1\right)\left(\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c^{\prime}})\right)}{\left(\sum\limits_{c=1}^{C}\exp(\sum\limits_{j\in\mathcal{N}_{i}}\hat{A}_{i,j}\tilde{X}_{j,c})\right)}\right)
=\displaystyle= i=1NA^i,j(Zi,cP^(Yic)+(Zi,c1)P^(Yi=c))superscriptsubscript𝑖1𝑁subscript^𝐴𝑖superscript𝑗subscript𝑍𝑖superscript𝑐^𝑃subscript𝑌𝑖superscript𝑐subscript𝑍𝑖superscript𝑐1^𝑃subscript𝑌𝑖superscript𝑐\displaystyle-\sum\limits_{i=1}^{N}\hat{A}_{i,j^{\prime}}\left(Z_{i,c^{\prime}}\hat{P}(Y_{i}\neq c^{\prime})+(Z_{i,c^{\prime}}-1)\hat{P}(Y_{i}=c^{\prime})\right)
=\displaystyle= i=1NA^i,j(Zi,cP^(Yi=c))superscriptsubscript𝑖1𝑁subscript^𝐴𝑖superscript𝑗subscript𝑍𝑖superscript𝑐^𝑃subscript𝑌𝑖superscript𝑐\displaystyle-\sum\limits_{i=1}^{N}\hat{A}_{i,j^{\prime}}\left(Z_{i,c^{\prime}}-\hat{P}(Y_{i}=c^{\prime})\right)

Writing the above in matrix form, we have

ddX~=A^(ZY),ddW~=XTA^T(ZY),ΔYA^XXTA^T(ZY)formulae-sequence𝑑𝑑~𝑋^𝐴𝑍𝑌formulae-sequence𝑑𝑑~𝑊superscript𝑋𝑇superscript^𝐴𝑇𝑍𝑌proportional-toΔsuperscript𝑌^𝐴𝑋superscript𝑋𝑇superscript^𝐴𝑇𝑍𝑌\frac{d\mathcal{L}}{d\tilde{X}}=\hat{A}(Z-Y),\ \frac{d\mathcal{L}}{d\tilde{W}}=X^{T}\hat{A}^{T}(Z-Y),\ \Delta Y^{\prime}\propto\hat{A}XX^{T}\hat{A}^{T}(Z-Y) (15)

Appendix C Details of Synthetic Experiments

In our synthetic experiments, we generate graphs with edge homophily levels h0.95:0.05:0.05:0.950.05:0.05h\in 0.95:0.05:0.05 and h0.05:0.005:0.005:0.050.005:0.005h\in 0.05:0.005:0.005. We explore the interval [0.05,0.005]0.050.005[0.05,0.005] with a more fine-grained scale 0.005 because we empirically find that the performance of GNNs is sensitive in more this area. For a given hh, we generate intra-class edges from numpy.random.multinomial(2, numpy.ones(399)/399, size=1)[0] (does not include self-loop) and inter-class edges from numpy.random.multinomial(int(2/h -2), numpy.ones(1600)/1600, size=1)[0].

Appendix D Proof of Theorem 1

Proof.

According to the given assumptions, for node v𝑣v, we have A^v,k=1d+1subscript^𝐴𝑣𝑘1𝑑1\hat{A}_{v,k}=\frac{1}{d+1}, the expected number of intra-class edges is dh𝑑dh (here the self-loop edge introduced by A^^𝐴\hat{A} is not counted based on the definition of edge homophily and data generation process) and inter-class edges is (1h)d1𝑑(1-h)d. Suppose there are C2𝐶2C\geq 2 classes. Consider matrix A^Z^𝐴𝑍\hat{A}Z,

Then, we have 𝔼[(A^Z)v,c]=𝔼[k𝒱A^v,k1{Zk,:=ecT}]=k𝒱𝔼[1{Zk,:=ecT}]d+1𝔼delimited-[]subscript^𝐴𝑍𝑣𝑐𝔼delimited-[]subscript𝑘𝒱subscript^𝐴𝑣𝑘subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇subscript𝑘𝒱𝔼delimited-[]subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇𝑑1\mathbb{E}\left[(\hat{A}Z)_{v,c}\right]=\mathbb{E}\left[\sum\limits_{k\in\mathcal{V}}\hat{A}_{v,k}\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}}\right]=\sum\limits_{k\in\mathcal{V}}\frac{\mathbb{E}\left[\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}}\right]}{d+1}, where 𝟏1\bm{1} is the indicator function.

When v𝑣v is in class c𝑐c, we have k𝒱𝔼[1{Zk,:=ecT}]d+1=hd+1d+1subscript𝑘𝒱𝔼delimited-[]subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇𝑑1𝑑1𝑑1\sum\limits_{k\in\mathcal{V}}\frac{\mathbb{E}\left[\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}}\right]}{d+1}=\frac{hd+1}{d+1} (hd+1=hd𝑑1𝑑hd+1=hd intra-class edges ++ 1 self-loop introduced by A^^𝐴\hat{A}).

When v𝑣v is not in class c𝑐c, we have k𝒱𝔼[1{Zk,:=ecT}]d+1=(1h)d(C1)(d+1)subscript𝑘𝒱𝔼delimited-[]subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇𝑑11𝑑𝐶1𝑑1\sum\limits_{k\in\mathcal{V}}\frac{\mathbb{E}\left[\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}}\right]}{d+1}=\frac{(1-h)d}{(C-1)(d+1)} ((1h)d1𝑑(1-h)d inter-class edges uniformly distributed in the other C1𝐶1C-1 classes).

For nodes v,u𝑣𝑢v,u, we have (A^Z)v,:,(A^Z)u,:Csubscript^𝐴𝑍𝑣:subscript^𝐴𝑍𝑢:superscript𝐶(\hat{A}Z)_{v,:},(\hat{A}Z)_{u,:}\in\mathbb{R}^{C} and since elements in A^v,ksubscript^𝐴𝑣𝑘\hat{A}_{v,k} and A^u,ksubscript^𝐴𝑢superscript𝑘\hat{A}_{u,k^{\prime}} are independently generated for all k,k𝒱𝑘superscript𝑘𝒱k,k^{\prime}\in\mathcal{V}, we have

𝔼[(A^Z)v,c(A^Z)u,c]𝔼delimited-[]subscript^𝐴𝑍𝑣𝑐subscript^𝐴𝑍𝑢𝑐\displaystyle\mathbb{E}\left[(\hat{A}Z)_{v,c}(\hat{A}Z)_{u,c}\right] =𝔼[(k𝒱A^v,k1{Zk,:=ecT})(k𝒱A^u,k1{Zk,:=ecT})]absent𝔼delimited-[]subscript𝑘𝒱subscript^𝐴𝑣𝑘subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇subscriptsuperscript𝑘𝒱subscript^𝐴𝑢superscript𝑘subscript1subscript𝑍superscript𝑘:superscriptsubscript𝑒𝑐𝑇\displaystyle=\mathbb{E}\left[(\sum\limits_{k\in\mathcal{V}}\hat{A}_{v,k}\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}})(\sum\limits_{k^{\prime}\in\mathcal{V}}\hat{A}_{u,k^{\prime}}\textbf{1}_{\{Z_{k^{\prime},:}=e_{c}^{T}\}})\right]
=𝔼[(k𝒱A^v,k1{Zk,:=ecT})]𝔼[(k𝒱A^u,k1{Zk,:=ecT})]absent𝔼delimited-[]subscript𝑘𝒱subscript^𝐴𝑣𝑘subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇𝔼delimited-[]subscriptsuperscript𝑘𝒱subscript^𝐴𝑢superscript𝑘subscript1subscript𝑍superscript𝑘:superscriptsubscript𝑒𝑐𝑇\displaystyle=\mathbb{E}\left[(\sum\limits_{k\in\mathcal{V}}\hat{A}_{v,k}\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}})\right]\mathbb{E}\left[(\sum\limits_{k^{\prime}\in\mathcal{V}}\hat{A}_{u,k^{\prime}}\textbf{1}_{\{Z_{k^{\prime},:}=e_{c}^{T}\}})\right]

Thus,

𝔼[S(A^,Z)v,u]𝔼delimited-[]𝑆subscript^𝐴𝑍𝑣𝑢\displaystyle\mathbb{E}\left[S(\hat{A},Z)_{v,u}\right] =𝔼[<(A^Z)v,:,(A^Z)u,:>]=c𝔼[(k𝒱A^v,k1{Zk,:=ecT})]𝔼[(k𝒱A^u,k1{Zk,:=ecT})]\displaystyle=\mathbb{E}\left[<(\hat{A}Z)_{v,:},(\hat{A}Z)_{u,:}>\right]=\sum\limits_{c}\mathbb{E}\left[(\sum\limits_{k\in\mathcal{V}}\hat{A}_{v,k}\textbf{1}_{\{Z_{k,:}=e_{c}^{T}\}})\right]\mathbb{E}\left[(\sum\limits_{k^{\prime}\in\mathcal{V}}\hat{A}_{u,k^{\prime}}\textbf{1}_{\{Z_{k^{\prime},:}=e_{c}^{T}\}})\right]
={(hd+1d+1)2+((1h)d)2(C1)(d+1)2,u,v are in the same class2(hd+1)(1h)d(C1)(d+1)2+(C2)(1h)2d2(C1)2(d+1)2,u,v are in different classesabsentcasessuperscript𝑑1𝑑12superscript1𝑑2𝐶1superscript𝑑12u,v are in the same class2𝑑11𝑑𝐶1superscript𝑑12𝐶2superscript12superscript𝑑2superscript𝐶12superscript𝑑12u,v are in different classes\displaystyle=\left\{\begin{array}[]{ll}\left(\frac{hd+1}{d+1}\right)^{2}+\frac{\left((1-h)d\right)^{2}}{(C-1)(d+1)^{2}},&\text{$u,v$ are in the same class}\\ \frac{2(hd+1)(1-h)d}{(C-1)(d+1)^{2}}+\frac{(C-2)(1-h)^{2}d^{2}}{(C-1)^{2}(d+1)^{2}},&\text{$u,v$ are in different classes}\end{array}\right.

For nodes u1subscript𝑢1u_{1}, u2subscript𝑢2u_{2}, and v𝑣v, where Zu1,:=Zv,:subscript𝑍subscript𝑢1:subscript𝑍𝑣:Z_{u_{1},:}=Z_{v,:} and Zu2,:Zv,:subscript𝑍subscript𝑢2:subscript𝑍𝑣:Z_{u_{2},:}\neq Z_{v,:},

g(h)𝑔\displaystyle g(h) 𝔼[S(A^,Z)v,u1]𝔼[S(A^,Z)v,u2]absent𝔼delimited-[]𝑆subscript^𝐴𝑍𝑣subscript𝑢1𝔼delimited-[]𝑆subscript^𝐴𝑍𝑣subscript𝑢2\displaystyle\equiv\mathbb{E}\left[S(\hat{A},Z)_{v,u_{1}}\right]-\mathbb{E}\left[S(\hat{A},Z)_{v,u_{2}}\right] (16)
=(C1)2(hd+1)2+(C1)[(1h)d]2(C1)(2(hd+1)(1h)d)(C2)[(1h)d]2(C1)2(d+1)2absentsuperscript𝐶12superscript𝑑12𝐶1superscriptdelimited-[]1𝑑2𝐶12𝑑11𝑑𝐶2superscriptdelimited-[]1𝑑2superscript𝐶12superscript𝑑12\displaystyle=\frac{(C-1)^{2}(hd+1)^{2}+(C-1)\left[(1-h)d\right]^{2}-(C-1)\left(2(hd+1)(1-h)d\right)-(C-2)\left[(1-h)d\right]^{2}}{(C-1)^{2}(d+1)^{2}}
=((C1)(hd+1)(1h)d(C1)(d+1))2absentsuperscript𝐶1𝑑11𝑑𝐶1𝑑12\displaystyle=\left(\frac{(C-1)(hd+1)-(1-h)d}{(C-1)(d+1)}\right)^{2}

Setting g(h)=0𝑔0g(h)=0, we obtain the optimal hh:

h=d+1CCd𝑑1𝐶𝐶𝑑h=\frac{d+1-C}{Cd} (17)

For the data generation process in the synthetic experiments, we fix dintrasubscript𝑑intrad_{\text{intra}}, then d=dintra/h𝑑subscript𝑑intrad=d_{\text{intra}}/h, which is a function of hh. We change d𝑑d in (17) to dintra/hsubscript𝑑intrad_{\text{intra}}/h, leading to

h=dintra/h+1CCdintra/hsubscript𝑑intra1𝐶𝐶subscript𝑑intrah=\frac{d_{\text{intra}}/h+1-C}{Cd_{\text{intra}}/h} (18)

It is easy to observe that hh satisfying (18) still makes g(h)=0𝑔0g(h)=0, when d𝑑d in g(h)𝑔g(h) is replaced by dintra/hsubscript𝑑intrad_{\text{intra}}/h. From (18) we obtain the optimal hh in terms of dintrasubscript𝑑intrad_{\text{intra}}:

h=dintraCdintra+C1subscript𝑑intra𝐶subscript𝑑intra𝐶1h=\frac{d_{\text{intra}}}{Cd_{\text{intra}}+C-1}

D.1 An extension of Theorem 1

Sagg(S(A^,Z))subscript𝑆agg𝑆^𝐴𝑍\displaystyle S_{\text{agg}}\left(S(\hat{A},Z)\right) =|{v|Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:})}||𝒱|absentconditional-set𝑣subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱\displaystyle=\frac{\left|\left\{v\,\big{|}\,\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right\}\right|}{\left|\mathcal{V}\right|}
=v𝒱𝟏{Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:})}|𝒱|absentsubscript𝑣𝒱subscript1subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱\displaystyle=\frac{\sum\limits_{v\in\mathcal{V}}\bm{1}_{\left\{\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right\}}}{\left|\mathcal{V}\right|}

Then,

𝔼(Sagg(S(A^,Z)))𝔼subscript𝑆agg𝑆^𝐴𝑍\displaystyle\mathbb{E}\left(S_{\text{agg}}\left(S(\hat{A},Z)\right)\right) =𝔼(v𝒱𝟏{Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:})}|𝒱|)absent𝔼subscript𝑣𝒱subscript1subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱\displaystyle=\mathbb{E}\left(\frac{\sum\limits_{v\in\mathcal{V}}\bm{1}_{\left\{\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right\}}}{\left|\mathcal{V}\right|}\right)
=v𝒱(Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:}))|𝒱|absentsubscript𝑣𝒱subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱\displaystyle=\frac{\sum\limits_{v\in\mathcal{V}}\mathbb{P}\left(\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right)}{\left|\mathcal{V}\right|}
=(Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:})0)absentsubscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:0\displaystyle=\mathbb{P}\left(\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}-\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\geq 0\right)

Consider the random variable

RV=Meanu({S(A^,Z)v,u|Zu,:=Zv,:})Meanu({S(A^,Z)v,u|Zu,:Zv,:})𝑅𝑉subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑍𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:RV=\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}-\mathrm{Mean}_{u}\big{(}\{S(\hat{A},Z)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}

Since RV𝑅𝑉RV is symmetrically distributed and under the conditions in theorem 1, its expectation is 𝔼[RV]=g(h)𝔼delimited-[]𝑅𝑉𝑔\mathbb{E}[RV]=g(h) as showed in (16). Since the minimum of g(h)𝑔g(h) is 00 and RV𝑅𝑉RV is symmetrically distributed, we have (RV0)0.5𝑅𝑉00.5\mathbb{P}(RV\geq 0)\geq 0.5 and this can explain why Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) is always greater than 0.5 in many real-world tasks.

Appendix E Proof of Theorem 2

Proof.

Define Wvc=(A^Z)v,csuperscriptsubscript𝑊𝑣𝑐subscript^𝐴𝑍𝑣𝑐W_{v}^{c}=(\hat{A}Z)_{v,c}. Then,

Wvc=k𝒱A^v,k𝟏{Zk,:=ecT}[0,1],c=1CWvc=1formulae-sequencesuperscriptsubscript𝑊𝑣𝑐subscript𝑘𝒱subscript^𝐴𝑣𝑘subscript1subscript𝑍𝑘:superscriptsubscript𝑒𝑐𝑇01superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑣𝑐1W_{v}^{c}=\sum\limits_{k\in\mathcal{V}}\hat{A}_{v,k}\bm{1}_{\{Z_{k,:}=e_{c}^{T}\}}\in[0,1],\ \ \sum\limits_{c=1}^{C}W_{v}^{c}=1

Note that

S(IA^,Z)=(IA^)ZZT(IA^)T=ZZT+A^ZZTA^TA^ZZTZZTA^T𝑆𝐼^𝐴𝑍𝐼^𝐴𝑍superscript𝑍𝑇superscript𝐼^𝐴𝑇𝑍superscript𝑍𝑇^𝐴𝑍superscript𝑍𝑇superscript^𝐴𝑇^𝐴𝑍superscript𝑍𝑇𝑍superscript𝑍𝑇superscript^𝐴𝑇S(I-\hat{A},Z)=(I-\hat{A})ZZ^{T}(I-\hat{A})^{T}=ZZ^{T}+\hat{A}ZZ^{T}\hat{A}^{T}-\hat{A}ZZ^{T}-ZZ^{T}\hat{A}^{T} (19)

For any node v𝑣v, let the class v𝑣v belongs to be denoted by cvsubscript𝑐𝑣c_{v}. For two nodes v,u𝑣𝑢v,u, if Zv,:Zu,:subscript𝑍𝑣:subscript𝑍𝑢:Z_{v,:}\neq Z_{u,:}, we have

(ZZT)v,u=0subscript𝑍superscript𝑍𝑇𝑣𝑢0\displaystyle(ZZ^{T})_{v,u}=0
(A^ZZTA^T)v,u=c=1CWvcWucsubscript^𝐴𝑍superscript𝑍𝑇superscript^𝐴𝑇𝑣𝑢superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐\displaystyle(\hat{A}ZZ^{T}\hat{A}^{T})_{v,u}=\sum\limits_{c=1}^{C}W_{v}^{c}W_{u}^{c}
(A^ZZT)v,u=Wvcusubscript^𝐴𝑍superscript𝑍𝑇𝑣𝑢superscriptsubscript𝑊𝑣subscript𝑐𝑢\displaystyle(\hat{A}ZZ^{T})_{v,u}=W_{v}^{c_{u}}
(ZZTA^T)v,u=(A^ZZT)u,v=Wucvsubscript𝑍superscript𝑍𝑇superscript^𝐴𝑇𝑣𝑢subscript^𝐴𝑍superscript𝑍𝑇𝑢𝑣superscriptsubscript𝑊𝑢subscript𝑐𝑣\displaystyle(ZZ^{T}\hat{A}^{T})_{v,u}=(\hat{A}ZZ^{T})_{u,v}=W_{u}^{c_{v}}

Then, from (19) it follows that

(S(IA^,Z))v,u=c=1CWvcWucWvcuWucvsubscript𝑆𝐼^𝐴𝑍𝑣𝑢superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐superscriptsubscript𝑊𝑣subscript𝑐𝑢superscriptsubscript𝑊𝑢subscript𝑐𝑣\displaystyle(S(I-\hat{A},Z))_{v,u}=\sum\limits_{c=1}^{C}W_{v}^{c}W_{u}^{c}-W_{v}^{c_{u}}-W_{u}^{c_{v}}

When C=2𝐶2C=2,

S(IA^,Z)v,u=Wvcu(Wucu1)+Wucv(Wvcv1)0𝑆subscript𝐼^𝐴𝑍𝑣𝑢superscriptsubscript𝑊𝑣subscript𝑐𝑢superscriptsubscript𝑊𝑢subscript𝑐𝑢1superscriptsubscript𝑊𝑢subscript𝑐𝑣superscriptsubscript𝑊𝑣subscript𝑐𝑣10\displaystyle S(I-\hat{A},Z)_{v,u}=W_{v}^{c_{u}}(W_{u}^{c_{u}}-1)+W_{u}^{c_{v}}(W_{v}^{c_{v}}-1)\leq 0

If Zv,:=Zu,:subscript𝑍𝑣:subscript𝑍𝑢:Z_{v,:}=Z_{u,:}, i.e., cv=cusubscript𝑐𝑣subscript𝑐𝑢c_{v}=c_{u}, we have

(ZZT)v,u=1subscript𝑍superscript𝑍𝑇𝑣𝑢1\displaystyle(ZZ^{T})_{v,u}=1
(A^ZZTA^T)v,u=c=1CWvcWucsubscript^𝐴𝑍superscript𝑍𝑇superscript^𝐴𝑇𝑣𝑢superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐\displaystyle(\hat{A}ZZ^{T}\hat{A}^{T})_{v,u}=\sum\limits_{c=1}^{C}W_{v}^{c}W_{u}^{c}
(A^ZZT)v,u=Wvcvsubscript^𝐴𝑍superscript𝑍𝑇𝑣𝑢superscriptsubscript𝑊𝑣subscript𝑐𝑣\displaystyle(\hat{A}ZZ^{T})_{v,u}=W_{v}^{c_{v}}
(ZZTA^T)v,u=(A^ZZT)u,v=Wucu=Wucvsubscript𝑍superscript𝑍𝑇superscript^𝐴𝑇𝑣𝑢subscript^𝐴𝑍superscript𝑍𝑇𝑢𝑣superscriptsubscript𝑊𝑢subscript𝑐𝑢superscriptsubscript𝑊𝑢subscript𝑐𝑣\displaystyle(ZZ^{T}\hat{A}^{T})_{v,u}=(\hat{A}ZZ^{T})_{u,v}=W_{u}^{c_{u}}=W_{u}^{c_{v}}

Then, from (19) it follows that

S(IA^,Z)v,u𝑆subscript𝐼^𝐴𝑍𝑣𝑢\displaystyle S(I-\hat{A},Z)_{v,u} =1+c=1CWvcWucWvcvWucvabsent1superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐superscriptsubscript𝑊𝑣subscript𝑐𝑣superscriptsubscript𝑊𝑢subscript𝑐𝑣\displaystyle=1+\sum\limits_{c=1}^{C}W_{v}^{c}W_{u}^{c}-W_{v}^{c_{v}}-W_{u}^{c_{v}}
=c=1,ccvCWvcWuc+1+WvcvWucvWvcvWucvabsentsuperscriptsubscriptformulae-sequence𝑐1𝑐subscript𝑐𝑣𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐1superscriptsubscript𝑊𝑣subscript𝑐𝑣superscriptsubscript𝑊𝑢subscript𝑐𝑣superscriptsubscript𝑊𝑣subscript𝑐𝑣superscriptsubscript𝑊𝑢subscript𝑐𝑣\displaystyle=\sum\limits_{c=1,c\neq c_{v}}^{C}W_{v}^{c}W_{u}^{c}+1+W_{v}^{c_{v}}W_{u}^{c_{v}}-W_{v}^{c_{v}}-W_{u}^{c_{v}}
=c=1,ccvCWvcWuc+(1Wvcv)(1Wucv)0absentsuperscriptsubscriptformulae-sequence𝑐1𝑐subscript𝑐𝑣𝐶superscriptsubscript𝑊𝑣𝑐superscriptsubscript𝑊𝑢𝑐1superscriptsubscript𝑊𝑣subscript𝑐𝑣1superscriptsubscript𝑊𝑢subscript𝑐𝑣0\displaystyle=\sum\limits_{c=1,c\neq c_{v}}^{C}W_{v}^{c}W_{u}^{c}+(1-W_{v}^{c_{v}})(1-W_{u}^{c_{v}})\geq 0

Thus, if C=2𝐶2C=2, for any v𝒱𝑣𝒱v\in\mathcal{V}, if Zu,:Zv,:subscript𝑍𝑢:subscript𝑍𝑣:Z_{u,:}\neq Z_{v,:}, we have S(IA^,Z)v,u0𝑆subscript𝐼^𝐴𝑍𝑣𝑢0S(I-\hat{A},Z)_{v,u}\leq 0; if Zu,:=Zv,:subscript𝑍𝑢:subscript𝑍𝑣:Z_{u,:}=Z_{v,:}, we have S(IA^,Z)v,u0𝑆subscript𝐼^𝐴𝑍𝑣𝑢0S(I-\hat{A},Z)_{v,u}\geq 0. Apparently, the two conditions in (10) are satisfied. Thus v𝑣v is diversification distinguishable and DDA^,X(𝒢)=1subscriptDD^𝐴𝑋𝒢1\mathrm{DD}_{\hat{A},X}(\mathcal{G})=1. The theorem is proved. ∎

Appendix F Model Comparison on Synthetic Graphs

Refer to caption
(a) syn-Cora
Refer to caption
(b) syn-CiteSeer
Refer to caption
(c) syn-PubMed
Refer to caption
(d) syn-Chameleon
Refer to caption
(e) syn-Squirrel
Refer to caption
(f) syn-Film
Figure 4: Comparison of test accuracy (mean ±plus-or-minus\pm std) of MLP-1, SGC-1 and ACM-SGC-1 on synthetic datasets
Refer to caption
(a) syn-Cora
Refer to caption
(b) syn-CiteSeer
Refer to caption
(c) syn-PubMed
Refer to caption
(d) syn-Chameleon
Refer to caption
(e) syn-Squirrel
Refer to caption
(f) syn-Film
Figure 5: Comparison of test accuracy (mean ±plus-or-minus\pm std) of MLP-2, GCN and ACM-GCN on synthetic datasets

In order to separate the effects of nonlinearity and graph structure, we compare sgc with 1 hop (sgc-1) with MLP-1 (linear model). For GCN which includes nonlinearity, we use MLP-2 as the graph-agnostic baseline model. We train the above GNN models, graph-agnostic baseline models and ACM-GNN models on all synthetic datasets and plot the mean test accuracy with standard deviation on each dataset. From Figure 4 and Figure 5, we can see that on each HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) level, ACM-GNNs will not underperform GNNs and graph-agnostic models. But when HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) is small, GNNs will be outperformed by graph-agnostic models by a large margin. This demonstrate the advantage of the ACM framework.

Appendix G Discussion of the Limitations of Diversification Operation

Refer to caption
Figure 6: Example of the case (the area in black box) that HP filter does not work well for harmful heterophily

From the black box area of S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) in the example in Figure 6 we can see that nodes in class 1 and 4 assign non-negative weights to each other; nodes in class 2 and 3 assign non-negative weights to each other as well. This is because the surrounding differences of class 1 are similar as class 4, so are class 2 and 3. In real-world applications, when nodes in several small clusters connect to a large cluster, the surrounding differences of the nodes in the small clusters will become similar. In such case, HP filter are not able to distinguish the nodes from different small clusters.

Appendix H The Similarity, Homophily and DDA^,X(𝒢)subscriptDD^𝐴𝑋𝒢\mathrm{DD}_{\hat{A},X}(\mathcal{G}) Metrics and Their Estimations

Additional Metrics

There are three key factors that influence the performance of GNNs in real-world tasks: labels, features and graph structure. The (modified) aggregation homophily tries to investigate how the graph structure will influence the performance with labels and features being fixed. And their correlation is verified through the synthetic experiments.

Besides graph-label consistency, we need to consider feature-label consistency and aggregated-feature-label consistency as well. With aggregation similarity score of the features Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) and aggregated features Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) listed in appendix H, our methods open up a new perspective on analyzing and comparing the performance of graph-agnostic models and graph-aware models in real-world tasks. Here are 2 examples.

Example 1: GCN (graph-aware model) underperforms MLP-2 (graph-agnostic model) on Cornell, Wisconsin, Texas, Film. Based on the aggregation homophily, the graph structure is not the main cause of the performance degradation. And from Table 6 we can see that the Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) for the above 4 datasets are lower than their corresponding Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right), which implies that it is the aggregated-feature-label inconsistency that causes the performance degradation.

Example 2: According to the widely used analysis based on node or edge homophily, the graph structure of Chameleon, and Squirrel are heterophilous and bad for GNNs. But in practice, GCN outperforms MLP-2 on those 2 datasets which means the additional graph information is helpful for node classification instead of being harmful. Traditional homophily metrics fail to explain such phenomenon but our method can give an explanation from different angles: For Chameleon, its modified aggregation homophily is not low and its Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) is higher than its Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) which means its graph-label consistency and aggregated-feature-label consistency help the graph-aware model obtain the performance gain; for Squirrel, its modified aggregation homophily is low but its Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) is higher than its Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) which means although its graph-label consistency is bad but the aggregated-feature-label consistency is the key factor to help the graph-aware model perform better.

We also need to point out that (modified) aggregation similarity score, Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) and Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) are not deciding or threshold values because they do not consider the nonlinearity structure in the features. In practice, a low score does not tell us the GNN models will definitely perform bad.

Table 8: Additional metrics and their estimations with only training labels (mean ±plus-or-minus\pm std)
Cornell Wisconsin Texas Film Chameleon Squirrel Cora CiteSeer PubMed
Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 0.9016 0.8884 0.847 0.8411 0.805 0.6783 0.9952 0.9913 0.9716
Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) 0.8251 0.7769 0.6557 0.5118 0.8292 0.7216 0.9439 0.9393 0.8623
Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) 0.9672 0.8287 0.9672 0.5405 0.7931 0.701 0.9103 0.9315 0.8823
DDA^,X(𝒢)𝐷subscript𝐷^𝐴𝑋𝒢DD_{\hat{A},X}(\mathcal{G}) 0.3497 0.6096 0.459 0.3279 0.3109 0.2711 0.2681 0.4124 0.1889
H^agg(𝒢)subscript^𝐻agg𝒢\hat{H}_{\text{agg}}(\mathcal{G}) 0.9046 ±plus-or-minus\pm 0.0282 0.9147 ±plus-or-minus\pm 0.0260 0.8596 ±plus-or-minus\pm 0.0299 0.8451 ±plus-or-minus\pm 0.0041 0.8041 ±plus-or-minus\pm 0.0078 0.6788 ±plus-or-minus\pm 0.0077 0.9959 ±plus-or-minus\pm 0.0011 0.9907 ±plus-or-minus\pm 0.0015 0.9724 ±plus-or-minus\pm 0.0015
S^agg(S(A^,X))subscript^𝑆agg𝑆^𝐴𝑋\hat{S}_{\text{agg}}\left(S(\hat{A},X)\right) 0.8266 ±plus-or-minus\pm 0.0526 0.8280 ±plus-or-minus\pm 0.0351 0.6835 ±plus-or-minus\pm 0.0498 0.5345 ±plus-or-minus\pm 0.0421 0.8433 ±plus-or-minus\pm 0.0070 0.7352 ±plus-or-minus\pm 0.0132 0.9487 ±plus-or-minus\pm 0.0023 0.9451 ±plus-or-minus\pm 0.0038 0.8626 ±plus-or-minus\pm 0.0021
S^agg(S(I,X))subscript^𝑆agg𝑆𝐼𝑋\hat{S}_{\text{agg}}\left(S(I,X)\right) 0.9752 ±plus-or-minus\pm 0.0174 0.8680 ±plus-or-minus\pm 0.0270 0.9661 ±plus-or-minus\pm 0.0336 0.5438 ±plus-or-minus\pm 0.0184 0.8257 ±plus-or-minus\pm 0.0050 0.7472 ±plus-or-minus\pm 0.0089 0.9204 ±plus-or-minus\pm 0.0044 0.9441 ±plus-or-minus\pm 0.0036 0.8835 ±plus-or-minus\pm 0.0019
DD^A^,X(𝒢)subscript^𝐷𝐷^𝐴𝑋𝒢\hat{DD}_{\hat{A},X}(\mathcal{G}) 0.3936 ±plus-or-minus\pm 0.0663 0.6073 ±plus-or-minus\pm 0.0436 0.4817 ±plus-or-minus\pm 0.0762 0.3300 ±plus-or-minus\pm 0.0136 0.3329 ±plus-or-minus\pm 0.0151 0.3021 ±plus-or-minus\pm 0.0101 0.3198 ±plus-or-minus\pm 0.0225 0.4424 ±plus-or-minus\pm 0.0136 0.1919 ±plus-or-minus\pm 0.0046

In most real-world applications, not all labels are available to calculate the dataset statistics. In this section, We randomly split the data into 60%/20%/20% for training/validation/test, and only use the training labels for the estimation of the statistics. We repeat each estimation for 10 times and report the mean with standard deviation. The results are shown in table 8.

Estimation

The statistics we estimate are Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}), Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right), Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) and DDA^,X(𝒢)𝐷subscript𝐷^𝐴𝑋𝒢DD_{\hat{A},X}(\mathcal{G}) and are denoted as H^agg(𝒢)subscript^𝐻agg𝒢\hat{H}_{\text{agg}}(\mathcal{G}), S^agg(S(A^,X))subscript^𝑆agg𝑆^𝐴𝑋\hat{S}_{\text{agg}}\left(S(\hat{A},X)\right), S^agg(S(I,X))subscript^𝑆agg𝑆𝐼𝑋\hat{S}_{\text{agg}}\left(S(I,X)\right) and DD^A^,X(𝒢)subscript^𝐷𝐷^𝐴𝑋𝒢\hat{DD}_{\hat{A},X}(\mathcal{G}). The two similarity scores Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) and Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) measures the proportion of nodes, according to aggregated features and nodes features respectively, that will put larger weights on nodes in the same class than in other classes. The higher values of Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}\left(S(\hat{A},X)\right) and Sagg(S(I,X))subscript𝑆agg𝑆𝐼𝑋S_{\text{agg}}\left(S(I,X)\right) indicates the better quality of aggregated features and nodes features.

Analysis

From the reported results we can see that the estimations are accurate and the errors are within the acceptable range, which means the proposed metrics and similarity scores can be accurately estimated with a subset of labels and this is important for real-world applications. Furthermore, we notice some interesting results, e.g., the performance of GNNs and MLP are bad on Squirrel and Film, and according to the aggregation homophily values, the graph structure of Film is not quite harmful compared to other datasets, but its features and aggregated features are much worse than others; the features and aggregated features of Squirrel are not too bad, buts its graph topology is more harmful than others. Combining the metrics defined in this paper together can help us separate different factors in graph structure and features and identify what might cause the performance degradation of GNNs.

Appendix I Experiments on Fixed Splits Provided by [33]

See table 9 for the results and table 10 the optimal searched hyperparameters.

Table 9: Experimental results on fixed splits provided by [33]: average test accuracy ±plus-or-minus\pm standard deviation on 9 real-world benchmark datasets. The best results are highlighted. Results of Geom-GCN, H2GCN and GPRGNN are from [33, 40, 25]; results on the rest models are run by ourselves and the hyperparameter searching range is the same as table 5.
Cornell Wisconsin Texas Film Chameleon Squirrel Cora CiteSeer PubMed Rank
GPRGNN 78.11 ±plus-or-minus\pm 6.55 82.55 ±plus-or-minus\pm 6.23 81.35 ±plus-or-minus\pm 5.32 35.16 ±plus-or-minus\pm 0.9 62.59 ±plus-or-minus\pm 2.04 46.31 ±plus-or-minus\pm 2.46 87.95 ±plus-or-minus\pm 1.18 77.13 ±plus-or-minus\pm 1.67 87.54 ±plus-or-minus\pm 0.38 8.22
H2GCN 82.70 ±plus-or-minus\pm 5.28 87.65 ±plus-or-minus\pm 4.98 84.86 ±plus-or-minus\pm 7.23 35.70 ±plus-or-minus\pm 1.00 60.11 ±plus-or-minus\pm 2.15 36.48 ±plus-or-minus\pm 1.86 87.87 ±plus-or-minus\pm 1.20 77.11 ±plus-or-minus\pm 1.57 89.49 ±plus-or-minus\pm 0.38 6.78
FAGCN 76.76 ±plus-or-minus\pm 5.87 79.61 ±plus-or-minus\pm 1.58 76.49 ±plus-or-minus\pm 2.87 34.82 ±plus-or-minus\pm 1.35 46.07 ±plus-or-minus\pm 2.11 30.83 ±plus-or-minus\pm 0.69 88.05 ±plus-or-minus\pm 1.57 77.07 ±plus-or-minus\pm 2.05 88.09 ±plus-or-minus\pm 1.38 9.56
Geom-GCN* 60.54 ±plus-or-minus\pm 3.67 64.51 ±plus-or-minus\pm 3.66 66.76 ±plus-or-minus\pm 2.72 31.59 ±plus-or-minus\pm 1.15 60.00 ±plus-or-minus\pm 2.81 38.15 ±plus-or-minus\pm 0.92 85.35 ±plus-or-minus\pm 1.57 78.02 ±plus-or-minus\pm 1.15 89.95 ±plus-or-minus\pm 0.47 9.22
ACM-SGC-1 82.43 ±plus-or-minus\pm 5.44 86.47 ±plus-or-minus\pm 3.77 81.89 ±plus-or-minus\pm 4.53 35.49 ±plus-or-minus\pm 1.06 63.99 ±plus-or-minus\pm 1.66 45.00 ±plus-or-minus\pm 1.4 86.9 ±plus-or-minus\pm 1.38 76.73 ±plus-or-minus\pm 1.59 88.49 ±plus-or-minus\pm 0.51 8.44
ACM-SGC-2 82.43 ±plus-or-minus\pm 5.44 86.47 ±plus-or-minus\pm 3.77 81.89 ±plus-or-minus\pm 4.53 36.04 ±plus-or-minus\pm 0.83 59.21 ±plus-or-minus\pm 2.22 40.02 ±plus-or-minus\pm 0.96 87.69 ±plus-or-minus\pm 1.07 76.59 ±plus-or-minus\pm 1.69 89.01 ±plus-or-minus\pm 0.6 8.22
ACM-GCN 85.14 ±plus-or-minus\pm 6.07 88.43 ±plus-or-minus\pm 3.22 87.84 ±plus-or-minus\pm 4.4 36.28 ±plus-or-minus\pm 1.09 66.93 ±plus-or-minus\pm 1.85 54.4 ±plus-or-minus\pm 1.88 87.91 ±plus-or-minus\pm 0.95 77.32 ±plus-or-minus\pm 1.7 90.00 ±plus-or-minus\pm 0.52 2.33
ACM-Snowball-2 85.41 ±plus-or-minus\pm 5.43 87.06 ±plus-or-minus\pm 2 87.57 ±plus-or-minus\pm 4.86 36.89 ±plus-or-minus\pm 1.18 67.08 ±plus-or-minus\pm 2.04 52.5 ±plus-or-minus\pm 1.49 87.42 ±plus-or-minus\pm 1.09 76.41 ±plus-or-minus\pm 1.38 89.89 ±plus-or-minus\pm 0.57 4.11
ACM-Snowball-3 83.24 ±plus-or-minus\pm 5.38 86.67 ±plus-or-minus\pm 4.37 87.84 ±plus-or-minus\pm 3.87 36.82 ±plus-or-minus\pm 0.94 66.91 ±plus-or-minus\pm 1.73 53.31 ±plus-or-minus\pm 1.88 87.1 ±plus-or-minus\pm 0.93 75.91 ±plus-or-minus\pm 1.57 89.81 ±plus-or-minus\pm 0.43 5.22
ACMII-GCN 85.95 ±plus-or-minus\pm 5.64 87.45 ±plus-or-minus\pm 3.74 86.76 ±plus-or-minus\pm 4.75 36.16 ±plus-or-minus\pm 1.11 66.91 ±plus-or-minus\pm 2.55 51.85 ±plus-or-minus\pm 1.38 88.01 ±plus-or-minus\pm 1.08 77.15 ±plus-or-minus\pm 1.45 89.89 ±plus-or-minus\pm 0.43 3.22
ACMII-Snowball-2 85.68 ±plus-or-minus\pm 5.93 87.45 ±plus-or-minus\pm 2.8 86.76 ±plus-or-minus\pm 4.43 36.55 ±plus-or-minus\pm 1.24 66.49 ±plus-or-minus\pm 1.75 50.15 ±plus-or-minus\pm 1.4 87.57 ±plus-or-minus\pm 0.86 76.92 ±plus-or-minus\pm 1.45 89.84 ±plus-or-minus\pm 0.48 4.67
ACMII-Snowball-3 82.7 ±plus-or-minus\pm 4.86 85.29 ±plus-or-minus\pm 4.23 85.41 ±plus-or-minus\pm 6.42 36.49 ±plus-or-minus\pm 1.41 66.86 ±plus-or-minus\pm 1.74 48.87 ±plus-or-minus\pm 1.23 87.16 ±plus-or-minus\pm 1.01 76.18 ±plus-or-minus\pm 1.55 89.73 ±plus-or-minus\pm 0.52 7.00
Table 10: Hyperparameters for FAGCN and ACM-GNNs on fixed splits
Datasets Models\Hyperparameters lr weight_decay dropout hidden results std average epoch time/average total time
Cornell ACM-SGC-1 0.01 5.00E-06 0 64 82.43 5.44 5.37ms/23.05s
ACM-SGC-2 0.01 5.00E-06 0 64 82.43 5.44 5.93ms/25.66s
ACM-GCN 0.05 5.00E-04 0.5 64 85.14 6.07 8.04ms/1.67s
ACMII-GCN 0.1 1.00E-04 0 64 85.95 5.64 7.83ms/2.66s
FAGCN 0.01 1.00E-04 0.6 64 76.76 5.87 8.80ms/7.67s
ACM-Snowball-2 0.05 5.00E-03 0.3 64 85.41 5.43 11.50ms/2.35s
ACM-Snowball-3 0.05 5.00E-03 0.2 64 83.24 5.38 15.06ms/3.12s
ACMII-Snowball-2 0.1 5.00E-03 0.2 64 85.68 5.93 12.63ms/2.58s
ACMII-Snowball-3 0.05 5.00E-03 0.2 64 82.7 4.86 14.59ms/3.06s
Wisconsin ACM-SGC-1 0.1 5.00E-06 0 64 86.47 3.77 5.07ms/14.07s
ACM-SGC-2 0.1 5.00E-06 0 64 86.47 3.77 5.30ms/16.05s
ACM-GCN 0.05 1.00E-03 0.4 64 88.43 3.22 8.04ms/1.66s
ACMII-GCN 0.01 5.00E-05 0.1 64 87.45 3.74 8.40ms/2.19s
FAGCN 0.01 5.00E-05 0.5 64 79.61 1.59 8.61ms/5.84s
ACM-Snowball-2 0.01 1.00E-03 0.4 64 87.06 2 12.51ms/2.60s
ACM-Snowball-3 0.01 1.00E-02 0.1 64 86.67 4.37 14.92ms/3.15s
ACMII-Snowball-2 0.01 5.00E-04 0.1 64 87.45 2.8 11.96ms/2.63s
ACMII-Snowball-3 0.01 5.00E-03 0.5 64 85.29 4.23 14.87ms/3.10s
Texas ACM-SGC-1 0.01 1.00E-05 0 64 81.89 4.53 5.34ms/19.00s
ACM-SGC-2 0.05 1.00E-05 0 64 81.89 4.53 5.50ms/9.26s
ACM-GCN 0.05 5.00E-04 0.5 64 87.84 4.4 9.62ms/1.99s
ACMII-GCN 0.01 1.00E-03 0.1 64 86.76 4.75 9.98ms/2.22s
FAGCN 0.01 1.00E-05 0 64 76.49 2.87 10.45ms/5.70s
ACM-Snowball-2 0.01 5.00E-03 0.2 64 87.57 4.86 11.56ms/2.45s
ACM-Snowball-3 0.01 5.00E-03 0.2 64 87.84 3.87 15.17ms/3.15s
ACMII-Snowball-2 0.01 1.00E-03 0.2 64 86.76 4.43 11.36ms/2.30
ACMII-Snowball-3 0.01 5.00E-03 0.6 64 85.41 6.42 15.84ms/3.48s
Film ACM-SGC-1 0.05 5.00E-04 0 64 35.49 1.06 5.39ms/1.17s
ACM-SGC-2 0.05 5.00E-04 0.1 64 36.04 0.83 13.22ms/3.31s
ACM-GCN 0.01 5.00E-03 0 64 36.28 1.09 8.96ms/1.82s
ACMII-GCN 0.01 5.00E-03 0 64 36.16 1.11 9.06ms/1.83s
FAGCN 0.01 5.00E-05 0.4 64 34.82 1.35 15.60ms/2.51s
ACM-Snowball-2 0.01 1.00E-02 0 64 36.89 1.18 14.77ms/3.01s
ACM-Snowball-3 0.01 1.00E-02 0.2 64 36.82 0.94 16.57ms/3.36s
ACMII-Snowball-2 0.01 5.00E-03 0.1 64 36.55 1.24 12.76ms/2.57s
ACMII-Snowball-3 0.05 5.00E-03 0.3 64 36.49 1.41 16.51ms/3.49s
Chameleon ACM-SGC-1 0.1 5.00E-06 0.9 64 63.99 1.66 5.92ms/1.74s
ACM-SGC-2 0.1 0.00E+00 0.9 64 59.21 2.22 8.84ms/1.78s
ACM-GCN 0.05 5.00E-05 0.7 64 66.93 1.85 8.40ms/1.71s
ACMII-GCN 0.05 5.00E-06 0.8 64 66.91 2.55 8.90ms/2.10s
FAGCN 0.01 5.00E-05 0 64 46.07 2.11 16.90ms/7.94s
ACM-Snowball-2 0.01 1.00E-04 0.7 64 67.08 2.04 12.50ms/2.69s
ACM-Snowball-3 0.01 1.00E-05 0.8 64 66.91 1.73 16.12ms/4.91s
ACMII-Snowball-2 0.01 5.00E-05 0.8 64 66.49 1.75 12.65ms/3.42s
ACMII-Snowball-3 0.05 5.00E-05 0.7 64 66.86 1.74 17.60ms/4.06s
Squirrel ACM-SGC-1 0.05 5.00E-06 0.9 64 45 1.4 6.10ms/2.18s
ACM-SGC-2 0.05 0.00E+00 0.9 64 40.02 0.96 35.75ms/9.62s
ACM-GCN 0.05 5.00E-06 0.7 64 54.4 1.88 10.48ms/2.68s
ACMII-GCN 0.05 5.00E-06 0.7 64 51.85 1.38 11.69ms/2.91s
FAGCN 0 5.00E-03 0 64 30.86 0.69 10.90ms/13.91s
ACM-Snowball-2 0.01 1.00E-04 0.7 64 52.5 1.49 17.89ms/5.78s
ACM-Snowball-3 0.01 5.00E-05 0.7 64 53.31 1.88 22.60ms/7.53s
ACMII-Snowball-2 0.05 5.00E-05 0.6 64 50.15 1.4 16.95ms/3.45s
ACMII-Snowball-3 0.01 5.00E-04 0.6 64 48.87 1.23 23.52ms/4.94s
Cora ACM-SGC-1 0.05 5.00E-05 0.7 64 86.9 1.38 4.99ms/2.40s
ACM-SGC-2 0.1 0 0.8 64 87.69 1.07 5.16ms/1.16s
ACM-GCN 0.01 5.00E-05 0.6 64 87.91 0.95 8.41ms/1.84s
ACMII-GCN 0.01 1.00E-04 0.6 64 88.01 1.08 8.59ms/1.96s
FAGCN 0.02 1.00E-04 0.5 64 88.05 1.57 9.30ms/10.64s
ACM-Snowball-2 0.01 1.00E-03 0.5 64 87.42 1.09 12.54ms/2.72s
ACM-Snowball-3 0.01 5.00E-06 0.9 64 87.1 0.93 15.83ms/11.33s
ACMII-Snowball-2 0.01 1.00E-03 0.6 64 87.57 0.86 12.06ms/2.64s
ACMII-Snowball-3 0.01 5.00E-03 0.5 64 87.16 1.01 16.29ms/3.62s
CiteSeer ACM-SGC-1 0.05 0.00E+00 0.7 64 76.73 1.59 5.24ms/1.14s
ACM-SGC-2 0.1 0.00E+00 0.8 64 76.59 1.69 5.14ms/1.03s
ACM-GCN 0.01 5.00E-06 0.3 64 77.32 1.7 8.89ms/1.79s
ACMII-GCN 0.01 5.00E-05 0.5 64 77.15 1.45 8.95ms/1.80s
FAGCN 0.02 5.00E-05 0.4 64 77.07 2.05 10.05ms/5.69s
ACM-Snowball-2 0.01 5.00E-05 0 64 76.41 1.38 12.87ms/2.59s
ACM-Snowball-3 0.01 5.00E-06 0.9 64 75.91 1.57 17.40ms/11.92s
ACMII-Snowball-2 0.01 5.00E-03 0.5 64 76.92 1.45 13.10ms/2.94s
ACMII-Snowball-3 0.1 5.00E-05 0.9 64 76.18 1.55 17.47ms/5.88s
PubMed ACM-SGC-1 0.05 5.00E-06 0.4 64 88.49 0.51 5.77ms/3.65s
ACM-SGC-2 0.05 5.00E-06 0.3 64 89.01 0.6 8.50ms/5.18s
ACM-GCN 0.01 5.00E-05 0.4 64 90 0.52 8.99ms/2.51s
ACMII-GCN 0.01 1.00E-04 0.3 64 89.89 0.43 9.70ms/2.57s
FAGCN 0.01 1.00E-04 0 64 88.09 1.38 10.30ms/8.75s
ACM-Snowball-2 0.01 1.00E-03 0.3 64 89.89 0.57 15.05ms/3.11s
ACM-Snowball-3 0.01 5.00E-03 0.1 64 89.81 0.43 20.51ms/4.63s
ACMII-Snowball-2 0.01 5.00E-04 0.4 64 89.84 0.48 15.10ms/3.2s
ACMII-Snowball-3 0.01 1.00E-03 0.4 64 89.73 0.52 20.46ms/4.32s

Appendix J A Detailed Explanation of the Differences Between ACM(II)-GNNs and Several SOTA Models

  • Difference with GPRGNN [7]: To explain the difference between channel mixing mechanism and the learning mechanism in GPRGNN, we first rewrite GPRGNN as 𝐙=k=0Kγk𝐇(k)=k=0KγkI𝐇(k)=k=0Kdiag(γk,γk,,γk)𝐇(k)𝐙superscriptsubscript𝑘0𝐾subscript𝛾𝑘superscript𝐇𝑘superscriptsubscript𝑘0𝐾subscript𝛾𝑘𝐼superscript𝐇𝑘superscriptsubscript𝑘0𝐾𝑑𝑖𝑎𝑔subscript𝛾𝑘subscript𝛾𝑘subscript𝛾𝑘superscript𝐇𝑘\mathbf{Z}=\sum\limits_{k=0}^{K}\gamma_{k}\mathbf{H}^{(k)}=\sum\limits_{k=0}^{K}\gamma_{k}I\mathbf{H}^{(k)}=\sum\limits_{k=0}^{K}diag(\gamma_{k},\gamma_{k},\dots,\gamma_{k})\mathbf{H}^{(k)}. The node-wise channel mixing mechanism in GPRGNN form is 𝐙=k=0Kdiag(γk1,γk2,,γkN)𝐇(k)𝐙superscriptsubscript𝑘0𝐾𝑑𝑖𝑎𝑔superscriptsubscript𝛾𝑘1superscriptsubscript𝛾𝑘2superscriptsubscript𝛾𝑘𝑁superscript𝐇𝑘\mathbf{Z}=\sum\limits_{k=0}^{K}diag(\gamma_{k}^{1},\gamma_{k}^{2},\dots,\gamma_{k}^{N})\mathbf{H}^{(k)}, where N𝑁N is the number of nodes and γki,i=1,,Nformulae-sequencesuperscriptsubscript𝛾𝑘𝑖𝑖1𝑁\gamma_{k}^{i},i=1,\dots,N are learnable parametric mixing weights.

  • Difference with FAGCN [4]: instead of using a fixed A^^𝐴\hat{A}, FAGCN learns a new filter A^superscript^𝐴\hat{A}^{\prime} based on A^^𝐴\hat{A}. And A^superscript^𝐴\hat{A}^{\prime} can be decomposed as A^=A^1A^2superscript^𝐴superscriptsubscript^𝐴1superscriptsubscript^𝐴2\hat{A}^{\prime}=\hat{A}_{1}^{\prime}-\hat{A}_{2}^{\prime}, where A^1superscriptsubscript^𝐴1\hat{A}_{1}^{\prime} and A^2superscriptsubscript^𝐴2-\hat{A}_{2}^{\prime} represents positive and negative edge (propagation) information respectively. In our paper, we are not discussing the advantages of using the learned filter A^superscript^𝐴\hat{A}^{\prime} over the fixed filter A^^𝐴\hat{A}, we are comparing the models with and without channel mixing mechanism. We believe the empirical results on real-world tasks in table 2 and table 9 is the best way to compare the models with fixed filter and node-wise channel mixing and the models with learned filter but without channel mixing