这是用户在 2024-5-3 23:03 为 https://ar5iv.labs.arxiv.org/html/2109.05641v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
\newfloatcommand

capbtabboxtable[][\FBwidth]

Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?
异质性是否对图神经网络进行节点分类构成真正的噩梦?

Sitao Luan1,2, Chenqing Hua1, Qincheng Lu1, Jiaqi Zhu1, Mingde Zhao1,2, Shuyuan Zhang1,2,
栾思涛 1,2 ,华晨晴 1 ,陆钦程 1 ,朱佳琦 1 ,赵明德 1,2 ,张书远 1,2

Xiao-Wen Chang1, Doina Precup1,2,3
Xiao-Wen Chang 1 ,Doina Precup 1,2,3

{sitao.luan@mail, chenqing.hua@mail, qincheng.lu@mail, jiaqi.zhu@mail, mingde.zhao@mail,
shuyuan.zhang@mail, chang@cs, dprecup@cs}.mcgill.ca
1McGill University; 2Mila; 3DeepMind
shuyuan.zhang@mail, chang@cs, dprecup@cs}.mcgill.ca 1 麦吉尔大学; 2 Mila; 3 DeepMind

Abstract 摘要

Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by using the graph structures based on the relational inductive bias (homophily assumption). Though GNNs are believed to outperform NNs in real-world tasks, performance advantages of GNNs over graph-agnostic NNs seem not generally satisfactory. Heterophily has been considered as a main cause and numerous works have been put forward to address it. In this paper, we first show that not all cases of heterophily are harmful 111In general, harmful heterophily means the heterophilous structure that will make a graph-aware model underperform its corresponding graph-agnostic model.
一般来说,有害的异质性意味着异质结构会导致一个图感知模型表现不如其对应的不考虑图的模型。

图神经网络(GNNs)通过使用基于关系归纳偏差(同质性假设)的图结构来扩展基本神经网络(NNs)。虽然人们认为 GNNs 在现实任务中胜过 NNs,但是 GNNs 相对于不考虑图的 NNs 的性能优势似乎并不总是令人满意的。异质性被认为是一个主要原因,已经提出了许多作品来解决这个问题。在本文中,我们首先展示并不是所有的异质性情况都是有害的 1
for GNNs with aggregation operation. Then, we propose new metrics based on a similarity matrix which considers the influence of both graph structure and input features on GNNs. The metrics demonstrate advantages over the commonly used homophily metrics by tests on synthetic graphs. From the metrics and the observations, we find some cases of harmful heterophily can be addressed by diversification operation. With this fact and knowledge of filterbanks, we propose the Adaptive Channel Mixing (ACM) framework to adaptively exploit aggregation, diversification and identity channels in each GNN layer to address harmful heterophily. We validate the ACM-augmented baselines with 101010 real-world node classification tasks. They consistently achieve significant performance gain and exceed the state-of-the-art GNNs on most of the tasks without incurring significant computational burden.
对于具有聚合操作的 GNN。然后,我们提出了基于相似矩阵的新度量标准,该矩阵考虑了图结构和输入特征对 GNN 的影响。这些度量标准通过对合成图进行测试,展示了优于常用同质性度量标准的优势。通过度量标准和观察,我们发现一些有害异质性的情况可以通过多样化操作来解决。基于这一事实和滤波器组的知识,我们提出了自适应通道混合(ACM)框架,以自适应地利用每个 GNN 层中的聚合、多样化和身份通道,以解决有害异质性。我们通过 101010 真实世界节点分类任务验证了 ACM 增强基线。它们在大多数任务上始终实现显著的性能提升,并超过了现有技术中的大多数任务的 GNN,而不会带来显著的计算负担。

1 Introduction 1 介绍

Deep Neural Networks (NNs) [21] have revolutionized many machine learning areas, including image recognition [20], speech recognition [13] and natural language processing [2], etc.One major strength is their capacity and effectiveness of learning latent representation from Euclidean data. Recently, the focus has been put on its applications on non-Euclidean data [5], e.g., relational data or graphs. Combining graph signal processing and convolutional neural networks [22], numerous Graph Neural Networks (GNNs) [34, 9, 15, 35, 18, 28] have been proposed which empirically outperform traditional neural networks on graph-based machine learning tasks, e.g., node classification, graph classification, link prediction and graph generation, etc.GNNs are built on the homophily assumption[31], i.e., connected nodes tend to share similar attributes with each other [14], which offers additional information besides node features. Such relational inductive bias [3] is believed to be a key factor leading to GNNs’ superior performance over NNs’ in many tasks.
深度神经网络(NNs)[ 21] 已经彻底改变了许多机器学习领域,包括图像识别[ 20]、语音识别[ 13]和自然语言处理[ 2]等。其主要优势之一是从欧几里得数据中学习潜在表示的能力和有效性。最近,人们开始关注其在非欧几里得数据[ 5]上的应用,例如关系数据或图。结合图信号处理和卷积神经网络[ 22],已经提出了许多图神经网络(GNNs)[ 34, 9, 15, 35, 18, 28],在图基机器学习任务上经验性地优于传统神经网络,例如节点分类、图分类、链接预测和图生成等。GNNs 建立在同质性假设[ 31] 的基础上,即连接的节点倾向于彼此具有相似的属性[ 14],这除了节点特征外还提供了额外信息。这种关系归纳偏差[ 3] 被认为是导致 GNNs 在许多任务中表现优越于 NNs 的关键因素。

Nevertheless, growing evidence shows that GNNs do not always gain advantages over traditional NNs when dealing with relational data. In some cases, even simple Multi-Layer Perceptrons (MLPs) can outperform GNNs by a large margin [40, 26, 7]. An important reason for the performance degradation is believed to be the heterophily problem, i.e., connected nodes tend to have different labels which makes the homophily assumption fail. Heterophily challenge has received attention recently and there are increasing number of models being put forward to address this problem [40, 26, 7, 39, 38].
然而,越来越多的证据显示,在处理关系数据时,GNN 并不总是比传统的 NN 获得优势。在某些情况下,甚至简单的多层感知器(MLPs)也能大幅胜过 GNN [40, 26, 7]。性能下降的一个重要原因被认为是异质性问题,即连接的节点往往具有不同的标签,这使得同质性假设失败。异质性挑战最近受到关注,出现了越来越多的模型来解决这个问题 [40, 26, 7, 39, 38]。

Contributions 贡献

In this paper, we first demonstrate that not all heterophilous graphs are harmful for aggregation-based GNNs and the existing metrics of homophily are insufficient to decide whether the aggregation operation will make nodes less distinguishable or not. By constructing a similarity matrix from backpropagation analysis, we derive new metrics to depict how much GNNs are influenced by the graph structure and node features. We show the advantage of our metrics over the existing metrics by comparing the ability of characterizing the performance of two baseline GNNs on synthetic graphs of different levels of homophily. From the similarity matrix, we find that diversification operation is able to address some harmful heterophily cases, and based on which we propose Adaptive Channel Mixing (ACM) GNN framework. The experiments on the synthetic datasets, ablation studies and real-world datasets consistently show that the baseline GNNs augmented by ACM framework are able to obtain significant performance boost on node classification tasks on heterophilous graphs.
在本文中,我们首先证明了并非所有的异质图对基于聚合的 GNNs 都是有害的,并且现有的同质性度量不足以决定聚合操作是否会使节点变得不太可区分。通过从反向传播分析中构建相似性矩阵,我们推导出新的度量标准,以描述 GNNs 受图结构和节点特征影响的程度。我们通过比较两个基线 GNN 在不同同质性水平的合成图上表现的能力,展示了我们的度量标准相对于现有度量标准的优势。从相似性矩阵中,我们发现多样化操作能够解决一些有害的异质性情况,基于此,我们提出了自适应通道混合(ACM)GNN 框架。在合成数据集、消融研究和真实世界数据集上的实验一致表明,通过 ACM 框架增强的基线 GNN 能够在异质图上的节点分类任务中获得显著的性能提升。

The rest of this paper is mainly organized as follows: In section 2, we introduce the notation and the background knowledge. In section 3, we conduct node-wise analysis on heterophily, derive new homophily metrics based on a similarity matrix and conduct experiments to show their advantages over the existing homophily metrics. In section 4, we demonstrate the capability of diversification operation on addressing some cases of harmful heterophily and propose the ACM-GNN framework to adaptively utilize the information from different filterbank channels to address heterophily problem. In section 5, we discuss the related works and clarify the differences to our method. In section 6, we provide empirical evaluations on ACM framework, including ablation study and tests on 101010 real-world node classification tasks.
本文的其余部分主要组织如下:在第 2 节中,我们介绍符号和背景知识。在第 3 节中,我们对异质性进行节点分析,基于相似矩阵推导新的同质性度量,并进行实验展示它们相对于现有同质性度量的优势。在第 4 节中,我们展示了多样化操作在解决一些有害异质性情况方面的能力,并提出了 ACM-GNN 框架,以自适应地利用不同滤波器通道的信息来解决异质性问题。在第 5 节中,我们讨论相关工作并澄清与我们方法的差异。在第 6 节中,我们对 ACM 框架进行实证评估,包括消融研究和对 101010 个真实世界节点分类任务的测试。

2 Preliminaries 2 初步

We will introduce the related notation and background knowledge in this section. We use bold fonts for vectors (e.g., 𝒗𝒗\bm{v}). Suppose we have an undirected connected graph 𝒢=(𝒱,,A)𝒢𝒱𝐴\mathcal{G}=(\mathcal{V},\mathcal{E},A), where 𝒱𝒱\mathcal{V} is the node set with |𝒱|=N𝒱𝑁\left|\mathcal{V}\right|=N; \mathcal{E} is the edge set without self-loop; AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N} is the symmetric adjacency matrix with Ai,j=1subscript𝐴𝑖𝑗1A_{i,j}=1 iff eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E}, otherwise Ai,j=0subscript𝐴𝑖𝑗0A_{i,j}=0. We use D𝐷D to denote the diagonal degree matrix of 𝒢𝒢{\cal G}, i.e., Di,i=di=jAi,jsubscript𝐷𝑖𝑖subscript𝑑𝑖subscript𝑗subscript𝐴𝑖𝑗D_{i,i}=d_{i}=\sum_{j}A_{i,j} and use 𝒩isubscript𝒩𝑖\mathcal{N}_{i} to denote the neighborhood set of node i𝑖i, i.e., 𝒩i={j:eij}subscript𝒩𝑖conditional-set𝑗subscript𝑒𝑖𝑗\mathcal{N}_{i}=\{j:e_{ij}\in\mathcal{E}\}. A graph signal is a vector 𝒙N𝒙superscript𝑁\bm{x}\in\mathbb{R}^{N} defined on 𝒱𝒱\mathcal{V}, where 𝒙isubscript𝒙𝑖\bm{x}_{i} is defined on the node i𝑖i. We also have a feature matrix XN×F𝑋superscript𝑁𝐹{X}\in\mathbb{R}^{N\times F}, whose columns are graph signals and whose i𝑖i-th row Xi,:subscript𝑋𝑖:{X_{i,:}} is a feature vector of node i𝑖i. We use ZN×C𝑍superscript𝑁𝐶Z\in\mathbb{R}^{N\times C} to denote the label encoding matrix, whose i𝑖i-th row Zi,:subscript𝑍𝑖:Z_{i,:} is the one-hot encoding of the label of node i𝑖i.
在本节中,我们将介绍相关的符号和背景知识。我们使用粗体字表示向量(例如, 𝒗𝒗\bm{v} )。假设我们有一个无向连通图 𝒢=(𝒱,,A)𝒢𝒱𝐴\mathcal{G}=(\mathcal{V},\mathcal{E},A) ,其中 𝒱𝒱\mathcal{V} 是节点集合,具有 |𝒱|=N𝒱𝑁\left|\mathcal{V}\right|=N\mathcal{E} 是不带自环的边集; AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N} 是对称邻接矩阵,当 Ai,j=1subscript𝐴𝑖𝑗1A_{i,j}=1 时为 eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E} ,否则为 Ai,j=0subscript𝐴𝑖𝑗0A_{i,j}=0 。我们使用 D𝐷D 表示 𝒢𝒢{\cal G} 的对角度量矩阵,即 Di,i=di=jAi,jsubscript𝐷𝑖𝑖subscript𝑑𝑖subscript𝑗subscript𝐴𝑖𝑗D_{i,i}=d_{i}=\sum_{j}A_{i,j} ,并使用 𝒩isubscript𝒩𝑖\mathcal{N}_{i} 表示节点 i𝑖i 的邻域集合,即 𝒩i={j:eij}subscript𝒩𝑖conditional-set𝑗subscript𝑒𝑖𝑗\mathcal{N}_{i}=\{j:e_{ij}\in\mathcal{E}\} 。图信号是定义在 𝒱𝒱\mathcal{V} 上的向量 𝒙N𝒙superscript𝑁\bm{x}\in\mathbb{R}^{N} ,其中 𝒙isubscript𝒙𝑖\bm{x}_{i} 定义在节点 i𝑖i 上。我们还有一个特征矩阵 XN×F𝑋superscript𝑁𝐹{X}\in\mathbb{R}^{N\times F} ,其列是图信号,第 i𝑖iXi,:subscript𝑋𝑖:{X_{i,:}} 是节点 i𝑖i 的特征向量。我们使用 ZN×C𝑍superscript𝑁𝐶Z\in\mathbb{R}^{N\times C} 表示标签编码矩阵,其第 i𝑖iZi,:subscript𝑍𝑖:Z_{i,:} 是节点 i𝑖i 的标签的独热编码。

2.1 Graph Laplacian, Affinity Matrix and Their Variants
2.1 图拉普拉斯,相似矩阵及其变体

The (combinatorial) graph Laplacian is defined as L=DA𝐿𝐷𝐴L=D-A, which is Symmetric Positive Semi-Definite (SPSD) [8]. Its eigendecomposition gives L=UΛUT𝐿𝑈Λsuperscript𝑈𝑇L=U\Lambda U^{T}, where the columns 𝒖isubscript𝒖𝑖\boldsymbol{u}_{i} of UN×N𝑈superscript𝑁𝑁U\in{\mathbb{R}}^{N\times N} are orthonormal eigenvectors, namely the graph Fourier basis, Λ=diag(λ1,,λN)Λdiagsubscript𝜆1subscript𝜆𝑁\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}) with λ1λNsubscript𝜆1subscript𝜆𝑁\lambda_{1}\leq\cdots\leq\lambda_{N}, and these eigenvalues are also called frequencies. The graph Fourier transform of the graph signal 𝒙𝒙{\boldsymbol{x}} is defined as 𝒙=U1𝒙=UT𝒙=[𝒖1T𝒙,,𝒖NT𝒙]Tsubscript𝒙superscript𝑈1𝒙superscript𝑈𝑇𝒙superscriptsuperscriptsubscript𝒖1𝑇𝒙superscriptsubscript𝒖𝑁𝑇𝒙𝑇\bm{x}_{\mathcal{F}}=U^{-1}\bm{x}=U^{T}\bm{x}=[\boldsymbol{u}_{1}^{T}{\boldsymbol{x}},\ldots,\boldsymbol{u}_{N}^{T}{\boldsymbol{x}}]^{T}, where 𝒖iT𝒙superscriptsubscript𝒖𝑖𝑇𝒙\bm{u}_{i}^{T}\bm{x} is the component of 𝒙𝒙\bm{x} in the direction of 𝒖𝒊subscript𝒖𝒊\bm{u_{i}}.
(组合)图拉普拉斯定义为 L=DA𝐿𝐷𝐴L=D-A ,它是对称正定半定(SPSD) [8]。它的特征分解给出 L=UΛUT𝐿𝑈Λsuperscript𝑈𝑇L=U\Lambda U^{T} ,其中列 𝒖isubscript𝒖𝑖\boldsymbol{u}_{i}UN×N𝑈superscript𝑁𝑁U\in{\mathbb{R}}^{N\times N} 的标准正交特征向量,即图傅立叶基 Λ=diag(λ1,,λN)Λdiagsubscript𝜆1subscript𝜆𝑁\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}) ,满足 λ1λNsubscript𝜆1subscript𝜆𝑁\lambda_{1}\leq\cdots\leq\lambda_{N} ,这些特征值也被称为频率。图信号 𝒙𝒙{\boldsymbol{x}} 的图傅立叶变换定义为 𝒙=U1𝒙=UT𝒙=[𝒖1T𝒙,,𝒖NT𝒙]Tsubscript𝒙superscript𝑈1𝒙superscript𝑈𝑇𝒙superscriptsuperscriptsubscript𝒖1𝑇𝒙superscriptsubscript𝒖𝑁𝑇𝒙𝑇\bm{x}_{\mathcal{F}}=U^{-1}\bm{x}=U^{T}\bm{x}=[\boldsymbol{u}_{1}^{T}{\boldsymbol{x}},\ldots,\boldsymbol{u}_{N}^{T}{\boldsymbol{x}}]^{T} ,其中 𝒖iT𝒙superscriptsubscript𝒖𝑖𝑇𝒙\bm{u}_{i}^{T}\bm{x}𝒙𝒙\bm{x}𝒖𝒊subscript𝒖𝒊\bm{u_{i}} 方向上的分量。

In additional to L𝐿L, some variants are also commonly used, e.g., the symmetric normalized Laplacian Lsym=D1/2LD1/2=ID1/2AD1/2subscript𝐿symsuperscript𝐷12𝐿superscript𝐷12𝐼superscript𝐷12𝐴superscript𝐷12L_{\text{sym}}=D^{-1/2}LD^{-1/2}=I-D^{-1/2}AD^{-1/2} and the random walk normalized Laplacian Lrw=D1L=ID1Asubscript𝐿rwsuperscript𝐷1𝐿𝐼superscript𝐷1𝐴L_{\text{rw}}=D^{-1}L=I-D^{-1}A. The affinity (transition) matrices can be derived from the Laplacians, e.g., Arw=ILrw=D1Asubscript𝐴rw𝐼subscript𝐿rwsuperscript𝐷1𝐴A_{\text{rw}}=I-L_{\text{rw}}=D^{-1}A, Asym=ILsym=D1/2AD1/2subscript𝐴sym𝐼subscript𝐿symsuperscript𝐷12𝐴superscript𝐷12A_{\text{sym}}=I-L_{\text{sym}}=D^{-1/2}AD^{-1/2} and are considered to be low-pass filters [30]. Their eigenvalues satisfy λi(Arw)=λi(Asym)=1λi(Lsym)=1λi(Lrw)(1,1]subscript𝜆𝑖subscript𝐴rwsubscript𝜆𝑖subscript𝐴sym1subscript𝜆𝑖subscript𝐿sym1subscript𝜆𝑖subscript𝐿rw11\lambda_{i}(A_{\text{rw}})=\lambda_{i}(A_{\text{sym}})=1-\lambda_{i}(L_{\text{sym}})=1-\lambda_{i}(L_{\text{rw}})\in(-1,1]. Applying the renormalization trick [18] to affinity and Laplacian matrices respectively leads to A^sym=D~1/2A~D~1/2subscript^𝐴symsuperscript~𝐷12~𝐴superscript~𝐷12\hat{A}_{\text{sym}}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} and L^sym=IA^symsubscript^𝐿sym𝐼subscript^𝐴sym\hat{L}_{\text{sym}}=I-\hat{A}_{\text{sym}}, where A~A+I~𝐴𝐴𝐼\tilde{A}\equiv A+I and D~D+I~𝐷𝐷𝐼\tilde{D}\equiv D+I. The renormalized affinity matrix essentially adds a self-loop to each node in the graph, and is widely used in Graph Convolutional Network (GCN) [18] as follows,
除了 L𝐿L 之外,还有一些常用的变体,例如对称归一化拉普拉斯 Lsym=D1/2LD1/2=ID1/2AD1/2subscript𝐿symsuperscript𝐷12𝐿superscript𝐷12𝐼superscript𝐷12𝐴superscript𝐷12L_{\text{sym}}=D^{-1/2}LD^{-1/2}=I-D^{-1/2}AD^{-1/2} 和随机游走归一化拉普拉斯 Lrw=D1L=ID1Asubscript𝐿rwsuperscript𝐷1𝐿𝐼superscript𝐷1𝐴L_{\text{rw}}=D^{-1}L=I-D^{-1}A 。亲和(转移)矩阵可以从拉普拉斯矩阵中导出,例如 Arw=ILrw=D1Asubscript𝐴rw𝐼subscript𝐿rwsuperscript𝐷1𝐴A_{\text{rw}}=I-L_{\text{rw}}=D^{-1}AAsym=ILsym=D1/2AD1/2subscript𝐴sym𝐼subscript𝐿symsuperscript𝐷12𝐴superscript𝐷12A_{\text{sym}}=I-L_{\text{sym}}=D^{-1/2}AD^{-1/2} ,被认为是低通滤波器[30]。它们的特征值满足 λi(Arw)=λi(Asym)=1λi(Lsym)=1λi(Lrw)(1,1]subscript𝜆𝑖subscript𝐴rwsubscript𝜆𝑖subscript𝐴sym1subscript𝜆𝑖subscript𝐿sym1subscript𝜆𝑖subscript𝐿rw11\lambda_{i}(A_{\text{rw}})=\lambda_{i}(A_{\text{sym}})=1-\lambda_{i}(L_{\text{sym}})=1-\lambda_{i}(L_{\text{rw}})\in(-1,1] 。将亲和矩阵和拉普拉斯矩阵分别应用重新归一化技巧[18],得到 A^sym=D~1/2A~D~1/2subscript^𝐴symsuperscript~𝐷12~𝐴superscript~𝐷12\hat{A}_{\text{sym}}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}L^sym=IA^symsubscript^𝐿sym𝐼subscript^𝐴sym\hat{L}_{\text{sym}}=I-\hat{A}_{\text{sym}} ,其中 A~A+I~𝐴𝐴𝐼\tilde{A}\equiv A+ID~D+I~𝐷𝐷𝐼\tilde{D}\equiv D+I 。重新归一化的亲和矩阵实质上为图中的每个节点添加了自环,并且在图卷积网络(GCN)[18]中被广泛使用,如下所示,

Y=softmax(A^symReLU(A^symXW0)W1)𝑌softmaxsubscript^𝐴symReLUsubscript^𝐴sym𝑋subscript𝑊0subscript𝑊1Y=\text{softmax}(\hat{A}_{\text{sym}}\;\text{ReLU}(\hat{A}_{\text{sym}}{X}W_{0})\;W_{1}) (1)

where W0F×F1subscript𝑊0superscript𝐹subscript𝐹1W_{0}\in{\mathbb{R}}^{F\times F_{1}} and W1F1×Osubscript𝑊1superscriptsubscript𝐹1𝑂W_{1}\in{\mathbb{R}}^{F_{1}\times O} are learnable parameter matrices. GCN can be trained by minimizing the following cross entropy loss
其中 W0F×F1subscript𝑊0superscript𝐹subscript𝐹1W_{0}\in{\mathbb{R}}^{F\times F_{1}}W1F1×Osubscript𝑊1superscriptsubscript𝐹1𝑂W_{1}\in{\mathbb{R}}^{F_{1}\times O} 是可学习的参数矩阵。GCN 可以通过最小化以下交叉熵损失来训练

=trace(ZTlogY)tracesuperscript𝑍𝑇𝑌\mathcal{L}=-\mathrm{trace}(Z^{T}\log Y) (2)

where log()\log(\cdot) is a component-wise logarithm operation. The random walk renormalized matrix A^rw=D~1A~subscript^𝐴rwsuperscript~𝐷1~𝐴\hat{A}_{\text{rw}}=\tilde{D}^{-1}\tilde{A}, which shares the same eigenvalues as A^symsubscript^𝐴sym\hat{A}_{\text{sym}}, can also be applied in GCN. The corresponding Laplacian is defined as L^rw=IA^rwsubscript^𝐿rw𝐼subscript^𝐴rw\hat{L}_{\text{rw}}=I-\hat{A}_{\text{rw}}. A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} is essentially a random walk matrix and behaves as a mean aggregator that is applied in spatial-based GNNs [15, 14]. To bridge the spectral and spatial methods, we use A^rwsubscript^𝐴𝑟𝑤\hat{A}_{rw} in the paper.
其中 log()\log(\cdot) 是一个逐分量对数运算。随机游走归一化矩阵 A^rw=D~1A~subscript^𝐴rwsuperscript~𝐷1~𝐴\hat{A}_{\text{rw}}=\tilde{D}^{-1}\tilde{A} ,其特征值与 A^symsubscript^𝐴sym\hat{A}_{\text{sym}} 相同,也可以应用于 GCN。相应的拉普拉斯定义为 L^rw=IA^rwsubscript^𝐿rw𝐼subscript^𝐴rw\hat{L}_{\text{rw}}=I-\hat{A}_{\text{rw}}A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} 本质上是一个随机游走矩阵,并且作为应用在基于空间的 GNNs [15, 14]中的均值聚合器。为了连接频谱和空间方法,我们在论文中使用 A^rwsubscript^𝐴𝑟𝑤\hat{A}_{rw}

2.2 Metrics of Homophily 2.2 同质性度量

The metrics of homophily are defined by considering different relations between node labels and graph structures defined by adjacency matrix. There are three commonly used homophily metrics: edge homophily [1, 40], node homophily [33], and class homophily [24] 222The authors in [24] did not name this homophily metric. We name it class homophily based on its definition.
[ 24]中的作者没有给这个同质性度量取名。我们根据其定义将其命名为类同质性。

同质性度量是通过考虑节点标签之间的不同关系和邻接矩阵定义的图结构来定义的。有三种常用的同质性度量:边同质性[ 1, 40],节点同质性[ 33]和类同质性[ 24] 2
defined as follows: 如下所定义:

Hedge(𝒢)=|{euveuv,Zu,:=Zv,:}|||,Hnode(𝒢)=1|𝒱|v𝒱|{uu𝒩v,Zu,:=Zv,:}|dv,formulae-sequencesubscript𝐻edge𝒢conditional-setsubscript𝑒𝑢𝑣formulae-sequencesubscript𝑒𝑢𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝐻node𝒢1𝒱subscript𝑣𝒱conditional-set𝑢formulae-sequence𝑢subscript𝒩𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝑑𝑣\displaystyle H_{\text{edge}}(\mathcal{G})=\frac{\big{|}\{e_{uv}\mid e_{uv}\in\mathcal{E},Z_{u,:}=Z_{v,:}\}\big{|}}{|\mathcal{E}|},\ \ H_{\text{node}}(\mathcal{G})=\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}\frac{\big{|}\{u\mid u\in\mathcal{N}_{v},Z_{u,:}=Z_{v,:}\}\big{|}}{d_{v}}, (3)
Hclass(𝒢)=1C1k=1C[hk|{vZv,k=1}|N]+,hk=v𝒱|{uZv,k=1,u𝒩v,Zu,:=Zv,:}|v{v|Zv,k=1}dvformulae-sequencesubscript𝐻class𝒢1𝐶1superscriptsubscript𝑘1𝐶subscriptdelimited-[]subscript𝑘conditional-set𝑣subscript𝑍𝑣𝑘1𝑁subscript𝑘subscript𝑣𝒱conditional-set𝑢formulae-sequencesubscript𝑍𝑣𝑘1formulae-sequence𝑢subscript𝒩𝑣subscript𝑍𝑢:subscript𝑍𝑣:subscript𝑣conditional-set𝑣subscript𝑍𝑣𝑘1subscript𝑑𝑣\displaystyle H_{\text{class}}(\mathcal{G})=\frac{1}{C-1}\sum_{k=1}^{C}\left[h_{k}-\frac{\big{|}\{v\mid Z_{v,k}=1\}\big{|}}{N}\right]_{+},\ \ h_{k}=\frac{\sum_{v\in\mathcal{V}}\big{|}\{u\mid Z_{v,k}=1,u\in\mathcal{N}_{v},Z_{u,:}=Z_{v,:}\}\big{|}}{\sum_{v\in\{v|Z_{v,k}=1\}}d_{v}}

where [a]+=max(a,0)subscriptdelimited-[]𝑎𝑎0[a]_{+}=\max(a,0); hksubscript𝑘h_{k} is the class-wise homophily metric [24]. They are all in the range of [0,1]01[0,1] and a value close to 111 corresponds to strong homophily while a value close to 00 indicates strong heterophily. Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) measures the proportion of edges that connect two nodes in the same class; Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G}) evaluates the average proportion of edge-label consistency of all nodes; Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) tries to avoid the sensitivity to imbalanced class, which can cause Hedgesubscript𝐻edgeH_{\text{edge}} misleadingly large. The above definitions are all based on the graph-label consistency and imply that the inconsistency will cause harmful effect to the performance of GNNs. With this in mind, we will show a counter example to illustrate the insufficiency of the above metrics and propose new metrics in the following section.
其中 [a]+=max(a,0)subscriptdelimited-[]𝑎𝑎0[a]_{+}=\max(a,0) ; hksubscript𝑘h_{k} 是基于类别的同质性度量[24]。它们都在 [0,1]01[0,1] 的范围内,接近 111 的值表示强同质性,而接近 00 的值表示强异质性。 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 衡量连接两个同类节点的边的比例; Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G}) 评估所有节点的边标签一致性的平均比例; Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) 试图避免对不平衡类别的敏感性,这可能导致 Hedgesubscript𝐻edgeH_{\text{edge}} 误导性地增大。上述定义都基于图标签的一致性,并暗示不一致性会对 GNN 的性能产生有害影响。考虑到这一点,我们将展示一个反例来说明上述度量的不足,并在下一节提出新的度量方法。

3 Analysis of Heterophily 3 异质性分析

3.1 Motivation and Aggregation Homophily
3.1 动机和聚合同好

Refer to caption
Figure 1: Example of harmless heterophily
图 1:无害异质性的示例

Heterophily is believed to be harmful for message-passing based GNNs [40, 33, 7] because intuitively features of nodes in different classes will be falsely mixed and this will lead nodes indistinguishable [40]. Nevertheless, it is not always the case, e.g., the bipartite graph shown in Figure 1 is highly heterophilous according to the homophily metrics in (LABEL:eq:definition_homophily_metrics), but after mean aggregation, the nodes in classes 1 and 2 only exchange colors and are still distinguishable. Authors in [7] also point out the insufficiency of Hnodesubscript𝐻nodeH_{\text{node}} by examples to show that different graph typologies with the same Hnodesubscript𝐻nodeH_{\text{node}} can carry different label information.
异质性被认为对基于消息传递的 GNNs 有害[ 40, 33, 7],因为直观地,不同类别节点的特征将被错误混合,这将导致节点无法区分[ 40]。 然而,并非总是如此,例如,图 1 中显示的二部图根据同质性度量在(LABEL:eq:definition_homophily_metrics)中高度异质,但在均值聚合后,类别 1 和 2 中的节点仅交换颜色仍然可区分。 [ 7]中的作者还通过示例指出 Hnodesubscript𝐻nodeH_{\text{node}} 的不足,以表明具有相同 Hnodesubscript𝐻nodeH_{\text{node}} 的不同图拓扑结构可以携带不同的标签信息。

To analyze to what extent the graph structure can affect the output of a GNN, we first simplify the GCN by removing its nonlinearity as [36]. Let A^N×N^𝐴superscript𝑁𝑁\hat{A}\in\mathbb{R}^{N\times N} denote a general aggregation operator. Then, equation (1) can be simplified as,
为了分析图结构对 GNN 输出的影响程度,我们首先通过去除其非线性性来简化 GCN[ 36]。 让 A^N×N^𝐴superscript𝑁𝑁\hat{A}\in\mathbb{R}^{N\times N} 表示一般的聚合算子。 然后,方程(1)可以简化为,

Y=softmax(A^XW)=softmax(Y)𝑌softmax^𝐴𝑋𝑊softmaxsuperscript𝑌\displaystyle Y=\text{softmax}(\hat{A}XW)=\text{softmax}(Y^{\prime}) (4)

After each gradient decent step ΔW=γddWΔ𝑊𝛾𝑑𝑑𝑊\Delta W=\gamma\frac{d\mathcal{L}}{dW}, where γ𝛾\gamma is the learning rate, the update of Ysuperscript𝑌Y^{\prime} will be (see Appendix B for derivation),
每次梯度下降步骤之后 ΔW=γddWΔ𝑊𝛾𝑑𝑑𝑊\Delta W=\gamma\frac{d\mathcal{L}}{dW} ,其中 γ𝛾\gamma 是学习率, Ysuperscript𝑌Y^{\prime} 的更新将会是(见附录 B 进行推导),

ΔY=A^XΔW=γA^XddWA^XddW=A^XXTA^T(ZY)=S(A^,X)(ZY)Δsuperscript𝑌^𝐴𝑋Δ𝑊𝛾^𝐴𝑋𝑑𝑑𝑊proportional-to^𝐴𝑋𝑑𝑑𝑊^𝐴𝑋superscript𝑋𝑇superscript^𝐴𝑇𝑍𝑌𝑆^𝐴𝑋𝑍𝑌\displaystyle\Delta Y^{\prime}=\hat{A}X\Delta W=\gamma\hat{A}X\frac{d\mathcal{L}}{dW}\propto\hat{A}X\frac{d\mathcal{L}}{dW}=\hat{A}XX^{T}\hat{A}^{T}(Z-Y)=S(\hat{A},X)(Z-Y) (5)

where S(A^,X)A^X(A^X)T𝑆^𝐴𝑋^𝐴𝑋superscript^𝐴𝑋𝑇S(\hat{A},X)\equiv\hat{A}X(\hat{A}X)^{T} is a post-aggregation node similarity matrix, ZY𝑍𝑌Z-Y is the prediction error matrix. The update direction of node i𝑖i is essentially a weighted sum of the prediction error, i.e., Δ(Y)i,:=j𝒱[S(A^,X)]i,j(ZY)j,:Δsubscriptsuperscript𝑌𝑖:subscript𝑗𝒱subscriptdelimited-[]𝑆^𝐴𝑋𝑖𝑗subscript𝑍𝑌𝑗:\Delta(Y^{\prime})_{i,:}=\sum_{j\in\mathcal{V}}\left[S(\hat{A},X)\right]_{i,j}(Z-Y)_{j,:}.
其中 S(A^,X)A^X(A^X)T𝑆^𝐴𝑋^𝐴𝑋superscript^𝐴𝑋𝑇S(\hat{A},X)\equiv\hat{A}X(\hat{A}X)^{T} 是后聚合节点相似性矩阵, ZY𝑍𝑌Z-Y 是预测误差矩阵。节点 i𝑖i 的更新方向本质上是预测误差的加权和,即 Δ(Y)i,:=j𝒱[S(A^,X)]i,j(ZY)j,:Δsubscriptsuperscript𝑌𝑖:subscript𝑗𝒱subscriptdelimited-[]𝑆^𝐴𝑋𝑖𝑗subscript𝑍𝑌𝑗:\Delta(Y^{\prime})_{i,:}=\sum_{j\in\mathcal{V}}\left[S(\hat{A},X)\right]_{i,j}(Z-Y)_{j,:}

To study the effect of heterophily, we first define the aggregation similarity score as follows.
研究异质性效应,我们首先定义聚集相似度分数如下。

Definition 1.

Aggregation similarity score

Sagg(S(A^,X))=|{v|Meanu({S(A^,X)v,u|Zu,:=Zv,:})Meanu({S(A^,X)v,u|Zu,:Zv,:})}||𝒱|subscript𝑆agg𝑆^𝐴𝑋conditional-set𝑣subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑋𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:subscriptMean𝑢conditional-set𝑆subscript^𝐴𝑋𝑣𝑢subscript𝑍𝑢:subscript𝑍𝑣:𝒱S_{\text{agg}}\left(S(\hat{A},X)\right)=\frac{\left|\left\{v\,\big{|}\,\mathrm{Mean}_{u}\big{(}\{S(\hat{A},X)_{v,u}|Z_{u,:}=Z_{v,:}\}\big{)}\geq\mathrm{Mean}_{u}\big{(}\{S(\hat{A},X)_{v,u}|Z_{u,:}\neq Z_{v,:}\}\big{)}\right\}\right|}{\left|\mathcal{V}\right|} (6)

where Meanu({})subscriptMean𝑢\mathrm{Mean}_{u}\left(\{\cdot\}\right) takes the average over u𝑢u of a given multiset of values or variables.


定义 1. 聚合相似度分数

Sagg(S(A^,X))subscript𝑆agg𝑆^𝐴𝑋S_{\text{agg}}(S(\hat{A},X)) measures the proportion of nodes v𝒱𝑣𝒱v\in\mathcal{V} that will put relatively larger similarity weights on nodes in the same class than in other classes after aggregation. It is easy to see that Sagg(S(A^,X))[0,1]subscript𝑆agg𝑆^𝐴𝑋01S_{\text{agg}}(S(\hat{A},X))\in[0,1]. But in practice, we observe that in most datasets, we will have Sagg(S(A^,X))0.5subscript𝑆agg𝑆^𝐴𝑋0.5S_{\text{agg}}(S(\hat{A},X))\geq 0.5. Based on this observation, we rescale (6) to the following modified aggregation similarity for practical usage,

SaggM(S(A^,X))=[2Sagg(S(A^,X))1]+subscriptsuperscript𝑆𝑀agg𝑆^𝐴𝑋subscriptdelimited-[]2subscript𝑆agg𝑆^𝐴𝑋1S^{M}_{\text{agg}}\left(S(\hat{A},X)\right)=\left[2S_{\text{agg}}\left(S(\hat{A},X)\right)-1\right]_{+} (7)

In order to measure the consistency between labels and graph structures without considering node features and make a fair comparison with the existing homophily metrics in (LABEL:eq:definition_homophily_metrics), we define the graph (𝒢𝒢\mathcal{G}) aggregation (A^^𝐴\hat{A}) homophily and its modified version as

Hagg(𝒢)=Sagg(S(A^,Z)),HaggM(𝒢)=SaggM(S(A^,Z))formulae-sequencesubscript𝐻agg𝒢subscript𝑆agg𝑆^𝐴𝑍superscriptsubscript𝐻agg𝑀𝒢superscriptsubscript𝑆agg𝑀𝑆^𝐴𝑍H_{\text{agg}}(\mathcal{G})=S_{\text{agg}}\left(S(\hat{A},Z)\right),\;H_{\text{agg}}^{M}(\mathcal{G})=S_{\text{agg}}^{M}\left(S(\hat{A},Z)\right) (8)

In practice, we will only check Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) when HaggM(𝒢)=0superscriptsubscript𝐻agg𝑀𝒢0H_{\text{agg}}^{M}(\mathcal{G})=0. As Figure 1 shows, when A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}}, Hagg(𝒢)=HaggM(𝒢)=1subscript𝐻agg𝒢superscriptsubscript𝐻agg𝑀𝒢1H_{\text{agg}}(\mathcal{G})=H_{\text{agg}}^{M}(\mathcal{G})=1. Thus, this new metric reflects the fact that nodes in classes 1 and 2 are still highly distinguishable after aggregation, while other metrics mentioned before fail to capture the information and misleadingly give value 0. This shows the advantage of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) and HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) by additionally considering information from aggregation operator A^^𝐴\hat{A} and the similarity matrix.
在实践中,我们只会在 HaggM(𝒢)=0superscriptsubscript𝐻agg𝑀𝒢0H_{\text{agg}}^{M}(\mathcal{G})=0 时检查 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 。如图 1 所示,当 A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}} 时, Hagg(𝒢)=HaggM(𝒢)=1subscript𝐻agg𝒢superscriptsubscript𝐻agg𝑀𝒢1H_{\text{agg}}(\mathcal{G})=H_{\text{agg}}^{M}(\mathcal{G})=1 。因此,这个新度量反映了类别 1 和 2 中的节点在聚合后仍然高度可区分,而之前提到的其他度量未能捕捉到这一信息,误导性地给出值 0。这显示了 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G})HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 的优势,因为它们另外考虑了来自聚合算子 A^^𝐴\hat{A} 和相似度矩阵的信息。

To comprehensively compare HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) with the metrics in (LABEL:eq:definition_homophily_metrics) in terms of how they reveal the influence of graph structure on the GNN performance, we generate synthetic graphs and evaluate SGC [36] and GCN [18] on them in the next subsection.
为了全面比较 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 与(LABEL:eq:definition_homophily_metrics)中的指标,看它们如何揭示图结构对 GNN 性能的影响,我们生成了合成图,并在下一小节评估了 SGC [ 36]和 GCN [ 18]。

3.2 Evaluation and Comparison on Synthetic Graphs
在合成图上的评估和比较 3.2

Data Generation & Experimental Setup
数据生成与实验设置

We first generate 280 graphs in total with 28 edge homophily levels varied from 0.005 to 0.95, each corresponding to 101010 graphs. For every generated graph, we have 5 classes with 400 nodes in each class. For each node, we randomly generate 2 intra-class edges and [2h222\frac{2}{h}-2] inter-class edges (see the details about the data generation process in appendix C). The features of nodes in each class are sampled from node features in the corresponding class of the base dataset. Nodes are randomly split into 60%/20%/20% for train/validation/test. We train 1-hop SGC (sgc-1) [36] and GCN [18] on synthetic data (see appendix A.1 for hyperparameter searching range). For each value of Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}), we take the average test accuracy and standard deviation of runs over 10 generated graphs. For each generated graph, we also calculate its Hnode(𝒢),Hclass(𝒢)subscript𝐻node𝒢subscript𝐻class𝒢H_{\text{node}}(\mathcal{G}),H_{\text{class}}(\mathcal{G}) and HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}). Model performance with respect to different homophily values are shown in Figure 2.
我们总共生成了 280 个图,其中包含 28 个边同质性水平,从 0.005 到 0.95 变化,每个水平对应 101010 个图。对于每个生成的图,我们有 5 个类别,每个类别有 400 个节点。对于每个节点,我们随机生成 2 条类内边和[ 2h222\frac{2}{h}-2 ]条类间边(有关数据生成过程的详细信息请参见附录 C)。每个类别中节点的特征是从基础数据集相应类别的节点特征中抽样得到的。节点被随机分为 60%/20%/20%用于训练/验证/测试。我们在合成数据上训练 1-hop SGC(sgc-1)[36]和 GCN[18](有关超参数搜索范围,请参见附录 A.1)。对于每个 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 的值,我们计算 10 个生成图的平均测试准确率和运行的标准偏差。对于每个生成的图,我们还计算其 Hnode(𝒢),Hclass(𝒢)subscript𝐻node𝒢subscript𝐻class𝒢H_{\text{node}}(\mathcal{G}),H_{\text{class}}(\mathcal{G})HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 。模型性能与不同同质性值的关系如图 2 所示。

Refer to caption
(a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) (a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) (a) Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G})
Refer to caption
(b) Hnode(𝒢)subscript𝐻node𝒢H_{\text{node}}(\mathcal{G})
Refer to caption
(c) Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G})
Refer to caption
(d) HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G})
Figure 2: Comparison of baseline performance under different homophily metrics.
图 2:不同同质性度量下基线性能的比较。

Comparison of Homophily Metrics
同音词度量的比较

The performance of SGC-1 and GCN are expected to be monotonically increasing with a proper and informative homophily metric. However, Figure 2(a)(b)(c) show that the performance curves under Hedge(𝒢),Hnode(𝒢)subscript𝐻edge𝒢subscript𝐻node𝒢H_{\text{edge}}(\mathcal{G}),H_{\text{node}}(\mathcal{G}) and Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) are U𝑈U-shaped 333A similar J-shaped curve is found in [40], though using different data generation processes. It does not mention the insufficiency of edge homophily.
在[40]中发现了类似的 J 形曲线,尽管使用了不同的数据生成过程。它没有提到边缘同质性的不足。

SGC-1 和 GCN 的性能预计会随着适当和信息丰富的同质性度量单调增加。然而,图 2(a)(b)(c)显示,在 Hedge(𝒢),Hnode(𝒢)subscript𝐻edge𝒢subscript𝐻node𝒢H_{\text{edge}}(\mathcal{G}),H_{\text{node}}(\mathcal{G})Hclass(𝒢)subscript𝐻class𝒢H_{\text{class}}(\mathcal{G}) 下,性能曲线呈 U𝑈U 形状。
, while Figure 2(d) reveals a nearly monotonic curve with a little perturbation around 1. This indicates that HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) can describe how the graph structure affects the performance of SGC-1 and GCN more appropriately and adequately than the existing metrics.
,而图 2(d)显示了一个几乎单调的曲线,围绕 1 有一点扰动。这表明 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 可以更恰当和充分地描述图结构如何影响 SGC-1 和 GCN 的性能,而不是现有的度量标准。

In addition, we notice that in Figure 2(a), both SGC-1 and GCN get the worst performance on all datasets when Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) is around somewhere between 0.1 and 0.2. This interesting phenomenon can be explained by the following theorem based on the similarity matrix which can verify the usefulness of HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}).
另外,我们注意到在图 2(a)中,当 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 大约在 0.1 和 0.2 之间时,SGC-1 和 GCN 在所有数据集上表现最差。这一有趣的现象可以通过基于相似矩阵的以下定理来解释,该定理可以验证 HaggM(𝒢)superscriptsubscript𝐻agg𝑀𝒢H_{\text{agg}}^{M}(\mathcal{G}) 的有用性。

  • Theorem1.

    (See Appendix D for proof). Suppose there are C𝐶C classes in the graph 𝒢𝒢\cal G and 𝒢𝒢\cal G is a d𝑑d-regular graph (each node has d𝑑d neighbors). Given d𝑑d, edges for each node are i.i.d.generated, such that each edge of any node has probability hh to connect with nodes in the same class and probability 1h11-h to connect with nodes in different classes. Let the aggregation operator A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}}. Then, for nodes v𝑣v, u1subscript𝑢1u_{1} and u2subscript𝑢2u_{2}, where Zu1,:=Zv,:subscript𝑍subscript𝑢1:subscript𝑍𝑣:Z_{u_{1},:}=Z_{v,:} and Zu2,:Zv,:subscript𝑍subscript𝑢2:subscript𝑍𝑣:Z_{u_{2},:}\neq Z_{v,:}, we have

    g(h)𝔼(S(A^,Z)v,u1)𝔼(S(A^,Z)v,u2)=((C1)(hd+1)(1h)d(C1)(d+1))2𝑔𝔼𝑆subscript^𝐴𝑍𝑣subscript𝑢1𝔼𝑆subscript^𝐴𝑍𝑣subscript𝑢2superscript𝐶1𝑑11𝑑𝐶1𝑑12g(h)\equiv\mathbb{E}\left(S(\hat{A},Z)_{v,u_{1}}\right)-\mathbb{E}\left(S(\hat{A},Z)_{v,u_{2}}\right)=\left(\frac{(C-1)(hd+1)-(1-h)d}{(C-1)(d+1)}\right)^{2} (9)

    and the minimum of g(h)𝑔g(h) is reached at

    h=d+1CCd=dintra/h+1CC(dintra/h)h=dintraCdintra+C1𝑑1𝐶𝐶𝑑subscript𝑑intra1𝐶𝐶subscript𝑑intrasubscript𝑑intra𝐶subscript𝑑intra𝐶1h=\frac{d+1-C}{Cd}=\frac{d_{\text{intra}}/h+1-C}{C(d_{\text{intra}}/h)}\Rightarrow h=\frac{d_{\text{intra}}}{Cd_{\text{intra}}+C-1}

    where dintra=dhsubscript𝑑intra𝑑d_{\text{intra}}=dh, which is the expected number of neighbors of a node that have the same label as the node.
    其中 dintra=dhsubscript𝑑intra𝑑d_{\text{intra}}=dh ,即节点具有与节点相同标签的邻居节点的预期数量。


    定理 1.(见附录 D 证明)。假设图 𝒢𝒢\cal G 中有 C𝐶C 个类,并且 𝒢𝒢\cal G 是一个 d𝑑d -正则图(每个节点有 d𝑑d 个邻居)。给定 d𝑑d ,每个节点的边是独立同分布生成的,使得任何节点的每条边以概率 hh 连接到同一类别的节点,并以概率 1h11-h 连接到不同类别的节点。让聚合算子 A^=A^rw^𝐴subscript^𝐴rw\hat{A}=\hat{A}_{\text{rw}} 。那么,对于节点 v𝑣vu1subscript𝑢1u_{1}u2subscript𝑢2u_{2} ,其中 Zu1,:=Zv,:subscript𝑍subscript𝑢1:subscript𝑍𝑣:Z_{u_{1},:}=Z_{v,:}Zu2,:Zv,:subscript𝑍subscript𝑢2:subscript𝑍𝑣:Z_{u_{2},:}\neq Z_{v,:} ,我们有

The value of g(h)𝑔g(h) in (9) is the expected differences of the similarity values between nodes in the same class as v𝑣v and nodes in other classes. g(h)𝑔g(h) is strongly related to the definition of aggregation homophily and its minimum potentially implies the worst value of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}). In the synthetic experiments, we have dintra=2,C=5formulae-sequencesubscript𝑑intra2𝐶5d_{\text{intra}}=2,C=5 and the minimum of g(h)𝑔g(h) is reached at h=1/70.14170.14h=1/7\approx 0.14, which corresponds to the lowest point in the performance curve in Figure 2(a). In other words, the hh where SGC-1 and GCN perform worst is where g(h)𝑔g(h) gets the smallest value, instead of the point with the smallest edge homophily value h=00h=0. This again reveals the advantage of Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) over Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) by taking use of the similarity matrix.
(9)中 g(h)𝑔g(h) 的值是同一类别中的节点与其他类别中的节点之间相似性值的预期差异。 g(h)𝑔g(h) 与聚合同质性的定义密切相关,其最小值可能意味着 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 的最差值。在合成实验中,我们有 dintra=2,C=5formulae-sequencesubscript𝑑intra2𝐶5d_{\text{intra}}=2,C=5 ,并且 g(h)𝑔g(h) 的最小值在 h=1/70.14170.14h=1/7\approx 0.14 时达到,这对应于图 2(a)中性能曲线的最低点。换句话说,SGC-1 和 GCN 表现最差的 hhg(h)𝑔g(h) 获得最小值的地方,而不是具有最小边同质性值 h=00h=0 的点。这再次揭示了 Hagg(𝒢)subscript𝐻agg𝒢H_{\text{agg}}(\mathcal{G}) 相对于 Hedge(𝒢)subscript𝐻edge𝒢H_{\text{edge}}(\mathcal{G}) 的优势,通过利用相似性矩阵。

Refer to caption
Figure 3: Example of how HP filter addresses harmful heterophily
图 3:HP 滤波器如何处理有害的异质性的示例

4 Adaptive Channel Mixing (ACM) Framework
4 自适应通道混合(ACM)框架

Besides the new homophily metrics, in this section, we will also figure out how diversification operation (high-pass filter) is potentially capable to address some cases of harmful heterophily based on the similarity matrix proposed in equation (5). From the analysis, we argue that low-pass filter and high-pass filter should be combined together for feature extraction, which lead us to the filterbank method in subsection 4.2. We generalize filterbank method and propose ACM framework in subsection 4.3.
除了新的同质性度量之外,在本节中,我们还将弄清多样化操作(高通滤波器)如何潜在地能够解决一些基于方程(5)中提出的相似性矩阵的有害异质性情况。通过分析,我们认为低通滤波器和高通滤波器应该结合在一起进行特征提取,这使我们进入了第 4.2 小节的滤波器组方法。我们概括滤波器组方法,并在第 4.3 小节提出 ACM 框架。

4.1 How Diversification Operation Helps with Harmful Heterophily
4.1 多样化操作如何帮助解决有害异质性

We first consider the example shown in Figure 3. From S(A^,X)𝑆^𝐴𝑋S(\hat{A},X), nodes 1,3 assign relatively large positive weights to nodes in class 2 after aggregation, which will make node 1,3 hard to be distinguished from nodes in class 2. Despite the fact, we can still distinguish between nodes 1,3 and 4,5,6,7 by considering their neighborhood difference: nodes 1,3 are different from most of their neighbors while nodes 4,5,6,7 are similar to most of their neighbors. This indicates, in some cases, although some nodes become similar after aggregation, they are still distinguishable via their surrounding dissimilarities. This leads us to use diversification operation, i.e., high-pass (HP) filter IA^𝐼^𝐴I-\hat{A} [10] (will be introduced in the next subsection) to extract the information of neighborhood differences and address harmful heterophily. As S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) in Figure 3 shows, nodes 1,3 will assign negative weights to nodes 4,5,6,7 after diversification operation, i.e., nodes 1,3 treat nodes 4,5,6,7 as negative samples and will move away from them during backpropagation. Base on this example, we first propose diversification distinguishability as follows to measures the proportion of nodes that diversification operation is potentially helpful for,
我们首先考虑图 3 中显示的示例。从 S(A^,X)𝑆^𝐴𝑋S(\hat{A},X) ,节点 1,3 在聚合后对类 2 中的节点分配相对较大的正权重,这将使节点 1,3 难以与类 2 中的节点区分开。尽管如此,我们仍然可以通过考虑它们的邻域差异来区分节点 1,3 和 4,5,6,7:节点 1,3 与大多数邻居不同,而节点 4,5,6,7 与大多数邻居相似。这表明,在某些情况下,尽管一些节点在聚合后变得相似,但仍可以通过它们周围的不同之处来区分它们。这促使我们使用多样化操作,即高通(HP)滤波器 IA^𝐼^𝐴I-\hat{A} [10](将在下一小节中介绍)来提取邻域差异的信息并解决有害的异质性。如图 3 中所示 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) ,节点 1,3 在多样化操作后将对节点 4,5,6,7 分配负权重,即节点 1,3 将节点 4,5,6,7 视为负样本,并在反向传播过程中远离它们。根据这个例子,我们首先提出多样化可区分性如下来衡量多样化操作对潜在有帮助的节点比例

Definition 2.

Diversification Distinguishability (DD) based on S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X).


定义 2. 基于 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) 的分散区分度(DD)。

Given S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X), a node v𝑣v is diversification distinguishable if the following two conditions are satisfied at the same time,
给定 S(IA^,X)𝑆𝐼^𝐴𝑋S(I-\hat{A},X) ,如果同时满足以下两个条件,则节点 v𝑣v 的差异化可区分性

1.Meanu({S(IA^,X)v,u|u𝒱Zu,:=Zv,:})0;2.Meanu({S(IA^,X)v,u|u𝒱Zu,:Zv,:})0formulae-sequence1.subscriptMean𝑢conditional-set𝑆subscript𝐼^𝐴𝑋𝑣𝑢𝑢𝒱subscript𝑍𝑢:subscript𝑍𝑣:02.subscriptMean𝑢conditional-set𝑆subscript𝐼^𝐴𝑋𝑣𝑢𝑢𝒱subscript𝑍𝑢:subscript𝑍𝑣:0\begin{split}\textbf{1.}\ \mathrm{Mean}_{u}\left(\{S(I-\hat{A},X)_{v,u}|u\in\mathcal{V}\land Z_{u,:}=Z_{v,:}\}\right)\geq 0;\\ \textbf{2.}\ \mathrm{Mean}_{u}\left(\{S(I-\hat{A},X)_{v,u}|u\in\mathcal{V}\land Z_{u,:}\neq Z_{v,:}\}\right)\leq 0\end{split} (10)

Then, graph diversification distinguishability value is defined as
然后,图差异化可区分性值被定义为

DDA^,X(𝒢)=1|𝒱||{v|v is diversification distinguishable}|subscriptDD^𝐴𝑋𝒢1𝒱conditional-set𝑣𝑣 is diversification distinguishable\mathrm{DD}_{\hat{A},X}(\mathcal{G})=\frac{1}{\left|\mathcal{V}\right|}\Big{|}\{v|v\mbox{ is diversification distinguishable}\}\Big{|} (11)

We can see that DDA^,X(𝒢)[0,1]subscriptDD^𝐴𝑋𝒢01\mathrm{DD}_{\hat{A},X}(\mathcal{G})\in[0,1] . The effectiveness of diversification operation can be proved for binary classification problems under certain conditions based on definition 222, leading us to:

  • Theorem2.

    (See Appendix E for proof). Suppose X=Z,A^=A^rwformulae-sequence𝑋𝑍^𝐴subscript^𝐴rwX=Z,\hat{A}=\hat{A}_{\text{rw}}. Then, for a binary classification problem, i.e., C=2𝐶2C=2, all nodes are diversification distinguishable, i.e., DDA^,Z(𝒢)=1subscriptDD^𝐴𝑍𝒢1\mathrm{DD}_{\hat{A},Z}(\mathcal{G})=1.

Theorem 2 theoretically demonstrates the importance of diversification operation to extract high-frequency information of graph signal [10]. Combined with aggregation operation, which is a low-pass filter [10, 30], we can get a filterbank which uses both aggregation and diversification operations to distinctively extract the low- and high-frequency information from graph signals. We will introduce filterbank in the next subsection.
定理 2 在理论上证明了差异化操作对提取图信号的高频信息的重要性[10]。结合聚合操作,即低通滤波器[10, 30],我们可以得到一个滤波器组,它同时使用聚合和差异化操作来明显提取图信号的低频和高频信息。我们将在下一小节介绍滤波器组。

4.2 Filterbank in Spectral and Spatial Forms
4.2 频谱和空间形式中的滤波器组

Filterbank 滤波器组

For the graph signal 𝒙𝒙\bm{x} defined on 𝒢𝒢\mathcal{G}, a 2-channel linear (analysis) filterbank [10] 444In graph signal processing, an additional synthesis filter [10] is required to form the 2-channel filterbank. But synthesis filter is not needed in our framework, so we do not introduce it in our paper.
在图信号处理中,需要额外的合成滤波器[10]来形成 2 通道滤波器组。但在我们的框架中不需要合成滤波器,因此我们在论文中没有介绍它。

对于在 𝒢𝒢\mathcal{G} 上定义的图信号 𝒙𝒙\bm{x} ,一个 2 通道线性(分析)滤波器组[10] 4
includes a pair of low-pass(LP) and high-pass(HP) filters HLP,HHPsubscript𝐻LPsubscript𝐻HPH_{\text{LP}},H_{\text{HP}}, where HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} retain the low-frequency and high-frequency content of 𝒙𝒙\bm{x}, respectively.
包括一对低通(LP)和高通(HP)滤波器 HLP,HHPsubscript𝐻LPsubscript𝐻HPH_{\text{LP}},H_{\text{HP}} ,其中 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}} 分别保留 𝒙𝒙\bm{x} 的低频和高频内容。

Most existing GNNs are under uni-channel filtering architecture [18, 35, 15] with either HLPsubscript𝐻LPH_{\text{LP}} or HHPsubscript𝐻HPH_{\text{HP}} channel that only partially preserves the input information. Unlike the uni-channel architecture, filterbanks with HLP+HHP=Isubscript𝐻LPsubscript𝐻HP𝐼H_{\text{LP}}+H_{\text{HP}}=I will not lose any information of the input signal, i.e., perfect reconstruction property [10].
大多数现有的 GNN 都采用单通道滤波结构[18, 35, 15],只有 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}} 通道,这些通道只部分保留输入信息。与单通道结构不同,具有 HLP+HHP=Isubscript𝐻LPsubscript𝐻HP𝐼H_{\text{LP}}+H_{\text{HP}}=I 的滤波器组不会丢失输入信号的任何信息,即完美重构性质[10]。

Generally, the Laplacian matrices (Lsymsubscript𝐿symL_{\text{sym}}, Lrwsubscript𝐿rwL_{\text{rw}}, L^symsubscript^𝐿sym\hat{L}_{\text{sym}}, L^rwsubscript^𝐿rw\hat{L}_{\text{rw}}) can be regarded as HP filters [10] and affinity matrices (Asymsubscript𝐴symA_{\text{sym}}, Arwsubscript𝐴rwA_{\text{rw}}, A^symsubscript^𝐴sym\hat{A}_{\text{sym}}, A^rwsubscript^𝐴rw\hat{A}_{\text{rw}}) can be treated as LP filters [30, 14]. Moreover, MLPs can be considered as owing a special identity filterbank with matrix I𝐼I that satisfies HLP+HHP=I+0=Isubscript𝐻LPsubscript𝐻HP𝐼0𝐼H_{\text{LP}}+H_{\text{HP}}=I+0=I.
通常,拉普拉斯矩阵( Lsymsubscript𝐿symL_{\text{sym}}Lrwsubscript𝐿rwL_{\text{rw}}L^symsubscript^𝐿sym\hat{L}_{\text{sym}}L^rwsubscript^𝐿rw\hat{L}_{\text{rw}} )可以被视为 HP 滤波器[10],而相似性矩阵( Asymsubscript𝐴symA_{\text{sym}}Arwsubscript𝐴rwA_{\text{rw}}A^symsubscript^𝐴sym\hat{A}_{\text{sym}}A^rwsubscript^𝐴rw\hat{A}_{\text{rw}} )可以被视为 LP 滤波器[30, 14]。此外,MLP 可以被视为具有满足 HLP+HHP=I+0=Isubscript𝐻LPsubscript𝐻HP𝐼0𝐼H_{\text{LP}}+H_{\text{HP}}=I+0=I 的特殊身份滤波器组的矩阵 I𝐼I

Filterbank in Spatial Form
空间形式的滤波器组

Filterbank methods can also be extended to spatial GNNs. Formally, on the node level, left multiplying HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} on 𝒙𝒙\bm{x} performs as aggregation and diversification operations, respectively. For example, suppose HLP=A^subscript𝐻LP^𝐴H_{\text{LP}}=\hat{A} and HHP=IA^subscript𝐻HP𝐼^𝐴H_{\text{HP}}=I-\hat{A}, then for node i𝑖i we have
滤波器组方法也可以扩展到空间 GNN。形式上,在节点级别,左乘 HLPsubscript𝐻LPH_{\text{LP}}HHPsubscript𝐻HPH_{\text{HP}}𝒙𝒙\bm{x} 上分别执行聚合和多样化操作。例如,假设 HLP=A^subscript𝐻LP^𝐴H_{\text{LP}}=\hat{A}HHP=IA^subscript𝐻HP𝐼^𝐴H_{\text{HP}}=I-\hat{A} ,那么对于节点 i𝑖i ,我们有

(HLP𝒙)i=j{𝒩ii}A^i,j𝒙𝒋,(HHP𝒙)i=𝒙𝒊j{𝒩ii}A^i,j𝒙𝒋formulae-sequencesubscriptsubscript𝐻LP𝒙𝑖subscript𝑗subscript𝒩𝑖𝑖subscript^𝐴𝑖𝑗subscript𝒙𝒋subscriptsubscript𝐻HP𝒙𝑖subscript𝒙𝒊subscript𝑗subscript𝒩𝑖𝑖subscript^𝐴𝑖𝑗subscript𝒙𝒋(H_{\text{LP}}\bm{x})_{i}=\sum_{j\in\{\mathcal{N}_{i}\cup i\}}\hat{A}_{i,j}\bm{x_{j}},\ (H_{\text{HP}}\bm{x})_{i}=\bm{x_{i}}-\sum_{j\in\{\mathcal{N}_{i}\cup i\}}\hat{A}_{i,j}\bm{x_{j}} (12)

where A^i,jsubscript^𝐴𝑖𝑗\hat{A}_{i,j} is the connection weight between two nodes. To leverage HP and identity channels in GNNs, we propose the Adaptive Channel Mixing (ACM) architecture in the following subsection.
其中 A^i,jsubscript^𝐴𝑖𝑗\hat{A}_{i,j} 是两个节点之间的连接权重。为了利用 GNN 中的 HP 和身份通道,我们在以下小节中提出了自适应通道混合(ACM)架构。

4.3 Adaptive Channel Mixing(ACM) GNN Framework

ACM framework can be applied to lots of baseline GNNs and in this subsection, we use GCN as an example and introduce ACM framework in matrix form. We use HLPsubscript𝐻LPH_{\text{LP}} and HHPsubscript𝐻HPH_{\text{HP}} to represent general LP and HP filters. The ACM framework includes 333 steps as follows,

Step 1. Feature Extraction for Each Channel: (13)
Option 1: HLl=ReLU(HLPHl1WLl1),HHl=ReLU(HHPHl1WHl1),HIl=ReLU(IHl1WIl1);formulae-sequenceOption 1: subscriptsuperscript𝐻𝑙𝐿ReLUsubscript𝐻LPsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐿formulae-sequencesubscriptsuperscript𝐻𝑙𝐻ReLUsubscript𝐻HPsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐻subscriptsuperscript𝐻𝑙𝐼ReLU𝐼superscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐼\displaystyle\text{Option 1: }{H}^{l}_{L}=\text{ReLU}\left(H_{\text{LP}}{H^{l-1}}W^{l-1}_{L}\right),\ {{H}^{l}_{H}}=\text{ReLU}\left(H_{\text{HP}}{H^{l-1}}W^{l-1}_{H}\right),{H}^{l}_{I}=\ \text{ReLU}\left(I{H^{l-1}}W^{l-1}_{I}\right);
Option 2: HLl=HLPReLU(Hl1WLl1),HHl=HHPReLU(Hl1WHl1),HIl=IReLU(Hl1WIl1);formulae-sequenceOption 2: subscriptsuperscript𝐻𝑙𝐿subscript𝐻LPReLUsuperscript𝐻𝑙1subscriptsuperscript𝑊𝑙1𝐿formulae-sequencesubscriptsuperscript𝐻𝑙𝐻subscript𝐻HPReLUsuperscript