In order to solve the problem that the traditional spectral clustering algorithm is time-consuming and resource consuming when applied to large-scale data, resulting in poor clustering effect or even unable to cluster, this paper proposes a spectral clustering algorithm based on granular-ball(GBSC). The algorithm changes the construction method of the similarity matrix. Based on granular-ball, the size of the similarity matrix is greatly reduced, and the construction of the similarity matrix is more reasonable. Experimental results show that the proposed algorithm achieves better speedup ratio, less memory consumption and stronger anti noise performance while achieving similar clustering results to the traditional spectral clustering algorithm. Suppose the number of granular-balls is m,nm, n is the number of points in the dataset, and m≪nm \ll n, the time complexity of GBSC is O(m^(3))O\left(m^{3}\right). It is proved that GBSC has good adaptability to large-scale datasets. All codes have been released at https://github.com/xjnine/GBSC. 为了解决传统谱聚类算法在应用于大规模数据时耗时且资源消耗大的问题,导致聚类效果不佳甚至无法聚类,本文提出了一种基于粒球(GBSC)的谱聚类算法。该算法改变了相似度矩阵的构建方法。基于粒球,相似度矩阵的规模大大减小,相似度矩阵的构建更加合理。实验结果表明,所提出的算法在实现与传统谱聚类算法相似的聚类结果的同时,具有更好的加速比、更低的内存消耗和更强的抗噪声性能。假设粒球的数量为数据集中点的数量, m,nm, n 是数据集的大小, m≪nm \ll n ,GBSC 的时间复杂度为 O(m^(3))O\left(m^{3}\right) 。证明了 GBSC 对大规模数据集具有良好的适应性。所有代码已发布在 https://github.com/xjnine/GBSC。
Index Terms-Clustering, granular computing, granular-ball, spectral clustering. 索引词-聚类,粒计算,粒球,谱聚类。
I. IntroductionI. 引言
CLUSTERING is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters) [1]. Clustering has been widely used in research fields such as data mining [2], pattern recognition [3], image analysis [4], and information retrieval [5]. After decades of research, a large number of clustering algorithms have been proposed, such as partition clustering [6], hierarchical clustering [7], density clustering [8], grid clustering [9], model clustering [10], and graph clustering [11]. As a representative algorithm in graph clustering algorithms, spectral clustering has also been widely studied. In the beginning, people’s research on spectral clustering mainly focused on graph partitioning criteria, and more attention was paid to how to divide clusters more accurately, 聚类是无监督分类模式(观察值、数据项或特征向量)到组(簇)中的分类[1]。聚类已广泛应用于数据挖掘[2]、模式识别[3]、图像分析[4]和信息检索[5]等研究领域。经过几十年的研究,已提出了大量聚类算法,如划分聚类[6]、层次聚类[7]、密度聚类[8]、网格聚类[9]、模型聚类[10]和图聚类[11]。作为图聚类算法中的一个代表性算法,谱聚类也得到了广泛研究。最初,人们对于谱聚类的研发主要集中于图划分标准,更多地关注如何更准确地划分簇。
and developed the minimum cut set criterion [12], the normative cut set criterion [13], the proportional cut set criterion [14], the average cut set criterion [15], the maximum and minimum cut set criterion [16], the multi-path normative cut set criterion [17] and the division method of spectral clustering algorithm has been very perfect. However, as the scale of data increases, the shortcomings of spectral clustering begin to become prominent. The eigen decomposition calculation complexity of spectral clustering is high, and its time complexity is as high as O(n^(3))O\left(n^{3}\right). 并发展了最小割集准则[12]、规范割集准则[13]、比例割集准则[14]、平均割集准则[15]、最大和最小割集准则[16]、多路径规范割集准则[17],以及谱聚类算法的划分方法已经非常完善。然而,随着数据规模的增加,谱聚类的缺点开始变得突出。谱聚类的特征分解计算复杂度高,其时间复杂度高达 O(n^(3))O\left(n^{3}\right) 。
Therefore, how to improve the performance of spectral clustering has been widely studied. Zhang [18] et al. gives consideration to the quality of intra view division and the collaboration between views, carries out consistency constraints on diversified large-scale data sets through multi representative point strategy and multi view representativeness, and realizes spectral clustering of data through maximization of weighting coefficient and Lagrange multiplier. Compared with traditional spectral clustering, the improved algorithm has certain advantages in clustering time and storage optimization. Jiang [19] et al. selects nonnegative matrix decomposition to process multi view data to obtain the potential feature matrix of each view, then introduces orthogonal constraints to optimize local features, adjusts the clustering contribution of missing data through data weighting, and finally alleviates the memory requirements of large-scale data through block processing. Fowlkes [20] et al. uses Nyström approximate sampling to reduce the complexity of similarity calculation of large-scale data, which is conducive to the improvement of clustering efficiency. Through the adaptive optimization of the sampling process, the clustering effect of spectral clustering on large-scale data sets is further improved. Huang [21] et al. proposes a spectral clustering method based on K-means approximation. First, K-means clustering is performed on the data, and then spectral clustering is performed on the obtained clustering centers (key nodes). Other data points are assigned to the corresponding clusters following their corresponding clustering centers (key nodes). Cai [22] et al. re-encodes the randomly selected key nodes and the original data, and performs singular value decomposition (SVD) on the encoded new data. 因此,如何提高谱聚类的性能已被广泛研究。张[18]等人考虑了视图内划分的质量和视图之间的协作,通过多代表点策略和多视图代表性对多样化的大规模数据集进行一致性约束,并通过最大化权重系数和拉格朗日乘数实现数据的谱聚类。与传统的谱聚类相比,改进算法在聚类时间和存储优化方面具有一定的优势。江[19]等人选择非负矩阵分解来处理多视图数据以获得每个视图的潜在特征矩阵,然后引入正交约束来优化局部特征,通过数据加权调整缺失数据的聚类贡献,最后通过块处理减少大规模数据的内存需求。Fowlkes[20]等人使用 Nyström 近似采样来降低大规模数据相似度计算的复杂性,有利于提高聚类效率。 通过对采样过程的自适应优化,进一步提高了谱聚类在大规模数据集上的聚类效果。Huang 等人[21]提出了一种基于 K-means 近似的谱聚类方法。首先对数据进行 K-means 聚类,然后在得到的聚类中心(关键节点)上执行谱聚类。其他数据点根据其对应的聚类中心(关键节点)被分配到相应的簇中。Cai 等人[22]对随机选择的关键节点和原始数据重新编码,并对编码后的新数据执行奇异值分解(SVD)。
Most of the existing clustering methods are based on a single granularity of information, such as the distance and density of each data. Therefore, the clustering algorithm is greatly affected by outliers, and the process of calculate the Laplace matrix LL is slow. Although some research achievements have been made in spectral clustering on large-scale data sets. Most of the selected key nodes are obtained by random selection or K-means 大多数现有的聚类方法基于单一粒度的信息,例如每个数据的距离和密度。因此,聚类算法很容易受到异常值的影响,并且计算拉普拉斯矩阵的过程很慢。尽管在大规模数据集上的谱聚类方面取得了一些研究成果,但大多数选择的关键节点是通过随机选择或 K-means 获得的。
clustering, which can not guarantee that the selected key nodes can well represent the original data, thus affecting the subsequent clustering results. 聚类,这不能保证所选的关键节点能很好地代表原始数据,从而影响后续的聚类结果。
In order to solve the above problems, this paper proposes an improved spectral clustering algorithm based on granular-ball, which is a multi-granularity structure and can adaptively characterize unlabeled data [23]. We use granular-ball to represent unsupervised data, and then process the data efficiently with appropriate granularity. The core idea of granular-ball is that data with similar distribution form a granular-ball, and adjacent granular-balls form a cluster. We use granular-ball to represent the data in a coarse-grained way. We propose a clustering algorithm that combines multi-granularity Granular-Ball and Spectral Clustering. 为了解决上述问题,本文提出了一种基于粒球的改进型谱聚类算法,它是一种多粒度结构,能够自适应地表征未标记数据[23]。我们使用粒球来表示无监督数据,然后以适当的粒度高效地处理数据。粒球的核心思想是具有相似分布的数据形成粒球,相邻的粒球形成簇。我们以粗粒度的方式使用粒球来表示数据。我们提出了一种结合多粒度粒球和谱聚类的聚类算法。
The main contributions of this paper are as follows: 本文的主要贡献如下:
Self-adaption: The generation of granular-ball is completely parameter-free. Therefore, the algorithm does not require manual parameters. 自适应:粒球的生成完全是参数无关的。因此,该算法不需要手动参数。
Efficiency: Based on granular-ball, the size of the similarity matrix is greatly reduced and the construction of the similarity matrix is more reasonable. Suppose the number of granular-balls is m,nm, n is the number of points in the dataset, and m≪nm \ll n, the time complexity of GBSC is O(m^(3))O\left(m^{3}\right). 效率:基于粒球,相似矩阵的规模大大减小,相似矩阵的构建更加合理。假设粒球的个数是数据集中点的个数,即 m,nm, n ,GBSC 的时间复杂度是 O(m^(3))O\left(m^{3}\right) 。
Robustness: Since each granular-ball covers many points and only contains two data, namely center and radius, a small number of noise points can be smoothed, so that the granular-ball will not be affected. 鲁棒性:由于每个粒球覆盖了许多点,并且只包含两个数据,即中心和半径,少量的噪声点可以被平滑,因此粒球不会受到影响。
II. Related WorkII. 相关工作
A. Large Scale Spectral Clustering A. 大规模谱聚类
The spectral clustering method transforms the data clustering problem into a graph partition problem by using graph theory. By decomposing the eigenvalues of the graph Laplace matrix, the vector representation of the original data in the transformed low dimensional space is obtained. Finally, the k-means algorithm is run on these low dimensional feature vectors to get the final clustering results. 谱聚类方法通过图论将数据聚类问题转化为图划分问题。通过对图拉普拉斯矩阵的特征值分解,在变换后的低维空间中获得原始数据的向量表示。最后,在这些低维特征向量上运行 k-means 算法以获得最终的聚类结果。
Among them, the time complexity defects of spectral clustering are mainly reflected in three aspects: the construction of similarity matrix O(n^(2)d)O\left(n^{2} d\right), the eigenvalue decomposition of Laplace matrix O(n^(3))O\left(n^{3}\right) and the final K-means clustering step O(nkt)O(n k t) (t is the number of iterations). Its spatial complexity defect is mainly reflected in the O(n^(2))O\left(n^{2}\right) required to store the similarity matrix and Laplace matrix. With the increase of data size n , the computational complexity of spectral clustering is too high, which makes the spectral clustering method unable to play a better role in dealing with large-scale data sets. 其中,谱聚类的时空复杂度缺陷主要体现在三个方面:相似度矩阵的构建 O(n^(2)d)O\left(n^{2} d\right) 、拉普拉斯矩阵的特征值分解 O(n^(3))O\left(n^{3}\right) 以及最终的 k-means 聚类步骤 O(nkt)O(n k t) (t 是迭代次数)。其空间复杂度缺陷主要体现在需要存储相似度矩阵和拉普拉斯矩阵的内存空间。随着数据规模 n 的增加,谱聚类的计算复杂度过高,使得谱聚类方法难以在大规模数据集上发挥更好的作用。
In view of the above limitations on the application of spectral clustering to large-scale data sets, the existing spectral clustering methods can be divided into two types: feature map-based methods and sample reduction-based methods. 鉴于上述谱聚类在大规模数据集应用中的局限性,现有的谱聚类方法可以分为两类:基于特征映射的方法和基于样本减少的方法。
Hansen [24] et al. propose a large-scale spectral clustering algorithm based on the feature mapping method. This algorithm is proposed based on the fact that the feature vector is not suitable Hansen [24] 等人提出了一种基于特征映射方法的大规模谱聚类算法。该算法是基于特征向量不适用于这一事实提出的。
for the case of being interested in local features of data. It provides a method to construct Laplacian semi supervised feature vector to solve this problem. The algorithm uses locally biased feature vectors to perform locally biased machine learning, which overcomes the problem of applying spectral clustering to large-scale data. Wu [25] et al. proposed SCRB algorithm, it is the first time to use the random bin feature (RB) to generate the inner product of a large-scale sparse feature matrix, use the inner product to implicitly approximate the graph similarity matrix, and introduce the singular value decomposition solver to calculate the feature vector of the large-scale matrix, so as to accelerate the construction and decomposition of the similarity graph. Rahman [26] et al. Combined histogram based features with distribution based density features to develop a flexible feature mapping technology, which contains prior knowledge of data and provides flexible representation for it, thus improving the recognition ability of the classifier. Vazquez Martin [27] et al. proposed a key frame detector. The algorithm also establishes auxiliary mapping based on local characteristics, and then uses spectral clustering method to obtain graph partition. This method does not need to estimate the motion of the camera, and the similarity measure defined in the detector can be used for any type of feature. Yang [28] et al. proposed a local spectral clustering mapping algorithm based on sparse self coding. By preprocessing the data, the algorithm uses sparse self coding to extract deep-seated features that can reflect the nature of the original data, and replaces the original data. Each data is linearly reconstructed by its neighborhood, and the reconstruction weight is used to replace the Gaussian kernel function to establish the adjacency matrix. 对于关注数据局部特征的情况。它提供了一种构建拉普拉斯半监督特征向量的方法来解决这个问题。该算法使用局部偏差特征向量执行局部偏差机器学习,克服了将谱聚类应用于大规模数据的难题。Wu [25] 等人提出了 SCRB 算法,这是首次使用随机箱特征 (RB) 来生成大规模稀疏特征矩阵的内积,使用内积来隐式近似图相似度矩阵,并引入奇异值分解求解器来计算大规模矩阵的特征向量,从而加速相似度图的构建和分解。Rahman [26] 等人结合基于直方图的特征与基于分布的密度特征,开发了一种灵活的特征映射技术,其中包含数据的先验知识,为其提供灵活的表示,从而提高了分类器的识别能力。Vazquez Martin [27] 等人提出了一种关键帧检测器。 该算法还基于局部特征建立辅助映射,然后使用谱聚类方法获得图划分。该方法无需估计摄像机的运动,探测器中定义的相似性度量可用于任何类型的特征。Yang 等人[28]提出了一种基于稀疏自编码的局部谱聚类映射算法。通过对数据进行预处理,该算法使用稀疏自编码提取能够反映原始数据本质的深层特征,并用这些特征替换原始数据。每个数据点由其邻域线性重建,重建权重用于替代高斯核函数以建立邻接矩阵。
Another kind of large-scale spectral clustering algorithm is to reduce the size of sample data set by selecting representative objects. This method reduces the size of available data sets by selecting representative objects in the data set and using the representative objects to form a small-scale data set. Then, the existing spectral clustering method is used to partition the small-scale data set composed of these representative objects. Finally, the class to which the data in the original dataset belongs is allocated according to the class to which the representative object belongs. The typical method is to use Nystrom method to reduce the computational cost of solving eigenvectors. The main idea is that given a data matrix, the Nystrom method is used to randomly sample the data, calculate the eigenvectors of the sub matrix, and use the calculated eigenvectors to estimate the eigenvectors of the original large-scale matrix [29]. For example, Charles [20] et al. Applied the Nystrom method to the grouping problem of permutation and combination. Based on the fact that the pixels occupied by the target in the real scene are far less than all pixels, it is just consistent with the random sampling data of Nystrom method, and successfully solves the grouping problem of large-scale data. In addition, He [30] et al. proposed an algorithm. The main idea is to use random Fourier features to explicitly represent data in kernel space. It solves the problem of insufficient memory by constructing a kernel matrix which is far smaller than the data scale. At the same time, the running speed of the algorithm is significantly faster. 另一种大规模谱聚类算法是通过选择代表性对象来减小样本数据集的规模。该方法通过在数据集中选择代表性对象,并使用这些代表性对象来形成一个规模较小的数据集,从而减小可用数据集的规模。然后,使用现有的谱聚类方法对由这些代表性对象组成的小规模数据集进行划分。最后,根据代表性对象所属的类别,将原始数据集中的数据分配到相应的类别中。典型的方法是使用 Nystrom 方法来降低求解特征向量的计算成本。其主要思想是:给定一个数据矩阵,使用 Nystrom 方法对数据进行随机采样,计算子矩阵的特征向量,并使用计算出的特征向量来估计原始大规模矩阵的特征向量[29]。例如,Charles 等人[20]将 Nystrom 方法应用于排列组合的分组问题。 事实上,目标在真实场景中占用的像素远少于所有像素,这与 Nystrom 方法的随机采样数据一致,并成功解决了大规模数据的分组问题。此外,He 等人[30]提出了一种算法。其主要思想是利用随机傅里叶特征在核空间中显式表示数据。通过构建远小于数据规模的核矩阵,解决了内存不足的问题。同时,该算法的运行速度显著提高。
In addition, there are some large-scale spectral clustering algorithms that are not based on the above two representative 此外,还有一些基于上述两种代表性方法的大规模谱聚类算法。
methods, such as the algorithm proposed by Huang [21] et al. It adopts a mixed representation selection strategy, and designs a k-nearest representation fast approximation method to construct a sparse affinity submatrix, then interprets the sparse submatrix as a bipartite graph, and finally uses transfer cutting to effectively partition the graph to obtain clustering results. Yang [31] et al. applied the concept of hypergraph to the improvement of spectral clustering algorithm, thus proposed an algorithm, which extended hypergraph to a general format to capture complex high-order relationships, and proposed the concept of “feature technique” to reduce the computational complexity and accelerate the solving process of feature problems. This algorithm overcomes the problem of insufficient memory on large-scale data by storing only hypergraph instance matrix, but not Laplace matrix. 方法,例如黄等人提出的算法[21],它采用混合表示选择策略,设计了一种 k 近邻表示快速近似方法来构建稀疏亲和子矩阵,然后将稀疏子矩阵解释为二部图,最后使用迁移切割有效地划分图以获得聚类结果。杨等人[31]将超图的概念应用于谱聚类算法的改进,从而提出了一种算法,该算法将超图扩展为一般格式以捕获复杂的高阶关系,并提出了“特征技术”的概念以降低计算复杂度并加速特征问题的求解过程。该算法通过仅存储超图实例矩阵而不是拉普拉斯矩阵克服了大规模数据内存不足的问题。
It can be seen that sample extraction is a main strategy for optimizing the similarity matrix of spectral clustering methods. Although the operation of reference sampling using the distance of data points is intuitive and simple, the artificial selection of threshold and the mapping of feature space will lead to the difference between representative samples and actual samples. In addition, using the method of binding representative points to sparse the graph will damage the density of the data point set and reduce the clustering accuracy. Increasing the number of representative points according to the size of the data set will result in that the larger the amount of data, the more representative points, and the lower the accuracy of the algorithm. 可以看出,样本提取是优化谱聚类方法相似性矩阵的主要策略。尽管使用数据点距离进行参考样本抽样的操作直观简单,但人工选择阈值和特征空间映射会导致代表性样本与实际样本之间的差异。此外,使用将代表性点绑定到稀疏图的方法会损害数据点集的密度并降低聚类精度。根据数据集的大小增加代表性点的数量会导致数据量越大,代表性点越多,算法精度越低。
The generation process of granular-ball is a process of adaptive characterization of unlabeled data, which can eliminate the interference of noise and reduce the scale of data. This paper first optimizes the generation process of granular-ball, and then uses granular-ball to improve the construction of similarity matrix and improve the efficiency of spectral clustering algorithm. 粒子球的生成过程是对未标记数据自适应表征的过程,可以消除噪声的干扰并减少数据规模。本文首先优化了粒子球的生成过程,然后使用粒子球来改进相似性矩阵的构建,提高谱聚类算法的效率。
B. The Basic Process of Spectral Clustering B. 谱聚类的基本过程
The spectral clustering algorithm adopts the idea of undirected graph. The data in the data set is regarded as the nodes in the undirected graph, and the similarity between the data is expressed as the weight between the nodes. Then the undirected graph is divided into different subgraphs according to the weight rules. The clustering analysis of the data is realized by optimizing the high weight of the edges in the subgraph and the low weight of the edges between the subgraphs. The solution of optimal division of spectral clustering usually depends on Laplace matrix, and its process is as follows: 谱聚类算法采用无向图的思想。数据集中的数据被视为无向图中的节点,数据之间的相似性表示为节点之间的权重。然后根据权重规则将无向图划分为不同的子图。通过优化子图内的高权重边和子图之间的低权重边来实现数据的聚类分析。谱聚类的最优划分解通常依赖于拉普拉斯矩阵,其过程如下:
Input the initial class number k and the similarity matrix s inR^((n**m))s \in R^{(n * m)}, then establish the similarity graph of the data set, and calculate the edge weight matrix WW; 输入初始类别数 k 和相似矩阵 s inR^((n**m))s \in R^{(n * m)} ,然后建立数据集的相似图,并计算边权重矩阵 WW ;
Calculate the Laplace matrix LL of the graph, and give the matrix composed of the first kk eigenvectors: V=V=[v_(1),v_(2),dots,v_(k)]inR^((n**k))\left[v_{1}, v_{2}, \ldots, v_{k}\right] \in R^{(n * k)}; Normalize each row of VV so that its norm value is 1 to obtain the matrix U=[u_(i)j]U=\left[u_{i} j\right], where, 计算图的拉普拉斯矩阵 LL ,并给出由前 kk 个特征向量组成的矩阵: V=V=[v_(1),v_(2),dots,v_(k)]inR^((n**k))\left[v_{1}, v_{2}, \ldots, v_{k}\right] \in R^{(n * k)} ;将 VV 的每一行进行归一化,使其范数为 1,得到矩阵 U=[u_(i)j]U=\left[u_{i} j\right] ,其中,
Set y_(i)inR^(k)y_{i} \in R^{k} corresponds to column ii of matrix U,i=U, i=1,2,dots,n1,2, \ldots, n, then y_(i)inR^(k)y_{i} \in R^{k} for clustering, clustering result C_(1),C_(2),dots,C_(k)C_{1}, C_{2}, \ldots, C_{k}. 设 y_(i)inR^(k)y_{i} \in R^{k} 对应于矩阵 U,i=U, i=1,2,dots,n1,2, \ldots, n 的第 ii 列,然后 y_(i)inR^(k)y_{i} \in R^{k} 用于聚类,聚类结果 C_(1),C_(2),dots,C_(k)C_{1}, C_{2}, \ldots, C_{k} 。
C. Granular-Ball Computing C. 粒子球计算
Given a data set D=p_(i)(i=1,2,dots,n)D=p_{i}(i=1,2, \ldots, n), where nn is the number of samples on DD. Granular balls GB_(1),GB_(2),dots,GB_(m)G B_{1}, G B_{2}, \ldots, G B_{m} are used to cover and represent the data set DD. Suppose the number of samples in the j^("th ")j^{\text {th }} granular-ball GB_(j)G B_{j} is expressed as |GB_(j)|\left|G B_{j}\right|, then its coverage degree can be expressed as sum_(j=1)^(m)(|GB_(j)|)//n\sum_{j=1}^{m}\left(\left|G B_{j}\right|\right) / n. The basic model of granular-ball coverage can be expressed as 给定数据集 D=p_(i)(i=1,2,dots,n)D=p_{i}(i=1,2, \ldots, n) ,其中 nn 是 DD 上的样本数。使用粒子球 GB_(1),GB_(2),dots,GB_(m)G B_{1}, G B_{2}, \ldots, G B_{m} 来覆盖和表示数据集 DD 。假设第 j^("th ")j^{\text {th }} 个粒子球 GB_(j)G B_{j} 中的样本数表示为 |GB_(j)|\left|G B_{j}\right| ,则其覆盖度可以表示为 sum_(j=1)^(m)(|GB_(j)|)//n\sum_{j=1}^{m}\left(\left|G B_{j}\right|\right) / n 。粒子球覆盖的基本模型可以表示为
{:[minlambda_(1)**n//sum_(j=1)^(m)(|GB_(j)|)//n+lambda_(2)**m],[" s.t. quality "(GB_(j)) >= T]:}\begin{gathered}
\min \lambda_{1} * n / \sum_{j=1}^{m}\left(\left|G B_{j}\right|\right) / n+\lambda_{2} * m \\
\text { s.t. quality }\left(G B_{j}\right) \geq T
\end{gathered}
where lambda_(1)\lambda_{1} and lambda_(2)\lambda_{2} are the corresponding weight coefficients, and mm the number of granular balls. When other factors remain unchanged, the higher the coverage, the less the sample information is lost, and the more the number of granular-balls, the the characterization is more accurate. Therefore, the minimum number of granular-balls should be considered to obtain the maximum coverage degree when generating granular-balls. By adjusting the parameters lambda_(1)\lambda_{1} and lambda_(2)\lambda_{2}, the optimal granular-ball generation results can be obtained to minimize the value of the whole equation. In most cases, the two items in the objective function do not affect each other and do not need trade off, so lambda_(1)\lambda_{1} and lambda_(2)\lambda_{2} are set to 1 by default. Granular-ball computing can fit arbitrarily distributed data [34], [37], [39]. 其中 lambda_(1)\lambda_{1} 和 lambda_(2)\lambda_{2} 是相应的权重系数, mm 是粒子球的数量。在其他因素不变的情况下,覆盖度越高,样本信息丢失越少,粒子球数量越多,表征越准确。因此,在生成粒子球时,应考虑最小化粒子球数量以获得最大覆盖度。通过调整参数 lambda_(1)\lambda_{1} 和 lambda_(2)\lambda_{2} ,可以获得最优的粒子球生成结果,以最小化整个方程的值。在大多数情况下,目标函数中的两项不会相互影响,无需权衡,因此默认将 lambda_(1)\lambda_{1} 和 lambda_(2)\lambda_{2} 设置为 1。粒子球计算可以适应任意分布的数据 [34], [37], [39]。
III. Granular-Ball Spectral Clustering III. 粒子球谱聚类
A. The Process of Generating Granular-Balls A. 生成粒球的过程
The granular-ball method proposed by Xia [23] et al. treat the entire data as the most coarse-grained, and use granular-ball to split and refine from coarse-grained to fine-grained to achieve a scalable and robust computing process. This approach is based on a basic assumption that the data with similar distribution form a granular-ball and the adjacent granular-balls form a cluster. In this paper, the splitting standard is changed from that the quality of the child ball is better than that of the parent ball to that of the weighted child ball. This improvement can obtain a more reasonable set of granular-balls in noisy datasets. Xia 等人 [23] 提出的粒球方法将整个数据视为最粗粒度,并使用粒球从粗粒度到细粒度进行分割和细化,以实现可扩展和鲁棒的计算过程。这种方法基于一个基本假设,即具有相似分布的数据形成一个粒球,相邻的粒球形成一个簇。在本文中,分割标准从子球的质量优于父球变为加权子球的质量。这种改进可以在噪声数据集中获得更合理的粒球集。
Definition 1. Given a dataset D inR^(d)D \in R^{d}, there are two quantities: Granular-Ball (GB_(j))\left(G B_{j}\right) and Distribution Measure (DM_(j))\left(D M_{j}\right). For each GB_(j)G B_{j}, the c_(j)c_{j} is the center of GB_(j)G B_{j} and r_(j)r_{j} is the radius of GB. The definitions of c_(j)c_{j} and r_(j)r_{j} are as follows: 定义 1。给定数据集 D inR^(d)D \in R^{d} ,有两个量:粒球 (GB_(j))\left(G B_{j}\right) 和分布度量 (DM_(j))\left(D M_{j}\right) 。对于每个 GB_(j)G B_{j} , c_(j)c_{j} 是 GB_(j)G B_{j} 的中心, r_(j)r_{j} 是 GB 的半径。 c_(j)c_{j} 和 r_(j)r_{j} 的定义如下:
where ||\|.||denotesthe2\| denotes the 2-norm and n_(j)n_{j} represents the number of data located in GB_(j)G B_{j}. 其中 ||\| 。 ||denotesthe2\| denotes the 2 -范数和 n_(j)n_{j} 表示位于 GB_(j)G B_{j} 中的数据数量。
Definition 2. DM_(j)D M_{j} is measured by computing the ration of the number data point n_(j)n_{j} and the sum radius s_(j)s_{j} in GB_(j)G B_{j}, Where 定义 2. DM_(j)D M_{j} 是通过计算数据点 n_(j)n_{j} 与 GB_(j)G B_{j} 中半径 s_(j)s_{j} 的比值来衡量的,其中
Manuscript received 7 September 2022; revised 20 January 2023; accepted 21 February 2023. Date of publication 27 February 2023; date of current version 8 August 2023. This work was supported in part by NICE: NRT for Integrated Computational Entomology, US NSF under Grant 1631776, in part by the Natural Science Foundation of Chongqing under Grants cstc2019jcyj-msxmX0485 and cstc2019jcyj-cxttX0002 and in part by the National Natural Science Foundation of China under Grants 62176033 and 61936001. Recommended for acceptance by L. Nie. (Corresponding author: Shuyin Xia.) 手稿于 2022 年 9 月 7 日收到;于 2023 年 1 月 20 日修改;于 2023 年 2 月 21 日接受。发表日期为 2023 年 2 月 27 日;当前版本日期为 2023 年 8 月 8 日。这项工作部分由 NICE:综合计算昆虫学 NRT 支持,美国国家科学基金会资助项目 1631776,部分由重庆市自然科学基金资助项目 cstc2019jcyj-msxmX0485 和 cstc2019jcyj-cxttX0002 支持,以及部分由中国国家自然科学基金资助项目 62176033 和 61936001 支持。建议由 L. Nie 接受。(通讯作者:夏舒印。)
The authors are with the Chongqing Key Laboratory of Computatio nal Intelligence, Chongqing University of Telecommunications and Posts, Chongqing 400065, China (e-mail: xiejiang@cqupt.edu.cn; s200231137@ stu.cqupt.edu.cn; xiasy @cqupt.edu.cn; wanggy@cqupt.edu.cn; gaoxb@cqupt. edu.cn). 作者单位:重庆邮电大学重庆大学计算智能重点实验室,重庆 400065,中国(电子邮件:xiejiang@cqupt.edu.cn;s200231137@stu.cqupt.edu.cn;xiasy@cqupt.edu.cn;wanggy@cqupt.edu.cn;gaoxb@cqupt.edu.cn)。
This article has supplementary downloadable material available at https://doi.org/10.1109/TKDE.2023.3249475, provided by the authors. 本文在 https://doi.org/10.1109/TKDE.2023.3249475 提供了作者提供的可下载的补充材料。
Digital Object Identifier 10.1109/TKDE.2023.3249475