介绍
由于可视化聚类分析是一项固有的人机协同任务,缺乏通用的地面实况,因此它通常采用降维 (DR) 技术来支持聚类模式 [9] 的可视化和探索 [87] 。 [84] 许多 DR 技术,例如 t-SNE 和 UMAP,采用“接近≈相似”的比喻,其中一对点之间的相似性由其 2D 嵌入之间的距离保持 [82] 。因此,在嵌入空间中聚集在一起的点可以直观地形成隐式集群。在本文中,我们提出了一种用于可视化集群分析的交互式 DR 技术。
尽管现有的 DR 技术广泛用于视觉聚类分析,但它们的结果往往不令人满意。 Fig. 1 显示了三种广泛使用的 DR 方法在 Indian Food 数据集 [4] 上的嵌入结果。我们可以看到,在使用 ISOMAP [73] 时,数据点严重混乱,使得识别集群变得极其困难。对于 t-SNE [76] 和 UMAP [55] ,它们被认为是最先进的 DR 方法 [28] , [62] 它们 [78] 的结果在集群之间具有更好的分离。但是,在没有类标签颜色编码的情况下识别所有集群仍然会令人困惑,集群任务中通常不提供这些标签。此外,目前尚不清楚用户如何引导这些嵌入。从这些限制中,我们确定了 DR 技术支持交互式可视化集群分析的三个主要要求。
首先是正确构建邻域结构,并忠实地将它们保留在嵌入空间中。这确保了集群模式的可信表示,这是可视化集群分析的先决条件。最先进的 t-SNE 和 UMAP 技术显式或隐式使用 k 最近邻 (kNN) 来表示邻域结构。但是,由于欧几里得相似性度量的限制,kNN 不可避免地存在错误,因为数据不相似性可能本质上是非欧几里得 [57] 的。在欧几里得相似性下,不同流形中的数据点可能被视为彼此的 k 最近邻。此外,在 embedding 空间中保留正确的邻居也并非易事。缺失邻居和假邻居经常出现在现有的 DR 技术 [52] 中。 [57] 提出了几种参数化技术,如参数化 UMAP [67] 、 参数化 t-SNE [75] 和 深度递归嵌入 [92] 。然而,它们在邻域保护方面的表现比最先进的非参数技术差。提出高质量的参数化 DR 技术仍然是一个悬而未决的问题。
二是在低维空间实现清晰的视觉聚类分离,可以增强用户的信心,加快识别聚类的过程。在这个问题上,t-SNE 和 UMAP 都不受欢迎。如图所示 Fig. 1 ,两种方法都获得了相对较低的轮廓系数 (SC) 分数 (t-SNE: 0.466;UMAP: 0.610),并在集群之间存在视觉混淆。现有工作致力于视觉区分标记类 [80] 的线性 DR 技术。但是,目前尚不清楚如何改善没有类标签的数据集的视觉聚类分离。此外,尽管 DR 技术旨在保留邻域结构,但如果两个簇在高维空间中没有很好地分离,那么在嵌入空间中分离簇会破坏结构。分离集群和保留邻域并非易事。通常,这两者之间存在权衡。
第三个是支持人机回环分析的有效交互。用户希望利用他们的领域知识来引导低维嵌入,例如,两个点应该在同一个集群中或属于不同的集群。尽管已经开发了 [31] 几种交互式嵌入技术, [45] [47] [49] 但关于引导 DR 进行视觉聚类分析的研究工作很少。Perez 等人 [60] 允许用户操作单个参数来生成分离的集群。但是,参数上的交互并不直观,需要模型知识。直观的交互式非线性 DR 技术仍然是一个悬而未决的问题。
为了满足这些要求,我们提出了一种用于交互式视觉聚类分析的对比降维技术 (CDR)。对比学习是一种自我监督的表示学习方法,在区分相似/不同的数据点 [23] 方面取得了巨大成功。 [37] 因此,我们首先将对比学习引入降维,以更好地保留邻域结构并呈现可信的集群模式。因此,我们仔细设计了正/负对,并在降维的背景下调整对比损失。其次,为了便于视觉聚类分析,我们进一步改进了所提出的 CDR 方法,在保持邻域结构的同时实现更好的视觉聚类分离。为了实现这一点,我们将损失函数的梯度重新定义为负对,以便模型不仅可以区分数据点之间的相似性/不相似性,还可以区分集群之间的相似性/不相似性。第三,我们在 CDR 中引入了基于链接的约束。这使得用户可以通过 Must Link 和 Cannot Link 操作进行交互,支持用户利用领域知识引导嵌入。支持迭代交互和优化分析循环,以进一步减轻交互负担。通过提出的交互式对比降维方法,我们还开发了一个原型界面,该界面集成了所提出的方法和一组可视化效果,用于交互式视觉聚类分析。通过对真实数据集的定量比较,我们表明所提出的 CDR 在呈现可信和呈现分离良好的集群模式方面优于 t-SNE 和 UMAP。 消融实验证明了所提方法的有效性。我们还通过一项用户研究和两个用例展示了 CDR 在交互式视觉集群分析中的有效性。拟议 CDR 的代码可在 https://github.com/DRLib/CDR 上获得。
总而言之,我们的主要贡献是:
我们提出了一种对比降维 (CDR) 方法,将对比学习引入降维并改进对比损失,以实现更好的邻域保持和视觉集群分离。
我们将基于链接的交互引入 CDR 中,以捕获用户意图以进行交互式可视化集群分析。
广泛的定量比较、一项用户研究和两个案例研究证明了与流行的 DR 方法(例如 t-SNE 和 UMAP)相比,所提出的 CDR 的有效性。
相关工作
2.1 可视化集群分析
可视化聚类分析利用可视化技术来理解和探索数据特征、聚类模型和聚类模式。数据特征(例如维度的统计分布和聚合统计信息)描述了维度中的特征分布如何影响聚类结果,从而能够了解聚类与相关维度之间的关系。它们可以通过统计图表(如 字形 [64] 、 直方图 [18] 和 平行坐标 [2] 、 [42] )进行可视化。除了直接可视化统计特征之外,在并行坐标 [13] 或相似性矩阵中对它们 [53] [63] 进行重新排序也是一种有效的聚类方式 [91] ,因为具有相似特征的实例可以通过各种重新排序策略紧密放置。将可视化与聚类分析模型相结合,使用户能够理解、引导和调整聚类分析模型。例如,可以使用热图 [69] 或树 [86] 来探索分层聚类结构,以指导聚类结果的优化。此外,分类标签等可视化编码模型有助于检查群集 [29] 中实例属性的相关性。最近,人们努力采用 DR 技术来探索嵌入空间中 [82] 的集群模式。特别是,DR 技术可以与聚类算法结合使用,无论是在聚类 [56] 之前 [88] 还是在聚类 [21] 、 [30] [48] 、 之后,以在 2D 嵌入空间中显式形成集群。 采用“接近≈相似”隐喻的 DR 技术可以直接可视化,保留隐式聚类 [19] , [66] 与高维空间中的原始聚类结构建立最直观的关系。由于变换可能会导致变形,从而使聚类实例的相对位置不可靠 [39] ,因此挑战在于尽可能多地保留邻域结构,同时获得高质量、分离良好的聚类。我们的方法利用对比学习和交互式聚类来应对这一挑战。
2.2 交互式聚类
交互式集群利用最终用户的领域知识来指导自动集群流程。提供了许多不同的交互式操作来支持转向过程,例如调整集群实例的 [11] 特征或权重、 [58] 参数化底层模型 [3] 、 [6] 、 [14] [22] 、 或直接操作聚类结果(例如,通过拆分或合并集群)、 [10] [20] .但是,这些技术要么需要对底层模型有深入的了解,要么依赖于从外部源注入特定于领域的知识。为了减轻认知成本,已经初步努力进行微观层面的交互,这些交互侧重于监督环境中样本之间基于关系的变量。它们可以在视觉上进行操作,并且独立于特定域和底层模型。例如,用户可以调整特定集群实例 [15] 之间的嵌入距离 , [25] 指定它们的关系 [85] ,或者直接建立成对实例的 必须链接 和 不能链接 约束,以优化各种降维模型 [50] 。聚类算法的基于链接的交互激发了我们在交互式可视化聚类分析中应用转向嵌入的灵感。
2.3 交互式降维
一些降维技术支持交互式 DR 流程,其中用户在半自动设置中以交互方式控制 DR 模型。大多数现有技术允许用户调整底层参数化,无论是针对特征权重 [19] , [40] [81] 还是直接在不同的 DR 模型 [24] 之间切换。 [48] 为了减轻参数化底层模型所需的认知成本,提出了观察级交互,以直接与嵌入中的数据点交互。例如,Brown 等人和 [15] Endert 等人中描述的技术 [31] 允许用户直接调整数据点的位置,从这些数据点中,加权距离函数等隐藏参数会相应地更新。此外,可以将单个点分配为控制点,其位置调整将动态修改嵌入的 [41] 全局布局 , [54] 。interAxis [45] 和 AxiSketcher [49] 等技术允许用户直接操作数据点,例如,将它们画成线,以指定嵌入动态更新的尺寸关系。但是,希望利用链接约束对嵌入的相似性矩阵提供显式影响,而不是使用上述交互 [17] 。与我们的方法类似,基于链接的交互已被引入嵌入技术,如多维缩放 [47] 和内核 PCA [16] ,其中嵌入的距离矩阵可以根据基于链接的约束进行细化。遗憾的是,它们没有针对视觉群集分离进行优化,导致性能 [5] 相对较差。 在利用交互式 DR 来促进聚类分析方面,最初所做的努力很少。例如,Perez 等人提出的技术 [60] 允许用户操纵单个参数来生成分离良好的集群,同时在高维空间中保留原始结构。与这些方法相比,所提出的方法可以直接对单个数据点进行约束。
方法
在本节中,我们将介绍我们的交互式可视化集群分析方法。首先,我们将对比学习引入降维,以更好地保留邻域结构。其次,我们将损失函数的梯度重新定义为负对,以改善视觉聚类分离。第三,我们将基于链接的交互(包括必须链接和不能链接)引入 CDR,使用户能够对嵌入进行定向。我们还实现了一个原型界面,该界面集成了建议的技术和一组可视化效果,用于交互式可视化集群分析。
3.1 基于对比学习的嵌入
对比学习是一种自我监督的表示学习方法,在计算机视觉任务 [23] 的实例判别方面取得了巨大成功。 [37] 通常,它将相似的实例拉在一起,并将不同的实例分开以学习有效的 embedding [74] 。相似和不同关系分别由正对和负对表示。嵌入网络使用对比损失函数进行训练,该函数测量点对的损失。为了利用对比学习的能力,我们在视觉聚类分析的降维中引入了它。在我们的方法中,嵌入函数是
正对
正对指定点之间的相似关系。在计算机视觉任务中,正对通常由数据增强技术 [23] 、 [37] 、 [74] (如旋转图像)指定,以提供实例判别能力。与计算机视觉任务不同,我们寻找集群级别的判别,而不是实例判别。因此,我们需要将属于同一集群的对指定为正对。我们注意到,k 最近邻 (kNN) [61] 被广泛认为属于相同的类/集群,例如,基于 kNN 的分类算法 [89] 。 [90] 因此,对于每个
负对
如果只有正对,模型将折叠以将所有点嵌入到同一位置 [35] 。因此,我们需要指定负对来分离不同的点。遵循 对比学习 [23] 中使用的常用策略 ,对于每个点
对比损失函数
我们采用 NT-Xent (归一化温度标度交叉熵损失) [72] 作为对比损失函数,并对其进行调整以进行降维。在训练批次中,point
与交叉熵损失相比,NT-Xent 增加了一个温度参数
Fig. 2 (a) 显示对应于 的不同值的梯度
3.2 通过梯度重新定义改进视觉聚类分离
保留邻域结构并不能确保聚类之间有足够的分离以进行可视化聚类分析。两个被区分的集群在嵌入空间中可能仍然彼此靠近,需要努力在视觉上识别它们。因此,我们需要改进 embedding 结果的视觉聚类分离。
如 所示 Fig. 2(a) ,如果负对的相似度较低,则梯度也会较低。因此,这些负对对嵌入优化的贡献很小。鉴于负对的点通常属于不同的集群,我们可以通过增加这些负对的梯度来增加它们对嵌入的贡献。增加的贡献将导致两个集群之间的距离变大。因此,我们将梯度重新定义为负对,以增强视觉聚类分离。
然而,挑战在于找到负对的适当部分来重新定义。简单地将梯度增加到所有负对会打破正负对之间的平衡。具有高相似性的负对也会在训练中引入误差,因为它们有可能属于同一集群。因此,我们决定将梯度增加到负对,这些负对很有可能被放置在两个不同的集群中。我们从 UCI Repository [7] 中选择 25 个真实世界的数据集,训练 25 个相应的对比学习模型,并对负对进行计数以获得它们的统计数据。在这里,我们使用类标签作为基本实况。因为嵌入已经纠正了部分假邻居,所以我们计算嵌入空间中的相似性
根据统计数据,我们只想将梯度增加到相似度较低的负对,同时保持损失可微分。NT-Xent 表明,相对于负相似性的指数梯度在分离负对时效果很好。因此,我们想在与单独聚类的相似性的低频带的梯度中添加一个相似但较小的分布。但是,直接添加剪切曲线将导致渐变不连续。因此,我们选择正偏态分布,它类似于左侧的指数分布,并继续在右侧为零。具体来说,偏态分布
基于正偏态分布,我们将负对
3.3 Link-Based Steering of Embedding
We introduce link-based interactions into the proposed CDR to steer the embedding results. Must link and cannot link constraints are widely used in interactive clustering algorithms because it is convenient to make use of the user's instance-level knowledge [79], [85]. Must link refers to specifying two points to be clustered into the same cluster. On the contrary, cannot link specifies two points that cannot be grouped in a single cluster in the clustering.
Given a pre-trained embedding, the steering process starts by specifying link-based constraints from users. Then, the probability
3.4 Network Structure and Training Process
Network Structure
The input of our model are high-dimensional data points. For image data, we emply SimCLR [23] to transfer it into a 512-dimensional vector. Similar to SimCLR [23], our training model has an encoder that learns the representation of data and a projection head that embeds the representation into a 2-dimensional embedding space (Fig. 4). The encoder is a Dense Neural Network that consists of four densely-connected layers. The four layers contain 128, 256, 256, and 512 units, respectively. We add a batch normalization layer after each layer to avoid the gradient vanishing. A ReLU activation function is applied to each layer. The projection head contains a densely-connected layer with 512 units, a ReLU activation function, and a densely-connected layer with 2 units.
Training Process
We employ the Adam optimizer for network training, with an initial learning rate of 0.001. The learning rate is successively decayed by 10% when the number of epochs reaches 0.8E and 0.9E.
3.5 The Prototype Interface
To demonstrate the effectiveness of the proposed interactive visual cluster analysis, we develop a visual interface that integrates CDR with a set of interactive visualizations (Fig. 5). The interface provides a control panel (Fig. 5(a)) for steering parameters. Embedding results are represented in a scatterplot view (Fig. 5(b)). If the dataset contains images, thumbnails of randomly sampled images can be displayed simultaneously in the picture view. In addition, a parallel coordinates plot is utilized to explore the high-dimensional features of data points (Fig. 5(c)). If the dimensionality of a dataset is too high, users can apply Principal Component Analysis [1] to reduce the dimensionality so that only the major information is represented. The link board records the link-based constraints specified by users (Fig. 5(d)). The context of each constrained link is displayed in a thumbnail, where the link is highlighted in a heatmap of the scatterplot view. The four views are coordinated to support the CDR while allowing users to interactively examine the dimensional features and steering the embedding results through link-based interactions.
Evaluation and Results
4.1 Quantitative Evaluation
In this section, we use quantitative experiments to evaluate CDR without user steering. The experimental environment uses a desktop PC with Intel Core i9-7900X (3.30 GHz), NVIDIA GeForce GTX 1080 Ti, 128 GB memory, and Windows 10 installed.
Datasets
We use 10 real-world datasets including Animals [26], Cifar10 [46], Indian Food [4], Isolet [33], MNIST [51], Stanford Dogs [44], Texture [7], USPS [38], Weathers [77] and WiFi [12]. The types of data contain image data, tabular data, and text data. All of them have at least two clusters and have class labels as the ground truth.
DR Techniques
For comparison, we select five non-parametric DR techniques, i.e., ISOMAP [73], Landmark ISOMAP (LISOMAP) [70], t-SNE [76], AtSNE [34], and UMAP [55]; and four parametric DR techniqcues, i.e., Parametric t-SNE (PtSNE) [75], Parametric UMAP (PUMAP) [67], Deep Recursive Embedding (DRE) [92], and the Dimensionality Reduction by Learning an Invariant Mapping (Dr-LIM) [36]. Although there are many contrastive learning applications in computer vision, to the best of our knowledge, DrLIM is the only contrastive learning technique for general DR tasks. We employ the implementation of ISOMAP, t-SNE, and UMAP in the sklearn library. For other techniques, we use the code published by their authors. Because we cannot find the code of LISOMAP, we implement it according to the paper. In the ablation experiments, we test the contrastive learning framework with Cross Entropy (CE), NT-Xent (NX-CDR), and gradient redefinition (CDR). All results are without user steering.
Parameters
We set the perplexity of t-SNE, AtSNE and PtSNE as 30, and the size of kNN neighbor in UMAP, PUMAP, ISOMAP and LISOMAP as
4.1.1 Measures
From literature review [32], [68], we employ six measures to evaluate the DR techniques. Specifically, we select Trustworthiness & Continuity [43] for preserving neighborhood structures, kNN classifier accuracy [27] & Neighbor Hit [59] for correcting false neighbors, and Distance Consistency [71] & Silhouette Coefficient [65] for visual cluster separation. For the first four measures that need to specify the kNN, we set
Trustworthiness measures how much the kNN neighborhood of a point in the embedding space reflects the true neighborhood in the high-dimensional space.
Continuity measures how much the kNN neighborhood of a point in the high-dimensional space is preserved in the embedding space.
k-NN classifier accuracy (kNN-CA) measures the accuracy of the kNN-based classification in the embedding space. For each point, we predict its class label by majority voting of its kNN in the embedding space. A high kNN-CA close to 1 indicates a high consistency between points and their kNN in terms of classification.
Neighbor Hit (NH) measures the proportion of points in the kNN that belong to the same cluster as their center point in the embedding space. A high NH that is close to 1 indicates good quality. On the contrary, a low NH that is close to 0 indicates that many false neighbors are preserved in the embedding space.
Distance Consistency (DSC) is the proportion of points which have the same class as their nearest class centroids. DSC is the best separation measure according to Sedlmair and Aupetit [68].
Silhouette Coefficient (SC) measures the difference of between-class and within-class average distances normalized by the maximum of them. Compared to DSC which measures the point-class relationship, SC evaluates separation from a class-class perspective. It is also widely used in measuring visual class separation.
Except for SC, which is ranged from −1 to 1, the other five measures are normalized to [0, 1]. Higher value refers to higher quality for all six measures. It is worth noting that we use class labels as the ground truth clusters in kNN-CA, NH, DSC, and SC. Although this is a common practice in evaluating DR techniques [60], [92], the possible class-cluster mismatching [8] should be aware. In our experiments, we choose datasets with well-known cluster structures to address this issue.
4.1.2 Comparison Results
Table 1 to Table 6 present the quantitative comparison of well-known existing techniques with CDR on 10 datasets with respect to each of the six measures. From Table 1 and Table 2, it is observed that CDR achieves the best average Trustworthiness than other techniques, but its average Continuity is slightly lower than t-SNE, UMAP, PtSNE, and PUMAP. The main reason is that the calculation of Continuity includes false neighbors. Removing false neighbors in the embedding space, although improves the embedding quality, results in a decrease of Continuity. This is verified by the results on correcting false neighbors. According to Table 3 and Table 4, CDR achieves the best performance in most datasets in terms of kNN-CA and Neighbor Hit, demonstrating its improved neighborhood accuracy and ability to correct false neighbors. In terms of visual cluster separation, Table 5 and Table 6 show that CDR outperforms all other techniques in DSC and SC. Fig. 6 shows the embedding results of the selected techniques on three typical datasets. CDR presents a much clearer visual cluster separation in the embedding results. In conclusion, CDR achieves comparable quality to t-SNE and UMAP in terms of preserving neighborhood structures. Considering the false neighbors, CDR outperforms all other techniques in terms of preserving correct neighborhood structures. CDR also achieves the best performance in terms of visual cluster separation.
4.1.3 Results of Ablation Experiments
We compare the performance of CE, NX-CDR, and CDR to evaluate the effectiveness of the adapted NT-Xent and the proposed gradient redefinition. In terms of preserving neighborhood structures, NX-CDR performs better than CE. CDR achieves higher average Trustworthiness but lower average Continuity than NX-CDR (see Table 1 and Table 2). In terms of correcting false neighbors, NX-CDR also performs better than CE. CDR performs better than NX-CDR (see Table 3 and Table 4). Again, the results suggest that Continuity is distorted by false neighbors. In terms of visual cluster separation, CDR outperforms CE and NX-CDR in DSC and SC for all datasets (see Table 5 and Table 6). In conclusion, the adapted NT-Xent has positive effects on preserving correct neighborhood structures. The proposed gradient redefinition is effective in improving visual cluster separation.
4.1.4 Runtime Performance
Table 7 shows the running time of the evaluated techniques. Among them AtSNE and UMAP are the fastest with the cost of slightly decreased quality than t-SNE. CDR runs faster than original t-SNE, but much slower than other non-parametric techniques. Among parametric techniques, CDR runs faster than PtSNE and PUMAP, which are the parametric version of t-SNE and UMAP. In our experiments, we use the same backbone network in DRE and CDR. Therefore, they have similar time performances. DrLIM achieves better time performance than CDR because it has a simpler loss function. Generally, parametric techniques run slower than non-parametric ones due to their training process. However, it is worthwhile to develop parametric techniques not only because of the better embedding qualities, but also parametric techniques offer several opportunities to DR applications. We will discuss these opportunities in Section 5.
4.2 User Study
We have conducted a user study to evaluate the ability of CDR without user steering to present well-separated clusters in terms of visual perception. In particular, we focus on the cluster identification task to test whether projected clusters are well separated and easily recognized by users. We consider two state-of-the-art DR techniques, t-SNE and UMAP, as baseline techniques and compare them with CDR. We put forward two hypotheses to verify:
H1: CDR performs better in identifying clusters;
H2: CDR consumes less time for identifying clusters;
Tasks
We designed a controlled experiment for evaluating the performance of cluster identification. They were tested using the three DR techniques on 20 datasets. 10 of them are the same as described Section 4.1. The remainings are described in the appendix. The controlled experiment was based on the cluster identification task (E1) described in Xia et al. [84]. We followed the same task design and asked each participant to complete
Participants, Apparatus, and Procedure
We recruited 20 participants (16 males and 4 females) for the user study, aged 23 to 27. None of them reported color blindness or color weakness. All the participants are graduate students with research experience in data visualization. To remove any bias from the analysis processes, we adopted the same testing application described in Xia et al. [84] and integrated CDR and the baseline techniques. All the participants took the experiments one by one on standard desktop computers in our research lab. Each computer is equipped with a standard keyboard, a mouse, and a DELL 27-inch U2720QM monitor. The computer has a
Results
The results of precision and recall values for the controlled experiment are displayed in Fig. 7(a) and Fig. 7(b). Among the three techniques, CDR has the best precision (80.68%) and recall (69.29%). There is a relatively small difference between t-SNE and UMAP on precision (65.64% and 66.25%, respectively) and recall (55.49% and 55.50%, respectively). The Friedman tests show statistical significance of precision
We conducted a post-interview to subjectively compare CDR with t-SNE and UMAP by reviewing their results side by side for each dataset. All participants agree that CDR performs better than t-SNE and UMAP in terms of presenting visually separated cluster patterns. They point out that the clearer visual separation makes them more confident in identifying clusters. However, they also feel somewhat uncertain about the results, as the ground truth is unknown. Some participants point out that providing the quality measures, such as Trustworthiness and Continuity, can strengthen their confidence.
4.3 Case Studies
In this section, we demonstrate how users can use the visual interface and the link-based interactions to steer the embeddings for visual cluster analysis. Using the visual interface, we carried out analyses on two datasets: an image dataset Animals [26] and a tabular dataset WiFi [12].
Animals
The Animals dataset contains 10000 animal images. Each image is encoded into a 512-dimensional feature vector as mentioned in Section 3.4. Using the default parameters, the data were embedded into the scatterplot view, as shown in Fig. 1(e). 11 clusters were observed. However, there was a much smaller cluster at the top left corner. Hoovering over points in this cluster and its neighboring two clusters, we found both the small cluster and the cluster at the top left corner contained butterfly images. Switched to the image mode, we found that the larger cluster contained images where butterflies dominate the foreground, while the smaller cluster contained images where the backgrounds dominate. That is why they formed two clusters. However, considering the interest is in animals, we would like to merge the two clusters into one. Therefore, we added a must link between the two clusters. As merging two clusters is essential to align the two clusters' centers, we chose the two ends of the must link to be the butterfly images near the two cluster centers. After re-embedding, the two clusters were successfully merged (Fig. 1(g)). Some spider images were observed at the boundary of the merged cluster (e.g., Fig. 1(g) A). We thus further improved the results by adding more links. We first added a cannot link between one of the spider images at the boundary and a nearby butterfly image to push away the mis-clustered spider images. We then added a must link between one of the mis-clustered spider images and a spider image near the center of the spider cluster to attract them together. Fig. 1(h) shows the results. With only the three links added, our method successfully merged the two butterfly clusters and separated the spider images. This demonstrated the effectiveness of adding links to improve clustering results. However, added links are expected to assist the similarity measure to improve the clustering results, rather than being the dominant factor to determine the clustering results. We thus expect some resistance of our method to “bad” links added. To verify this, we added a must link between two significantly different clusters: chicken and dog. As shown at the top of Fig. 8(b), the overall clustering pattern remained unchanged. It was not until three must links were added (Fig. 8(b)) then the chicken and the dog clusters were merged (Fig. 8(c)), in comparison to the only one must link needed to merge the two butterfly clusters. This demonstrated some degree of resistance of our method to “bad” links. Meanwhile, the Trustworthiness and Continuity measures were also significantly decreased (from 0.910 to 0.757 and 0.910 to 0.853. respectively).
WiFi
We then moved on to analyze the clustering results on the WiFi dataset. This dataset contains 2000 vectors recording the signal strengths at 2000 locations in four rooms (500 locations in each room). There are 7 routers in the space, from which the received signal strengths constitute the 7 dimensions of each vector. A good clustering of these vectors is expected to correspond to the four different rooms. Fig. 9(a) shows the embedding results using the default parameters. Five clusters were observed rather than four as expected. The problem seemed to lie in the three clusters (C1, C2, C3) that were closely positioned at the bottom right. We would like to find the reasons and thus had a closer examination of their parallel coordinates. It is observed that the data points in C1 and C2 did exhibit similar patterns in parallel coordinates (Fig. 9(d), (e)). The distributions along the dimensions had large overlaps, especially along A0, A1, A2, A3, and A4. This means a large number of data points (locations) in C1 and C2 had similar distances to these routers. We considered that is why C1 and C2 were closely positioned in the embedding space. To refine the CDR to better distinguish locations in the two clusters, we added a cannot link. The re-embedding results showed the successful separation of the two clusters (Fig. 9(b)). We then examined C3. C3 had much smaller number of data points, which indicated it was separated from its main cluster. Comparing the parallel coordinates between C3 (Fig. 9(f)) and C1/C2, it can be observed that C3 had very different distributions along A0 and A3 compared with C1 and C2. However, along the other 5 dimensions, the distributions of C3 were fully enclosed by the distributions of C2. Together with the closest positions of C3 and C2 in the embedding space, these were strong indications that the locations in C3 should be in the same room as those in C2. Their differences along A0 and A3 explained their separation, but these were more likely due to obstacles in the room. Based on the reasoning, we would like to refine the CDR to merge C2 and C3. We thus added a must link between C2 and C3. After re-embedding, the results had four well -separated clusters as expected (Fig. 9(c)), demonstrating the improved dimensionality reduction for visual cluster analysis.
Lesson Learned
Our case studies show that the link-based interaction is very efficient. To merge similar clusters or separate dissimilar clusters, only one or two links are needed. Regarding “bad” links, such as must links between dissimilar clusters, our technique shows some degree of resistance. Three or more links are required for counterfactual interactions. Users should be aware of the power of the interactions when exploring cluster patterns. A suggestion is using the embedding quality measures as the indication of whether the interaction is correct.
Discussions
Opportunities of Parametric Techniques
Although parametric techniques are generally slower than non-parametric ones, they provide an explicit representation of the learned manifold mapping, which can be reused in new datasets and shared among multiple users. This brings several application opportunities for parametric techniques, such as fast embedding of new out-of-sample data [67], [75], federated learning-based joint projection [83], and reusing edited embedding on similar datasets. These applications are non-trivial for non-parametric techniques.
Comparison Between Parametric and Non-Parametric Dr Techniques
On the one hand, the parametric DR techniques run much slower than non-parametric ones, such as AtSNE and UMAP. The main reason is that the training process for parametric techniques is time-consuming. On the other hand, the proposed CDR achieves better embedding quality than all evaluated techniques in terms of both accuracy and visual cluster separation. Critical issues of non-parametric techniques, including the false neighbors and missing neighbors. With well-designed contrastive settings, we alleviated this issue and thus improved the embedding quality. We also would like to point out that only a parametric framework cannot address the issues of false neighbors and missing neighbors. For example, PtSNE [75] and PUMAP [67] performs worse than their non-parametric counterparts, t-SNE, and UMAP, respectively. The proposed CDR is the first parametric DR technique that achieves comparable embedding quality with state-of-the-art non-parametric techniques.
Conclusion
In this paper, we propose an interactive visual cluster analysis approach. We introduce contrastive learning into dimensionality reduction, propose a gradient redefinition technique, and enable link-based interactions on the embedding. The experiments and user study show that the proposed CDR outperforms existing DR techniques in preserving neighborhood structures, correcting false neighbors, and visually separating clusters. Case studies demonstrate the effectiveness of the link-based interactions.
ACKNOWLEDGMENTS
The authors would like to thank the helpful comments from the anonymous reviewers. This work is supported by the National Natural Science Foundation of China (No. 61872389, 62132017, 61872388), and the High Performance Computing Center of Central South University.