Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) selfcontrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods. 无监督学习是一项充满挑战的任务,因为缺乏标签。多目标跟踪(MOT)不可避免地遭受互斥物体干扰、遮挡等问题,在缺乏标签监督的情况下更加困难。本文我们探索了跨视频帧样本特征的潜在一致性,提出了一种无监督对比相似度学习方法 UCSL,包括三个对比模块:自对比、交叉对比和模糊对比。具体来说,i)自对比利用帧内直接对比和帧间间接对比,通过最大化自相似性来获得具有区分性的表示。ii)交叉对比将跨帧和连续帧的匹配结果对齐,从而消除物体遮挡造成的持续负面影响。iii)模糊对比以隐式方式将模糊的物体进行匹配,进一步增加后续物体关联的确定性。在现有基准上,我们的方法在仅使用有限的 ReID 头辅助的情况下,性能优于现有的无监督方法,甚至超过了许多完全监督的方法。
1. Introduction 1. 简介
As a basic task in computer vision, Multiple Object Tracking (MOT) is widely applied in a variety of fields, including robot navigation, intelligent surveillance, and other aspects [36,33]. Currently, one of the most popular tracking paradigms is joint detection and re-identification (ReID) embeddings. In the case of supervision, ReID is regarded as a classification task. To keep track of objects, many works [39, 45] utilize appearance features for object association, where the representation ability of the ReID head will directly affect the accuracy of the object association. 作为计算机视觉的一项基本任务,多目标跟踪(MOT)被广泛应用于机器人导航、智能监控等多个领域[36,33]。目前最流行的跟踪范式之一是联合检测和重识别(ReID)嵌入。在有监督的情况下,ReID 被视为一项分类任务。为了跟踪对象,许多研究[39,45]利用外观特征进行对象关联,ReID 头的表示能力将直接影响对象关联的准确性。
However, due to limitations in various conditions such as labeled datasets, to meet the needs of researchers, there has been a growing requirement to annotate tracking datasets, 然而,由于标注数据集等各种条件的限制,为了满足研究人员的需求,对跟踪数据集进行标注的需求正在不断增加
Figure 1. Supervised and Unsupervised MOT. In the joint detection and ReID embeddings framework, to obtain discriminative embeddings for tracking, the left branch is a usual method of supervised MOT training, i.e., given labels, it is trained as an object classification task. The middle branch is a common method of unsupervised training, i.e., it is processed by clustering, and targets with high similarity are regarded as the same class. The right branch is the proposed method with contrast similarity learning to improve the similarity of the same objects without label information. 图 1. 有监督和无监督的多目标跟踪。在联合检测和 ReID 嵌入框架中,为了获得用于跟踪的有区分性嵌入,左侧分支是通常的有监督多目标跟踪训练方法,即给定标签,它被训练为一个物体分类任务。中间分支是一种常见的无监督训练方法,即通过聚类处理,将高度相似的目标视为同一类。右侧分支是提出的方法,采用对比相似性学习来提高同一物体的相似性,而无需标签信息。
which is costly and time-consuming. Therefore, unsupervised learning of visual representation has attracted great attention in tracking. Some works [35, 30] have demonstrated that training the network in the real direction can also be done even without ground truth. Some works [5, 6] directly use ReID features to cluster objects with high similarity 这既耗时又成本高。因此,无监督的视觉表征学习在跟踪中引起了广泛关注。一些研究[35, 30]已经证明,即使没有真实标签,也可以训练网络在实际方向上工作。一些研究[5, 6]直接使用 ReID 特征对高相似度的目标进行聚类。
into the same class, then generate pseudo-labels to train the network, as shown in the middle branch of Figure 1. But these cluster-based methods are easy to accumulate errors during the training process. In contrast, we consider using an Unsupervised Contrastive Similarity Learning (UCSL) to train the ReID branch without generating pseudo-labels, as shown in the right branch of Figure 1. 放入同一个类别中,然后生成伪标签来训练网络,如图 1 中间分支所示。但是这些基于聚类的方法在训练过程中容易积累错误。相比之下,我们考虑使用无监督对比相似性学习(UCSL)来训练 ReID 分支而不生成伪标签,如图 1 右侧分支所示。
As a video task, the objects in MOT are always changing over time, which leads to inevitable problems of mutual occlusion between objects and objects, and between objects and non-objects, as well as the disappearance of old objects and the appearance of new ones. Occluded objects are not represented consistently from frame to frame due to additional interference features. Lost and emerging objects theoretically cannot be matched with other objects, and they are almost always negative for the current association stage. Thus, it is difficult for the model to determine whether arbitrary two objects are the same or not. In supervised cases, ID labels can be used to make the training more explicitly directed, while in the unsupervised case, the dual limitation of unlabeling and inherent problems makes unsupervised MOT even more challenging. So we manage to find potential connections between objects to determine if the objects are identical. 作为一个视频任务,MOT 中的物体随时间而变化,这导致了物体与物体之间以及物体与非物体之间不可避免的互遮问题,同时也出现了旧物体消失和新物体出现的情况。由于额外的干扰特征,遮挡的物体无法从帧到帧中保持一致的表示。理论上,丢失的物体和新出现的物体无法与其他物体匹配,这对当前的关联阶段几乎都是负面影响。因此,模型很难确定任意两个物体是否相同。在监督的情况下,可以使用 ID 标签来使训练更加明确有针对性,而在无监督的情况下,无标签和固有问题的双重限制使得无监督 MOT 更加具有挑战性。所以我们设法找到物体之间的潜在联系,以确定这些物体是否相同。
In this paper, we propose an Unsupervised Contrastive Similarity Learning (UCSL) method to solve the inherent object association problems of unsupervised MOT. Specifically, UCSL consists of three modules, self-contrast, crosscontrast and ambiguity contrast, designed to address different issues respectively. For the self-contrast, we first match between objects within frames and between objects in adjacent frames. Correspondingly, we get the direct and indirect matching results of the intra-frame objects. Then we maximize the matching probability of self-to-self to maximize the similarity of the same objects. For cross-contrast, considering theoretically the cross-frame matching results should be consistent with the final results of continuous matching, we improve the similarity of the occluded objects by making these two matching results as close as possible. For ambiguity contrast, we match between ambiguous objects, mainly containing occluded, lost, and emerging objects whose final similarity is generally low, to further determine the object identity. Our proposed method is simple but effective, which achieves outstanding performance by utilizing only the ReID embeddings without adding any additional branch such as the occlusion handling or opticalflow based cue to the detection branch. 在本文中,我们提出了一种无监督对比相似度学习(UCSL)方法来解决无监督多目标跟踪(MOT)中固有的对象关联问题。具体而言,UCSL 由三个模块组成,即自对比、交叉对比和歧义对比,分别用于解决不同的问题。对于自对比,我们首先在帧内匹配对象,并在相邻帧之间匹配对象。相应地,我们获得了帧内对象的直接和间接匹配结果。然后我们最大化自对自的匹配概率,从而最大化相同对象的相似度。对于交叉对比,考虑到理论上跨帧匹配结果应该与连续匹配的最终结果一致,我们通过使这两个匹配结果尽可能接近来提高被遮挡对象的相似度。对于歧义对比,我们匹配歧义对象,主要包含通常相似度较低的遮挡、丢失和出现的对象,以进一步确定对象身份。我们提出的方法简单但有效,仅利用 ReID 嵌入即可实现出色的性能,无需为检测分支添加任何额外的分支,如遮挡处理或光流提取。
We implement the method on the basis of FairMOT [45] using the pre-trained model on the COCO dataset [19]. Our experiments on the MOT15 [15], MOT17 [24] and MOT20 [7] datasets are conducted to evaluate the effectiveness of the proposed method. The performance of our unsupervised approach is comparable with, or outperforms, that of some supervised methods using expensive annotations. 我们基于 FairMOT [45] 使用在 COCO 数据集 [19] 上预训练的模型实现了该方法。我们在 MOT15 [15]、MOT17 [24] 和 MOT20 [7] 数据集上进行了实验,以评估所提出方法的有效性。我们无监督方法的性能可与一些利用昂贵注释的监督方法相媲美或优于它们。
Overall, our contributions are summarized as follows: 总的来说,我们的贡献总结如下:
We propose a contrastive similarity learning method for unsupervised MOT task, which pursues latent object consistency based only on the sample features in the ReID module given without the ID information. 我们提出一种用于无监督多目标跟踪任务的对比相似性学习方法,该方法仅基于 ReID 模块中的样本特征(不包含 ID 信息)追求潜在的目标一致性。
We design three useful modules to model associations between objects in different cases. To elaborate, self-contrast module matches intra-frame objects, cross-contrast module associates cross-frame objects, and ambiguity contrast module deals with those hard/corner cases (e.g., occluded objects, lost objects, etc.) 我们设计了三个有用的模块来建模不同情况下对象之间的联系。具体来说,自对比模块匹配帧内对象,跨帧对比模块关联跨帧对象,而歧义对比模块处理那些困难/角落案例(例如,遮挡的对象、丢失的对象等)。
Experiments on MOT15[15], MOT17[24] and MOT20 [7] demonstrate the effectiveness of the proposed UCSL method. As an unsupervised method, UCSL outperforms state-of-the-art unsupervised MOT methods and even achieves similar performance as the fully supervised MOT methods. 在 MOT15[15]、MOT17[24]和 MOT20[7]上的实验证明了所提出的 UCSL 方法的有效性。作为一种无监督方法,UCSL 优于目前最先进的无监督 MOT 方法,甚至可以达到完全监督的 MOT 方法的类似性能。
2. Related Work 2. 相关工作
Multi-Object Tracking. Multi-object tracking is a task that localizes objects from consecutive frames and then associates them according to their identity. Thus, for a long time, the most classic tracking paradigm is tracking-bydetection [27], i.e., firstly, an object detector is used to detect objects from every frame, and secondly, a tracker is used to associate these objects across frames. A large number of works [3, 40, 32, 4] in this paradigm have achieved decent performance, but the paradigm relies too much on the performance of detectors. In the past two years, the joint detection and tracking or embedding paradigm has become stronger. Some transformer-based MOT architectures [31, 23, 41] designed two decoders to perform detection and object propagation respectively. JDE [39] and FairMOT [45] directly incorporated the appearance model into a one-stage detector, and then the model can simultaneously output detection results and the corresponding embeddings. These simple but effective frameworks have been what we are looking for, so we take FairMOT [45] as our baseline. 多目标跟踪。多目标跟踪是一项任务,它首先从连续帧中定位对象,然后根据它们的身份将它们关联起来。因此,很长一段时间以来,最经典的跟踪范式是跟踪-检测 [27],即首先使用物体探测器从每一帧中检测物体,然后使用追踪器将这些物体关联起来。这一范式中有大量研究 [3,40,32,4]取得了不错的性能,但该范式过于依赖探测器的性能。在过去两年中,联合检测和跟踪或嵌入范式变得更加强大。一些基于变换器的多目标跟踪架构 [31,23,41]设计了两个解码器,分别执行检测和对象传播。JDE [39]和 FairMOT [45]直接将外观模型纳入单级检测器,然后该模型可以同时输出检测结果和相应的嵌入。这些简单但有效的框架是我们一直在寻找的,所以我们将 FairMOT [45]作为我们的基线。
Unsupervised Tracking. For some tasks, existing datasets or other resources cannot meet the needs of researchers. In this condition, unsupervised learning has been a popular solution and its efficiency has been demonstrated in related studies [12, 21, 35, 30]. SimpleReID [12] first used unlabeled videos and the corresponding detection sets, and generated tracking results using SORT [3] to simulate the labels, and trained the ReID network to predict the labels of the given images. It is the first demonstration of the effectiveness of the simple unsupervised ReID network 无监督跟踪。对于某些任务,现有数据集或其他资源无法满足研究人员的需求。在这种情况下,无监督学习一直是一种流行的解决方案,其效率已在相关研究中得到证明[12、21、35、30]。SimpleReID[12]首先使用未标记的视频和相应的检测集,并使用 SORT[3]生成跟踪结果来模拟标签,然后训练 ReID 网络来预测给定图像的标签。这是关于简单无监督 ReID 网络有效性的首次示范。
Figure 2. The overall pipeline of our proposed unsupervised contrastive similarity learning model (UCSL), which learns representations with self-contrast, cross-contrast and ambiguity contrast. 图 2. 我们提出的无监督对比相似性学习模型(UCSL)的整体流程,它通过自对比、交叉对比和模糊对比来学习表示。
for MOT. Liu et al. [21] proposed a model, named OUTrack, using an unsupervised ReID learning module and a supervised occlusion estimation module together to improve tracking performance. 刘等人[21]提出了一种名为 OUTrack 的模型,使用无监督的重新识别学习模块和有监督的遮挡估计模块相结合,以提高跟踪性能。
Re-Identification. In the field of re-identification, which is more relevant to MOT, unsupervised learning has been widely used through various means including domain adaption, clustering, etc. Considering the visual similarity and cycle consistency of labels, MMCL [34] predicted pseudo labels and regarded each person as a class, transforming ReID into a multi-classification problem. Some other works [5, 6, 20] also utilized clustering algorithms to generate pseudo labels and take them as ground truth to train the network. However, error accumulation is easy to occur during the iterative process. Recent methods propose selfsupervised learning, Wang et al. [38] proposed CycAs inspired by the data association concept in multi-object tracking. By using the self-supervised signal as a constraint on the data, networks gradually strengthen the feature expression ability during the training process. 重新识别。在重新识别这个领域中,这与多目标跟踪更为相关,无监督学习已被广泛使用,包括通过域自适应、聚类等各种方式。考虑到标签的视觉相似性和循环一致性,MMCL[34]预测了伪标签,将每个人视为一个类别,将 ReID 转化为一个多分类问题。一些其他工作[5,6,20]也利用聚类算法生成伪标签,并将其作为实际标签来训练网络。然而,在迭代过程中很容易出现误差累积。最近的方法提出了自监督学习,Wang 等人[38]提出了 CycAs,受多目标跟踪中数据关联概念的启发。通过将自监督信号用作对数据的约束,网络在训练过程中逐步增强了特征表达能力。
Cycle Consistency. Cycle consistency is originally proposed in Generative Adversarial Network (GAN), and widely used in segmentation, tracking, etc. Jabri et al. [10] constructed a space-time graph from the video, and cast correspondence as prediction of links. By cycle consistency, the single path-level constraint implicitly supervised chains of intermediate comparisons. Wang et al. [37] used cycle consistency in time as the free supervisory signal for learning visual representations from scratch. Then they used the acquired representation to find nearest neighbors across space and time in a range of visual correspondence tasks. 循环一致性。循环一致性最初在生成对抗网络(GAN)中提出,并广泛应用于分割、跟踪等领域。Jabri 等人[10]从视频中构建了时空图,并将对应作为链接预测的问题。通过循环一致性,单一路径级约束隐式监督了中间比较的链条。Wang 等人[37]在时间上使用循环一致性作为从头学习视觉表征的免费监督信号。然后他们使用获得的表征在空间和时间上找到各种视觉对应任务的最近邻。
Contrastive Learning. Contrastive learning has shown great potential in self-supervised learning. Pang et al. [26] proposed QDTrack, which densely sampled hundreds of region proposals on a pair of images for contrastive learning. And they directly combined this with existing detection methods. Yu et al. [42] proposed multi-view trajectory con- trastive learning and designed a trajectory-level contrastive loss to explore the inter-frame information in the whole trajectories. Bastani et al. [1] proposed to construct two different inputs for the same video sequence by hiding different information. Then they computed the trajectory of that sequence by applying the RNN model independently on each input, and trained the model using contrastive learning to produce consistent tracks. 对比学习。对比学习在自监督学习中表现出巨大潜力。Pang 等人[26]提出了 QDTrack,它在一对图像上密集采样了数百个区域建议进行对比学习。他们直接将其与现有的检测方法相结合。Yu 等人[42]提出了多视图轨迹对比学习,设计了轨迹级对比损失来探索整个轨迹中的帧间信息。Bastani 等人[1]提出构建同一视频序列的两个不同输入,通过隐藏不同信息实现。然后他们独立地在每个输入上应用 RNN 模型计算该序列的轨迹,并使用对比学习训练模型以产生一致的跟踪。
3. Method 3.方法
In this section, we first introduce the overall pipeline, as illustrated in Figure 2, and then describe the corresponding specific concepts in detail in the subsequent parts. Finally, we introduce the whole steps of training and inference. 在本节中,我们首先介绍整体流程,如图 2 所示,然后在后续部分详细描述相应的具体概念。最后,我们介绍整个训练和推理的步骤。
3.1. Contrast Similarity Learning 3.1. 对比相似性学习
Given consecutive three images I_(1),I_(2),I_(3)inR^(H xx W xx3)\boldsymbol{I}_{1}, \boldsymbol{I}_{2}, \boldsymbol{I}_{3} \in \mathbb{R}^{H \times W \times 3}, we first feed them to the backbone, then through detection branches and ReID heads, we could get detection results and ReID feature maps, as shown in Figure 2. Based on the position of the bounding box in the ground truth, the feature embedding corresponding to each object is obtained from the corresponding feature map, which forms embedding matrices X_(1)=[x_(1)^(0),x_(1)^(1),dots dots,x_(1)^(N-1)]inR^(D xx N)\boldsymbol{X}_{1}=\left[\boldsymbol{x}_{1}^{0}, \boldsymbol{x}_{1}^{1}, \ldots \ldots, \boldsymbol{x}_{1}^{N-1}\right] \in \mathbb{R}^{D \times N}, X_(2)=[x_(2)^(0),x_(2)^(1),dots dots,x_(2)^(M-1)]inR^(D xx M)\boldsymbol{X}_{2}=\left[\boldsymbol{x}_{2}^{0}, \boldsymbol{x}_{2}^{1}, \ldots \ldots, \boldsymbol{x}_{2}^{M-1}\right] \in \mathbb{R}^{D \times M} and X_(3)=\boldsymbol{X}_{3}=[x_(3)^(0),x_(3)^(1),dots dots,x_(3)^(K-1)]inR^(D xx K)\left[\boldsymbol{x}_{3}^{0}, \boldsymbol{x}_{3}^{1}, \ldots \ldots, \boldsymbol{x}_{3}^{K-1}\right] \in \mathbb{R}^{D \times K}, where N,MN, M, and KK are the object numbers in I_(1),I_(2)\boldsymbol{I}_{1}, \boldsymbol{I}_{2} and I_(3)\boldsymbol{I}_{3}, respectively, and DD is the embedding dimension. 给定三个连续的图像 I_(1),I_(2),I_(3)inR^(H xx W xx3)\boldsymbol{I}_{1}, \boldsymbol{I}_{2}, \boldsymbol{I}_{3} \in \mathbb{R}^{H \times W \times 3} ,我们首先将它们输入到主干网络中,然后通过检测分支和 ReID 头部,我们可以获得检测结果和 ReID 特征映射,如图 2 所示。基于地面实况中边界框的位置,从相应的特征映射中获取每个目标的特征嵌入,形成嵌入矩阵 X_(1)=[x_(1)^(0),x_(1)^(1),dots dots,x_(1)^(N-1)]inR^(D xx N)\boldsymbol{X}_{1}=\left[\boldsymbol{x}_{1}^{0}, \boldsymbol{x}_{1}^{1}, \ldots \ldots, \boldsymbol{x}_{1}^{N-1}\right] \in \mathbb{R}^{D \times N} 、 X_(2)=[x_(2)^(0),x_(2)^(1),dots dots,x_(2)^(M-1)]inR^(D xx M)\boldsymbol{X}_{2}=\left[\boldsymbol{x}_{2}^{0}, \boldsymbol{x}_{2}^{1}, \ldots \ldots, \boldsymbol{x}_{2}^{M-1}\right] \in \mathbb{R}^{D \times M} 和 X_(3)=\boldsymbol{X}_{3}= 。其中 N,MN, M 和 KK 分别是 I_(1),I_(2)\boldsymbol{I}_{1}, \boldsymbol{I}_{2} 和 I_(3)\boldsymbol{I}_{3} 中的目标数量, DD 是嵌入维度。
The ReID branch is connected to three contrast similarity learning branches, in which (1) Self-contrast uses intraframe direct and inter-frame indirect self-matching to obtain discriminative representations and reduce feature interference from other objects by maximizing self-similarity. (2) Cross-contrast uses cross- and continuous-frame matching, and then adjusts similarity between objects to extract more beneficial features for object association. (3) Ambi- 重识别分支连接到三个对比相似度学习分支,其中(1)自对比使用帧内直接和帧间间接自匹配来获得判别式表征,并通过最大化自相似度来减少其他物体的特征干扰。(2)交叉对比使用帧间和连续帧匹配,然后调整物体之间的相似度以提取更有益的物体关联特征。(3)双
guity contrast takes into account occluded, lost, and emerging objects simultaneously, and these ambiguous objects are matched with each other again to further increase the certainty of subsequent object association. We will describe the specific operation in Section 3.1.1, 3.1.2 and 3.1.3, respectively. 基于内疚对比考虑到被遮挡、丢失和新出现的物体,这些模棱两可的物体相互匹配,从而进一步提高后续物体关联的确定性。我们将在第 3.1.1 节、第 3.1.2 节和第 3.1.3 节分别描述具体操作。
3.1.1 Self-Contrast Module 3.1.1 自对比模块
According to the latent knowledge that objects from the same frame must belong to different classes, we can determine that the similarity between self-to-self should be large enough. So the proposed self-contrast finally lands on a self-to-self comparison, which is a strong, deterministic self-supervised restriction. This strong restriction allows us to improve the similarity of the same targets and reduce the interference from other objects by direct and indirect selfcontrast learning, as shown in the first column of Figure 3. 根据同一帧中的物体必须属于不同类别的潜在知识,我们可以确定自身到自身的相似性应该足够大。因此,所提出的自对比最终落在自身到自身的比较上,这是一种强有力的、确定性的自监督约束。这种强大的约束使我们能够通过直接和间接的自对比学习来提高同一目标的相似度,并减少其他物体的干扰,如图 3 第一列所示。
Direct Self-Contrast. We use current feature matrix X_(1)=[x_(1)^(0),x_(1)^(1),dots dots,x_(1)^(N-1)]inR^(D xx N)\boldsymbol{X}_{1}=\left[\boldsymbol{x}_{1}^{0}, \boldsymbol{x}_{1}^{1}, \ldots \ldots, \boldsymbol{x}_{1}^{N-1}\right] \in \mathbb{R}^{D \times N} to directly compute the self-similarity matrix S_(ds)=X_(1)^(T)X_(1)inR^(N xx N)\boldsymbol{S}_{d s}=\boldsymbol{X}_{1}^{T} \boldsymbol{X}_{1} \in \mathbb{R}^{N \times N}, where TT represents transpose operation. Then we compute the assignment matrix with a softmax operation, as 直接自对比。我们使用当前特征矩阵 X_(1)=[x_(1)^(0),x_(1)^(1),dots dots,x_(1)^(N-1)]inR^(D xx N)\boldsymbol{X}_{1}=\left[\boldsymbol{x}_{1}^{0}, \boldsymbol{x}_{1}^{1}, \ldots \ldots, \boldsymbol{x}_{1}^{N-1}\right] \in \mathbb{R}^{D \times N} 直接计算自相似性矩阵 S_(ds)=X_(1)^(T)X_(1)inR^(N xx N)\boldsymbol{S}_{d s}=\boldsymbol{X}_{1}^{T} \boldsymbol{X}_{1} \in \mathbb{R}^{N \times N} ,其中 TT 表示转置运算。然后我们用 softmax 操作计算分配矩阵,如
S_(dsc)=psi_("row ")(S_(ds))\boldsymbol{S}_{d s c}=\boldsymbol{\psi}_{\text {row }}\left(\boldsymbol{S}_{d s}\right)
where psi_("row ")\psi_{\text {row }} is row-wise softmax operation. psi_("row ")\psi_{\text {row }}
Indirect Self-Contrast. MOT itself operates on multiple frames, so we further perform our self-contrast similarity learning by indirect self-to-self matching. To measure similarity between objects, we calculate cosine similarity to get a similarity matrix between objects of different frames S_(is)=X_(1)^(T)X_(2)inR^(N xx M)\boldsymbol{S}_{i s}=\boldsymbol{X}_{1}^{T} \boldsymbol{X}_{2} \in \mathbb{R}^{N \times M}. And similar to Eq.1, we calculate the association matrix S^(1rarr2)=psi_("row ")(S_(is))\boldsymbol{S}^{1 \rightarrow 2}=\boldsymbol{\psi}_{\text {row }}\left(\boldsymbol{S}_{i s}\right) and S^(2rarr1)=psi_("row ")(S_(is)^(T))\boldsymbol{S}^{2 \rightarrow 1}=\boldsymbol{\psi}_{\text {row }}\left(\boldsymbol{S}_{i s}{ }^{T}\right). The corresponding results S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} and S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} are considered to match the targets in I_(1)\boldsymbol{I}_{1} to I_(2)\boldsymbol{I}_{2}, and the targets in I_(2)\boldsymbol{I}_{2} to I_(1)\boldsymbol{I}_{1}, respectively. Each element of S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} and S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} in the ii-th row and jj-th column are as follows, respectively: 间接自对比。 MOT 本身在多个帧上运行,因此我们进一步通过间接自对自匹配来执行自对比相似性学习。 为了衡量对象之间的相似性,我们计算余弦相似性以获得不同帧之间对象的相似性矩阵 S_(is)=X_(1)^(T)X_(2)inR^(N xx M)\boldsymbol{S}_{i s}=\boldsymbol{X}_{1}^{T} \boldsymbol{X}_{2} \in \mathbb{R}^{N \times M} 。 类似于公式 1,我们计算关联矩阵 S^(1rarr2)=psi_("row ")(S_(is))\boldsymbol{S}^{1 \rightarrow 2}=\boldsymbol{\psi}_{\text {row }}\left(\boldsymbol{S}_{i s}\right) 和 S^(2rarr1)=psi_("row ")(S_(is)^(T))\boldsymbol{S}^{2 \rightarrow 1}=\boldsymbol{\psi}_{\text {row }}\left(\boldsymbol{S}_{i s}{ }^{T}\right) 。 相应的结果 S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} 和 S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} 被认为与 I_(1)\boldsymbol{I}_{1} 到 I_(2)\boldsymbol{I}_{2} 中的目标相匹配,而 I_(2)\boldsymbol{I}_{2} 到 I_(1)\boldsymbol{I}_{1} 中的目标相匹配。 ii 行 jj 列中的 S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} 和 S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} 的每个元素如下:
where tau\tau is a temperature hyper-parameter [38]. 其中 tau\tau 是温度超参数[38]。
According to the cycle association consistency, after forward association S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} and backward association S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1}, each object will match itself again ideally, 根据循环联想的一致性,在前向联想 S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} 和后向联想 S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} 之后,每个对象理想情况下都会再次与自己匹配
S_(isc)=S^(1rarr2)S^(2rarr1)\boldsymbol{S}_{i s c}=\boldsymbol{S}^{1 \rightarrow 2} \boldsymbol{S}^{2 \rightarrow 1}
Figure 3. Self-Contrast and Cross-Contrast. We use three sets of indirect self-contrast and two sets of cross-contrast methods using different inputs. For the sake of brevity, we only show a set of specific feature calculation in each contrast. 图 3. 自我对比和交叉对比。我们使用三组间接自我对比和两组交叉对比方法,采用不同的输入。为简明起见,我们仅在每个对比中展示一组特定特征计算。
The corresponding self-contrast loss can be formulated as: 相应的自对比损失可以表述为: L_(sc)=-(1)/(N)(sum log(diag(S_(dsc)))+sum log(diag(S_(isc))))L_{\mathrm{sc}}=-\frac{1}{N}\left(\sum \log \left(\operatorname{diag}\left(\boldsymbol{S}_{d s c}\right)\right)+\sum \log \left(\operatorname{diag}\left(\boldsymbol{S}_{i s c}\right)\right)\right),
where diag()\operatorname{diag}() is to get a diagonal matrix. 其中 diag()\operatorname{diag}() 用于获取对角矩阵。
Due to the self-contrast, it is obvious that the similarity between the same targets should be the largest, i.e., the diagonal elements of S_(dsc)S_{d s c} and S_(isc)S_{i s c} obtained above are the largest and should be as close to 1 as possible. 由于自对比,很明显,同一目标之间的相似度应该最大,即,上文获得的 S_(dsc)S_{d s c} 和 S_(isc)S_{i s c} 的对角线元素是最大的,应该尽可能接近于 1。
3.1.2 Cross-Contrast Module 3.1.2 交叉对比模块
In almost all scenes of MOT, there is more or less object occlusion, and the similarity of these objects is generally low. Since MOT is an operation on multiple consecutive frames, the negative impact of these occluded objects could last for a long time. Considering theoretically the cross-frame matching results should be the same as the final results of continuous matching, we use a weaker unsupervised restriction, i.e., direct (cross-frame) vs. indirect 几乎在所有 MOT 场景中,都存在或多或少的物体遮挡,且这些物体的相似度通常较低。由于 MOT 是在多个连续帧上进行的操作,这些被遮挡的物体的负面影响可能会持续很长时间。从理论上讲,跨帧匹配结果应该与连续匹配的最终结果相同,我们使用了更弱的无监督约束,即直接(跨帧)与间接。
(continuous-frame) association similarity comparison, to alleviate the above issue. (连续帧)关联相似性比较,以缓解上述问题。
Specifically, we take three frames I_(1),I_(2),I_(3)in\boldsymbol{I}_{1}, \boldsymbol{I}_{2}, \boldsymbol{I}_{3} \inR^(H xx W xx3)\mathbb{R}^{H \times W \times 3} as inputs, similar with Section 3.1.1, we calculate the target matching matrices between different frames, i.e., S^(1rarr2),S^(2rarr1),S^(2rarr3),S^(3rarr2),S^(1rarr3),S^(3rarr1)\boldsymbol{S}^{1 \rightarrow 2}, \boldsymbol{S}^{2 \rightarrow 1}, \boldsymbol{S}^{2 \rightarrow 3}, \boldsymbol{S}^{3 \rightarrow 2}, \boldsymbol{S}^{1 \rightarrow 3}, \boldsymbol{S}^{3 \rightarrow 1}. As shown in the second column of Figure 3, we utilize S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} and S^(3rarr2)\boldsymbol{S}^{3 \rightarrow 2} to compute the association matrix of 3rarr13 \rightarrow 1, similarly use S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} and S^(2rarr3)\boldsymbol{S}^{2 \rightarrow 3} to compute the association matrix of 1rarr31 \rightarrow 3, as 具体而言,我们以三个帧 I_(1),I_(2),I_(3)in\boldsymbol{I}_{1}, \boldsymbol{I}_{2}, \boldsymbol{I}_{3} \inR^(H xx W xx3)\mathbb{R}^{H \times W \times 3} 作为输入,类似于第 3.1.1 节,我们计算不同帧之间的目标匹配矩阵,即 S^(1rarr2),S^(2rarr1),S^(2rarr3),S^(3rarr2),S^(1rarr3),S^(3rarr1)\boldsymbol{S}^{1 \rightarrow 2}, \boldsymbol{S}^{2 \rightarrow 1}, \boldsymbol{S}^{2 \rightarrow 3}, \boldsymbol{S}^{3 \rightarrow 2}, \boldsymbol{S}^{1 \rightarrow 3}, \boldsymbol{S}^{3 \rightarrow 1} 。如图 3 的第二列所示,我们利用 S^(2rarr1)\boldsymbol{S}^{2 \rightarrow 1} 和 S^(3rarr2)\boldsymbol{S}^{3 \rightarrow 2} 计算 3rarr13 \rightarrow 1 的关联矩阵,同样使用 S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} 和 S^(2rarr3)\boldsymbol{S}^{2 \rightarrow 3} 计算 1rarr31 \rightarrow 3 的关联矩阵,如
These matching matrices, which are generated indirectly through a middle frame, should be the same as directgenerated matching results. 这些通过中间帧间接生成的匹配矩阵应该与直接生成的匹配结果相同。
We use relative entropy to measure the difference between the two matching distributions. KL divergence [14] is often used to compute the difference between two distributions P and Q , 我们使用相对熵来衡量两个匹配分布之间的差异。KL 散度[14]通常用于计算两个分布 P 和 Q 之间的差异。
but it is asymmetrical. We further utilize JS divergence [18] with symmetrical properties, 但它是非对称的。我们进一步利用具有对称性质的 JS 散度[18]。
JSD(P||Q)=(1)/(2)KL(P||T)+(1)/(2)KL(Q||T)J S D(P \| Q)=\frac{1}{2} K L(P \| T)+\frac{1}{2} K L(Q \| T)
where T=(P+Q)//2T=(P+Q) / 2. The corresponding cross-contrast loss is as follows, T=(P+Q)//2T=(P+Q) / 2 。相应的交叉对比损失如下,
L_(cc)=(1)/(N)JSD(S_(**)^(1rarr3)||S^(1rarr3))+(1)/(K)JSD(S_(**)^(3rarr1)||S^(3rarr1))L_{c c}=\frac{1}{N} J S D\left(\boldsymbol{S}_{*}^{1 \rightarrow 3} \| \boldsymbol{S}^{1 \rightarrow 3}\right)+\frac{1}{K} J S D\left(\boldsymbol{S}_{*}^{3 \rightarrow 1} \| \boldsymbol{S}^{3 \rightarrow 1}\right)
By enabling the continuous and cross-frame matching results to be close together, we use the different association results to mainly mitigate the differences in the same target caused by occlusion. 通过使连续和跨框匹配结果相互接近,我们利用不同的关联结果主要缓解遮挡造成的同一目标的差异。
3.1.3 Ambiguity Contrast 3.1.3 歧义对比
There are occluded, lost, and emerging objects in the MOT, which will interfere with the whole learning process. We explore this problem and propose the ambiguity contrast module. 在 MOT 中存在遮挡、丢失和新兴的物体,这些会干扰整个学习过程。我们探索了这个问题,并提出了歧义对比模块。
Based on the similarity between objects, we assume that objects with similarity greater than theta\theta are the same object. The remaining objects with lower similarity are defined as ambiguous objects here. The low similarity mainly due to occlusion or the disappearance and appearance. In the occlusion case, objects of the same ID do exist, but the similarity is decreased due to the absence of original features and involvement of unrelated features. In the latter case, the similarity between the lost object and the newly emerged 根据对象之间的相似性,我们假定相似度大于 theta\theta 的对象是同一个对象。剩余具有较低相似度的对象在此定义为模糊对象。低相似度主要由于遮挡或消失和出现所导致。在遮挡的情况下,同一 ID 的对象确实存在,但由于缺少原有特征和出现不相关特征,从而导致相似度降低。在后一种情况下,失去的对象与新出现的对象之间的相似度
Figure 4. Ambiguity Contrast. For brevity, we only give the maximum similarity for each row, where certain objects have lower similarity to all other targets, i.e., even the maximum similarity is below the threshold value, which is indicated by red circles in the figure. The corresponding feature embeddings are extracted and then matched again. 图 4. 歧义对比。为了简洁起见,我们仅给出每一行的最大相似度,其中某些物体与所有其他目标的相似度都较低,即使最大相似度也低于阈值值,在图中用红色圆圈表示。提取相应的特征嵌入,然后再次进行匹配。
object is lower because there is really no target that can match it. 对象较低,因为确实没有与之匹配的目标。
Our proposed method for ambiguous objects in the unsupervised training process is shown in Figure 4. We find the ambiguous objects in I_(1)\boldsymbol{I}_{1} according to the matching matrix of S^(1rarr2)\boldsymbol{S}^{1 \rightarrow 2} based on the similarity. Similarly, we get ambiguous objects in I_(2)I_{2} based on the matching result of S^(2rarr1)S^{2 \rightarrow 1}. Then these objects are again subjected to similarity calculation to get the similarity matrix S_(r)^(1rarr2)inR^(D xxN_(r))\boldsymbol{S}_{r}{ }^{1 \rightarrow 2} \in \mathbb{R}^{D \times N_{r}} and S_(r)^(2rarr1)inR^(D xxM_(r))\boldsymbol{S}_{r}{ }^{2 \rightarrow 1} \in \mathbb{R}^{D \times M_{r}}, where N_(r)N_{r} and M_(r)M_{r} are the number of ambiguous objects in I_(1)\boldsymbol{I}_{1} and I_(2)\boldsymbol{I}_{2}, respectively. Finally, the loss of the module is calculated by minimum entropy: 我们提出的用于无监督训练过程中模棱两可对象的方法如图 4 所示。我们根据相似度得到的匹配矩阵来找出 I_(1)\boldsymbol{I}_{1} 中的模棱两可对象。同样地,我们根据 S^(2rarr1)S^{2 \rightarrow 1} 的匹配结果得到了 I_(2)I_{2} 中的模棱两可对象。然后这些对象再次进行相似性计算以得到相似矩阵 S_(r)^(1rarr2)inR^(D xxN_(r))\boldsymbol{S}_{r}{ }^{1 \rightarrow 2} \in \mathbb{R}^{D \times N_{r}} 和 S_(r)^(2rarr1)inR^(D xxM_(r))\boldsymbol{S}_{r}{ }^{2 \rightarrow 1} \in \mathbb{R}^{D \times M_{r}} ,其中 N_(r)N_{r} 和 M_(r)M_{r} 分别是 I_(1)\boldsymbol{I}_{1} 和 I_(2)\boldsymbol{I}_{2} 中的模棱两可对象的数量。最后,通过最小熵计算该模块的损失。