[1]\fnmFabio \surBellavia
[1]\fnm 法比奥 \sur 贝拉维亚
[1]\orgnameUniversità degli Studi di Palermo, \countryItaly
2]\orgnameThe Chinese University of Hong Kong, \countryChina
3]\orgnameBruno Kessler Foundation, \countryItaly
[1]巴勒莫大学,意大利 [2]香港中文大学,中国 [3]布鲁诺·凯斯勒基金会,意大利
Image Matching Filtering and Refinement
by Planes and Beyond
图像匹配滤波与平面及超越平面的优化
Abstract 摘要
This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
本文介绍了一种模块化的非深度学习方法,用于在图像匹配中过滤和提炼稀疏对应关系。假设场景内的运动流可通过局部单应变换近似,采用基于迭代 RANSAC 的方法将匹配聚合到对应于虚拟平面的重叠簇中,并舍弃不符合的对应关系。此外,底层平面结构设计提供了匹配关联局部块之间的显式映射,使得在块重投影后可通过互相关模板匹配进行关键点位置的选配精炼。最后,为增强对分段平面近似假设违背的鲁棒性和容错性,设计了一种进一步的策略,通过引入中间单应变换将两个块投影到同一平面,以最小化平面重投影中的相对块畸变。所提出的方法在标准数据集和图像匹配流程中得到了广泛评估,并与最先进的方法进行了比较。 与当前其他比较不同,所提出的基准还考虑了更普遍、真实且实际的情况,即相机内参不可用的情况。实验结果表明,我们提出的非深度学习、基于几何的方法在性能上优于或与最新的深度学习方法相当。最后,本研究指出,在所考虑的研究方向上,实际图像匹配解决方案仍具有开发潜力,未来可整合到新颖的深度图像匹配架构中。
keywords:
image matching, keypoint refinement, planar homography, normalized cross-correlation, SIFT, SuperGlue, LoFTR关键词:图像匹配,关键点优化,平面单应性,归一化互相关,SIFT,SuperGlue,LoFTR
1 Introduction 1 引言
1.1 Background 1.1 背景
Image matching constitutes the backbone of most higher-level computer vision tasks and applications that require image registration to recover the 3D scene structure, including the camera poses. Image stitching [1, 2], Structure-from-Motion (SfM) [2, 3, 4, 5], Simultaneous Localization and Mapping (SLAM) [6, 2] and the more recently Neural Radiance Fields (NeRF) [7, 8] and Gaussian Splatting (GS) [9, 10] are probably nowadays the most common, widespread and major tasks critically relying on image matching.
图像匹配构成了大多数需要图像配准以恢复三维场景结构(包括相机姿态)的高级计算机视觉任务和应用的支柱。图像拼接[1, 2]、运动恢复结构(SfM)[2, 3, 4, 5]、同时定位与地图构建(SLAM)[6, 2],以及近期出现的神经辐射场(NeRF)[7, 8]和高斯喷射(GS)[9, 10],可能是当今最常见、最广泛且最关键依赖图像匹配的主要任务。
Image matching traditional paradigm can be organized into a modular pipeline concerning keypoint extraction, feature description, and the proper correspondence matching [11]. This representation is being neglected in modern deep end-to-end image matching methods due to their intrinsic design which guarantees a better global optimization at the expense of the interpretability of the system. Notwithstanding that end-to-end deep networks represent for many aspects the State-Of-The-Art (SOTA) in image matching, such in terms of the robustness and accuracy of the estimated poses in complex scenes [12, 13, 14, 15, 16, 17, 18], handcrafted image matching pipelines such as the popular Scale Invariant Feature Transform (SIFT) [19] are employed still today in practical applications [4, 2] due to their scalability, adaptability, and understandability. Furthermore, handcrafted or deep modules remain present in post-process modern monolithic deep architectures, to further filter the final output matches as done notably by the RAndom SAmpling Consensus (RANSAC) [20] based on geometric constraints.
传统图像匹配范式可以组织成一个模块化流水线,涉及关键点提取、特征描述以及适当的对应匹配[11]。由于其内在设计保证了更好的全局优化,同时牺牲了系统的可解释性,这种表示在现代端到端的深度图像匹配方法中被忽视。尽管端到端深度网络在许多方面代表了图像匹配的最新技术(SOTA),例如在复杂场景中估计姿态的鲁棒性和准确性[12, 13, 14, 15, 16, 17, 18],但像流行的尺度不变特征变换(SIFT)[19]这样的手工图像匹配流水线,由于其可扩展性、适应性和可理解性,至今仍在实际应用中使用[4, 2]。此外,手工或深度模块仍然存在于现代单一深度架构的后处理阶段,以进一步过滤最终的匹配结果,如基于几何约束的随机采样一致性(RANSAC)[20]所做的那样。
Image matching filters to prune correspondences are not limited to RANSAC, which in any of its form [21, 22, 23, 24] still remains mandatory nowadays as final step, and have been evolved from handcrafted methods [25, 26, 27, 28, 29, 30, 31, 32, 33] to deep ones [34, 35, 36, 37, 38, 39, 40, 41]. In addition, more recently methods inspired by coarse-to-fine paradigm, such as the Local Feature TRansformer (LoFTR) [14], have been developed to refine correspondences [42, 43, 44], which can be considered as the introduction of a feedback-system in the pipelined interpretation of the matching process.
用于修剪对应关系的图像匹配滤波器不仅限于 RANSAC,无论其形式如何[21, 22, 23, 24],至今仍作为最终步骤必不可少,并且已从手工方法[25, 26, 27, 28, 29, 30, 31, 32, 33]演变为深度方法[34, 35, 36, 37, 38, 39, 40, 41]。此外,最近受由粗到细范式启发的方法,如局部特征变换器(LoFTR)[14],已被开发用于细化对应关系[42, 43, 44],这可以视为在匹配过程的流水线解释中引入了反馈系统。
1.2 Main contribution 1.2 主要贡献
As contribution, within the former background, this paper introduces a modular approach to filter and refine matches as post-process before RANSAC. The key idea is that the motion flow within the images, i.e. the correspondences, can be approximated by local overlapping homography transformations [45]. This is the core concept of traditional image matching pipelines, where local patches are approximated by less constrained transformations going from a fine to a coarse scale level. Specifically, besides the similarity and affine transformations, the planar homography is introduced at a further level. The whole approach proceeds as follows:
作为贡献,在前述背景下,本文提出了一种模块化方法,用于在 RANSAC 之前对匹配进行后处理筛选和提纯。其核心思想是,图像内的运动流,即对应关系,可以通过局部重叠的单应性变换来近似[45]。这是传统图像匹配流程的核心概念,其中局部块通过从精细到粗略尺度级别的较少约束变换来近似。具体而言,除了相似性和仿射变换外,还在更深层次引入了平面单应性。整个方法的流程如下:
-
•
The Multiple Overlapping Planes (MOP) module aggregates matches into soft clusters, each one related to a planar homography roughly representing the transformation of included matches. On one hand, matches unable to fit inside any cluster are discarded as outliers, on the other hand, the associated planar map can be used to normalize the patches of the related keypoint correspondences. As more matches are involved in the homography estimation, the patch normalization process becomes more robust than the canonical ones used for instance in SIFT. Furthermore, homographies better adapt the warp as they are more general than similarity or affine transformations. MOP is implemented by iterating RANSAC to greedily estimate the next best plane according to a multi-model fitting strategy. The main difference from previous similar approaches [46, 47], is in maintaining matches during the iterations in terms of strong and weak inliers according to two different reprojection errors to guide the cluster partitioning.
• 多重重叠平面(MOP)模块将匹配点聚合为软簇,每个簇关联一个大致表示包含匹配点变换的平面单应性。一方面,无法适应任何簇的匹配点被视为离群点丢弃;另一方面,关联的平面图可用于归一化相关关键点对应区域的补丁。随着更多匹配点参与单应性估计,补丁归一化过程比 SIFT 等方法中使用的典型归一化更为稳健。此外,单应性更适应变形,因其比相似性或仿射变换更为通用。MOP 通过迭代 RANSAC 贪婪估计下一个最佳平面,采用多模型拟合策略实现。与先前类似方法[46, 47]的主要区别在于,在迭代过程中根据两种不同的重投影误差维持强弱内点,以指导簇的划分。 -
•
The Middle Homography (MiHo) module additionally improves the patch normalization by further minimizing the relative patch distortion. More in detail, instead of reprojecting the patch of one image into the other image taken as reference according to the homography associated with the cluster, both the corresponding patches are reprojected into a virtual plane chosen to be in the middle of the base homography transformations so that globally the deformation is distributed into both the two patches, reducing the patch distortions with respect to the original images due to interpolation but also to the planar approximation.
• 中值单应性(MiHo)模块通过进一步最小化相对补丁畸变,进一步改进了补丁归一化。更详细地说,不是将一张图像的补丁根据与聚类关联的单应性重投影到作为参考的另一张图像中,而是将两个对应的补丁都重投影到一个虚拟平面上,该平面选择位于基础单应性变换的中间位置,从而使得全局变形分布到两个补丁中,减少了由于插值以及平面近似导致的相对于原始图像的补丁畸变。 -
•
The Normalized Cross-Correlation (NCC) is employed following traditional template matching [48, 2] to improve the keypoint localization in the reference middle homography virtual plane using one patch as a template to being localized on the other patch. The relative shift adjustment found on the virtual plane is finally back-projected through into the original images. In order to improve the process, the template patch is perturbed by small rotations and anisotropic scaling changes for a better search in the solution space.
• 采用归一化互相关(NCC)方法,遵循传统的模板匹配[48, 2],以提升在参考中间单应性虚拟平面上关键点的定位精度,其中使用一个补丁作为模板,在另一补丁上进行定位。在虚拟平面上找到的相对位移调整最终通过反投影回原始图像中。为优化该过程,模板补丁经过小幅旋转和各向异性尺度变化扰动,以在解空间中进行更有效的搜索。
MOP and MiHo are purely geometric-based, requiring only keypoint coordinates, while NCC relies on the image intensity in the local area of the patches. The early idea of MOP+MiHo+NCC can be found in [49]. Furthermore, MOP overlapping plane exploitation concept have been introduced in [50], while MiHo definition was draft in [51]. The proposed approach is handcrafted and hence more general as is does not require re-training to adapt to the particular kind of scene and its behavior is more interpretable as not hidden by deep architectures. As reported and discussed later in this paper, RANSAC takes noticeable benefits from MOP+MiHo as pre-filter before it in most configurations and scene kinds, and in any case does not deteriorate the matches. Concerning MOP+MiHo+NCC instead, keypoint localization accuracy seems to strongly depend on the kind of extracted keypoint. In particular, for corner-like keypoints, which are predominant in detectors such as the Keypoint Network (Key.Net) [52] or SuperPoint [53], NCC greatly improves the keypoint position, while for blob-like keypoints, predominant for instance in SIFT, NCC application can degrade the keypoint localization, due probably to the different nature of their surrounding areas. Nevertheless, the proposed approach indicates that there are still margins for improvements in current image matching solutions, even the deep-based ones, exploiting the analyzed methodology.
MOP 和 MiHo 完全基于几何,仅需关键点坐标,而 NCC 则依赖于补丁局部区域的图像强度。MOP+MiHo+NCC 的早期构想可见于[49]。此外,MOP 重叠平面利用概念在[50]中提出,而 MiHo 的定义则在[51]中草拟。所提出的方法是手工设计的,因此更具通用性,无需重新训练以适应特定场景,且其行为因不隐藏于深度架构中而更易解释。如本文后续所述和讨论,在大多数配置和场景类型中,RANSAC 在作为预过滤器使用 MOP+MiHo 时获益显著,且无论如何都不会损害匹配效果。至于 MOP+MiHo+NCC,关键点定位精度似乎强烈依赖于所提取关键点的类型。特别是对于类似角点的关键点,这类关键点在 Keypoint Network(KeyNet)等检测器中占主导地位。Net)[52]或 SuperPoint [53],NCC 显著提升了关键点位置的精度,然而对于类似斑点的关键点,如 SIFT 中常见的那样,NCC 的应用可能会降低关键点的定位精度,这可能归因于其周围区域性质的不同。尽管如此,所提出的方法表明,即使在基于深度学习的图像匹配解决方案中,仍存在改进的空间,利用所分析的方法论。
1.3 Additional contribution
1.3 额外贡献
As further contribution, an extensive comparative evaluation is provided against eleven less and more recent SOTA deep and handcrafted image matching filtering methods on both non-planar indoor and outdoor standard benchmark datasets [54, 55, 11], as well as planar scenes [56, 57]. Each filtering approach, including several combinations of the three proposed modules, has been tested to post-process matches obtained with seven different image matching pipelines and end-to-end networks, also considering distinct RANSAC configurations. Moreover, unlike the mainstream evaluation which has been taken as current standard [14], the proposed benchmark assumes that no camera intrinsics are available, reflecting a more general, realistic, and practical scenario. On this premise, the pose error accuracy is defined, analyzed, and employed with different protocols intended to highlight the specific complexity of the pose estimation working with fundamental and essential matrices [45], respectively. This analysis reveal the many-sided nature of the image matching problem and of its solutions.
作为进一步的贡献,本文对十一项较新和较旧的 SOTA 深度学习与手工设计的图像匹配过滤方法进行了广泛的对比评估,涵盖了非平面室内外标准基准数据集[54, 55, 11]以及平面场景[56, 57]。每种过滤方法,包括所提出的三个模块的多种组合,均已用于对通过七种不同图像匹配流程和端到端网络获得的匹配结果进行后处理,同时考虑了不同的 RANSAC 配置。此外,与当前标准的主流评估[14]不同,本文提出的基准假设不提供相机内参,更贴近普遍、现实且实用的场景。在此前提下,定义、分析并采用了姿态误差精度,通过不同协议来凸显在基本矩阵和本质矩阵下进行姿态估计的特定复杂性[45]。这一分析揭示了图像匹配问题及其解决方案的多面性。
1.4 Paper organization 1.4 论文结构
The rest of the paper is organized as follows. The related work is presented in Sec. 2, the design of MOP, MiHo, and NCC is discussed in Sec. 3, and the evaluation setup and the results are analyzed in Sec.4. Finally, conclusions and future work are outlined in Sec. 5.
本文其余部分组织如下。相关工作在第 2 节中介绍,MOP、MiHo 和 NCC 的设计在第 3 节中讨论,评估设置和结果在第 4 节中分析。最后,结论和未来工作在第 5 节中概述。
1.5 Additional resources 1.5 附加资源
Code and data are freely available online111https://github.com/fb82/MiHo. Data also include high-resolution versions of the images, plots, and tables presented in this paper222https://github.com/fb82/MiHo/tree/main/data/paper.
代码和数据可在线免费获取 1 。数据还包括本文中展示的图像、图表和表格的高分辨率版本 2 。
2 Related work 2 相关工作
2.1 Disclamer and further references
2.1 免责声明及进一步参考文献
The following discussion is far to be exhaustive due to more than three decades of active research and progress in the field, which is still ongoing. The reader can refer to [58, 59, 60, 61] for a more comprehensive analysis than that reported below, which is related and covers only the methods discussed in the rest of the paper.
以下讨论远非详尽无遗,因为该领域已活跃研究并取得进展超过三十年,且这一进程仍在继续。读者可参阅[58, 59, 60, 61]以获取比下文更为全面的分析,下文仅涉及与本文其余部分讨论方法相关的部分内容。
2.2 Non-deep image matching
2.2 非深度图像匹配
Traditionally, image matching is divided into three main steps: keypoint detection, description, and proper matching.
传统上,图像匹配分为三个主要步骤:关键点检测、描述和适当匹配。
2.2.1 Keypoint detectors 2.2.1 关键点检测器
Keypoint detection is aimed at finding salient local image regions whose feature descriptor vectors are at the same time-invariant to image transformations, i.e. repeatable, and distinguishable from other non-corresponding descriptors, i.e. discriminant. Clearly, discriminability decreases as long as invariance increases and vice-versa. Keypoints are usually recognized as corner-like and blob-like ones, even if there is no clear separation within them outside ad-hoc ideal images. Corners are predominantly extracted by the Harris detector [62], while blobs by SIFT [19]. Several other handcrafted detectors exist besides the Harris and the SIFT one [63, 64, 65], but excluding deep image matching nowadays the former and their extensions are the most commonly adopted. Harris and SIFT keypoint extractors rely respectively on the covariance matrix of the first-order derivatives and the Hessian matrix of the second-order derivatives. Blobs and corners can be considered each other analogues for different derivative orders, as in the end, a keypoint in both cases correspond to a local maxima response to a filter which is a function of the determinant of the respective kind of matrix. Blob keypoint centers tend to be pushed away from edges, which characterize the regions, more than corners due to the usage of high-order derivatives. More recently, the joint extraction and refinement of robust keypoints has been investigated by the Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring (GMM-IKRS) [66]. The idea is to extract image keypoints after several homography warps and cluster them after back-projection into the input images so as to select the best clusters in terms of robustness and deviation, as another perspective of the Affine SIFT (ASIFT) [67] strategy.
关键点检测旨在寻找显著的局部图像区域,其特征描述符向量同时对图像变换具有时间不变性,即重复性,并且能与其他非对应描述符区分开来,即具有判别性。显然,判别性随不变性的增加而降低,反之亦然。关键点通常被识别为角点状和斑点状,尽管在理想图像之外,它们之间没有明确的区分。角点主要通过 Harris 检测器[62]提取,而斑点则通过 SIFT[19]提取。除了 Harris 和 SIFT 之外,还存在其他几种手工设计的检测器[63, 64, 65],但如今在深度图像匹配之外,前者及其扩展是最常用的。Harris 和 SIFT 关键点提取器分别依赖于一阶导数的协方差矩阵和二阶导数的 Hessian 矩阵。斑点和角点可以被视为不同导数阶数的对应物,因为在最终情况下,两种情况下的关键点都对应于对某种矩阵行列式函数的滤波器的局部最大响应。 Blob 关键点中心往往被推离边缘,这些边缘特征区域比角点更明显,这是由于使用了高阶导数。最近,通过高斯混合模型进行可解释关键点精化和评分(GMM-IKRS)[66],研究了鲁棒关键点的联合提取与精化。其思路是在多次单应性变换后提取图像关键点,并将它们反投影回输入图像进行聚类,从而根据鲁棒性和偏差选择最佳聚类,这是仿射 SIFT(ASIFT)[67]策略的另一种视角。
2.2.2 Keypoint descriptors
2.2.2 关键点描述符
Feature descriptors aim at embedding the keypoint characterization into numerical vectors. There has been a long research interest in this area, both in terms of handcrafted methods and non-deep machine learning ones, which contributed actively to the current development and the progress made by current SOTA deep image matching. Among non-deep keypoint descriptors, nowadays only the SIFT descriptor is practically still used and widespread due to its balance within computational efficiency, scaling, and matching robustness. SIFT descriptor is defined as the gradient orientation histogram of the local neighborhood of the keypoint, also referred to as patch. Many high-level computer vision applications like COLMAP [4] in the case of SfM are based on SIFT. Besides SIFT keypoint detector & descriptor, the Oriented FAST and Rotated BRIEF (ORB) [68] is a pipeline surviving still today in the deep era worth to be mentioned. ORB is the core of the popular ORB-SLAM [6], due to its computational efficiency, even if less robust than SIFT. Nevertheless, it seems to be going to be replaced by SIFT with the increased computational capability of current hardware [69].
特征描述符旨在将关键点特征嵌入数值向量中。长期以来,这一领域吸引了大量研究兴趣,涵盖手工设计方法和非深度机器学习方法,它们积极推动了当前深度图像匹配技术的发展与进步。在非深度关键点描述符中,如今仅 SIFT 描述符因其计算效率、尺度适应性和匹配鲁棒性之间的平衡而实际应用广泛。SIFT 描述符定义为关键点局部邻域的梯度方向直方图,亦称为图像块。众多高级计算机视觉应用,如 SfM 中的 COLMAP[4],均基于 SIFT。除 SIFT 关键点检测器与描述符外,Oriented FAST 与 Rotated BRIEF(ORB)[68]这一流程在深度学习时代仍值得提及,它作为 ORB-SLAM[6]的核心,因其计算效率而流行,尽管鲁棒性不及 SIFT。然而,随着当前硬件计算能力的提升,ORB 似乎正被 SIFT 所取代[69]。
2.2.3 Patch normalization 2.2.3 补丁归一化
Actually, the patch normalization takes place in between the detection and description of the keypoints, and it is critical for effective computation of the descriptor to be associated with the keypoint. Patch normalization is implicitly hidden inside both keypoint extraction and description and its task roughly consists of aligning and registering patches before descriptor computation [19]. The key idea is that complex image deformations can be locally approximated by more simple transformations as long as the considered neighborhood is small. Under these assumptions, SIFT normalizes the patch by a similarity transformation, where the scale is inferred by the detector and the dominant rotation is computed from the gradient orientation histogram. Affine covariant patch normalization [70, 67, 71] was indeed the natural evolution, which can improve the normalization in case of high degrees of perspective distortions. Notice that in the case of only small scene rotation, up-right patches computed neglecting orientation alignment can lead to better matches since a strong geometric constraint limiting the solution space is imposed [72]. Concerning instead patch photometric consistency, affine normalization, i.e. by mean and standard deviation, is usually the preferred choice being sufficiently robust in the general case [19].
实际上,补丁归一化发生在关键点检测与描述之间,对于有效计算与关键点相关联的描述符至关重要。补丁归一化隐含在关键点提取和描述过程中,其任务大致包括在描述符计算前对齐和注册补丁[19]。核心思想是,只要考虑的邻域足够小,复杂的图像形变可以在局部近似为更简单的变换。在这些假设下,SIFT 通过相似性变换对补丁进行归一化,其中尺度由检测器推断,主旋转则从梯度方向直方图中计算得出。仿射协变补丁归一化[70, 67, 71]确实是自然演进,能够在存在高度透视畸变的情况下改进归一化效果。值得注意的是,在场景旋转较小的情况下,忽略方向对齐计算的直立补丁可能会带来更好的匹配结果,因为强几何约束限制了解空间[72]。 关于补丁光度一致性,仿射归一化,即通过均值和标准差进行归一化,通常是首选方法,因为它在一般情况下足够稳健[19]。
2.2.4 Descriptor similarity and global geometric constraints
2.2.4 描述符相似性与全局几何约束
The last step is the proper matching, which mainly pairs keypoints by considering the distance similarity on the associated descriptors. Nearest Neighbor (NN) and Nearest Neighbor Ratio (NNR) [19] are the common and simple approaches even if additional extensions have been developed [73, 32]. For many tasks descriptor similarity alone does not provide satisfactory results, which can be achieved instead by including a further filtering post-process by constraining matches geometrically. RANSAC provides an effective method to discard outliers under the more general epipolar geometry constraints of the scene or in the case of planar homographies. The standard RANSAC continues to be extended in several ways [21, 22]. Worth mentioning are the Degenerate SAmpling Consensus (DegenSAC) [24] to avoid degenerate configurations, and the MArGinalized SAmpling Consensus (MAGSAC) [23] for defining a robust inlier threshold which remains the most critical parameter.
最后一步是适当的匹配,主要通过考虑关联描述符上的距离相似性来配对关键点。最近邻(NN)和最近邻比率(NNR)[19]是常见且简单的匹配方法,尽管已开发出其他扩展方法[73, 32]。对于许多任务,仅依赖描述符相似性往往无法获得满意的结果,而通过几何约束进行进一步过滤的后处理步骤则能实现更好的匹配效果。RANSAC 提供了一种有效的方法,在场景的更普遍的极线几何约束下或平面单应性的情况下,剔除异常值。标准 RANSAC 在多个方面得到了扩展[21, 22]。值得一提的是,退化采样一致性(DegenSAC)[24]用于避免退化配置,而边缘化采样一致性(MAGSAC)[23]则用于定义一个鲁棒的内点阈值,这是最关键的参数。
2.2.5 Local geometric constraints
2.2.5 局部几何约束
RANSAC exploits global geometric constraints. A more general and relaxed geometric assumption than RANSAC is to consider only local neighborhood consistency across the images as done in Grid-based Motion Statistics (GMS) [26] using square neighborhood, but also circular [25] and affine-based [27, 28] neighborhoods or Delaunay triangulation [32, 28] have been employed. In most cases, these approaches require an initial set of robust seed matches for initialization whose selection may become critical [74], while other approaches explicitly exploit descriptor similarities to rank matches [32]. In this respect, the vanilla RANSAC and GMS are pure geometric image filters. Notice that descriptor similarities have been also exploited successfully within RANSAC [75]. Closely connected to local filters, another class of methods is designed instead to estimate and interpolate the motion field upon which to check match consistency [29, 31, 30]. Local neighborhood consistency can be framed also into RANSAC as done by Adaptive Locally-Affine Matching (AdaLAM) [33] which runs multiple local affine RANSACs inside the circular neighborhoods of seed matches. AdaLAM design can be related with GroupSAC [76], which draws hypothesis model sample correspondences from different clusters, and with the more general multi-model fitting problem for which several solutions have been investigated [77, 78]. Among these approaches the recent Consensus Clustering (CC) [79] has been developed for the specific case of multiple planar homographies.
RANSAC 利用了全局几何约束。比 RANSAC 更为普遍且宽松的几何假设是仅考虑图像间的局部邻域一致性,如基于网格的运动统计(GMS)[26]采用方形邻域,但也有使用圆形[25]、仿射[27, 28]邻域或德劳内三角剖分[32, 28]的方法。在大多数情况下,这些方法需要一组鲁棒的初始匹配种子进行初始化,其选择可能变得至关重要[74],而其他方法则明确利用描述符相似性对匹配进行排序[32]。就此而言,原始的 RANSAC 和 GMS 是纯粹的几何图像滤波器。值得注意的是,描述符相似性在 RANSAC 内部也得到了成功利用[75]。与局部滤波器紧密相关的是,另一类方法旨在估计并插值运动场,以此检查匹配一致性[29, 31, 30]。局部邻域一致性也可以纳入 RANSAC 框架,如自适应局部仿射匹配(AdaLAM)[33]在种子匹配的圆形邻域内运行多个局部仿射 RANSAC。 AdaLAM 设计可与 GroupSAC[76]相关联,后者从不同聚类中提取假设模型样本对应关系,并与更普遍的多模型拟合问题相关,对此已有多种解决方案被探讨[77, 78]。在这些方法中,最新的共识聚类(CC)[79]针对多平面单应性的特定情况进行了开发。
2.3 Deep image matching 2.3 深度图像匹配
2.3.1 Standalone pipeline modules
2.3.1 独立管道模块
Deep learning evolution has provided a boost to image matching as well as other computer vision research fields thanks to the evolution of the deep architectures, hardware and software computational capability and design, and the huge amount of datasets now available for the training. Hybrid pipelines are raised first, replacing handcrafted descriptors with deep ones, as the feature descriptor module is the most natural and immediate to be replaced in the pipeline evolution. SOTA in this sense is represented by the Hard Network (HardNet) [80] and similar networks [81, 82] trained on contrastive learning with hard mining. For what concerns the keypoint extraction, better results with respect to the handcrafted counterparts were achieved by driving the network to mimic handcrafted detector design as for Key.Net [52]. Deep learning has been successfully applied to design standalone patch normalization modules to estimate the reference patch orientation [83, 84] or more complete affine transformations as in the Affine Network [84]. The mentioned deep modules have been employed successfully in many hybrid matching pipelines, especially in conjunction with SIFT [11] or Harris corners [85].
深度学习的演进为图像匹配及其他计算机视觉研究领域带来了显著提升,这得益于深度架构、硬件与软件计算能力及设计的进步,以及如今可用于训练的海量数据集。首先提出的混合管道方案,以深度特征描述符取代手工设计的描述符,因为特征描述模块在管道演进中最自然且直接可被替换。在此背景下,最先进的代表是 HardNet(HardNet)[80]及其类似网络[81, 82],它们通过对比学习与硬挖掘进行训练。在关键点提取方面,通过引导网络模仿手工检测器设计,如 Key.Net [52],取得了优于手工设计方法的更好结果。深度学习还被成功应用于设计独立的图像块归一化模块,以估计参考图像块的方向[83, 84],或更复杂的仿射变换,如 AffineNet [84]中所实现。 上述深度模块已在众多混合匹配流程中成功应用,尤其在与 SIFT[11]或 Harris 角点[85]结合时表现出色。
2.4 Joint detector & descriptor networks
2.4 联合检测器与描述符网络
A relevant characteristic in deep image matching which made it able to exceed handcrafted image matching pipelines was gained by allowing concurrently optimizing both the detector & descriptor. SuperPoint [53] integrates both of them in a single network and does not require a separate yet consecutive training of these as [86]. A further design feature that enabled SuperPoint to become a reference in the field is the use of homographic adaptation, which consists of the use of planar homographies to generate corresponding images during the training, to allow self-supervised training. Further solutions were proposed by architectures similar to SuperPoint as in [87]. Moreover, the Accurate and Lightweight Keypoint Detection and Descriptor Extraction (ALIKE) [88] implements a differentiable Non-Maximum Suppression (NMS) for the keypoint selection on the score map, and the DIScrete Keypoints (DISK) [89] employs reinforcement learning. ALIKE design is further improved by A LIghter Keypoint and descriptor Extraction network with Deformable transformation (ALIKED) [13] using deformable local features for each sparse keypoint instead of dense descriptor maps, resulting in saving computational budget. Alternative rearrangements in the pipeline structure to use the same sub-network for both detector & descriptor (detect-and-describe) [90], or to extract descriptors and then keypoints (describe-then-detect) [91] instead of the common approach (detect-then-describe) have been also investigated. More recently, the idea of decoupling the two blocks by formulating the matching problems in terms of 3D tracks in large-scale SfM has been applied in Detect, Don’t Describe–Describe, Don’t Detect (DeDoDe) [92] and DeDoDe v2 [93] with interesting results.
深度图像匹配中一个相关特性使其能够超越手工设计的图像匹配流程,是通过同时优化检测器和描述符实现的。SuperPoint [53] 将两者整合于单一网络中,无需像 [86] 那样分别且连续地进行训练。SuperPoint 成为该领域标杆的另一设计特点是采用了单应性适应,即在训练过程中利用平面单应性生成对应图像,以实现自监督训练。类似 SuperPoint 的架构如 [87] 也提出了进一步的解决方案。此外,精确轻量级关键点检测与描述符提取(ALIKE)[88] 实现了可微分的非极大值抑制(NMS)用于得分图上的关键点选择,而离散关键点(DISK)[89] 则运用了强化学习。 ALIKE 设计通过采用可变形变换的轻量级关键点和描述子提取网络(ALIKED)[13]得到了进一步改进,该网络为每个稀疏关键点使用可变形局部特征,而非密集描述子图,从而节省了计算预算。此外,还研究了流水线结构中的替代重排,例如使用同一子网络同时进行检测和描述(detect-and-describe)[90],或先提取描述子再检测关键点(describe-then-detect)[91],而非传统方法(detect-then-describe)。最近,通过在大规模 SfM 中以 3D 轨迹形式表述匹配问题,实现两个模块解耦的思想被应用于 Detect, Don’t Describe–Describe, Don’t Detect(DeDoDe)[92]和 DeDoDe v2 [93],取得了有趣的结果。
2.4.1 Full end-to-end architectures
2.4.1 全端到端架构
Full end-to-end deep image matching architectures arose with SuperGlue [94], which incorporates a proper matching network exploiting self and cross attentions through transformers to jointly encode keypoint positions and visual data into a graph structure, and a differentiable form of the Sinkhorn algorithm to provide the final assignments. SuperGlue can represent a breakthrough in the areas and has been extended by the recent LightGlue [12] improving its efficiency, accuracy and training process. Detector-free end-to-end image matching architectures have been also proposed, which avoid explicitly computing a sparse keypoint map, considering instead a semi-dense strategy. The Local Feature TRansformer (LoFTR) [14] is a competitive realization of these matching architectures, which deploys a coarse-to-fine strategy using self and cross attention layers on dense feature maps and has been employed as a base for further design extensions [15, 16]. Notice that end-to-end image matching networks are not fully rotational invariant and can handle only relatively small scene rotations, except specific architectures designed with this purpose [95, 96, 97]. Coarse-to-fine strategy following LoFTR design has been also applied to hybrid canonical pipelines with only deep patch and descriptor modules, leading to matching results comparable in challenging scenarios at an increased computation cost [50]. More recently, dense image matching networks that directly estimate the scene as dense point maps [17] or employing Gaussian process [18], achieve SOTA results in many benchmarks but are more computationally expensive than sparse or semi-dense methods.
全端到端深度图像匹配架构随着 SuperGlue[94]的出现而兴起,该架构通过利用自注意力和交叉注意力机制的匹配网络,将关键点位置和视觉数据联合编码为图结构,并采用可微分的 Sinkhorn 算法进行最终分配。SuperGlue 在相关领域代表了突破性进展,并由最近的 LightGlue[12]进一步扩展,提升了其效率、准确性和训练过程。此外,还提出了无需显式计算稀疏关键点图的端到端图像匹配架构,转而采用半密集策略。Local Feature TRansformer (LoFTR)[14]是这些匹配架构的竞争性实现,它采用粗到细的策略,在密集特征图上使用自注意力和交叉注意力层,并已被用作进一步设计扩展的基础[15, 16]。值得注意的是,端到端图像匹配网络并非完全旋转不变,仅能处理相对较小的场景旋转,除非是为此目的设计的特定架构[95, 96, 97]。 LoFTR 设计中的由粗到细策略也被应用于仅包含深度块和描述符模块的混合规范管道,在计算成本增加的情况下,在具有挑战性的场景中实现了可比的匹配结果[50]。最近,直接估计场景为密集点图[17]或采用高斯过程[18]的密集图像匹配网络,在许多基准测试中达到了最先进的结果,但相比于稀疏或半密集方法,其计算成本更高。
2.4.2 Outlier rejection with geometric constraints
2.4.2 基于几何约束的异常值剔除
The first effective deep matching filter module was achieved with the introduction of the context normalization [34], which is the normalization by mean and standard deviation, that guarantees to preserve permutation equivalence. The Attentive Context Network (ACNe) [35] further includes local and global attention to exclude outliers from context normalization, while the Order-Aware Network (OANet) [36] adds additional layers to learn how to cluster unordered sets of correspondences to incorporate the data context and the spatial correlation. The Consensus Learning Network (CLNet) [37] prunes matches by filtering data according a local-to-global dynamic neighborhood consensus graphs in consecutive rounds. The Neighbor Consistency Mining Network (NCMNet) [38] improves CLNet architecture by considering different neighborhoods in the feature space, the keypoint coordinate space, and a further global space that considers both previous ones altogether with cross-attention. More recently, the Multiple Sparse Semantics Dynamic Graph Network (MS2DG-Net) [39] designs the filtering in terms of a neighborhood graph through transformers [98] to capture semantics similar structures in the local topology among correspondences. Unlike previous deep approaches, ConvMatch[40] builds a smooth motion field by making use of Convolutional Neural Network (CNN) layers to verify the consistency of the matches. DeMatch [41] refines ConvMatch motion field estimation for a better accommodation of scene disparities by decomposing the rough motion field into several sub-fields. Finally, notice that deep differentiable RANSAC modules have been investigated as well [99, 100].
首个有效的深度匹配滤波模块通过引入上下文归一化[34]得以实现,该归一化基于均值和标准差,确保了置换等价性。注意力上下文网络(ACNe)[35]进一步结合局部和全局注意力,以排除上下文归一化中的异常值,而顺序感知网络(OANet)[36]则增加了额外层,用于学习如何聚类无序的对应集,以整合数据上下文和空间相关性。共识学习网络(CLNet)[37]通过在连续轮次中依据局部到全局的动态邻域共识图过滤数据来修剪匹配。邻域一致性挖掘网络(NCMNet)[38]通过考虑特征空间、关键点坐标空间以及综合前两者的全局空间中的不同邻域,改进了 CLNet 架构,并引入了交叉注意力机制。 最近,多重稀疏语义动态图网络(MS 2 DG-Net)[39]通过变换器[98]设计了基于邻域图的过滤机制,以捕捉对应关系间局部拓扑中的语义相似结构。与先前的深度方法不同,ConvMatch[40]利用卷积神经网络(CNN)层构建平滑的运动场,以验证匹配的一致性。DeMatch[41]则通过将粗糙的运动场分解为多个子场,改进了 ConvMatch 的运动场估计,从而更好地适应场景差异。最后,值得注意的是,深度可微分的 RANSAC 模块也得到了研究[99, 100]。
2.4.3 Keypoint refinement 2.4.3 关键点精炼
The coarse-to-fine strategy trend has pushed the research of solutions focusing on the refinement of keypoint localization as a standalone module after the coarse assignment of the matches. Worth mentioning is Pixel-Perfect SfM [101] that learns to refine multi-view tracks on their photometric appearance and the SfM structure, while Patch2Pix [42] refines and filters patches exploiting image dense features extracted by the network backbone. Similar to Patch2Pix, Keypt2Subpx [43] is a recent lightweight module to refine only matches requiring as input the descriptors of the local heat map of the related correspondences beside keypoints position. Finally, the Filtering and Calibrating Graph Neural Network (FC-GNN) [44] is an attention-based Graph Neural Network (GNN) jointly leveraging contextual and local information to both filter and refine matches. Unlike Patch2Pix and Keypt2Subpx, does not require feature maps or descriptor confidences as input, as also the proposed MOP+MiHo+NCC. Similarly to FC-GNN, the proposed approach uses both global context and local information to refine matches, respectively to warp and check photometric consistency, but differently from FC-GNN it currently only employs global contextual data since does not implement any feedback system to discard matches on patch similarity after the alignment refinement.
粗到细策略的趋势推动了将关键点定位细化作为粗匹配分配后独立模块的研究。值得一提的是 Pixel-Perfect SfM [101],它学习根据光度外观和 SfM 结构来细化多视图轨迹,而 Patch2Pix [42]则利用网络主干提取的图像密集特征来细化并过滤补丁。与 Patch2Pix 类似,Keypt2Subpx [43]是一个近期轻量级模块,仅对需要细化的匹配进行处理,输入包括相关对应关系的局部热图描述符以及关键点位置。最后,过滤与校准图神经网络(FC-GNN)[44]是一种基于注意力的图神经网络(GNN),联合利用上下文和局部信息来过滤和细化匹配。与 Patch2Pix 和 Keypt2Subpx 不同,FC-GNN 不要求输入特征图或描述符置信度,正如所提出的 MOP+MiHo+NCC 方法一样。 与 FC-GNN 类似,所提出的方法同时利用全局上下文和局部信息来优化匹配,分别用于扭曲和对光度一致性进行检查,但与 FC-GNN 不同的是,它目前仅采用全局上下文数据,因为尚未实现任何反馈系统在对齐优化后基于块相似性丢弃匹配。
3 Proposed approach 3 提出的方法
3.1 Multiple Overlapping Planes (MOP)
3.1 多重重叠平面(MOP)
3.1.1 Preliminaries 3.1.1 预备知识
MOP assumes that motion flow within the images can be piecewise approximated by local planar homography.
MOP 假设图像内的运动流可以通过局部平面单应性进行分段近似。
Representing with an abuse of notation the reprojection of a point through a planar homography with non-singular as , a match within two keypoints and is considered an inlier based on the maximum reprojection error
用符号 表示通过非奇异平面单应性 对点 的重投影,若两个关键点 和 之间的匹配 基于最大重投影误差被视为内点
(1) |
For a set of matches , the inlier subset is provided according to a threshold as
对于一组匹配 ,根据阈值 提供内点子集
(2) |
A proper scene planar transformation must be quasi-affine to preserve the convex hull [45] with respect to the minimum sampled model generating in RANSAC so the inlier set effectively employed is
正确的场景平面变换必须保持准仿射性,以保留相对于 RANSAC 中最小采样模型生成的凸包,从而使有效利用的内点集为
(3) |
where is the set of matches in satisfying the quasi-affininity property with respect to the RANSAC minimum model of described later in Eq. 12.
其中 是 中满足与后文公式 12 所述的 的 RANSAC 最小模型相关的准亲和性属性的匹配集合。
3.1.2 Main loop 3.1.2 主循环
MOP iterates through a loop where a new homography is discovered and the match set is reduced for the next iteration. Iterations halt when the counter of the sequential failures at the end of a cycle reaches the maximum allowable value . If not incremented at the end of the cycle, is reset to 0. As output, MOP returns the set of found homographies which covers the image visual flow.
MOP 通过循环迭代,每次发现新的单应性矩阵并缩减匹配集以进行下一次迭代。当循环结束时连续失败的计数器 达到最大允许值 时,迭代停止。如果在循环结束时未增加, 将被重置为 0。作为输出,MOP 返回一组 已找到的单应性矩阵 ,这些矩阵覆盖了图像视觉流。
Starting with , and the initial set of input matches provided by the image matching pipeline, MOP at each iteration looks for the best planar homography compatible with the current match set according to a relaxed inlier threshold px. When the number of inliers provided by is less than the minimum required amount, set to , i.e.
从 、 以及图像匹配流水线提供的初始输入匹配集 开始,MOP 在每次迭代 中寻找与当前匹配集 最兼容的最佳平面单应性 ,依据的是一个放宽的内点阈值 像素。当 提供的内点数量少于所需的最小数量 时,即
(4) |
the current homography is discarded and the failure counter is incremented without updating the iteration counter , the next match set and the output list . Otherwise, MOP removes the inlier matches of for a threshold for the next iteration
当前的 单应性被丢弃,失败计数器 递增,而不更新迭代计数器 、下一匹配集 和输出列表 。否则,MOP 将移除 的内点匹配,以 为阈值,用于下一次迭代 。
(5) |
where and updates the output list
其中 并更新输出列表
(6) |
The threshold adapts to the evolution of the homography search and is equal to a strict threshold in case the cardinality of the inlier set using is greater than half of , i.e. in the worst case no more than half of the added matches should belong to previous overlapping planes. If this condition does not hold turns into the relaxed threshold and the failure counter is incremented
阈值 适应单应性搜索的演变,并且在使用 的内点集基数大于 的一半时,等于严格的阈值 ,即在最坏情况下,不超过一半的添加匹配应属于先前的重叠平面。如果此条件不成立, 变为宽松的阈值 ,并且失败计数器 递增。
(7) |
Using the relaxed threshold to select the best homography is required since the real motion flow is only locally approximated by planes, and removing only strong inliers by for the next iteration guarantees smooth changes within overlapping planes and more robustness to noise. Nevertheless, in case of slow convergence or when the homography search gets stuck around a wide overlapping or noisy planar configuration, limiting the search to matches excluded by previous planes can provide a way out. The final set of homography returned is
使用松弛阈值 来选择最佳单应性 是必要的,因为实际运动流仅由局部平面近似,并且仅通过 移除强内点以进行下一次迭代 ,可确保重叠平面内的平滑变化并增强对噪声的鲁棒性。然而,在收敛缓慢或单应性搜索陷入广泛重叠或噪声平面配置的情况下,将搜索限制在先前平面排除的匹配点上,可提供一种解决方案。最终返回的单应性集合为
(8) |
found by the last iteration when the failure counter reaches .
由最后一次迭代 发现,当失败计数器达到 时。
3.1.3 Inside RANSAC 3.1.3 RANSAC 内部机制
Within each MOP iteration , a homography is extracted by RANSAC on the match set . The vanilla RANSAC is modified according to the following heuristics to improve robustness and avoid degenerate cases.
在每个 MOP 迭代 中,通过在匹配集 上应用 RANSAC 提取单应性 。根据以下启发式方法对传统 RANSAC 进行修改,以提高鲁棒性并避免退化情况。
A minimum number of RANSAC loop iterations is required besides the max number of iterations . In addition, a minimum hypothesis sampled model
除了最大迭代次数 外,还需要最少的 RANSAC 循环迭代次数 。此外,还需要采样模型的最小假设。
(9) |
is discarded if the distance within the keypoints in one of the images is less than the relaxed threshold , i.e. it must hold
如果图像中关键点之间的距离小于松弛阈值 ,则被舍弃,即必须满足
(10) |
Furthermore, being the corresponding homography derived by the normalized 8-point algorithm [45], the smaller singular value found by solving through SVD the homogeneous system
must be relatively greater than zero for a stable solution not close to degeneracy. According to these observations, the minimum singular value is constrained to be greater than 0.05. Note that an absolute threshold is feasible since data are normalized before SVD.
此外,由于 对应于通过归一化 8 点算法[45]推导出的单应性,通过 SVD 求解 齐次系统时发现的最小奇异值必须相对大于零,以确保解的稳定性,避免接近退化。基于这些观察,最小奇异值被限制在大于 0.05。需要注意的是,由于 SVD 之前数据已归一化,因此采用绝对阈值是可行的。
Next, matches in must satisfy quasi-affinity with respect to the derived homography . This can be verified by checking that the sign of the last coordinate of the reprojection in non-normalized homogenous coordinates are the same for , i.e. for all the four keypoint in the first image. This can be expressed as
接下来, 中的匹配点必须满足相对于推导出的单应性矩阵 的准亲和性。这可以通过检查重投影 在非归一化齐次坐标中的最后一个坐标的符号是否对 一致来验证,即对于第一张图像中的所有四个关键点。这可以表示为
(11) |
where denotes the third and last vector element of the point in non-normalized homogeneous coordinates. Following the above discussion, the quasi-affine set of matches in Eq. 3 is formally defined as
其中 表示点 在非归一化齐次坐标中的第三个也是最后一个向量元素。根据上述讨论,匹配的准仿射集 在公式 3 中正式定义为
(12) |
For better numerical stability, the analogous checks are executed also in the reverse direction through .
为了提高数值稳定性,类似的检查也在 中以反向执行。
Lastly, to speed up the RANSAC search, a buffer with is globally maintained in order to contain the top discarded sub-optimal homographies encountered within a RANSAC run. For the current match set if is the current best RANSAC model maximizing the number of inliers, are ordered by the number of inliers excluding those compatible with the previous homographies in the buffer, i.e. it must holds that
最后,为了加速 RANSAC 搜索,全局维护了一个缓冲区 ,其中包含 ,用于存储在一次 RANSAC 运行中遇到的排名前 的次优单应性矩阵。对于当前的匹配集 ,如果 是当前最佳 RANSAC 模型,其内点数量最多,那么 则按内点数量 排序,排序时排除那些与缓冲区中先前单应性矩阵兼容的内点,即必须满足以下条件:
(13) |
where 其中
(14) |
The buffer can be updated when the number of inliers of the homography associated with the current sampled hypothesis is greater than the minimum , by inserting in the case the new homography, removing the low-scoring one and resorting to the buffer. Moreover, at the beginning of a RANSAC run inside a MOP iteration, the minimum hypothesis model sampling sets corresponding to homographies in are evaluated before proceeding with true random samples to provide a bootstrap for the solution search. This also guarantees a global sequential inspection of the overall solution search space within MOP.
缓冲区 可以在与当前采样假设相关联的单应性的内点数量大于最小值 时进行更新,具体操作是插入新的单应性,移除得分较低的单应性,并重新调整缓冲区。此外,在 MOP 迭代内部的 RANSAC 运行开始时,会先评估与 中单应性对应的假设模型采样集 ,然后再进行真正的随机采样,以此为解搜索提供引导。这同时也确保了在 MOP 内对整个解搜索空间进行全局顺序检查。
3.1.4 Homography assignment
3.1.4 单应性分配
After the final set of homographies is computed, the filtered match set is obtained by removing all matches for which there is no homography in having them as inlier, i.e.
在计算出最终的一组单应性矩阵 后,通过移除所有匹配项 来获得过滤后的匹配集 ,这些匹配项 在 中没有对应的单应性矩阵 将其视为内点,即
(15) |
Next, a homography must be associated with each survived match , which will be used for patch normalization.
接下来,必须为每个幸存的匹配项 关联一个单应性矩阵,用于块归一化。
A possible choice is to assign the homography which gives the minimum reprojection error, i.e.
一个可能的选择是分配单应性矩阵 ,它产生最小的重投影误差,即
(16) |
However, this choice is likely to select homographies corresponding to narrow planes in the image with small consensus sets, which leads to a low reprojection error but more unstable and prone to noise assignment.
然而,这种选择可能会挑选出对应于图像中狭窄平面的单应性变换,这些变换具有较小的共识集,从而导致重投影误差较低,但更不稳定且易受噪声影响。
Another choice is to assign to the match the homography compatible with with the wider consensus set, i.e.
另一种选择是将匹配 分配给与 兼容且具有更广泛共识集的单应性 ,即
(17) |
which points to more stable and robust homographies but can lead to incorrect patches when the corresponding reprojection error is not among the minimum ones.
这表明了更稳定和鲁棒的单应性,但在对应的重投影误差并非最小值时,可能导致错误的图像块。
The homography actually chosen for the match association provides a solution within the above. Specifically, for a match , defining as the median of the number of inliers of the top 5 compatible homographies according to the number of inliers for , is the compatible homography with the minimum reprojection error among those with an inlier number at least equal to
所选用于匹配关联的单应性矩阵 确实在上述范围内提供了解决方案。具体而言,对于一个匹配 ,定义 为根据内点数量对前 5 个兼容单应性矩阵的内点数量中位数, 是在内点数量至少等于 的单应性矩阵中,具有最小重投影误差的兼容单应性矩阵。
(18) |
Figure 1 shows an example of the achieved solution directly in combination with MiHo homography estimation described hereafter in Sec. 3.2, since MOP or MOP+MiHo in visual cluster representation does not present appreciable differences. Keypoint belonging to discarded matches are marked with black diamonds, while the clusters highlighted by other combinations of markers and colors indicate the resulting filtered matches with the selected planar homography . Notice that clusters are not designed to highlight scene planes but points that move according to the same homography transformation within the image pair.
图 1 展示了一个直接结合 MiHo 单应性估计(详见第 3.2 节)所获得的解决方案示例,因为在视觉聚类表示中,MOP 或 MOP+MiHo 并未呈现显著差异。属于被丢弃匹配的关键点以黑色菱形标记,而其他组合标记和颜色高亮的聚类则表示经过筛选的匹配结果 ,以及选定的平面单应性 。值得注意的是,聚类并非旨在突出场景平面,而是强调在图像对中根据相同单应性变换移动的点。
3.2 Middle Homography (MiHo)
3.2 中间单应性(MiHo)
3.2.1 Rationale 3.2.1 基本原理
The homography estimated by Eq. 18 for the match allows to reproject the local patch neighborhood of onto the corresponding one centered in . Nevertheless, approximates the true transformation within the images which becomes less accurate as long as the distance from the keypoint center increases. This implies that patch alignment could be invalid for wider, and theoretically more discriminant, patches.
通过公式 18 估计的匹配 的 单应性,可以将 的局部块邻域重新投影到以 为中心的对应邻域上。然而, 近似于图像内的真实变换,随着与关键点中心距离的增加,其准确性逐渐降低。这意味着对于更宽、理论上更具辨别力的块,块对齐可能失效。
The idea behind MiHo is to break the homography in the middle by two homographies and such that so that the patch on gets deformed by less than , and likewise the patch on gets less deformed by than . This means that visually on the reprojected patch a unit area square should remain almost similar to the original one. As this must hold from both the images, the deformation error must be distributed almost equally between the two homographies and . Moreover, since interpolation degrades with up-sampling, MiHo aims to balance the down-sampling of the patch at finer resolution and the up-sampling of the patch at the coarser resolution.
MiHo 背后的理念是通过两个单应性变换 和 将单应性 在中间打破,使得 ,从而使得位于 上的图像块经 变形程度小于 ,同样地,位于 上的图像块经 变形程度也小于 。这意味着在重新投影的图像块上,单位面积的正方形应与原始形状几乎相似。由于这一要求必须从两幅图像中都成立,变形误差必须在两个单应性 和 之间几乎均匀分布。此外,鉴于插值质量随上采样而下降,MiHo 旨在平衡在更精细分辨率下对图像块的下采样与在较粗分辨率下对图像块的上采样。
3.2.2 Implementation 3.2.2 实现
While a strict analytical formulation of the above problem which leads to a practical satisfying solution is not easy to derive, the heuristic found in [51] can be exploited to modify RANSAC within MOP to account for the required constraints. Specifically, each match is replaced by two corresponding matches and where is the midpoint within the two keypoints
虽然严格分析上述问题并导出一个实际满意的解决方案并不容易,但[51]中发现的启发式方法可以被利用来修改 MOP 中的 RANSAC,以考虑所需约束。具体而言,每个匹配 被替换为两个对应的匹配 和 ,其中 是两个关键点之间的中点。
(19) |
Hence the RANSAC input match set for the MOP -th iteration is split in the two match sets and . Within a RANSAC iteration, a sample defined by Eq. 9 is likewise split into
因此,MOP 第 次迭代的 RANSAC 输入匹配集 被划分为两个匹配集 和 。在 RANSAC 迭代过程中,由公式 9 定义的样本 同样被分割为
(20) |
leading to two concurrent homographies and to be verified simultaneously with the inlier set
in analogous way to Eq. 3
导致两个并发的单应性 和 需与内点集以类似式 3 的方式同时验证
(21) |
where for two generic match set , , the operator rejoints splitted matches according to
其中,对于两个通用的匹配集 、 , 运算符根据规则重新连接已拆分的匹配项
(22) |
All the other MOP steps defined in Sec. 3.1, including RANSAC degeneracy checks, threshold adaptation, and final homography assignment, follow straightly in an analogous way. Overall, the principal difference is that a pair of homography is obtained instead of the single homography to be operated simultaneously on split match sets and instead of the original one .
第 3.1 节中定义的所有其他 MOP 步骤,包括 RANSAC 退化检查、阈值调整和最终单应性分配,均以类似方式直接进行。总体而言,主要区别在于获得的是一对单应性 ,而非单一的 单应性,同时作用于分割匹配集 和 ,而非原始的 。
Figure 2 shows the differences in applying MiHo to align to planar scenes with respect to directly reproject in any of the two images considered as reference. It can be noted that the distortion differences are overall reduced, as highlighted by the corresponding grid deformation. Moreover, MiHo homography strongly resembles an affine transformation or even a similarity, i.e. a rotation and scale change only, imposing in some sense a tight constraint on the transformation than the base homography. To balance this additional restrain with respect to the original MOP, in MOP+MiHo the minimum requirement amount of inliers for increasing the failure counter defined by Eq. 4 has been relaxed after experimental validation from to .
图 2 展示了将 MiHo 应用于对齐平面场景与直接在任意两张被视为参考的图像中重新投影之间的差异。可以看出,扭曲差异整体上有所减少,这在相应的网格变形中得到了体现。此外,MiHo 单应性非常接近仿射变换,甚至相似变换,即仅涉及旋转和尺度变化,在某种意义上对变换施加了比基础单应性更严格的约束。为了在 MOP+MiHo 中平衡这一额外约束与原始 MOP 的关系,经过实验验证后,增加失败计数器所需的最小内点数量已从 放宽至 。
3.2.3 Translation invariance
3.2.3 平移不变性
MiHo is invariant to translations as long as the algorithm used to extract the homography provides the same invariance. This can be easily verified since when translation vectors , are respectively added to all keypoints and to get and the corresponding midpoints are of the form
只要用于提取单应性的算法提供相同的平移不变性,MiHo 对平移是不变的。这一点很容易验证,因为当平移向量 和 分别添加到所有关键点 和 以得到 和 时,相应的中间点形式为
(23) |
where is a fixed translation vector is added to all the corresponding original midpoints . Overall, translation keypoints in the respective images have the effect of adding a fixed translation to the corresponding midpoint. Hence MiHo is invariant to translations since the normalization employed by the normalized 8-point algorithm is invariant to similarity transformations [45], including translations.
其中, 是一个固定的平移向量,被加到所有相应的原始中点 上。总体而言,在各自图像中的平移关键点具有为相应中点添加固定平移的效果。因此,MiHo 对平移是不变的,因为归一化 8 点算法所采用的归一化处理对包括平移在内的相似变换是不变的 [45]。
3.2.4 Fixing rotations 3.2.4 固定旋转
MiHo is not rotation invariant. Figure 3 illustrates how midpoints change with rotations. In the ideal MiHo configuration where the images are upright the area of the image formed in the midpoint plane is within the areas of the original ones, making a single cone when considering the images as stacked. The area of the midpoint plane is lower than the minimum area between the two images when there is a rotation about , and could degenerate to the apex of the corresponding double cone. A simple strategy to get an almost-optimal configuration was experimentally verified to work well in practice under the observation that images acquired by cameras tend to have a possible relative rotation of a multiple of .
MiHo 不具有旋转不变性。图 3 展示了中点随旋转的变化情况。在理想的 MiHo 配置中,当图像竖直时,中点平面内形成的图像面积介于原始图像面积之间,将图像视为堆叠时形成一个单锥。当围绕 进行旋转时,中点平面的面积小于两图像间的最小面积,并可能退化为对应双锥的顶点。通过实验验证,一种简单的策略在实践中表现良好,该策略基于以下观察:相机获取的图像往往具有 的相对旋转,且该旋转为 的倍数。
In detail, let us define a given input match and the corresponding midpoint , where and being the image rotated by . One has to choose to maximize the configuration for which midpoint inter-distances are between corresponding keypoint inter-distances
具体而言,我们定义一个给定的输入匹配 及其对应的中间点 ,其中 和 是图像 旋转 后的结果。需要选择 以最大化中间点间距在对应关键点间距之间的配置。
(24) |
where 其中
(25) |
and is the Iverson bracket. The global orientation estimation can be computed efficiently on the initial input matches before running MOP, adjusting keypoints accordingly. Final homographies can be adjusted in turn by removing the rotation given by . Figure 4 shows the MOP+MiHo result obtained on a set of matches without or with orientation pre-processing, highlighting the better matches obtained in the latter case.
其中 是艾弗森括号。全局方向估计 可以在运行 MOP 之前,通过对初始输入匹配 进行高效计算,并相应调整关键点。最终的单应性矩阵可以通过去除由 给出的旋转来进行调整。图 4 展示了在未进行或进行方向预处理的情况下,通过 MOP+MiHo 方法在一组匹配上获得的结果,突出了在后一种情况下获得的更好匹配。
图 3:MiHo 的中点和旋转。蓝色和红色四边形通过由四个角点对应关系定义的单应性相连接,这些对应关系由虚线灰色表示。角点对应的中点确定了参考绿色四边形以及与原始四边形相关的两个 MiHo 平面单应性。在最佳 MiHo 配置(图 2(a))中,由中点定义的四边形面积最大化。通过在两个原始四边形内进行增量相对旋转,上述面积从(图 2(b))减少到最小(图 2(c))。从最小值进一步旋转,面积通过(图 2(d))再次增加到最大。作为启发式方法,在最佳配置中,任意两个中点角之间的距离应在其对应原始图像角点距离范围内,详见第 3.2.4 节。请注意,对于特定示例,在最坏情况下,平面方向约束也可能被违反[45]。最佳彩色放大查看。
(a) 未进行旋转修正的 MOP+MiHo
(b) 旋转固定的 MOP+MiHo
图 4:MOP+MiHo 视觉聚类(4(a))在未处理旋转和(4(b))在处理最坏情况下的 相对旋转(如第 3.2.4 节所述)时的表现。使用了与图 1 相同的符号和图像对。尽管处理旋转后的结果明显更优,但这些结果与图 1 并不相同,因为 Key.Net 提供的输入匹配有所不同,OriNet [84]已进一步纳入流程以计算补丁方向。值得注意的是,基础 MOP 无需特殊处理即可获得与(4(b))相似的结果。最佳查看方式为彩色并放大。
3.3 Normalized Cross Correlation (NCC)
3.3 归一化互相关(NCC)
3.3.1 Local image warping 3.3.1 局部图像扭曲
NCC is employed to effectively refine corresponding keypoint localization. The approach assumes that corresponding keypoints have been roughly aligned locally by the transformations provided as a homography pair. In the warped aligning space, template matching [48] is employed to refine translation offsets in order to maximize the NCC peak response when the patch of a keypoint is employed as a filter onto the other image of the corresponding keypoint.
NCC 被用于有效精炼相应关键点的定位。该方法假设通过作为单应性对提供的变换,相应关键点已在局部大致对齐。在扭曲对齐的空间中,采用模板匹配[48]来精炼平移偏移,以在将关键点的补丁用作另一图像中对应关键点的滤波器时,最大化 NCC 峰值响应。
For the match within images and there can be multiple potential choices of the warping homography pair, which are denoted by the set of associated homography pairs . In detail, according to the previous steps of the matching pipeline, a candidate is the base homography pair defined as
对于图像 和 中的匹配 ,可能存在多个潜在的扭曲单应性对选择,这些选择由相关单应性对集合 表示。具体而言,根据匹配流水线的先前步骤,候选者是定义为基单应性对 的单应性对。
-
•
, where is the identity matrix, when no local neighborhood data are provided, e.g. for SuperPoint;
• ,其中 是单位矩阵,当未提供局部邻域数据时,例如在 SuperPoint 中; -
•
, where are affine or similarity transformations in case of affine covariant detector & descriptor, as for AffNet or SIFT, respectively;
• ,其中 在仿射协变检测器和描述符的情况下分别是仿射或相似变换,如 AffNet 或 SIFT
and the other is extended pair defined as
另一个是扩展对 ,定义为
-
•
, where is defined in Eq. 18, in case MOP is used without MiHo;
• ,其中 在公式 18 中定义,适用于未使用 MiHo 的 MOP 情况; -
•
as described in Sec. 3.2.1, in case of MOP+MiHo
• 如 3.2.1 节所述,在 MOP+MiHo 情况下 -
•
as above otherwise.
• 如上所述,否则。
Since the found transformations are approximated, the extended homography pair is perturbed in practice by small shear factors and rotation changes through the affine transformations
由于所发现的变换是近似的,扩展的单应性对在实际应用中通过仿射变换受到微小的剪切因子 和旋转变化 的扰动
(26) |
with 与
(27) | |||
(28) |
so that 以便
(29) |
3.3.2 Computation 3.3.2 计算
Let us denote a generic image by , the intensity values of for a generic pixel as , the window radius extension as
我们用 表示一个通用图像, 在通用像素 处的强度值为 ,窗口半径扩展为
(30) |
such that the squared window set of offsets is
使得偏移量的平方窗口集为
(31) |
and has cardinality equal to
且基数等于
(32) |
The squared patch of centered in with radius is the pixel set
以 为中心、半径为 的 方形补丁是像素集
(33) |
while the mean intensity value of the patch is
而补丁 的平均强度值为
(34) |
and the variance is 方差为
(35) |
The normalized intensity value by the mean and standard deviation of the patch for a pixel in is
通过像素 在 中的补丁 的均值和标准差归一化的强度值 为
(36) |
which is ideally robust to affine illumination changes [48]. Lastly, the similarity between two patches and is
这理想地对仿射光照变化具有鲁棒性[48]。最后,两个块 和 之间的相似性是
(37) |
so that, for the match and a set of associated homography pairs derived in Eq. 3.3.1, the refined keypoint offsets and the best aligning homography pair are given by
从而,对于匹配 以及在公式 3.3.1 中导出的一组相关单应性对 ,精炼的关键点偏移量 和最佳对齐的单应性对 由以下公式给出:
(38) |
where means the warp of the image by , is the transformed pixel coordinates and
其中 表示图像 通过 的扭曲, 是变换后的像素坐标,
(39) |
is the set of offset translation to check considering in turn one of the images as template since this is not a symmetric process. Notice that cannot be both different from according to . The final refined match
是考虑偏移量翻译的集合,依次将其中一张图像作为模板进行检查,因为这是一个非对称过程。注意,根据 , 不能同时与 不同。最终的精细匹配
(40) |
is obtained by reprojecting back to the original images, i.e.
是通过重新投影回原始图像获得的,即
(41) |
The normalized cross correlation can be computed efficiently by convolution, where bilinear interpolation is employed to warp images. The patch radius is also the offset search radius and has been experimentally set to px. According to preliminary experiments, a wider radius extension can break the planar neighborhood assumption and patch alignment.
归一化互相关可以通过卷积高效计算,其中采用双线性插值对图像进行扭曲。补丁半径 同时也是偏移搜索半径,并已通过实验设定为 像素。根据初步实验,更宽的半径扩展会破坏平面邻域假设及补丁对齐。
3.3.3 Sub-pixel precision 3.3.3 亚像素精度
The refinement offset can be further enhanced by parabolic interpolation adding sub-pixel precision [102]. Notice that other forms of interpolations have been investigated but they did not provide any sub-pixel accuracy improvements according to previous experiments [49]. The NCC response map at in the best warped space with origin the peak value can be written as
通过抛物线插值可进一步提升细化偏移量,增加亚像素精度[102]。值得注意的是,尽管其他形式的插值方法已被研究,但根据先前实验[49],它们并未带来亚像素级别的精度提升。在最佳扭曲空间中,以峰值为原点的 NCC 响应图 可表示为
(42) |
from computed by Eq. 3.3.2. The sub-pixel refinement offsets in the horizontal direction are computed as the vertex of a parabola fitted on the horizontal neighborhood centered in the peak of the maximized NNC response map, i.e.
由公式 3.3.2 计算得到的 。水平方向上的亚像素细化偏移量计算为抛物线 的顶点 ,该抛物线拟合于以最大化 NNC 响应图峰值为中心的水平邻域上,即
(43) |
and analogously for the vertical offsets. Explicitly the sub-pixel refinement offset is where
并且类似地适用于垂直偏移。具体而言,亚像素精炼偏移为 ,其中
(44) |
and from Eq. 3.3.2 the final refined match
并从公式 3.3.2 得出最终的精细匹配
(45) |
becomes 变为
(46) |
where the Iverson bracket zeroes the sub-pixel offset increment in the reference image according to Eq. 39.
其中,Iverson 括号根据公式 39 将参考图像中的亚像素偏移增量 置零。
4 Evaluation 4 评估
4.1 Setup 4.1 设置
To ease the reading, each pipeline module or error measure that will be included in the results is highlighted in bold.
为方便阅读,结果中包含的每个管道模块或误差度量均以粗体突出显示。
4.1.1 Base pipelines 4.1.1 基础管道
Seven base pipelines have been tested to provide the input matches to be filtered. For each of these, the maximum number of keypoints was set to 8000 to retain as many as possible matches. This leads to including more outliers and better assessing the filtering potential. Moreover, nowadays benchmarks and deep matching architecture designs assume upright image pairs, where the relative orientation between images is roughly no more than . This additional constraint allows us to retrieve better initial matches in common user acquisition scenarios. Current general standard benchmark datasets, including those employed in this evaluation, are built so that the image pairs do not violate this constraint. Notice that many SOTA joint and end-to-end deep architectures do not tolerate strong rotations within images by design.
已对七种基础管道进行了测试,以提供待过滤的输入匹配。对于每一种管道,关键点最大数量设置为 8000,以尽可能保留更多匹配。这导致包含更多异常值,从而更好地评估过滤潜力。此外,当前的基准测试和深度匹配架构设计均假设图像对为竖直状态,图像间的相对方向大致不超过 。这一额外约束使我们在常见的用户采集场景中能够检索到更好的初始匹配。目前通用的标准基准数据集,包括本次评估所使用的数据集,均构建为图像对不违反此约束。值得注意的是,许多最先进的联合和端到端深度架构在设计上无法容忍图像内的强旋转。
SIFT+NNR [19] is included as the standard and reference handcrafted pipeline. The OpenCV [103] implementation was used for SIFT, exploiting RootSIFT [104] for descriptor normalization, while NNR implementation is provided by Kornia [105]. To stress the match filtering robustness of the successive steps of the matching pipeline, which is the goal of this evaluation, the NNR ratio threshold was set rather high, i.e. to 0.95, while common adopted values range in depending on the scene complexity, with higher values yielding to less discriminative matches and then possible outliers. Moreover, upright patches are employed by zeroing the dominant orientation for a fair comparison with other recent deep approaches.
SIFT+NNR [19] 被纳入作为标准和参考的手工设计流程。SIFT 采用 OpenCV [103] 实现,利用 RootSIFT [104] 进行描述子归一化,而 NNR 实现则由 Kornia [105] 提供。为强调匹配流程中各后续步骤的匹配过滤鲁棒性,这是本次评估的目标,NNR 比率阈值设置得相当高,即 0.95,而通常采用的值范围在 之间,具体取决于场景复杂度,较高值会导致匹配的区分度降低,从而可能引入异常值。此外,为公平比较其他近期深度方法,采用直立图像块,通过将主导方向置零来实现。
Key.Net++NNR [11], a modular pipeline that achieves good results in common benchmarks, is also taken into account. Excluding the NNR stage, it can be considered a modular deep pipeline. The Kornia implementation is used for the evaluation. As for SIFT, the NNR threshold is set very high, i.e. to 0.99, while more common values range in . The deep orientation estimation module OriNet [84] generally siding AffNet is not included to provide upright matches.
Key.Net+ +NNR [ 11],一种在常见基准测试中取得良好结果的模块化流水线,也被纳入考虑。排除 NNR 阶段后,它可视为一种模块化的深度流水线。评估中使用了 Kornia 实现。至于 SIFT,NNR 阈值设置得非常高,即 0.99,而更常见的取值范围在 。深度方向估计模块 OriNet [ 84]通常与 AffNet 并列,但为了提供直立匹配,未被包含在内。
Other base pipelines considered are SOTA full end-to-end matching networks. In details, these are SuperPoint+LightGlue, DISK+LightGlue and ALIKED+LightGlue as implemented in [12]. The input images are rescaled so the smaller size is 1024 following LightGlue default. For ALIKED the aliked-n16rot model weights are employed which according to the authors [13] can handle slight image rotation better and are more stable on viewpoint change.
其他考虑的基础管道是 SOTA 全端到端匹配网络。具体来说,这些是 SuperPoint+LightGlue、DISK+LightGlue 和 ALIKED+LightGlue,如[12]中所实现。输入图像按 LightGlue 默认设置缩放,使较小尺寸为 1024。对于 ALIKED,采用了 aliked-n16rot 模型权重,据作者[13]称,该权重能更好地处理轻微图像旋转,并在视角变化时更为稳定。
The last base pipelines added in the evaluation are DeDoDe v2 [93], which provides an alternative end-to-end deep architecture different from the above, and LoFTR, a semi-dense detector-free end-to-end network. These latter pipelines also achieve SOTA results in popular benchmarks. The authors’ implementation is employed for DeDoDe v2, setting to matching threshold to 0.01, while LoFTR implementation available through Kornia is used in the latter case.
评估中最后加入的基础管道是 DeDoDe v2 [93],它提供了一种与上述不同的端到端深度架构,以及 LoFTR,一种半密集的无检测器端到端网络。这些后者的管道在流行的基准测试中也达到了 SOTA 结果。对于 DeDoDe v2,采用了作者的实现,匹配阈值设为 0.01,而在后一种情况下,使用了通过 Kornia 提供的 LoFTR 实现。
4.1.2 Match filtering and post-processing
4.1.2 匹配滤波与后处理
Eleven match filters are applied after the base pipelines and included in the evaluation besides MOP+MiHo+NCC. For the proposed filtering sub-pipeline, all the five available combinations have been tested, i.e. MOP, MOP+NCC, NCC, MOP+MiHo, MOP+MiHo+NCC, to analyze the behavior of each module. In particular, applying NCC after SIFT+NNR or Key.Net++NNR can highlight the improvement introduced by MOP which uses homography-based patch normalization against similarity or affine patches, respectively. Likewise, applying NCC to the remaining end-to-end architectures should remark the importance of patch normalization. The setup parameters indicated in Sec. 3 is employed, which has worked well in preliminary experiments. Notice that MiHo orientation adjustment described in Sec. 3.2.4 is applied although not required in the case of upright images. This allows us to indirectly assess the correctness of base assumptions and the overall general method robustness since in the case of wrong orientation estimation the consequential failure can only decrease the match quality.
在基础管道之后应用了十一个匹配过滤器,并在评估中除了 MOP+MiHo+NCC 之外还包括了这些过滤器。对于所提出的过滤子管道,所有五个可用的组合均已测试,即 MOP、MOP+NCC、NCC、MOP+MiHo、MOP+MiHo+NCC,以分析每个模块的行为。特别是,在 SIFT+NNR 或 Key.Net+ +NNR 之后应用 NCC,可以突出 MOP 引入的改进,该改进分别使用基于单应性的补丁归一化对抗相似性或仿射补丁。同样,将 NCC 应用于剩余的端到端架构应强调补丁归一化的重要性。采用第 3 节中指示的设置参数,该参数在初步实验中表现良好。请注意,尽管在直立图像的情况下不需要,但仍应用了第 3.2.4 节中描述的 MiHo 方向调整。这使我们能够间接评估基本假设的正确性和整体通用方法的鲁棒性,因为在方向估计错误的情况下,后续的失败只能降低匹配质量。
To provide a fair and general comparison, the analysis was restricted to filters that require as input only the keypoint positions of the matches and the image pair. This excludes approaches requiring in addition descriptor similarities [25, 28, 27], related patch transformations [71], intermediate network representations [42, 43] or a SfM framework [101]. The compared methods are GMS [26], AdaLAM [33] and CC [79] as handcrafted filters, and ACNe [35], CLNet [37], ConvMatch [40], DeMatch [41], FC-GNN [44], MS2DG-Net [39], NCMNet [38] and OANet [36] as deep modules. The implementations from the respective authors have been employed for all filters with the default setup parameters if not stated otherwise. Except FC-GNN and OANet, deep filters require as input the intrinsic camera matrices of the two images, which is not commonly available in real practical scenarios the proposed evaluation was designed for. To bypass this restriction, the approach presented in ACNe was employed, for which the intrinsic camera matrix
为了提供公平且广泛的比较,分析仅限于那些仅需匹配的关键点位置和图像对作为输入的滤波器。这排除了那些还需要描述符相似性[25, 28, 27]、相关补丁变换[71]、中间网络表示[42, 43]或 SfM 框架[101]的方法。所比较的方法包括手工设计的滤波器 GMS[26]、AdaLAM[33]和 CC[79],以及深度模块 ACNe[35]、CLNet[37]、ConvMatch[40]、DeMatch[41]、FC-GNN[44]、MS 2 DG-Net[39]、NCMNet[38]和 OANet[36]。除非另有说明,所有滤波器均采用各自作者提供的实现,并使用默认设置参数。除 FC-GNN 和 OANet 外,深度滤波器需要输入两幅图像的内在相机矩阵,这在实际应用场景中并不常见,而本次评估正是为此设计的。为绕过这一限制,采用了 ACNe 中提出的方法,该方法的内在相机矩阵
(47) |
for the image with a resolution of pixel is estimated by setting the focal length to
对于分辨率为 像素的图像 ,通过设定焦距来估算
(48) |
and the camera centre as
并将相机中心作为
(49) |
The above focal length estimation is quite reasonable by the statistics reported in Fig. 5 that have been extracted from the evaluation data and discussed in Sec. 4.1.3. Notice also that most of the deep filters have been trained on SIFT matches, yet to be robust and generalizable they must be able to mainly depend only on the scene and not on the kind of feature extracted by the pipeline.
上述焦距估计 通过图 5 中的统计数据显得相当合理,这些数据提取自评估数据并在 4.1.3 节中进行了讨论。值得注意的是,大多数深度滤波器虽在 SIFT 匹配上进行了训练,但要实现稳健性和普适性,它们必须主要依赖于场景而非特征提取管道的类型。
For AdaLAM, no initial seeds or local similarity or affine clues were used as additional input as discussed above. Nevertheless, this kind of data could have been used only for SIFT+NNR or Key.Net++NNR. For CC, the inliers threshold was experimental set to px as in Sec. 3.1.2 instead of the value 1.5 px. suggested by authors. Otherwise, CC would not be able to work in general scenes with no planes. Note that AdaLAM and CC are the closest approaches to the proposed MOP+MiHo filtering pipeline. Moreover, among the compared methods, only FC-GNN is able to refine matches as MOP+MiHo+NCC does.
对于 AdaLAM,如上所述,未使用初始种子或局部相似性或仿射线索作为额外输入。然而,这类数据本可以仅用于 SIFT+NNR 或 Key.Net+ +NNR。对于 CC,内点阈值实验设定为 像素,如 3.1.2 节所述,而非作者建议的 1.5 像素。否则,CC 将无法在无平面的一般场景中工作。请注意,AdaLAM 和 CC 是与提出的 MOP+MiHo 过滤管道最接近的方法。此外,在比较的方法中,仅 FC-GNN 能像 MOP+MiHo+NCC 那样精炼匹配。
RANSAC is also optionally evaluated as the final post-process to filtering matches according to epipolar or planar constraints. Three different RANSAC implementations were initially considered, i.e. the RANSAC implementation in PoseLib333https://github.com/PoseLib/PoseLib, DegenSAC444https://github.com/ducha-aiki/pydegensac [24] and MAGSAC555https://github.com/danini/magsac [23]. The maximum number of iterations for all RANSACs is set to , which is uncommonly high, to provide a more robust pose estimation and to make the whole pipeline only depend on the previous filtering and refinement stages. Five different inlier threshold values , i.e.
RANSAC 还作为最终的后处理步骤,根据极线或平面约束对匹配进行过滤。最初考虑了三种不同的 RANSAC 实现,即 PoseLib 中的 RANSAC 实现 3 ,DegenSAC 4 [24]和 MAGSAC 5 [23]。所有 RANSAC 的最大迭代次数设置为 ,这一数值异常高,旨在提供更稳健的姿态估计,并使整个流程仅依赖于先前的过滤和优化阶段。采用了五个不同的内点阈值 ,即
(50) |
are considered. After a preliminary RANSAC ablation study666https://github.com/fb82/MiHo/tree/main/data/results according to proposed benchmark, the best general choice found is MAGSAC with 0.75 px. or 1 px. as inlier threshold , indicated by MAGSAC↓ and MAGSAC↑ respectively. On one hand, the former and more strict threshold is generally slightly better in terms of accuracy. On the other hand, the latter provides more inliers, so both results will be presented. Matching outputs without RANSAC are also reported for a complete analysis.
经过初步的 RANSAC 消融研究,根据提出的基准,发现最佳的通用选择是 MAGSAC,其内点阈值分别为 0.75 像素或 1 像素,分别标记为 MAGSAC ↓ 和 MAGSAC ↑ 。一方面,前者的更严格阈值在准确性上通常略胜一筹;另一方面,后者提供了更多的内点,因此将同时展示这两种结果。此外,还报告了未使用 RANSAC 的匹配输出,以进行全面的分析。
图 5:(5(a)) 非平面户外 MegaDepth 和 IMC PhotoTourism 数据集中,GT 焦距 与最大图像尺寸 之比的分布概率,详见 4.1.3 节。(5(b)) 户外数据与室内 ScanNet 数据集的差异,同样在 4.1.3 节中展示,以对数尺度呈现,以便更有效地进行视觉比较。对于室内场景,由于图像由同一台 iPad Air2 设备[55]采集,该比值固定为 0.9。对于户外数据,图像由不同相机采集,比值大致集中在 1 附近。通过观察图像得出的启发式结论是,该比值大致等于采集图像时所用的缩放因子。(5(c)) 以散点图形式表示的类似表示,涉及 GT 相机中心 ,证实了先前的直方图以及 4.1.2 节实验中使用的内在相机近似启发式。(5(d)) 评估中使用的图像对比值的二维分布,以 RGB 颜色通道叠加显示。 注意,户外数据集与相应室内数据集的分布重叠几乎为零,这一点可从(5(b))中看出,并通过检查(5(d))左下角的交叉形状得以验证。最佳查看方式为彩色并放大。
4.1.3 Datasets 4.1.3 数据集
Four different datasets have been employed in the evaluation. These include the MegaDepth [54] and ScanNet [55] datasets, respectively outdoor and indoor. The same 1500 image pairs for each dataset and Ground-Truth (GT) data indicated in the protocol employed in [14] are used. These datasets are de-facto standard in nowadays image matching benchmarking; sample image pairs for each dataset are shown in Fig. 1. MegaDepth test image pairs belong to only 2 different scenes and are resized proportionally so that the maximum size is equal to 1200 px, while ScanNet image pairs belong to 100 different scenes and are resized to px. Although according to previous evaluation [12] LightGlue provides better results when the images are rescaled so that the maximum dimension is 1600 px, in this evaluation the original 1200 px resize was maintained since more outliers and minor keypoint precision are achieved in the latter case, providing a configuration in which the filtering and refinement of the matches can be better revealed. For deep methods, the best weights that fit the kind of scene, i.e. indoor or outdoor, are used when available.
评估中采用了四个不同的数据集。其中包括室外场景的 MegaDepth [54]和室内场景的 ScanNet [55]数据集。每个数据集均使用与[14]中相同的 1500 对图像及其协议中指定的 Ground-Truth (GT)数据。这些数据集在当今图像匹配基准测试中已成为事实上的标准;各数据集的样本图像对如图 1 所示。MegaDepth 测试图像对仅属于两个不同场景,并按比例调整大小,使最大尺寸等于 1200 像素,而 ScanNet 图像对则属于 100 个不同场景,并调整至 像素。尽管根据先前的评估[12],当图像缩放至最大维度为 1600 像素时,LightGlue 表现更佳,但在本次评估中仍保持原始的 1200 像素缩放,因为后者能获得更多异常值和较小的关键点精度,从而提供一种配置,使匹配的过滤和细化得以更好地展现。对于深度方法,当可用时,会使用最适合场景类型(即室内或室外)的最佳权重。
To provide further insight into the case of outdoor scenes the Image Matching Challenge (IMC) PhotoTourism dataset777https://www.kaggle.com/competitions/image-matching-challenge-2022/data is also employed. The IMC PhotoTourism (IMC-PT) dataset is a curated collection of sixteen scenes derived by the SfM models of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset [106]. For each scene 800 image pairs have been randomly chosen, resulting in a total of 12800 image pairs. These scenes also provide 3D model scale as GT data so that metric evaluations are possible. Note that MegaDepth data are is roughly a subset of IMC-PT. Furthermore, YFCC100M has been often exploited to train deep image matching methods so it can be assumed that some of the compared deep filters are advantaged on this dataset, but this information cannot be retrieved or elaborated. Nevertheless, the proposed modules are handcrafted and not positively biased so the comparison is still valid for the main aim of the evaluation.
为了更深入地探讨户外场景的情况,还采用了 Image Matching Challenge (IMC) PhotoTourism 数据集。IMC PhotoTourism (IMC-PT)数据集是从 Yahoo Flickr Creative Commons 100 Million (YFCC100M)数据集的 SfM 模型中精心挑选出的十六个场景的集合[106]。每个场景随机选择了 800 对图像,总计 12800 对图像。这些场景还提供了 3D 模型尺度作为 GT 数据,使得度量评估成为可能。需要注意的是,MegaDepth 数据大致是 IMC-PT 的一个子集。此外,YFCC100M 常被用于训练深度图像匹配方法,因此可以假设一些比较的深度滤波器在此数据集上具有优势,但这一信息无法获取或详细阐述。尽管如此,所提出的模块是手工设计的,不存在正向偏差,因此比较仍然有效,符合评估的主要目的。
Lastly, the Planar dataset contains 135 image pairs from 35 scenes collected from HPatches [56], the Extreme View dataset (EVD) [57] and further datasets aggregated from [70, 32, 50]. Each scene usually includes five image pairs except those belonging to EVD consisting of a single image pair. The planar dataset includes scenes with challenging viewpoint changes possibly paired with strong illumination variations. All image pairs are adjusted to be roughly non-upright. Outdoor model weights are preferred for deep modules in the case of planar scenes. A thumbnail gallery for the scenes of the Planar dataset is shown in Fig. 6.
最后,Planar 数据集包含从 HPatches [56]、Extreme View 数据集(EVD)[57]以及从[70, 32, 50]中聚合的其他数据集中收集的 35 个场景中的 135 对图像。每个场景通常包含五对图像,除了属于 EVD 的场景仅包含一对图像。Planar 数据集包括具有挑战性视角变化的场景,可能伴随着强烈的光照变化。所有图像对都被调整为大致非直立状态。对于平面场景,深度模块更倾向于使用户外模型权重。图 6 展示了 Planar 数据集场景的缩略图画廊。
图 6:平面数据集中每个场景的图像对示例,详见第 4.1.3 节。帧颜色依次对应于场景的原始数据集:HPatches [56]、EVD [57]、Oxford [70]、[32] 和 [50]。最佳观看方式为彩色并放大。
4.1.4 Error metrics 4.1.4 误差度量
For non-planar datasets, unlike most common benchmarks