[1]\fnmFabio \surBellavia
[1]\fnm 法比奥 \sur 贝拉维亚

[1]\orgnameUniversità degli Studi di Palermo, \countryItaly 2]\orgnameThe Chinese University of Hong Kong, \countryChina 3]\orgnameBruno Kessler Foundation, \countryItaly
[1]巴勒莫大学，意大利 [2]香港中文大学，中国 [3]布鲁诺·凯斯勒基金会，意大利

Image Matching Filtering and Refinement
by Planes and Beyond
图像匹配滤波与平面及超越平面的优化

fabio.bellavia@unipa.it \fnmZhenjun \surZhao ericzzj89@gmail.com \fnmLuca \surMorelli lmorelli@fbk.eu \fnmFabio \surRemondino remondino@fbk.eu * [ [

Abstract 摘要

This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
本文介绍了一种模块化的非深度学习方法，用于在图像匹配中过滤和提炼稀疏对应关系。假设场景内的运动流可通过局部单应变换近似，采用基于迭代 RANSAC 的方法将匹配聚合到对应于虚拟平面的重叠簇中，并舍弃不符合的对应关系。此外，底层平面结构设计提供了匹配关联局部块之间的显式映射，使得在块重投影后可通过互相关模板匹配进行关键点位置的选配精炼。最后，为增强对分段平面近似假设违背的鲁棒性和容错性，设计了一种进一步的策略，通过引入中间单应变换将两个块投影到同一平面，以最小化平面重投影中的相对块畸变。所提出的方法在标准数据集和图像匹配流程中得到了广泛评估，并与最先进的方法进行了比较。与当前其他比较不同，所提出的基准还考虑了更普遍、真实且实际的情况，即相机内参不可用的情况。实验结果表明，我们提出的非深度学习、基于几何的方法在性能上优于或与最新的深度学习方法相当。最后，本研究指出，在所考虑的研究方向上，实际图像匹配解决方案仍具有开发潜力，未来可整合到新颖的深度图像匹配架构中。

keywords:

image matching, keypoint refinement, planar homography, normalized cross-correlation, SIFT, SuperGlue, LoFTR
关键词：图像匹配，关键点优化，平面单应性，归一化互相关，SIFT，SuperGlue，LoFTR

1 Introduction 1 引言

1.1 Background 1.1 背景

Image matching constitutes the backbone of most higher-level computer vision tasks and applications that require image registration to recover the 3D scene structure, including the camera poses. Image stitching [1, 2], Structure-from-Motion (SfM) [2, 3, 4, 5], Simultaneous Localization and Mapping (SLAM) [6, 2] and the more recently Neural Radiance Fields (NeRF) [7, 8] and Gaussian Splatting (GS) [9, 10] are probably nowadays the most common, widespread and major tasks critically relying on image matching.
图像匹配构成了大多数需要图像配准以恢复三维场景结构（包括相机姿态）的高级计算机视觉任务和应用的支柱。图像拼接[1, 2]、运动恢复结构（SfM）[2, 3, 4, 5]、同时定位与地图构建（SLAM）[6, 2]，以及近期出现的神经辐射场（NeRF）[7, 8]和高斯喷射（GS）[9, 10]，可能是当今最常见、最广泛且最关键依赖图像匹配的主要任务。

Image matching traditional paradigm can be organized into a modular pipeline concerning keypoint extraction, feature description, and the proper correspondence matching [11]. This representation is being neglected in modern deep end-to-end image matching methods due to their intrinsic design which guarantees a better global optimization at the expense of the interpretability of the system. Notwithstanding that end-to-end deep networks represent for many aspects the State-Of-The-Art (SOTA) in image matching, such in terms of the robustness and accuracy of the estimated poses in complex scenes [12, 13, 14, 15, 16, 17, 18], handcrafted image matching pipelines such as the popular Scale Invariant Feature Transform (SIFT) [19] are employed still today in practical applications [4, 2] due to their scalability, adaptability, and understandability. Furthermore, handcrafted or deep modules remain present in post-process modern monolithic deep architectures, to further filter the final output matches as done notably by the RAndom SAmpling Consensus (RANSAC) [20] based on geometric constraints.
传统图像匹配范式可以组织成一个模块化流水线，涉及关键点提取、特征描述以及适当的对应匹配[11]。由于其内在设计保证了更好的全局优化，同时牺牲了系统的可解释性，这种表示在现代端到端的深度图像匹配方法中被忽视。尽管端到端深度网络在许多方面代表了图像匹配的最新技术（SOTA），例如在复杂场景中估计姿态的鲁棒性和准确性[12, 13, 14, 15, 16, 17, 18]，但像流行的尺度不变特征变换（SIFT）[19]这样的手工图像匹配流水线，由于其可扩展性、适应性和可理解性，至今仍在实际应用中使用[4, 2]。此外，手工或深度模块仍然存在于现代单一深度架构的后处理阶段，以进一步过滤最终的匹配结果，如基于几何约束的随机采样一致性（RANSAC）[20]所做的那样。

Image matching filters to prune correspondences are not limited to RANSAC, which in any of its form [21, 22, 23, 24] still remains mandatory nowadays as final step, and have been evolved from handcrafted methods [25, 26, 27, 28, 29, 30, 31, 32, 33] to deep ones [34, 35, 36, 37, 38, 39, 40, 41]. In addition, more recently methods inspired by coarse-to-fine paradigm, such as the Local Feature TRansformer (LoFTR) [14], have been developed to refine correspondences [42, 43, 44], which can be considered as the introduction of a feedback-system in the pipelined interpretation of the matching process.
用于修剪对应关系的图像匹配滤波器不仅限于 RANSAC，无论其形式如何[21, 22, 23, 24]，至今仍作为最终步骤必不可少，并且已从手工方法[25, 26, 27, 28, 29, 30, 31, 32, 33]演变为深度方法[34, 35, 36, 37, 38, 39, 40, 41]。此外，最近受由粗到细范式启发的方法，如局部特征变换器（LoFTR）[14]，已被开发用于细化对应关系[42, 43, 44]，这可以视为在匹配过程的流水线解释中引入了反馈系统。

1.2 Main contribution 1.2 主要贡献

As contribution, within the former background, this paper introduces a modular approach to filter and refine matches as post-process before RANSAC. The key idea is that the motion flow within the images, i.e. the correspondences, can be approximated by local overlapping homography transformations [45]. This is the core concept of traditional image matching pipelines, where local patches are approximated by less constrained transformations going from a fine to a coarse scale level. Specifically, besides the similarity and affine transformations, the planar homography is introduced at a further level. The whole approach proceeds as follows:
作为贡献，在前述背景下，本文提出了一种模块化方法，用于在 RANSAC 之前对匹配进行后处理筛选和提纯。其核心思想是，图像内的运动流，即对应关系，可以通过局部重叠的单应性变换来近似[45]。这是传统图像匹配流程的核心概念，其中局部块通过从精细到粗略尺度级别的较少约束变换来近似。具体而言，除了相似性和仿射变换外，还在更深层次引入了平面单应性。整个方法的流程如下：

•

The Multiple Overlapping Planes (MOP) module aggregates matches into soft clusters, each one related to a planar homography roughly representing the transformation of included matches. On one hand, matches unable to fit inside any cluster are discarded as outliers, on the other hand, the associated planar map can be used to normalize the patches of the related keypoint correspondences. As more matches are involved in the homography estimation, the patch normalization process becomes more robust than the canonical ones used for instance in SIFT. Furthermore, homographies better adapt the warp as they are more general than similarity or affine transformations. MOP is implemented by iterating RANSAC to greedily estimate the next best plane according to a multi-model fitting strategy. The main difference from previous similar approaches [46, 47], is in maintaining matches during the iterations in terms of strong and weak inliers according to two different reprojection errors to guide the cluster partitioning.

• 多重重叠平面（MOP）模块将匹配点聚合为软簇，每个簇关联一个大致表示包含匹配点变换的平面单应性。一方面，无法适应任何簇的匹配点被视为离群点丢弃；另一方面，关联的平面图可用于归一化相关关键点对应区域的补丁。随着更多匹配点参与单应性估计，补丁归一化过程比 SIFT 等方法中使用的典型归一化更为稳健。此外，单应性更适应变形，因其比相似性或仿射变换更为通用。MOP 通过迭代 RANSAC 贪婪估计下一个最佳平面，采用多模型拟合策略实现。与先前类似方法[46, 47]的主要区别在于，在迭代过程中根据两种不同的重投影误差维持强弱内点，以指导簇的划分。
•

The Middle Homography (MiHo) module additionally improves the patch normalization by further minimizing the relative patch distortion. More in detail, instead of reprojecting the patch of one image into the other image taken as reference according to the homography associated with the cluster, both the corresponding patches are reprojected into a virtual plane chosen to be in the middle of the base homography transformations so that globally the deformation is distributed into both the two patches, reducing the patch distortions with respect to the original images due to interpolation but also to the planar approximation.

• 中值单应性（MiHo）模块通过进一步最小化相对补丁畸变，进一步改进了补丁归一化。更详细地说，不是将一张图像的补丁根据与聚类关联的单应性重投影到作为参考的另一张图像中，而是将两个对应的补丁都重投影到一个虚拟平面上，该平面选择位于基础单应性变换的中间位置，从而使得全局变形分布到两个补丁中，减少了由于插值以及平面近似导致的相对于原始图像的补丁畸变。
•

The Normalized Cross-Correlation (NCC) is employed following traditional template matching [48, 2] to improve the keypoint localization in the reference middle homography virtual plane using one patch as a template to being localized on the other patch. The relative shift adjustment found on the virtual plane is finally back-projected through into the original images. In order to improve the process, the template patch is perturbed by small rotations and anisotropic scaling changes for a better search in the solution space.

• 采用归一化互相关（NCC）方法，遵循传统的模板匹配[48, 2]，以提升在参考中间单应性虚拟平面上关键点的定位精度，其中使用一个补丁作为模板，在另一补丁上进行定位。在虚拟平面上找到的相对位移调整最终通过反投影回原始图像中。为优化该过程，模板补丁经过小幅旋转和各向异性尺度变化扰动，以在解空间中进行更有效的搜索。

MOP and MiHo are purely geometric-based, requiring only keypoint coordinates, while NCC relies on the image intensity in the local area of the patches. The early idea of MOP+MiHo+NCC can be found in [49]. Furthermore, MOP overlapping plane exploitation concept have been introduced in [50], while MiHo definition was draft in [51]. The proposed approach is handcrafted and hence more general as is does not require re-training to adapt to the particular kind of scene and its behavior is more interpretable as not hidden by deep architectures. As reported and discussed later in this paper, RANSAC takes noticeable benefits from MOP+MiHo as pre-filter before it in most configurations and scene kinds, and in any case does not deteriorate the matches. Concerning MOP+MiHo+NCC instead, keypoint localization accuracy seems to strongly depend on the kind of extracted keypoint. In particular, for corner-like keypoints, which are predominant in detectors such as the Keypoint Network (Key.Net) [52] or SuperPoint [53], NCC greatly improves the keypoint position, while for blob-like keypoints, predominant for instance in SIFT, NCC application can degrade the keypoint localization, due probably to the different nature of their surrounding areas. Nevertheless, the proposed approach indicates that there are still margins for improvements in current image matching solutions, even the deep-based ones, exploiting the analyzed methodology.
MOP 和 MiHo 完全基于几何，仅需关键点坐标，而 NCC 则依赖于补丁局部区域的图像强度。MOP+MiHo+NCC 的早期构想可见于[49]。此外，MOP 重叠平面利用概念在[50]中提出，而 MiHo 的定义则在[51]中草拟。所提出的方法是手工设计的，因此更具通用性，无需重新训练以适应特定场景，且其行为因不隐藏于深度架构中而更易解释。如本文后续所述和讨论，在大多数配置和场景类型中，RANSAC 在作为预过滤器使用 MOP+MiHo 时获益显著，且无论如何都不会损害匹配效果。至于 MOP+MiHo+NCC，关键点定位精度似乎强烈依赖于所提取关键点的类型。特别是对于类似角点的关键点，这类关键点在 Keypoint Network（KeyNet）等检测器中占主导地位。Net）[52]或 SuperPoint [53]，NCC 显著提升了关键点位置的精度，然而对于类似斑点的关键点，如 SIFT 中常见的那样，NCC 的应用可能会降低关键点的定位精度，这可能归因于其周围区域性质的不同。尽管如此，所提出的方法表明，即使在基于深度学习的图像匹配解决方案中，仍存在改进的空间，利用所分析的方法论。

1.3 Additional contribution
1.3 额外贡献

As further contribution, an extensive comparative evaluation is provided against eleven less and more recent SOTA deep and handcrafted image matching filtering methods on both non-planar indoor and outdoor standard benchmark datasets [54, 55, 11], as well as planar scenes [56, 57]. Each filtering approach, including several combinations of the three proposed modules, has been tested to post-process matches obtained with seven different image matching pipelines and end-to-end networks, also considering distinct RANSAC configurations. Moreover, unlike the mainstream evaluation which has been taken as current standard [14], the proposed benchmark assumes that no camera intrinsics are available, reflecting a more general, realistic, and practical scenario. On this premise, the pose error accuracy is defined, analyzed, and employed with different protocols intended to highlight the specific complexity of the pose estimation working with fundamental and essential matrices [45], respectively. This analysis reveal the many-sided nature of the image matching problem and of its solutions.
作为进一步的贡献，本文对十一项较新和较旧的 SOTA 深度学习与手工设计的图像匹配过滤方法进行了广泛的对比评估，涵盖了非平面室内外标准基准数据集[54, 55, 11]以及平面场景[56, 57]。每种过滤方法，包括所提出的三个模块的多种组合，均已用于对通过七种不同图像匹配流程和端到端网络获得的匹配结果进行后处理，同时考虑了不同的 RANSAC 配置。此外，与当前标准的主流评估[14]不同，本文提出的基准假设不提供相机内参，更贴近普遍、现实且实用的场景。在此前提下，定义、分析并采用了姿态误差精度，通过不同协议来凸显在基本矩阵和本质矩阵下进行姿态估计的特定复杂性[45]。这一分析揭示了图像匹配问题及其解决方案的多面性。

1.4 Paper organization 1.4 论文结构

The rest of the paper is organized as follows. The related work is presented in Sec. 2, the design of MOP, MiHo, and NCC is discussed in Sec. 3, and the evaluation setup and the results are analyzed in Sec.4. Finally, conclusions and future work are outlined in Sec. 5.
本文其余部分组织如下。相关工作在第 2 节中介绍，MOP、MiHo 和 NCC 的设计在第 3 节中讨论，评估设置和结果在第 4 节中分析。最后，结论和未来工作在第 5 节中概述。

1.5 Additional resources 1.5 附加资源

Code and data are freely available online¹¹1https://github.com/fb82/MiHo. Data also include high-resolution versions of the images, plots, and tables presented in this paper²²2https://github.com/fb82/MiHo/tree/main/data/paper.
代码和数据可在线免费获取 ¹ 。数据还包括本文中展示的图像、图表和表格的高分辨率版本 ² 。

2 Related work 2 相关工作

2.1 Disclamer and further references
2.1 免责声明及进一步参考文献

The following discussion is far to be exhaustive due to more than three decades of active research and progress in the field, which is still ongoing. The reader can refer to [58, 59, 60, 61] for a more comprehensive analysis than that reported below, which is related and covers only the methods discussed in the rest of the paper.
以下讨论远非详尽无遗，因为该领域已活跃研究并取得进展超过三十年，且这一进程仍在继续。读者可参阅[58, 59, 60, 61]以获取比下文更为全面的分析，下文仅涉及与本文其余部分讨论方法相关的部分内容。

2.2 Non-deep image matching
2.2 非深度图像匹配

Traditionally, image matching is divided into three main steps: keypoint detection, description, and proper matching.
传统上，图像匹配分为三个主要步骤：关键点检测、描述和适当匹配。

2.2.1 Keypoint detectors 2.2.1 关键点检测器

Keypoint detection is aimed at finding salient local image regions whose feature descriptor vectors are at the same time-invariant to image transformations, i.e. repeatable, and distinguishable from other non-corresponding descriptors, i.e. discriminant. Clearly, discriminability decreases as long as invariance increases and vice-versa. Keypoints are usually recognized as corner-like and blob-like ones, even if there is no clear separation within them outside ad-hoc ideal images. Corners are predominantly extracted by the Harris detector [62], while blobs by SIFT [19]. Several other handcrafted detectors exist besides the Harris and the SIFT one [63, 64, 65], but excluding deep image matching nowadays the former and their extensions are the most commonly adopted. Harris and SIFT keypoint extractors rely respectively on the covariance matrix of the first-order derivatives and the Hessian matrix of the second-order derivatives. Blobs and corners can be considered each other analogues for different derivative orders, as in the end, a keypoint in both cases correspond to a local maxima response to a filter which is a function of the determinant of the respective kind of matrix. Blob keypoint centers tend to be pushed away from edges, which characterize the regions, more than corners due to the usage of high-order derivatives. More recently, the joint extraction and refinement of robust keypoints has been investigated by the Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring (GMM-IKRS) [66]. The idea is to extract image keypoints after several homography warps and cluster them after back-projection into the input images so as to select the best clusters in terms of robustness and deviation, as another perspective of the Affine SIFT (ASIFT) [67] strategy.
关键点检测旨在寻找显著的局部图像区域，其特征描述符向量同时对图像变换具有时间不变性，即重复性，并且能与其他非对应描述符区分开来，即具有判别性。显然，判别性随不变性的增加而降低，反之亦然。关键点通常被识别为角点状和斑点状，尽管在理想图像之外，它们之间没有明确的区分。角点主要通过 Harris 检测器[62]提取，而斑点则通过 SIFT[19]提取。除了 Harris 和 SIFT 之外，还存在其他几种手工设计的检测器[63, 64, 65]，但如今在深度图像匹配之外，前者及其扩展是最常用的。Harris 和 SIFT 关键点提取器分别依赖于一阶导数的协方差矩阵和二阶导数的 Hessian 矩阵。斑点和角点可以被视为不同导数阶数的对应物，因为在最终情况下，两种情况下的关键点都对应于对某种矩阵行列式函数的滤波器的局部最大响应。 Blob 关键点中心往往被推离边缘，这些边缘特征区域比角点更明显，这是由于使用了高阶导数。最近，通过高斯混合模型进行可解释关键点精化和评分（GMM-IKRS）[66]，研究了鲁棒关键点的联合提取与精化。其思路是在多次单应性变换后提取图像关键点，并将它们反投影回输入图像进行聚类，从而根据鲁棒性和偏差选择最佳聚类，这是仿射 SIFT（ASIFT）[67]策略的另一种视角。

2.2.2 Keypoint descriptors
2.2.2 关键点描述符

Feature descriptors aim at embedding the keypoint characterization into numerical vectors. There has been a long research interest in this area, both in terms of handcrafted methods and non-deep machine learning ones, which contributed actively to the current development and the progress made by current SOTA deep image matching. Among non-deep keypoint descriptors, nowadays only the SIFT descriptor is practically still used and widespread due to its balance within computational efficiency, scaling, and matching robustness. SIFT descriptor is defined as the gradient orientation histogram of the local neighborhood of the keypoint, also referred to as patch. Many high-level computer vision applications like COLMAP [4] in the case of SfM are based on SIFT. Besides SIFT keypoint detector & descriptor, the Oriented FAST and Rotated BRIEF (ORB) [68] is a pipeline surviving still today in the deep era worth to be mentioned. ORB is the core of the popular ORB-SLAM [6], due to its computational efficiency, even if less robust than SIFT. Nevertheless, it seems to be going to be replaced by SIFT with the increased computational capability of current hardware [69].
特征描述符旨在将关键点特征嵌入数值向量中。长期以来，这一领域吸引了大量研究兴趣，涵盖手工设计方法和非深度机器学习方法，它们积极推动了当前深度图像匹配技术的发展与进步。在非深度关键点描述符中，如今仅 SIFT 描述符因其计算效率、尺度适应性和匹配鲁棒性之间的平衡而实际应用广泛。SIFT 描述符定义为关键点局部邻域的梯度方向直方图，亦称为图像块。众多高级计算机视觉应用，如 SfM 中的 COLMAP[4]，均基于 SIFT。除 SIFT 关键点检测器与描述符外，Oriented FAST 与 Rotated BRIEF（ORB）[68]这一流程在深度学习时代仍值得提及，它作为 ORB-SLAM[6]的核心，因其计算效率而流行，尽管鲁棒性不及 SIFT。然而，随着当前硬件计算能力的提升，ORB 似乎正被 SIFT 所取代[69]。

2.2.3 Patch normalization 2.2.3 补丁归一化

Actually, the patch normalization takes place in between the detection and description of the keypoints, and it is critical for effective computation of the descriptor to be associated with the keypoint. Patch normalization is implicitly hidden inside both keypoint extraction and description and its task roughly consists of aligning and registering patches before descriptor computation [19]. The key idea is that complex image deformations can be locally approximated by more simple transformations as long as the considered neighborhood is small. Under these assumptions, SIFT normalizes the patch by a similarity transformation, where the scale is inferred by the detector and the dominant rotation is computed from the gradient orientation histogram. Affine covariant patch normalization [70, 67, 71] was indeed the natural evolution, which can improve the normalization in case of high degrees of perspective distortions. Notice that in the case of only small scene rotation, up-right patches computed neglecting orientation alignment can lead to better matches since a strong geometric constraint limiting the solution space is imposed [72]. Concerning instead patch photometric consistency, affine normalization, i.e. by mean and standard deviation, is usually the preferred choice being sufficiently robust in the general case [19].
实际上，补丁归一化发生在关键点检测与描述之间，对于有效计算与关键点相关联的描述符至关重要。补丁归一化隐含在关键点提取和描述过程中，其任务大致包括在描述符计算前对齐和注册补丁[19]。核心思想是，只要考虑的邻域足够小，复杂的图像形变可以在局部近似为更简单的变换。在这些假设下，SIFT 通过相似性变换对补丁进行归一化，其中尺度由检测器推断，主旋转则从梯度方向直方图中计算得出。仿射协变补丁归一化[70, 67, 71]确实是自然演进，能够在存在高度透视畸变的情况下改进归一化效果。值得注意的是，在场景旋转较小的情况下，忽略方向对齐计算的直立补丁可能会带来更好的匹配结果，因为强几何约束限制了解空间[72]。关于补丁光度一致性，仿射归一化，即通过均值和标准差进行归一化，通常是首选方法，因为它在一般情况下足够稳健[19]。

2.2.4 Descriptor similarity and global geometric constraints
2.2.4 描述符相似性与全局几何约束

The last step is the proper matching, which mainly pairs keypoints by considering the distance similarity on the associated descriptors. Nearest Neighbor (NN) and Nearest Neighbor Ratio (NNR) [19] are the common and simple approaches even if additional extensions have been developed [73, 32]. For many tasks descriptor similarity alone does not provide satisfactory results, which can be achieved instead by including a further filtering post-process by constraining matches geometrically. RANSAC provides an effective method to discard outliers under the more general epipolar geometry constraints of the scene or in the case of planar homographies. The standard RANSAC continues to be extended in several ways [21, 22]. Worth mentioning are the Degenerate SAmpling Consensus (DegenSAC) [24] to avoid degenerate configurations, and the MArGinalized SAmpling Consensus (MAGSAC) [23] for defining a robust inlier threshold which remains the most critical parameter.
最后一步是适当的匹配，主要通过考虑关联描述符上的距离相似性来配对关键点。最近邻（NN）和最近邻比率（NNR）[19]是常见且简单的匹配方法，尽管已开发出其他扩展方法[73, 32]。对于许多任务，仅依赖描述符相似性往往无法获得满意的结果，而通过几何约束进行进一步过滤的后处理步骤则能实现更好的匹配效果。RANSAC 提供了一种有效的方法，在场景的更普遍的极线几何约束下或平面单应性的情况下，剔除异常值。标准 RANSAC 在多个方面得到了扩展[21, 22]。值得一提的是，退化采样一致性（DegenSAC）[24]用于避免退化配置，而边缘化采样一致性（MAGSAC）[23]则用于定义一个鲁棒的内点阈值，这是最关键的参数。

2.2.5 Local geometric constraints
2.2.5 局部几何约束

RANSAC exploits global geometric constraints. A more general and relaxed geometric assumption than RANSAC is to consider only local neighborhood consistency across the images as done in Grid-based Motion Statistics (GMS) [26] using square neighborhood, but also circular [25] and affine-based [27, 28] neighborhoods or Delaunay triangulation [32, 28] have been employed. In most cases, these approaches require an initial set of robust seed matches for initialization whose selection may become critical [74], while other approaches explicitly exploit descriptor similarities to rank matches [32]. In this respect, the vanilla RANSAC and GMS are pure geometric image filters. Notice that descriptor similarities have been also exploited successfully within RANSAC [75]. Closely connected to local filters, another class of methods is designed instead to estimate and interpolate the motion field upon which to check match consistency [29, 31, 30]. Local neighborhood consistency can be framed also into RANSAC as done by Adaptive Locally-Affine Matching (AdaLAM) [33] which runs multiple local affine RANSACs inside the circular neighborhoods of seed matches. AdaLAM design can be related with GroupSAC [76], which draws hypothesis model sample correspondences from different clusters, and with the more general multi-model fitting problem for which several solutions have been investigated [77, 78]. Among these approaches the recent Consensus Clustering (CC) [79] has been developed for the specific case of multiple planar homographies.
RANSAC 利用了全局几何约束。比 RANSAC 更为普遍且宽松的几何假设是仅考虑图像间的局部邻域一致性，如基于网格的运动统计（GMS）[26]采用方形邻域，但也有使用圆形[25]、仿射[27, 28]邻域或德劳内三角剖分[32, 28]的方法。在大多数情况下，这些方法需要一组鲁棒的初始匹配种子进行初始化，其选择可能变得至关重要[74]，而其他方法则明确利用描述符相似性对匹配进行排序[32]。就此而言，原始的 RANSAC 和 GMS 是纯粹的几何图像滤波器。值得注意的是，描述符相似性在 RANSAC 内部也得到了成功利用[75]。与局部滤波器紧密相关的是，另一类方法旨在估计并插值运动场，以此检查匹配一致性[29, 31, 30]。局部邻域一致性也可以纳入 RANSAC 框架，如自适应局部仿射匹配（AdaLAM）[33]在种子匹配的圆形邻域内运行多个局部仿射 RANSAC。 AdaLAM 设计可与 GroupSAC[76]相关联，后者从不同聚类中提取假设模型样本对应关系，并与更普遍的多模型拟合问题相关，对此已有多种解决方案被探讨[77, 78]。在这些方法中，最新的共识聚类（CC）[79]针对多平面单应性的特定情况进行了开发。

2.3 Deep image matching 2.3 深度图像匹配

2.3.1 Standalone pipeline modules
2.3.1 独立管道模块

Deep learning evolution has provided a boost to image matching as well as other computer vision research fields thanks to the evolution of the deep architectures, hardware and software computational capability and design, and the huge amount of datasets now available for the training. Hybrid pipelines are raised first, replacing handcrafted descriptors with deep ones, as the feature descriptor module is the most natural and immediate to be replaced in the pipeline evolution. SOTA in this sense is represented by the Hard Network (HardNet) [80] and similar networks [81, 82] trained on contrastive learning with hard mining. For what concerns the keypoint extraction, better results with respect to the handcrafted counterparts were achieved by driving the network to mimic handcrafted detector design as for Key.Net [52]. Deep learning has been successfully applied to design standalone patch normalization modules to estimate the reference patch orientation [83, 84] or more complete affine transformations as in the Affine Network [84]. The mentioned deep modules have been employed successfully in many hybrid matching pipelines, especially in conjunction with SIFT [11] or Harris corners [85].
深度学习的演进为图像匹配及其他计算机视觉研究领域带来了显著提升，这得益于深度架构、硬件与软件计算能力及设计的进步，以及如今可用于训练的海量数据集。首先提出的混合管道方案，以深度特征描述符取代手工设计的描述符，因为特征描述模块在管道演进中最自然且直接可被替换。在此背景下，最先进的代表是 HardNet（HardNet）[80]及其类似网络[81, 82]，它们通过对比学习与硬挖掘进行训练。在关键点提取方面，通过引导网络模仿手工检测器设计，如 Key.Net [52]，取得了优于手工设计方法的更好结果。深度学习还被成功应用于设计独立的图像块归一化模块，以估计参考图像块的方向[83, 84]，或更复杂的仿射变换，如 AffineNet [84]中所实现。上述深度模块已在众多混合匹配流程中成功应用，尤其在与 SIFT[11]或 Harris 角点[85]结合时表现出色。

2.4 Joint detector & descriptor networks
2.4 联合检测器与描述符网络

A relevant characteristic in deep image matching which made it able to exceed handcrafted image matching pipelines was gained by allowing concurrently optimizing both the detector & descriptor. SuperPoint [53] integrates both of them in a single network and does not require a separate yet consecutive training of these as [86]. A further design feature that enabled SuperPoint to become a reference in the field is the use of homographic adaptation, which consists of the use of planar homographies to generate corresponding images during the training, to allow self-supervised training. Further solutions were proposed by architectures similar to SuperPoint as in [87]. Moreover, the Accurate and Lightweight Keypoint Detection and Descriptor Extraction (ALIKE) [88] implements a differentiable Non-Maximum Suppression (NMS) for the keypoint selection on the score map, and the DIScrete Keypoints (DISK) [89] employs reinforcement learning. ALIKE design is further improved by A LIghter Keypoint and descriptor Extraction network with Deformable transformation (ALIKED) [13] using deformable local features for each sparse keypoint instead of dense descriptor maps, resulting in saving computational budget. Alternative rearrangements in the pipeline structure to use the same sub-network for both detector & descriptor (detect-and-describe) [90], or to extract descriptors and then keypoints (describe-then-detect) [91] instead of the common approach (detect-then-describe) have been also investigated. More recently, the idea of decoupling the two blocks by formulating the matching problems in terms of 3D tracks in large-scale SfM has been applied in Detect, Don’t Describe–Describe, Don’t Detect (DeDoDe) [92] and DeDoDe v2 [93] with interesting results.
深度图像匹配中一个相关特性使其能够超越手工设计的图像匹配流程，是通过同时优化检测器和描述符实现的。SuperPoint [53] 将两者整合于单一网络中，无需像 [86] 那样分别且连续地进行训练。SuperPoint 成为该领域标杆的另一设计特点是采用了单应性适应，即在训练过程中利用平面单应性生成对应图像，以实现自监督训练。类似 SuperPoint 的架构如 [87] 也提出了进一步的解决方案。此外，精确轻量级关键点检测与描述符提取（ALIKE）[88] 实现了可微分的非极大值抑制（NMS）用于得分图上的关键点选择，而离散关键点（DISK）[89] 则运用了强化学习。 ALIKE 设计通过采用可变形变换的轻量级关键点和描述子提取网络（ALIKED）[13]得到了进一步改进，该网络为每个稀疏关键点使用可变形局部特征，而非密集描述子图，从而节省了计算预算。此外，还研究了流水线结构中的替代重排，例如使用同一子网络同时进行检测和描述（detect-and-describe）[90]，或先提取描述子再检测关键点（describe-then-detect）[91]，而非传统方法（detect-then-describe）。最近，通过在大规模 SfM 中以 3D 轨迹形式表述匹配问题，实现两个模块解耦的思想被应用于 Detect, Don’t Describe–Describe, Don’t Detect（DeDoDe）[92]和 DeDoDe v2 [93]，取得了有趣的结果。

2.4.1 Full end-to-end architectures
2.4.1 全端到端架构

Full end-to-end deep image matching architectures arose with SuperGlue [94], which incorporates a proper matching network exploiting self and cross attentions through transformers to jointly encode keypoint positions and visual data into a graph structure, and a differentiable form of the Sinkhorn algorithm to provide the final assignments. SuperGlue can represent a breakthrough in the areas and has been extended by the recent LightGlue [12] improving its efficiency, accuracy and training process. Detector-free end-to-end image matching architectures have been also proposed, which avoid explicitly computing a sparse keypoint map, considering instead a semi-dense strategy. The Local Feature TRansformer (LoFTR) [14] is a competitive realization of these matching architectures, which deploys a coarse-to-fine strategy using self and cross attention layers on dense feature maps and has been employed as a base for further design extensions [15, 16]. Notice that end-to-end image matching networks are not fully rotational invariant and can handle only relatively small scene rotations, except specific architectures designed with this purpose [95, 96, 97]. Coarse-to-fine strategy following LoFTR design has been also applied to hybrid canonical pipelines with only deep patch and descriptor modules, leading to matching results comparable in challenging scenarios at an increased computation cost [50]. More recently, dense image matching networks that directly estimate the scene as dense point maps [17] or employing Gaussian process [18], achieve SOTA results in many benchmarks but are more computationally expensive than sparse or semi-dense methods.
全端到端深度图像匹配架构随着 SuperGlue[94]的出现而兴起，该架构通过利用自注意力和交叉注意力机制的匹配网络，将关键点位置和视觉数据联合编码为图结构，并采用可微分的 Sinkhorn 算法进行最终分配。SuperGlue 在相关领域代表了突破性进展，并由最近的 LightGlue[12]进一步扩展，提升了其效率、准确性和训练过程。此外，还提出了无需显式计算稀疏关键点图的端到端图像匹配架构，转而采用半密集策略。Local Feature TRansformer (LoFTR)[14]是这些匹配架构的竞争性实现，它采用粗到细的策略，在密集特征图上使用自注意力和交叉注意力层，并已被用作进一步设计扩展的基础[15, 16]。值得注意的是，端到端图像匹配网络并非完全旋转不变，仅能处理相对较小的场景旋转，除非是为此目的设计的特定架构[95, 96, 97]。 LoFTR 设计中的由粗到细策略也被应用于仅包含深度块和描述符模块的混合规范管道，在计算成本增加的情况下，在具有挑战性的场景中实现了可比的匹配结果[50]。最近，直接估计场景为密集点图[17]或采用高斯过程[18]的密集图像匹配网络，在许多基准测试中达到了最先进的结果，但相比于稀疏或半密集方法，其计算成本更高。

2.4.2 Outlier rejection with geometric constraints
2.4.2 基于几何约束的异常值剔除

The first effective deep matching filter module was achieved with the introduction of the context normalization [34], which is the normalization by mean and standard deviation, that guarantees to preserve permutation equivalence. The Attentive Context Network (ACNe) [35] further includes local and global attention to exclude outliers from context normalization, while the Order-Aware Network (OANet) [36] adds additional layers to learn how to cluster unordered sets of correspondences to incorporate the data context and the spatial correlation. The Consensus Learning Network (CLNet) [37] prunes matches by filtering data according a local-to-global dynamic neighborhood consensus graphs in consecutive rounds. The Neighbor Consistency Mining Network (NCMNet) [38] improves CLNet architecture by considering different neighborhoods in the feature space, the keypoint coordinate space, and a further global space that considers both previous ones altogether with cross-attention. More recently, the Multiple Sparse Semantics Dynamic Graph Network (MS²DG-Net) [39] designs the filtering in terms of a neighborhood graph through transformers [98] to capture semantics similar structures in the local topology among correspondences. Unlike previous deep approaches, ConvMatch[40] builds a smooth motion field by making use of Convolutional Neural Network (CNN) layers to verify the consistency of the matches. DeMatch [41] refines ConvMatch motion field estimation for a better accommodation of scene disparities by decomposing the rough motion field into several sub-fields. Finally, notice that deep differentiable RANSAC modules have been investigated as well [99, 100].
首个有效的深度匹配滤波模块通过引入上下文归一化[34]得以实现，该归一化基于均值和标准差，确保了置换等价性。注意力上下文网络（ACNe）[35]进一步结合局部和全局注意力，以排除上下文归一化中的异常值，而顺序感知网络（OANet）[36]则增加了额外层，用于学习如何聚类无序的对应集，以整合数据上下文和空间相关性。共识学习网络（CLNet）[37]通过在连续轮次中依据局部到全局的动态邻域共识图过滤数据来修剪匹配。邻域一致性挖掘网络（NCMNet）[38]通过考虑特征空间、关键点坐标空间以及综合前两者的全局空间中的不同邻域，改进了 CLNet 架构，并引入了交叉注意力机制。最近，多重稀疏语义动态图网络（MS ² DG-Net）[39]通过变换器[98]设计了基于邻域图的过滤机制，以捕捉对应关系间局部拓扑中的语义相似结构。与先前的深度方法不同，ConvMatch[40]利用卷积神经网络（CNN）层构建平滑的运动场，以验证匹配的一致性。DeMatch[41]则通过将粗糙的运动场分解为多个子场，改进了 ConvMatch 的运动场估计，从而更好地适应场景差异。最后，值得注意的是，深度可微分的 RANSAC 模块也得到了研究[99, 100]。

2.4.3 Keypoint refinement 2.4.3 关键点精炼

The coarse-to-fine strategy trend has pushed the research of solutions focusing on the refinement of keypoint localization as a standalone module after the coarse assignment of the matches. Worth mentioning is Pixel-Perfect SfM [101] that learns to refine multi-view tracks on their photometric appearance and the SfM structure, while Patch2Pix [42] refines and filters patches exploiting image dense features extracted by the network backbone. Similar to Patch2Pix, Keypt2Subpx [43] is a recent lightweight module to refine only matches requiring as input the descriptors of the local heat map of the related correspondences beside keypoints position. Finally, the Filtering and Calibrating Graph Neural Network (FC-GNN) [44] is an attention-based Graph Neural Network (GNN) jointly leveraging contextual and local information to both filter and refine matches. Unlike Patch2Pix and Keypt2Subpx, does not require feature maps or descriptor confidences as input, as also the proposed MOP+MiHo+NCC. Similarly to FC-GNN, the proposed approach uses both global context and local information to refine matches, respectively to warp and check photometric consistency, but differently from FC-GNN it currently only employs global contextual data since does not implement any feedback system to discard matches on patch similarity after the alignment refinement.
粗到细策略的趋势推动了将关键点定位细化作为粗匹配分配后独立模块的研究。值得一提的是 Pixel-Perfect SfM [101]，它学习根据光度外观和 SfM 结构来细化多视图轨迹，而 Patch2Pix [42]则利用网络主干提取的图像密集特征来细化并过滤补丁。与 Patch2Pix 类似，Keypt2Subpx [43]是一个近期轻量级模块，仅对需要细化的匹配进行处理，输入包括相关对应关系的局部热图描述符以及关键点位置。最后，过滤与校准图神经网络（FC-GNN）[44]是一种基于注意力的图神经网络（GNN），联合利用上下文和局部信息来过滤和细化匹配。与 Patch2Pix 和 Keypt2Subpx 不同，FC-GNN 不要求输入特征图或描述符置信度，正如所提出的 MOP+MiHo+NCC 方法一样。与 FC-GNN 类似，所提出的方法同时利用全局上下文和局部信息来优化匹配，分别用于扭曲和对光度一致性进行检查，但与 FC-GNN 不同的是，它目前仅采用全局上下文数据，因为尚未实现任何反馈系统在对齐优化后基于块相似性丢弃匹配。

3 Proposed approach 3 提出的方法

3.1 Multiple Overlapping Planes (MOP)
3.1 多重重叠平面(MOP)

3.1.1 Preliminaries 3.1.1 预备知识

MOP assumes that motion flow within the images can be piecewise approximated by local planar homography.
MOP 假设图像内的运动流可以通过局部平面单应性进行分段近似。

Representing with an abuse of notation the reprojection of a point $\mathbf{x}\in\mathbb{R}^{2}$ through a planar homography $\mathrm{H}\in\mathbb{R}^{3\times 3}$ with $\mathrm{H}$ non-singular as $\mathrm{H}\mathbf{x}$ , a match $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ within two keypoints $\mathbf{x}_{1}\in I_{1}$ and $\mathbf{x}_{2}\in I_{2}$ is considered an inlier based on the maximum reprojection error
用符号 $\mathrm{H}\mathbf{x}$ 表示通过非奇异平面单应性 $\mathrm{H}$ 对点 $\mathbf{x}\in\mathbb{R}^{2}$ 的重投影，若两个关键点 $\mathbf{x}_{1}\in I_{1}$ 和 $\mathbf{x}_{2}\in I_{2}$ 之间的匹配 $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ 基于最大重投影误差被视为内点

\varepsilon_{\mathrm{H}}(m)=\max\left(\left\|\mathbf{x}_{2}-\mathrm{H}\mathbf{% x}_{1}\right\|,\left\|\mathbf{x}_{1}-\mathrm{H}^{-1}\mathbf{x}_{2}\right\|\right)

(1)

For a set of matches $\mathcal{M}=\{m\}$ , the inlier subset is provided according to a threshold $t$ as
对于一组匹配 $\mathcal{M}=\{m\}$ ，根据阈值 $t$ 提供内点子集

\mathcal{J}^{\mathcal{M}}_{\mathrm{H},t}=\left\{m\in\mathcal{M}:\varepsilon_{% \mathrm{H}}(m)\leq t\right\}

(2)

A proper scene planar transformation must be quasi-affine to preserve the convex hull [45] with respect to the minimum sampled model generating $\mathrm{H}$ in RANSAC so the inlier set effectively employed is
正确的场景平面变换必须保持准仿射性，以保留相对于 RANSAC 中最小采样模型生成的凸包，从而使有效利用的内点集为

\mathcal{I}^{\mathcal{M}}_{\mathrm{H},t}=\mathcal{J}^{\mathcal{M}}_{\mathrm{H}% ,t}\cap\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}

(3)

where $\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}$ is the set of matches in $\mathcal{M}$ satisfying the quasi-affininity property with respect to the RANSAC minimum model of $\mathrm{H}$ described later in Eq. 12.
其中 $\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}$ 是 $\mathcal{M}$ 中满足与后文公式 12 所述的 $\mathrm{H}$ 的 RANSAC 最小模型相关的准亲和性属性的匹配集合。

3.1.2 Main loop 3.1.2 主循环

MOP iterates through a loop where a new homography is discovered and the match set is reduced for the next iteration. Iterations halt when the counter $c_{f}$ of the sequential failures at the end of a cycle reaches the maximum allowable value $c_{f}^{\star}=3$ . If not incremented at the end of the cycle, $c_{f}$ is reset to 0. As output, MOP returns the set $\mathcal{H}^{\star}$ of found homographies $\mathrm{H}$ which covers the image visual flow.
MOP 通过循环迭代，每次发现新的单应性矩阵并缩减匹配集以进行下一次迭代。当循环结束时连续失败的计数器 $c_{f}$ 达到最大允许值 $c_{f}^{\star}=3$ 时，迭代停止。如果在循环结束时未增加， $c_{f}$ 将被重置为 0。作为输出，MOP 返回一组 $\mathcal{H}^{\star}$ 已找到的单应性矩阵 $\mathrm{H}$ ，这些矩阵覆盖了图像视觉流。

Starting with $c_{f}=0$ , $\mathcal{H}^{0}=\emptyset$ and the initial set of input matches $\mathcal{M}^{0}$ provided by the image matching pipeline, MOP at each iteration $k$ looks for the best planar homography $\mathrm{H}^{k}$ compatible with the current match set $\mathcal{M}^{k}$ according to a relaxed inlier threshold $t_{l}=15$ px. When the number of inliers provided by $t_{l}$ is less than the minimum required amount, set to $n=12$ , i.e.
从 $c_{f}=0$ 、 $\mathcal{H}^{0}=\emptyset$ 以及图像匹配流水线提供的初始输入匹配集 $\mathcal{M}^{0}$ 开始，MOP 在每次迭代 $k$ 中寻找与当前匹配集 $\mathcal{M}^{k}$ 最兼容的最佳平面单应性 $\mathrm{H}^{k}$ ，依据的是一个放宽的内点阈值 $t_{l}=15$ 像素。当 $t_{l}$ 提供的内点数量少于所需的最小数量 $n=12$ 时，即

\left|\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t_{l}}\right|<n

(4)

the current $\mathrm{H}^{k}$ homography is discarded and the failure counter $c_{f}$ is incremented without updating the iteration counter $k$ , the next match set $\mathcal{M}^{k+1}$ and the output list $\mathcal{H}^{k+1}$ . Otherwise, MOP removes the inlier matches of $\mathrm{H}^{k}$ for a threshold $t^{k}$ for the next iteration $k+1$
当前的 $\mathrm{H}^{k}$ 单应性被丢弃，失败计数器 $c_{f}$ 递增，而不更新迭代计数器 $k$ 、下一匹配集 $\mathcal{M}^{k+1}$ 和输出列表 $\mathcal{H}^{k+1}$ 。否则，MOP 将移除 $\mathrm{H}^{k}$ 的内点匹配，以 $t^{k}$ 为阈值，用于下一次迭代 $k+1$ 。

\mathcal{M}^{k+1}=\mathcal{M}^{k}\setminus\mathcal{I}^{k}

(5)

where $\mathcal{I}^{k}=\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t^{k}}$ and updates the output list
其中 $\mathcal{I}^{k}=\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t^{k}}$ 并更新输出列表

\mathcal{H}^{k+1}=\mathcal{H}^{k}\cup\{\mathrm{H}^{k}\}

(6)

The threshold $t^{k}$ adapts to the evolution of the homography search and is equal to a strict threshold $t_{h}=t_{l}/2$ in case the cardinality of the inlier set using $t_{h}$ is greater than half of $n$ , i.e. in the worst case no more than half of the added matches should belong to previous overlapping planes. If this condition does not hold $t^{k}$ turns into the relaxed threshold $t_{l}$ and the failure counter $c_{f}$ is incremented
阈值 $t^{k}$ 适应单应性搜索的演变，并且在使用 $t_{h}$ 的内点集基数大于 $n$ 的一半时，等于严格的阈值 $t_{h}=t_{l}/2$ ，即在最坏情况下，不超过一半的添加匹配应属于先前的重叠平面。如果此条件不成立， $t^{k}$ 变为宽松的阈值 $t_{l}$ ，并且失败计数器 $c_{f}$ 递增。

t^{k}=\begin{cases}t_{h}&\text{if}\quad\left|\mathcal{I}^{\mathcal{M}^{k}}_{% \mathrm{H}^{k},t_{h}}\right|>\frac{n}{2}\\ t_{l}&\text{otherwise}\end{cases}

(7)

Using the relaxed threshold $t_{l}$ to select the best homography $\mathrm{H}^{k}$ is required since the real motion flow is only locally approximated by planes, and removing only strong inliers by $t_{h}$ for the next iteration $k+1$ guarantees smooth changes within overlapping planes and more robustness to noise. Nevertheless, in case of slow convergence or when the homography search gets stuck around a wide overlapping or noisy planar configuration, limiting the search to matches excluded by previous planes can provide a way out. The final set of homography returned is
使用松弛阈值 $t_{l}$ 来选择最佳单应性 $\mathrm{H}^{k}$ 是必要的，因为实际运动流仅由局部平面近似，并且仅通过 $t_{h}$ 移除强内点以进行下一次迭代 $k+1$ ，可确保重叠平面内的平滑变化并增强对噪声的鲁棒性。然而，在收敛缓慢或单应性搜索陷入广泛重叠或噪声平面配置的情况下，将搜索限制在先前平面排除的匹配点上，可提供一种解决方案。最终返回的单应性集合为

\mathcal{H}^{\star}=\mathcal{H}^{k^{\star}}

(8)

found by the last iteration $k^{\star}$ when the failure counter reaches $c_{f}^{\star}$ .
由最后一次迭代 $k^{\star}$ 发现，当失败计数器达到 $c_{f}^{\star}$ 时。

3.1.3 Inside RANSAC 3.1.3 RANSAC 内部机制

Within each MOP iteration $k$ , a homography $\mathrm{H}^{k}$ is extracted by RANSAC on the match set $\mathcal{M}^{k}$ . The vanilla RANSAC is modified according to the following heuristics to improve robustness and avoid degenerate cases.
在每个 MOP 迭代 $k$ 中，通过在匹配集 $\mathcal{M}^{k}$ 上应用 RANSAC 提取单应性 $\mathrm{H}^{k}$ 。根据以下启发式方法对传统 RANSAC 进行修改，以提高鲁棒性并避免退化情况。

A minimum number $c_{\min}=50$ of RANSAC loop iterations is required besides the max number of iterations $c_{\max}=2000$ . In addition, a minimum hypothesis sampled model
除了最大迭代次数 $c_{\max}=2000$ 外，还需要最少的 RANSAC 循环迭代次数 $c_{\min}=50$ 。此外，还需要采样模型的最小假设。

\mathcal{S}=\left\{\left(\mathbf{s}_{1i},\mathbf{s}_{2i}\right)\right\}% \subseteq\mathcal{M}^{k},\quad 1\leq i\leq 4

(9)

is discarded if the distance within the keypoints in one of the images is less than the relaxed threshold $t_{l}$ , i.e. it must hold
如果图像中关键点之间的距离小于松弛阈值 $t_{l}$ ，则被舍弃，即必须满足

\min_{\begin{subarray}{c}1\leq i,j\leq 4\\ i\neq j\\ w=1,2\end{subarray}}\left\|\mathbf{s}_{wi}-\mathbf{s}_{wj}\right\|\geq t_{l}

(10)

Furthermore, being $\mathrm{H}$ the corresponding homography derived by the normalized 8-point algorithm [45], the smaller singular value found by solving through SVD the $8\times 9$ homogeneous system must be relatively greater than zero for a stable solution not close to degeneracy. According to these observations, the minimum singular value is constrained to be greater than 0.05. Note that an absolute threshold is feasible since data are normalized before SVD.
此外，由于 $\mathrm{H}$ 对应于通过归一化 8 点算法[45]推导出的单应性，通过 SVD 求解 $8\times 9$ 齐次系统时发现的最小奇异值必须相对大于零，以确保解的稳定性，避免接近退化。基于这些观察，最小奇异值被限制在大于 0.05。需要注意的是，由于 SVD 之前数据已归一化，因此采用绝对阈值是可行的。

Next, matches in $\mathcal{S}$ must satisfy quasi-affinity with respect to the derived homography $\mathrm{H}$ . This can be verified by checking that the sign of the last coordinate of the reprojection $\mathrm{H}\mathbf{s}_{1i}$ in non-normalized homogenous coordinates are the same for $1\leq i\leq 4$ , i.e. for all the four keypoint in the first image. This can be expressed as
接下来， $\mathcal{S}$ 中的匹配点必须满足相对于推导出的单应性矩阵 $\mathrm{H}$ 的准亲和性。这可以通过检查重投影 $\mathrm{H}\mathbf{s}_{1i}$ 在非归一化齐次坐标中的最后一个坐标的符号是否对 $1\leq i\leq 4$ 一致来验证，即对于第一张图像中的所有四个关键点。这可以表示为

[\mathrm{H}\mathbf{s}_{1i}]_{3}\cdot[\mathrm{H}\mathbf{s}_{1j}]_{3}=1,\quad% \forall 1\leq i,j\leq 4

(11)

where $[\mathbf{x}]_{3}$ denotes the third and last vector element of the point $\mathbf{x}$ in non-normalized homogeneous coordinates. Following the above discussion, the quasi-affine set of matches $\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}$ in Eq. 3 is formally defined as
其中 $[\mathbf{x}]_{3}$ 表示点 $\mathbf{x}$ 在非归一化齐次坐标中的第三个也是最后一个向量元素。根据上述讨论，匹配的准仿射集 $\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}$ 在公式 3 中正式定义为

\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}=\left\{m\in\mathcal{M}:[\mathrm{H}% \mathbf{x}_{1}]_{3}\cdot[\mathrm{H}\mathbf{s}_{11}]_{3}=1\right\}

(12)

For better numerical stability, the analogous checks are executed also in the reverse direction through $\mathrm{H}^{-1}_{\mathcal{S}}$ .
为了提高数值稳定性，类似的检查也在 $\mathrm{H}^{-1}_{\mathcal{S}}$ 中以反向执行。

Lastly, to speed up the RANSAC search, a buffer $\mathcal{B}=\{\tilde{\mathrm{H}}_{i}\}$ with $1\leq i\leq z$ is globally maintained in order to contain the top $z=5$ discarded sub-optimal homographies encountered within a RANSAC run. For the current match set $\mathcal{M}^{k}$ if $\tilde{\mathrm{H}}_{0}$ is the current best RANSAC model maximizing the number of inliers, $\tilde{\mathrm{H}}_{i}\in\mathcal{B}$ are ordered by the number $v_{i}$ of inliers excluding those compatible with the previous homographies in the buffer, i.e. it must holds that
最后，为了加速 RANSAC 搜索，全局维护了一个缓冲区 $\mathcal{B}=\{\tilde{\mathrm{H}}_{i}\}$ ，其中包含 $1\leq i\leq z$ ，用于存储在一次 RANSAC 运行中遇到的排名前 $z=5$ 的次优单应性矩阵。对于当前的匹配集 $\mathcal{M}^{k}$ ，如果 $\tilde{\mathrm{H}}_{0}$ 是当前最佳 RANSAC 模型，其内点数量最多，那么 $\tilde{\mathrm{H}}_{i}\in\mathcal{B}$ 则按内点数量 $v_{i}$ 排序，排序时排除那些与缓冲区中先前单应性矩阵兼容的内点，即必须满足以下条件：

v_{i}\geq v_{i+1}

(13)

where 其中

v_{i}=\left|\mathcal{I}^{\mathcal{M}^{k}}_{\tilde{\mathrm{H}}_{i}}\setminus% \left(\begin{subarray}{c}\bigcup\\ {0\leq j<i}\end{subarray}\;\mathcal{I}^{\mathcal{M}^{k}}_{\tilde{\mathrm{H}}_{% j}}\right)\right|

(14)

The buffer $\mathcal{B}$ can be updated when the number of inliers of the homography associated with the current sampled hypothesis is greater than the minimum $v_{i}$ , by inserting in the case the new homography, removing the low-scoring one and resorting to the buffer. Moreover, at the beginning of a RANSAC run inside a MOP iteration, the minimum hypothesis model sampling sets $\mathcal{S}$ corresponding to homographies in $\mathcal{B}$ are evaluated before proceeding with true random samples to provide a bootstrap for the solution search. This also guarantees a global sequential inspection of the overall solution search space within MOP.
缓冲区 $\mathcal{B}$ 可以在与当前采样假设相关联的单应性的内点数量大于最小值 $v_{i}$ 时进行更新，具体操作是插入新的单应性，移除得分较低的单应性，并重新调整缓冲区。此外，在 MOP 迭代内部的 RANSAC 运行开始时，会先评估与 $\mathcal{B}$ 中单应性对应的假设模型采样集 $\mathcal{S}$ ，然后再进行真正的随机采样，以此为解搜索提供引导。这同时也确保了在 MOP 内对整个解搜索空间进行全局顺序检查。

3.1.4 Homography assignment
3.1.4 单应性分配

After the final set of homographies $\mathcal{H}^{\star}$ is computed, the filtered match set $\mathcal{M}^{\star}$ is obtained by removing all matches $m\in\mathcal{M}$ for which there is no homography $\mathrm{H}$ in $\mathcal{H}^{\star}$ having them as inlier, i.e.
在计算出最终的一组单应性矩阵 $\mathcal{H}^{\star}$ 后，通过移除所有匹配项 $m\in\mathcal{M}$ 来获得过滤后的匹配集 $\mathcal{M}^{\star}$ ，这些匹配项 $m\in\mathcal{M}$ 在 $\mathcal{H}^{\star}$ 中没有对应的单应性矩阵 $\mathrm{H}$ 将其视为内点，即

\mathcal{M}^{\star}=\left\{m\in\mathcal{M}:\exists\mathrm{H}\in\mathcal{H}^{% \star}\;m\in\mathcal{I}^{\mathcal{M}}_{\mathrm{H},t_{l}}\right\}

(15)

Next, a homography must be associated with each survived match $m\in\mathcal{M}^{\star}$ , which will be used for patch normalization.
接下来，必须为每个幸存的匹配项 $m\in\mathcal{M}^{\star}$ 关联一个单应性矩阵，用于块归一化。

A possible choice is to assign the homography $\mathrm{H}_{m}^{\varepsilon}$ which gives the minimum reprojection error, i.e.
一个可能的选择是分配单应性矩阵 $\mathrm{H}_{m}^{\varepsilon}$ ，它产生最小的重投影误差，即

\mathrm{H}_{m}^{\varepsilon}=\operatornamewithlimits{argmin}_{\mathrm{H}\in% \mathcal{H}^{\star}}\varepsilon_{\mathrm{H}}(m)

(16)

However, this choice is likely to select homographies corresponding to narrow planes in the image with small consensus sets, which leads to a low reprojection error but more unstable and prone to noise assignment.
然而，这种选择可能会挑选出对应于图像中狭窄平面的单应性变换，这些变换具有较小的共识集，从而导致重投影误差较低，但更不稳定且易受噪声影响。

Another choice is to assign to the match $m$ the homography $\mathrm{H}_{m}^{\mathcal{I}}$ compatible with $m$ with the wider consensus set, i.e.
另一种选择是将匹配 $m$ 分配给与 $m$ 兼容且具有更广泛共识集的单应性 $\mathrm{H}_{m}^{\mathcal{I}}$ ，即

\mathrm{H}_{m}^{\mathcal{I}}=\operatornamewithlimits{argmax}_{\mathrm{H}\in% \mathcal{H}^{\star},\,m\in\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}% }\left|\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}\right|

(17)

which points to more stable and robust homographies but can lead to incorrect patches when the corresponding reprojection error is not among the minimum ones.
这表明了更稳定和鲁棒的单应性，但在对应的重投影误差并非最小值时，可能导致错误的图像块。

The homography $\mathrm{H}_{m}$ actually chosen for the match association provides a solution within the above. Specifically, for a match $m$ , defining $q_{m}$ as the median of the number of inliers of the top 5 compatible homographies according to the number of inliers for $t_{l}$ , $\mathrm{H}_{m}$ is the compatible homography with the minimum reprojection error among those with an inlier number at least equal to $q_{m}$
所选用于匹配关联的单应性矩阵 $\mathrm{H}_{m}$ 确实在上述范围内提供了解决方案。具体而言，对于一个匹配 $m$ ，定义 $q_{m}$ 为根据内点数量对前 5 个兼容单应性矩阵的内点数量中位数， $\mathrm{H}_{m}$ 是在内点数量至少等于 $q_{m}$ 的单应性矩阵中，具有最小重投影误差的兼容单应性矩阵。

\mathrm{H}_{m}=\operatornamewithlimits{argmin}_{\mathrm{H}\in\mathcal{H}^{% \star},\,\left|\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}\right|\geq q% _{m}}\varepsilon_{H}(m)

(18)

Figure 1 shows an example of the achieved solution directly in combination with MiHo homography estimation described hereafter in Sec. 3.2, since MOP or MOP+MiHo in visual cluster representation does not present appreciable differences. Keypoint belonging to discarded matches are marked with black diamonds, while the clusters highlighted by other combinations of markers and colors indicate the resulting filtered matches $m\in\mathcal{M}^{\star}$ with the selected planar homography $\mathrm{H}_{m}$ . Notice that clusters are not designed to highlight scene planes but points that move according to the same homography transformation within the image pair.
图 1 展示了一个直接结合 MiHo 单应性估计（详见第 3.2 节）所获得的解决方案示例，因为在视觉聚类表示中，MOP 或 MOP+MiHo 并未呈现显著差异。属于被丢弃匹配的关键点以黑色菱形标记，而其他组合标记和颜色高亮的聚类则表示经过筛选的匹配结果 $m\in\mathcal{M}^{\star}$ ，以及选定的平面单应性 $\mathrm{H}_{m}$ 。值得注意的是，聚类并非旨在突出场景平面，而是强调在图像对中根据相同单应性变换移动的点。

Refer to caption — Figure 1: MOP+MiHo clustering and filtered matches for some image pair examples. Each combination of makers and colors is associated to a unique virtual planar homography as described in Secs. 3.1-3.2, while discarded matches are indicated by black diamonds. The matching pipeline employed is the the one relying on Key.Net described in Sec 4.1.1. The images of the middle and bottom rows belong to MegaDepth and ScanNet respectively, described later in Sec. 4.1.3. Best viewed in color and zoomed in.
图 1：MOP+MiHo 聚类及部分图像对示例的筛选匹配。每种标记和颜色的组合对应于一个独特的虚拟平面单应性，如第 3.1-3.2 节所述，而舍弃的匹配则用黑色菱形表示。所采用的匹配流程是基于 Key.Net 的，详见第 4.1.1 节。中间行和底行图像分别来自 MegaDepth 和 ScanNet，具体描述见第 4.1.3 节。最佳查看方式为彩色并放大。

3.2 Middle Homography (MiHo)
3.2 中间单应性（MiHo）

3.2.1 Rationale 3.2.1 基本原理

The $\mathrm{H}_{m}$ homography estimated by Eq. 18 for the match $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ allows to reproject the local patch neighborhood of $x_{1}\in I_{1}$ onto the corresponding one centered in $\mathbf{x}_{2}\in I_{2}$ . Nevertheless, $\mathrm{H}_{m}$ approximates the true transformation within the images which becomes less accurate as long as the distance from the keypoint center increases. This implies that patch alignment could be invalid for wider, and theoretically more discriminant, patches.
通过公式 18 估计的匹配 $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ 的 $\mathrm{H}_{m}$ 单应性，可以将 $x_{1}\in I_{1}$ 的局部块邻域重新投影到以 $\mathbf{x}_{2}\in I_{2}$ 为中心的对应邻域上。然而， $\mathrm{H}_{m}$ 近似于图像内的真实变换，随着与关键点中心距离的增加，其准确性逐渐降低。这意味着对于更宽、理论上更具辨别力的块，块对齐可能失效。

The idea behind MiHo is to break the homography $\mathrm{H}_{m}$ in the middle by two homographies $\mathrm{H}_{m_{1}}$ and $\mathrm{H}_{m_{2}}$ such that $\mathrm{H}_{m}=\mathrm{H}_{m_{2}}\mathrm{H}_{m_{1}}$ so that the patch on $\mathbf{x}_{1}$ gets deformed by $\mathrm{H}_{m_{1}}$ less than $\mathrm{H}_{m}$ , and likewise the patch on $\mathbf{x}_{2}$ gets less deformed by $\mathrm{H}_{m_{2}}^{-1}$ than $\mathrm{H}_{m}^{-1}$ . This means that visually on the reprojected patch a unit area square should remain almost similar to the original one. As this must hold from both the images, the deformation error must be distributed almost equally between the two homographies $\mathrm{H}_{m_{1}}$ and $\mathrm{H}_{m_{2}}$ . Moreover, since interpolation degrades with up-sampling, MiHo aims to balance the down-sampling of the patch at ﬁner resolution and the up-sampling of the patch at the coarser resolution.
MiHo 背后的理念是通过两个单应性变换 $\mathrm{H}_{m_{1}}$ 和 $\mathrm{H}_{m_{2}}$ 将单应性 $\mathrm{H}_{m}$ 在中间打破，使得 $\mathrm{H}_{m}=\mathrm{H}_{m_{2}}\mathrm{H}_{m_{1}}$ ，从而使得位于 $\mathbf{x}_{1}$ 上的图像块经 $\mathrm{H}_{m_{1}}$ 变形程度小于 $\mathrm{H}_{m}$ ，同样地，位于 $\mathbf{x}_{2}$ 上的图像块经 $\mathrm{H}_{m_{2}}^{-1}$ 变形程度也小于 $\mathrm{H}_{m}^{-1}$ 。这意味着在重新投影的图像块上，单位面积的正方形应与原始形状几乎相似。由于这一要求必须从两幅图像中都成立，变形误差必须在两个单应性 $\mathrm{H}_{m_{1}}$ 和 $\mathrm{H}_{m_{2}}$ 之间几乎均匀分布。此外，鉴于插值质量随上采样而下降，MiHo 旨在平衡在更精细分辨率下对图像块的下采样与在较粗分辨率下对图像块的上采样。

3.2.2 Implementation 3.2.2 实现

While a strict analytical formulation of the above problem which leads to a practical satisfying solution is not easy to derive, the heuristic found in [51] can be exploited to modify RANSAC within MOP to account for the required constraints. Specifically, each match $m$ is replaced by two corresponding matches $m_{1}=(\mathbf{x}_{1},\mathbf{m})$ and $m_{2}=(\mathbf{m},\mathbf{x}_{2})$ where $\mathbf{m}$ is the midpoint within the two keypoints
虽然严格分析上述问题并导出一个实际满意的解决方案并不容易，但[51]中发现的启发式方法可以被利用来修改 MOP 中的 RANSAC，以考虑所需约束。具体而言，每个匹配 $m$ 被替换为两个对应的匹配 $m_{1}=(\mathbf{x}_{1},\mathbf{m})$ 和 $m_{2}=(\mathbf{m},\mathbf{x}_{2})$ ，其中 $\mathbf{m}$ 是两个关键点之间的中点。

\mathbf{m}=\frac{\mathbf{x}_{1}+\mathbf{x}_{2}}{2}

(19)

Hence the RANSAC input match set $\mathcal{M}^{k}=\{m\}$ for the MOP $k$ -th iteration is split in the two match sets $\mathcal{M}^{k}_{1}=\{m_{1}\}$ and $\mathcal{M}^{k}_{2}=\{m_{2}\}$ . Within a RANSAC iteration, a sample $\mathcal{S}$ defined by Eq. 9 is likewise split into
因此，MOP 第 $k$ 次迭代的 RANSAC 输入匹配集 $\mathcal{M}^{k}=\{m\}$ 被划分为两个匹配集 $\mathcal{M}^{k}_{1}=\{m_{1}\}$ 和 $\mathcal{M}^{k}_{2}=\{m_{2}\}$ 。在 RANSAC 迭代过程中，由公式 9 定义的样本 $\mathcal{S}$ 同样被分割为

	$\displaystyle\mathcal{S}_{1}=\left\{\left(\mathbf{s}_{1i},\tfrac{\mathbf{s}_{1% i}+\mathbf{s}_{2i}}{2}\right)\right\}\subseteq\mathcal{M}^{k}_{1}$
	$\displaystyle\mathcal{S}_{2}=\left\{\right(\tfrac{\mathbf{s}_{1i}+\mathbf{s}_{% 2i}}{2},\mathbf{s}_{2i}\left)\right\}\subseteq\mathcal{M}^{k}_{2},1\leq i\leq 4$		(20)

leading to two concurrent homographies $\mathrm{H}_{1}$ and $\mathrm{H}_{2}$ to be verified simultaneously with the inlier set in analogous way to Eq. 3
导致两个并发的单应性 $\mathrm{H}_{1}$ 和 $\mathrm{H}_{2}$ 需与内点集以类似式 3 的方式同时验证

\mathcal{I}^{\mathcal{M}^{k}}_{(\mathrm{H}_{1},\mathrm{H}_{2}),t}=\left(% \mathcal{J}^{\mathcal{M}^{k}_{1}}_{\mathrm{H}_{1},t}\upharpoonleft% \downharpoonright\mathcal{J}^{\mathcal{M}^{k}_{2}}_{\mathrm{H}_{2},t}\right)% \cap\left(\mathcal{Q}^{\mathcal{M}^{k}_{1}}_{\mathrm{H}_{1}}\upharpoonleft% \downharpoonright\mathcal{Q}^{\mathcal{M}^{k}_{2}}_{\mathrm{H}_{2}}\right)

(21)

where for two generic match set $\mathcal{M}_{1}$ , $\mathcal{M}_{2}$ , the $\upharpoonleft\downharpoonright$ operator rejoints splitted matches according to
其中，对于两个通用的匹配集 $\mathcal{M}_{1}$ 、 $\mathcal{M}_{2}$ ， $\upharpoonleft\downharpoonright$ 运算符根据规则重新连接已拆分的匹配项

	$\displaystyle\mathcal{M}_{1}\upharpoonleft\downharpoonright\mathcal{M}_{2}=\{(% \mathbf{x}_{1},\mathbf{x}_{2}):$	$\displaystyle(\mathbf{x}_{1},\mathbf{m})\in\mathcal{M}_{1}$
	$\displaystyle\wedge$	$\displaystyle(\mathbf{x}_{2},\mathbf{m})\in\mathcal{M}_{2}\}$		(22)

All the other MOP steps defined in Sec. 3.1, including RANSAC degeneracy checks, threshold adaptation, and final homography assignment, follow straightly in an analogous way. Overall, the principal difference is that a pair of homography $(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})$ is obtained instead of the single $\mathrm{H}_{m}$ homography to be operated simultaneously on split match sets $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ instead of the original one $\mathcal{M}$ .
第 3.1 节中定义的所有其他 MOP 步骤，包括 RANSAC 退化检查、阈值调整和最终单应性分配，均以类似方式直接进行。总体而言，主要区别在于获得的是一对单应性 $(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})$ ，而非单一的 $\mathrm{H}_{m}$ 单应性，同时作用于分割匹配集 $\mathcal{M}_{1}$ 和 $\mathcal{M}_{2}$ ，而非原始的 $\mathcal{M}$ 。

Figure 2 shows the differences in applying MiHo to align to planar scenes with respect to directly reproject in any of the two images considered as reference. It can be noted that the distortion differences are overall reduced, as highlighted by the corresponding grid deformation. Moreover, MiHo homography strongly resembles an affine transformation or even a similarity, i.e. a rotation and scale change only, imposing in some sense a tight constraint on the transformation than the base homography. To balance this additional restrain with respect to the original MOP, in MOP+MiHo the minimum requirement amount of inliers for increasing the failure counter defined by Eq. 4 has been relaxed after experimental validation from $n=12$ to $n=8$ .
图 2 展示了将 MiHo 应用于对齐平面场景与直接在任意两张被视为参考的图像中重新投影之间的差异。可以看出，扭曲差异整体上有所减少，这在相应的网格变形中得到了体现。此外，MiHo 单应性非常接近仿射变换，甚至相似变换，即仅涉及旋转和尺度变化，在某种意义上对变换施加了比基础单应性更严格的约束。为了在 MOP+MiHo 中平衡这一额外约束与原始 MOP 的关系，经过实验验证后，增加失败计数器所需的最小内点数量已从 $n=12$ 放宽至 $n=8$ 。

3.2.3 Translation invariance
3.2.3 平移不变性

MiHo is invariant to translations as long as the algorithm used to extract the homography provides the same invariance. This can be easily verified since when translation vectors $\mathbf{t}_{a}$ , $\mathbf{t}_{b}$ are respectively added to all keypoints $\mathbf{x}_{1}\in I_{1}$ and $\mathbf{x}_{2}\in I_{2}$ to get $\mathbf{x}^{\prime}_{1}=\mathbf{x}_{1}+\mathbf{t}_{a}$ and $\mathbf{x}^{\prime}_{2}=\mathbf{x}_{2}+\mathbf{t}_{b}$ the corresponding midpoints are of the form
只要用于提取单应性的算法提供相同的平移不变性，MiHo 对平移是不变的。这一点很容易验证，因为当平移向量 $\mathbf{t}_{a}$ 和 $\mathbf{t}_{b}$ 分别添加到所有关键点 $\mathbf{x}_{1}\in I_{1}$ 和 $\mathbf{x}_{2}\in I_{2}$ 以得到 $\mathbf{x}^{\prime}_{1}=\mathbf{x}_{1}+\mathbf{t}_{a}$ 和 $\mathbf{x}^{\prime}_{2}=\mathbf{x}_{2}+\mathbf{t}_{b}$ 时，相应的中间点形式为

\mathbf{m}^{\prime}=\frac{\mathbf{x}^{\prime}_{1}+\mathbf{x}^{\prime}_{2}}{2}=% \frac{\mathbf{x}_{1}+\mathbf{x}_{2}}{2}+\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2% }=\mathbf{m}+\mathbf{t}

(23)

where $\mathbf{t}=\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2}$ is a fixed translation vector is added to all the corresponding original midpoints $\mathbf{m}$ . Overall, translation keypoints in the respective images have the effect of adding a fixed translation to the corresponding midpoint. Hence MiHo is invariant to translations since the normalization employed by the normalized 8-point algorithm is invariant to similarity transformations [45], including translations.
其中， $\mathbf{t}=\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2}$ 是一个固定的平移向量，被加到所有相应的原始中点 $\mathbf{m}$ 上。总体而言，在各自图像中的平移关键点具有为相应中点添加固定平移的效果。因此，MiHo 对平移是不变的，因为归一化 8 点算法所采用的归一化处理对包括平移在内的相似变换是不变的 [45]。

3.2.4 Fixing rotations 3.2.4 固定旋转

MiHo is not rotation invariant. Figure 3 illustrates how midpoints change with rotations. In the ideal MiHo configuration where the images are upright the area of the image formed in the midpoint plane is within the areas of the original ones, making a single cone when considering the images as stacked. The area of the midpoint plane is lower than the minimum area between the two images when there is a rotation about $180^{\circ}$ , and could degenerate to the apex of the corresponding double cone. A simple strategy to get an almost-optimal configuration was experimentally verified to work well in practice under the observation that images acquired by cameras tend to have a possible relative rotation $\alpha$ of a multiple of $90^{\circ}$ .
MiHo 不具有旋转不变性。图 3 展示了中点随旋转的变化情况。在理想的 MiHo 配置中，当图像竖直时，中点平面内形成的图像面积介于原始图像面积之间，将图像视为堆叠时形成一个单锥。当围绕 $180^{\circ}$ 进行旋转时，中点平面的面积小于两图像间的最小面积，并可能退化为对应双锥的顶点。通过实验验证，一种简单的策略在实践中表现良好，该策略基于以下观察：相机获取的图像往往具有 $\alpha$ 的相对旋转，且该旋转为 $90^{\circ}$ 的倍数。

In detail, let us define a given input match $m_{i}=(\mathbf{x}_{1i},\mathbf{x}^{\alpha}_{2i})\in\mathcal{M}^{\alpha}$ and the corresponding midpoint $\mathbf{m}^{\alpha}_{i}=\frac{\mathbf{x}_{1i}+\mathbf{x}_{2i}^{\alpha}}{2}$ , where $\mathbf{x}_{1i}\in I_{1}$ and $\mathbf{x}^{\alpha}_{2i}\in I^{\alpha}_{2}$ being the image $I_{2}$ rotated by $\alpha$ . One has to choose $\alpha^{\star}\in\mathcal{A}=\{0,\tfrac{\pi}{2},\pi,\tfrac{3\pi}{2}\}$ to maximize the configuration for which midpoint inter-distances are between corresponding keypoint inter-distances
具体而言，我们定义一个给定的输入匹配 $m_{i}=(\mathbf{x}_{1i},\mathbf{x}^{\alpha}_{2i})\in\mathcal{M}^{\alpha}$ 及其对应的中间点 $\mathbf{m}^{\alpha}_{i}=\frac{\mathbf{x}_{1i}+\mathbf{x}_{2i}^{\alpha}}{2}$ ，其中 $\mathbf{x}_{1i}\in I_{1}$ 和 $\mathbf{x}^{\alpha}_{2i}\in I^{\alpha}_{2}$ 是图像 $I_{2}$ 旋转 $\alpha$ 后的结果。需要选择 $\alpha^{\star}\in\mathcal{A}=\{0,\tfrac{\pi}{2},\pi,\tfrac{3\pi}{2}\}$ 以最大化中间点间距在对应关键点间距之间的配置。

\alpha^{\star}=\operatornamewithlimits{argmax}_{\begin{subarray}{c}m_{i},m_{j}% \in\mathcal{M}^{\alpha}\\ \alpha\in\mathcal{A}\end{subarray}}\left\llbracket\parallel\mathbf{m}^{\alpha}% _{i}-\mathbf{m}^{\alpha}_{j}\parallel\in\left[d_{\downarrow}^{ij},d^{\uparrow}% _{ij}\right]\right\rrbracket

(24)

where 其中

	$\displaystyle d_{\downarrow}^{ij}$	$\displaystyle=\min\left(\parallel\mathbf{x}_{1i}-\mathbf{x}_{1j}\parallel,% \parallel\mathbf{x}^{\alpha}_{2i}-\mathbf{x}^{\alpha}_{2j}\parallel\right)$
	$\displaystyle d^{\uparrow}_{ij}$	$\displaystyle=\max\left(\parallel\mathbf{x}_{1i}-\mathbf{x}_{1j}\parallel,% \parallel\mathbf{x}^{\alpha}_{2i}-\mathbf{x}^{\alpha}_{2j}\parallel\right)$		(25)

and $\llbracket\cdot\rrbracket$ is the Iverson bracket. The global orientation estimation $\alpha^{\star}$ can be computed efficiently on the initial input matches $\mathcal{M}$ before running MOP, adjusting keypoints accordingly. Final homographies can be adjusted in turn by removing the rotation given by $\alpha^{\star}$ . Figure 4 shows the MOP+MiHo result obtained on a set of matches without or with orientation pre-processing, highlighting the better matches obtained in the latter case.
其中 $\llbracket\cdot\rrbracket$ 是艾弗森括号。全局方向估计 $\alpha^{\star}$ 可以在运行 MOP 之前，通过对初始输入匹配 $\mathcal{M}$ 进行高效计算，并相应调整关键点。最终的单应性矩阵可以通过去除由 $\alpha^{\star}$ 给出的旋转来进行调整。图 4 展示了在未进行或进行方向预处理的情况下，通过 MOP+MiHo 方法在一组匹配上获得的结果，突出了在后一种情况下获得的更好匹配。

3.3 Normalized Cross Correlation (NCC)
3.3 归一化互相关（NCC）

3.3.1 Local image warping 3.3.1 局部图像扭曲

NCC is employed to effectively refine corresponding keypoint localization. The approach assumes that corresponding keypoints have been roughly aligned locally by the transformations provided as a homography pair. In the warped aligning space, template matching [48] is employed to refine translation offsets in order to maximize the NCC peak response when the patch of a keypoint is employed as a filter onto the other image of the corresponding keypoint.
NCC 被用于有效精炼相应关键点的定位。该方法假设通过作为单应性对提供的变换，相应关键点已在局部大致对齐。在扭曲对齐的空间中，采用模板匹配[48]来精炼平移偏移，以在将关键点的补丁用作另一图像中对应关键点的滤波器时，最大化 NCC 峰值响应。

For the match $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ within images $I_{1}$ and $I_{2}$ there can be multiple potential choices of the warping homography pair, which are denoted by the set of associated homography pairs $\mathcal{H}_{m}=\{(\mathrm{H}_{1},\mathrm{H}_{2})\}$ . In detail, according to the previous steps of the matching pipeline, a candidate is the base homography pair $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ defined as
对于图像 $I_{1}$ 和 $I_{2}$ 中的匹配 $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ ，可能存在多个潜在的扭曲单应性对选择，这些选择由相关单应性对集合 $\mathcal{H}_{m}=\{(\mathrm{H}_{1},\mathrm{H}_{2})\}$ 表示。具体而言，根据匹配流水线的先前步骤，候选者是定义为基单应性对 $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ 的单应性对。

•

$(\mathrm{I},\mathrm{I})$ , where $\mathrm{I}\in\mathbb{R}^{3\times 3}$ is the identity matrix, when no local neighborhood data are provided, e.g. for SuperPoint;

• $(\mathrm{I},\mathrm{I})$ ，其中 $\mathrm{I}\in\mathbb{R}^{3\times 3}$ 是单位矩阵，当未提供局部邻域数据时，例如在 SuperPoint 中；
•

$(\mathrm{A}_{1},\mathrm{A}_{2})$ , where $\mathrm{A}_{1},\mathrm{A}_{2}\in\mathbb{R}^{3\times 3}$ are affine or similarity transformations in case of affine covariant detector & descriptor, as for AffNet or SIFT, respectively;

• $(\mathrm{A}_{1},\mathrm{A}_{2})$ ，其中 $\mathrm{A}_{1},\mathrm{A}_{2}\in\mathbb{R}^{3\times 3}$ 在仿射协变检测器和描述符的情况下分别是仿射或相似变换，如 AffNet 或 SIFT

and the other is extended pair $(\mathrm{H}^{e}_{1},\mathrm{H}^{e}_{2})$ defined as
另一个是扩展对 $(\mathrm{H}^{e}_{1},\mathrm{H}^{e}_{2})$ ，定义为

•

$(\mathrm{I},\mathrm{H}_{m})$ , where $\mathrm{H}_{m}$ is defined in Eq. 18, in case MOP is used without MiHo;

• $(\mathrm{I},\mathrm{H}_{m})$ ，其中 $\mathrm{H}_{m}$ 在公式 18 中定义，适用于未使用 MiHo 的 MOP 情况；
•

$(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})$ as described in Sec. 3.2.1, in case of MOP+MiHo

• 如 3.2.1 节所述，在 MOP+MiHo 情况下
•

$(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ as above otherwise.

• 如上所述，否则。

Since the found transformations are approximated, the extended homography pair is perturbed in practice by small shear factors $f$ and rotation changes $\rho$ through the affine transformations
由于所发现的变换是近似的，扩展的单应性对在实际应用中通过仿射变换受到微小的剪切因子 $f$ 和旋转变化 $\rho$ 的扰动

\mathrm{A}_{\rho,f}=\begin{bmatrix}f\cos(\rho)&\text{-}f\sin(\rho)&0\\ \hfill\sin(\rho)&\hfill\cos(\rho)&0\\ 0&0&1\end{bmatrix}

(26)

with 与

	$\displaystyle\rho\in\mathcal{R}=\{-\tfrac{\pi}{6},-\tfrac{\pi}{12},0,\tfrac{% \pi}{12},\tfrac{\pi}{6}\}$		(27)
	$\displaystyle f\in\mathcal{F}=\{\tfrac{5}{7},\tfrac{5}{6},1,\tfrac{6}{5},% \tfrac{7}{5}\}$		(28)

so that 以便

$\displaystyle\mathcal{H}_{m}=$	$\displaystyle\left\{\left(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2}\right)\right\}$
$\displaystyle\cup$	$\displaystyle\left\{\left(\mathrm{A}_{\rho,f}\mathrm{H}^{e}_{1},\mathrm{H}^{e}% _{2}\right):(\rho,s)\in\mathcal{R}\times\mathcal{F}\right\}$
$\displaystyle\cup$	$\displaystyle\left\{\left(\mathrm{H}^{e}_{1},\mathrm{A}_{\rho,f}\mathrm{H}^{e}% _{2}\right):(\rho,s)\in\mathcal{R}\times\mathcal{F}\right\}$	(29)

3.3.2 Computation 3.3.2 计算

Let us denote a generic image by $I$ , the intensity values of $I$ for a generic pixel $\mathbf{x}\in\mathbb{R}^{2}$ as $I(\mathbf{x})$ , the window radius extension as
我们用 $I$ 表示一个通用图像， $I$ 在通用像素 $\mathbf{x}\in\mathbb{R}^{2}$ 处的强度值为 $I(\mathbf{x})$ ，窗口半径扩展为

\mathcal{W}_{r}=\left\{x\in[-r,r]\cap\mathbb{Z}\right\}

(30)

such that the squared window set of offsets is
使得偏移量的平方窗口集为

\mathcal{W}^{2}_{r}=\mathcal{W}_{r}\times\mathcal{W}_{r}

(31)

and has cardinality equal to
且基数等于

\left|\mathcal{W}^{2}_{r}\right|=(2w+1)^{2}

(32)

The squared patch of $I$ centered in $\mathbf{x}$ with radius $w$ is the pixel set
以 $\mathbf{x}$ 为中心、半径为 $w$ 的 $I$ 方形补丁是像素集

\mathcal{P}^{I}_{\mathbf{x},r}=\left\{I(\mathbf{x}+\mathbf{w}):\mathbf{w}\in% \mathcal{W}^{2}_{r}\right\}

(33)

while the mean intensity value of the patch $\mathcal{P}^{I}_{\mathbf{x},r}$ is
而补丁 $\mathcal{P}^{I}_{\mathbf{x},r}$ 的平均强度值为

\mu^{I}_{\mathbf{x},r}=\frac{1}{\left|\mathcal{W}^{2}_{r}\right|}\sum_{\mathbf% {w}\in\mathcal{W}^{2}}I(\mathbf{x}+\mathbf{w})

(34)

and the variance is 方差为

\sigma^{I}_{\mathbf{x},r}=\frac{1}{\left|\mathcal{W}^{2}_{r}\right|}\sum_{% \mathbf{w}\in\mathcal{W}^{2}}\left(I(\mathbf{x}+\mathbf{w})-\mu^{I}_{\mathbf{x% },r}\right)^{2}

(35)

The normalized intensity value $\overline{I}_{\mathbf{x},r}(\mathbf{y})$ by the mean and standard deviation of the patch $\mathbf{P}^{I}_{\mathbf{x},r}$ for a pixel $\mathbf{y}$ in $I$ is
通过像素 $\mathbf{y}$ 在 $I$ 中的补丁 $\mathbf{P}^{I}_{\mathbf{x},r}$ 的均值和标准差归一化的强度值 $\overline{I}_{\mathbf{x},r}(\mathbf{y})$ 为

\overline{I}_{\mathbf{x},r}(\mathbf{y})=\frac{I(\mathbf{y})-\mu^{I}_{\mathbf{x% },r}}{\sqrt{\sigma^{I}_{\mathbf{x},r}}}

(36)

which is ideally robust to affine illumination changes [48]. Lastly, the similarity between two patches $\mathcal{P}^{A}_{\mathbf{a},r}$ and $\mathcal{P}^{B}_{\mathbf{b},r}$ is
这理想地对仿射光照变化具有鲁棒性[48]。最后，两个块 $\mathcal{P}^{A}_{\mathbf{a},r}$ 和 $\mathcal{P}^{B}_{\mathbf{b},r}$ 之间的相似性是

S^{A,B}_{\mathbf{a},\mathbf{b},r}=\sum_{\mathbf{w}\in\mathcal{W}^{2}_{r}}% \overline{A}_{\mathbf{a},r}(\mathbf{a}+\mathbf{w})\overline{B}_{\mathbf{b},r}(% \mathbf{b}+\mathbf{w})

(37)

so that, for the match $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ and a set of associated homography pairs $\mathcal{H}_{m}$ derived in Eq. 3.3.1, the refined keypoint offsets $(\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star})$ and the best aligning homography pair $(\mathrm{H}_{1}^{\star},\mathrm{H}_{2}^{\star})$ are given by
从而，对于匹配 $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ 以及在公式 3.3.1 中导出的一组相关单应性对 $\mathcal{H}_{m}$ ，精炼的关键点偏移量 $(\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star})$ 和最佳对齐的单应性对 $(\mathrm{H}_{1}^{\star},\mathrm{H}_{2}^{\star})$ 由以下公式给出：

	$\displaystyle T^{I_{1},I_{2}}_{m}=$	$\displaystyle((\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star}),(\mathrm{H}_{1}^% {\star},\mathrm{H}_{2}^{\star}))$
	$\displaystyle=$	$\displaystyle\operatornamewithlimits{argmax}_{\begin{subarray}{c}(\mathbf{t}_{% 1},\mathbf{t}_{2})\in\overleftrightarrow{\mathcal{W}}^{2}_{r}\\ (\mathrm{H}_{1},\mathrm{H}_{2})\in\mathcal{H}_{m}\end{subarray}}S^{\mathrm{H}_% {1}\circ I_{1},\mathrm{H}_{2}\circ I_{2}}_{\mathrm{H}_{1}\mathbf{x}_{1}+% \mathbf{t_{1}},\mathrm{H}_{2}\mathbf{x}_{2}+\mathbf{t_{2}},r}$		(38)

where $\mathrm{H}\circ I$ means the warp of the image $I$ by $\mathrm{H}$ , $\mathrm{H}\mathbf{x}$ is the transformed pixel coordinates and
其中 $\mathrm{H}\circ I$ 表示图像 $I$ 通过 $\mathrm{H}$ 的扭曲， $\mathrm{H}\mathbf{x}$ 是变换后的像素坐标，

\overleftrightarrow{\mathcal{W}}^{2}_{r}=\left(\mathbf{0}\times\mathcal{W}^{2}% _{r}\right)\cup\left(\mathcal{W}^{2}_{r}\times\mathbf{0}\right)

(39)

is the set of offset translation to check considering in turn one of the images as template since this is not a symmetric process. Notice that $\mathbf{t}^{\star}_{1},\mathbf{t}^{\star}_{2}$ cannot be both different from $\mathbf{0}$ according to $\overleftrightarrow{\mathcal{W}}^{2}_{r}$ . The final refined match
是考虑偏移量翻译的集合，依次将其中一张图像作为模板进行检查，因为这是一个非对称过程。注意，根据 $\overleftrightarrow{\mathcal{W}}^{2}_{r}$ ， $\mathbf{t}^{\star}_{1},\mathbf{t}^{\star}_{2}$ 不能同时与 $\mathbf{0}$ 不同。最终的精细匹配

m^{\star}=\left(\mathbf{x}_{1}^{\star},\mathbf{x}_{2}^{\star}\right)

(40)

is obtained by reprojecting back to the original images, i.e.
是通过重新投影回原始图像获得的，即

		$\displaystyle\mathbf{x}_{1}^{\star}={\mathrm{H}^{\star}_{1}}^{-1}\left(\mathrm% {H}_{1}\mathbf{x}_{1}+\mathbf{t}_{1}^{\star}\right)$
		$\displaystyle\mathbf{x}_{2}^{\star}={\mathrm{H}^{\star}_{2}}^{-1}\left(\mathrm% {H}_{2}\mathbf{x}_{2}+\mathbf{t}_{2}^{\star}\right)$		(41)

The normalized cross correlation can be computed efficiently by convolution, where bilinear interpolation is employed to warp images. The patch radius $r$ is also the offset search radius and has been experimentally set to $r=10\approx\sqrt{2}t_{h}$ px. According to preliminary experiments, a wider radius extension can break the planar neighborhood assumption and patch alignment.
归一化互相关可以通过卷积高效计算，其中采用双线性插值对图像进行扭曲。补丁半径 $r$ 同时也是偏移搜索半径，并已通过实验设定为 $r=10\approx\sqrt{2}t_{h}$ 像素。根据初步实验，更宽的半径扩展会破坏平面邻域假设及补丁对齐。

3.3.3 Sub-pixel precision 3.3.3 亚像素精度

The refinement offset can be further enhanced by parabolic interpolation adding sub-pixel precision [102]. Notice that other forms of interpolations have been investigated but they did not provide any sub-pixel accuracy improvements according to previous experiments [49]. The NCC response map at $\mathbf{u}\in\mathbb{R}^{2}$ in the best warped space with origin the peak value can be written as
通过抛物线插值可进一步提升细化偏移量，增加亚像素精度[102]。值得注意的是，尽管其他形式的插值方法已被研究，但根据先前实验[49]，它们并未带来亚像素级别的精度提升。在最佳扭曲空间中，以峰值为原点的 NCC 响应图 $\mathbf{u}\in\mathbb{R}^{2}$ 可表示为

S^{\star}_{m}(\mathbf{u})=S^{\mathrm{H}^{\star}_{1}\circ I_{1},\mathrm{H}^{% \star}_{2}\circ I_{2}}_{\mathrm{H}^{\star}_{1}\mathbf{x}_{1}+\mathbf{t}^{\star% }_{1}+\mathbf{u},\mathrm{H}^{\star}_{2}\mathbf{x}_{2}+\mathbf{t}^{\star}_{2}+% \mathbf{u},r}

(42)

from $T_{m}^{I_{1},I_{2}}$ computed by Eq. 3.3.2. The sub-pixel refinement offsets in the horizontal direction are computed as the vertex $-\tfrac{b}{2a}$ of a parabola $ax^{2}+bx+c=y$ fitted on the horizontal neighborhood centered in the peak of the maximized NNC response map, i.e.
由公式 3.3.2 计算得到的 $T_{m}^{I_{1},I_{2}}$ 。水平方向上的亚像素细化偏移量计算为抛物线 $ax^{2}+bx+c=y$ 的顶点 $-\tfrac{b}{2a}$ ，该抛物线拟合于以最大化 NNC 响应图峰值为中心的水平邻域上，即

	$\displaystyle a$	$\displaystyle=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right)}{2}$
	$\displaystyle b$	$\displaystyle=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right)}{2}$		(43)

and analogously for the vertical offsets. Explicitly the sub-pixel refinement offset is $\mathbf{p}=\left[\begin{smallmatrix}p_{x}\\ p_{y}\end{smallmatrix}\right]$ where
并且类似地适用于垂直偏移。具体而言，亚像素精炼偏移为 $\mathbf{p}=\left[\begin{smallmatrix}p_{x}\\ p_{y}\end{smallmatrix}\right]$ ，其中

	$\displaystyle p_{x}=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}\text{-}% 1\\ \;0\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{smallmatrix% }1\\ 0\end{smallmatrix}\right]\right)}{2(S^{\star}_{m}\left(\left[\begin{% smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right))}$
	$\displaystyle p_{y}=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}\;0\\ \text{-}1\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{% smallmatrix}0\\ 1\end{smallmatrix}\right]\right)}{2(S^{\star}_{m}\left(\left[\begin{% smallmatrix}0\\ 1\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \;0\\ \text{-}1\end{smallmatrix}\right]\right))}$		(44)

and from Eq. 3.3.2 the final refined match
并从公式 3.3.2 得出最终的精细匹配

m^{\star\star}=(\mathbf{x}_{1}^{\star\star},\mathbf{x}_{2}^{\star\star})

(45)

becomes 变为

	$\displaystyle\mathbf{x}_{1}^{\star\star}={\mathrm{H}^{\star}_{1}}^{-1}\left(% \mathrm{H}_{1}\mathbf{x}_{1}+\mathbf{t}_{1}^{\star}+\left\llbracket\,\mathbf{t% }_{1}^{\star}\neq\mathbf{0}\,\right\rrbracket\,\mathbf{p}\right)$
	$\displaystyle\mathbf{x}_{2}^{\star\star}={\mathrm{H}^{\star}_{2}}^{-1}\left(% \mathrm{H}_{2}\mathbf{x}_{2}+\mathbf{t}_{2}^{\star}+\left\llbracket\,\mathbf{t% }_{2}^{\star}\neq\mathbf{0}\,\right\rrbracket\,\mathbf{p}\right)$		(46)

where the Iverson bracket zeroes the sub-pixel offset increment $\mathbf{p}$ in the reference image according to Eq. 39.
其中，Iverson 括号根据公式 39 将参考图像中的亚像素偏移增量 $\mathbf{p}$ 置零。

4 Evaluation 4 评估

4.1 Setup 4.1 设置

To ease the reading, each pipeline module or error measure that will be included in the results is highlighted in bold.
为方便阅读，结果中包含的每个管道模块或误差度量均以粗体突出显示。

4.1.1 Base pipelines 4.1.1 基础管道

Seven base pipelines have been tested to provide the input matches to be filtered. For each of these, the maximum number of keypoints was set to 8000 to retain as many as possible matches. This leads to including more outliers and better assessing the filtering potential. Moreover, nowadays benchmarks and deep matching architecture designs assume upright image pairs, where the relative orientation between images is roughly no more than $\frac{\pi}{8}$ . This additional constraint allows us to retrieve better initial matches in common user acquisition scenarios. Current general standard benchmark datasets, including those employed in this evaluation, are built so that the image pairs do not violate this constraint. Notice that many SOTA joint and end-to-end deep architectures do not tolerate strong rotations within images by design.
已对七种基础管道进行了测试，以提供待过滤的输入匹配。对于每一种管道，关键点最大数量设置为 8000，以尽可能保留更多匹配。这导致包含更多异常值，从而更好地评估过滤潜力。此外，当前的基准测试和深度匹配架构设计均假设图像对为竖直状态，图像间的相对方向大致不超过 $\frac{\pi}{8}$ 。这一额外约束使我们在常见的用户采集场景中能够检索到更好的初始匹配。目前通用的标准基准数据集，包括本次评估所使用的数据集，均构建为图像对不违反此约束。值得注意的是，许多最先进的联合和端到端深度架构在设计上无法容忍图像内的强旋转。

SIFT+NNR [19] is included as the standard and reference handcrafted pipeline. The OpenCV [103] implementation was used for SIFT, exploiting RootSIFT [104] for descriptor normalization, while NNR implementation is provided by Kornia [105]. To stress the match filtering robustness of the successive steps of the matching pipeline, which is the goal of this evaluation, the NNR ratio threshold was set rather high, i.e. to 0.95, while common adopted values range in $[0.7,0.9]$ depending on the scene complexity, with higher values yielding to less discriminative matches and then possible outliers. Moreover, upright patches are employed by zeroing the dominant orientation for a fair comparison with other recent deep approaches.
SIFT+NNR [19] 被纳入作为标准和参考的手工设计流程。SIFT 采用 OpenCV [103] 实现，利用 RootSIFT [104] 进行描述子归一化，而 NNR 实现则由 Kornia [105] 提供。为强调匹配流程中各后续步骤的匹配过滤鲁棒性，这是本次评估的目标，NNR 比率阈值设置得相当高，即 0.95，而通常采用的值范围在 $[0.7,0.9]$ 之间，具体取决于场景复杂度，较高值会导致匹配的区分度降低，从而可能引入异常值。此外，为公平比较其他近期深度方法，采用直立图像块，通过将主导方向置零来实现。

Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR [11], a modular pipeline that achieves good results in common benchmarks, is also taken into account. Excluding the NNR stage, it can be considered a modular deep pipeline. The Kornia implementation is used for the evaluation. As for SIFT, the NNR threshold is set very high, i.e. to 0.99, while more common values range in $[0.8,0.95]$ . The deep orientation estimation module OriNet [84] generally siding AffNet is not included to provide upright matches.
Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR [ 11]，一种在常见基准测试中取得良好结果的模块化流水线，也被纳入考虑。排除 NNR 阶段后，它可视为一种模块化的深度流水线。评估中使用了 Kornia 实现。至于 SIFT，NNR 阈值设置得非常高，即 0.99，而更常见的取值范围在 $[0.8,0.95]$ 。深度方向估计模块 OriNet [ 84]通常与 AffNet 并列，但为了提供直立匹配，未被包含在内。

Other base pipelines considered are SOTA full end-to-end matching networks. In details, these are SuperPoint+LightGlue, DISK+LightGlue and ALIKED+LightGlue as implemented in [12]. The input images are rescaled so the smaller size is 1024 following LightGlue default. For ALIKED the aliked-n16rot model weights are employed which according to the authors [13] can handle slight image rotation better and are more stable on viewpoint change.
其他考虑的基础管道是 SOTA 全端到端匹配网络。具体来说，这些是 SuperPoint+LightGlue、DISK+LightGlue 和 ALIKED+LightGlue，如[12]中所实现。输入图像按 LightGlue 默认设置缩放，使较小尺寸为 1024。对于 ALIKED，采用了 aliked-n16rot 模型权重，据作者[13]称，该权重能更好地处理轻微图像旋转，并在视角变化时更为稳定。

The last base pipelines added in the evaluation are DeDoDe v2 [93], which provides an alternative end-to-end deep architecture different from the above, and LoFTR, a semi-dense detector-free end-to-end network. These latter pipelines also achieve SOTA results in popular benchmarks. The authors’ implementation is employed for DeDoDe v2, setting to matching threshold to 0.01, while LoFTR implementation available through Kornia is used in the latter case.
评估中最后加入的基础管道是 DeDoDe v2 [93]，它提供了一种与上述不同的端到端深度架构，以及 LoFTR，一种半密集的无检测器端到端网络。这些后者的管道在流行的基准测试中也达到了 SOTA 结果。对于 DeDoDe v2，采用了作者的实现，匹配阈值设为 0.01，而在后一种情况下，使用了通过 Kornia 提供的 LoFTR 实现。

4.1.2 Match filtering and post-processing
4.1.2 匹配滤波与后处理

Eleven match filters are applied after the base pipelines and included in the evaluation besides MOP+MiHo+NCC. For the proposed filtering sub-pipeline, all the five available combinations have been tested, i.e. MOP, MOP+NCC, NCC, MOP+MiHo, MOP+MiHo+NCC, to analyze the behavior of each module. In particular, applying NCC after SIFT+NNR or Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR can highlight the improvement introduced by MOP which uses homography-based patch normalization against similarity or affine patches, respectively. Likewise, applying NCC to the remaining end-to-end architectures should remark the importance of patch normalization. The setup parameters indicated in Sec. 3 is employed, which has worked well in preliminary experiments. Notice that MiHo orientation adjustment described in Sec. 3.2.4 is applied although not required in the case of upright images. This allows us to indirectly assess the correctness of base assumptions and the overall general method robustness since in the case of wrong orientation estimation the consequential failure can only decrease the match quality.
在基础管道之后应用了十一个匹配过滤器，并在评估中除了 MOP+MiHo+NCC 之外还包括了这些过滤器。对于所提出的过滤子管道，所有五个可用的组合均已测试，即 MOP、MOP+NCC、NCC、MOP+MiHo、MOP+MiHo+NCC，以分析每个模块的行为。特别是，在 SIFT+NNR 或 Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR 之后应用 NCC，可以突出 MOP 引入的改进，该改进分别使用基于单应性的补丁归一化对抗相似性或仿射补丁。同样，将 NCC 应用于剩余的端到端架构应强调补丁归一化的重要性。采用第 3 节中指示的设置参数，该参数在初步实验中表现良好。请注意，尽管在直立图像的情况下不需要，但仍应用了第 3.2.4 节中描述的 MiHo 方向调整。这使我们能够间接评估基本假设的正确性和整体通用方法的鲁棒性，因为在方向估计错误的情况下，后续的失败只能降低匹配质量。

To provide a fair and general comparison, the analysis was restricted to filters that require as input only the keypoint positions of the matches and the image pair. This excludes approaches requiring in addition descriptor similarities [25, 28, 27], related patch transformations [71], intermediate network representations [42, 43] or a SfM framework [101]. The compared methods are GMS [26], AdaLAM [33] and CC [79] as handcrafted filters, and ACNe [35], CLNet [37], ConvMatch [40], DeMatch [41], FC-GNN [44], MS²DG-Net [39], NCMNet [38] and OANet [36] as deep modules. The implementations from the respective authors have been employed for all filters with the default setup parameters if not stated otherwise. Except FC-GNN and OANet, deep filters require as input the intrinsic camera matrices of the two images, which is not commonly available in real practical scenarios the proposed evaluation was designed for. To bypass this restriction, the approach presented in ACNe was employed, for which the intrinsic camera matrix
为了提供公平且广泛的比较，分析仅限于那些仅需匹配的关键点位置和图像对作为输入的滤波器。这排除了那些还需要描述符相似性[25, 28, 27]、相关补丁变换[71]、中间网络表示[42, 43]或 SfM 框架[101]的方法。所比较的方法包括手工设计的滤波器 GMS[26]、AdaLAM[33]和 CC[79]，以及深度模块 ACNe[35]、CLNet[37]、ConvMatch[40]、DeMatch[41]、FC-GNN[44]、MS ² DG-Net[39]、NCMNet[38]和 OANet[36]。除非另有说明，所有滤波器均采用各自作者提供的实现，并使用默认设置参数。除 FC-GNN 和 OANet 外，深度滤波器需要输入两幅图像的内在相机矩阵，这在实际应用场景中并不常见，而本次评估正是为此设计的。为绕过这一限制，采用了 ACNe 中提出的方法，该方法的内在相机矩阵

\mathrm{K}=\begin{bmatrix}f&0&c_{x}\\ 0&f&c_{y}\\ 0&0&1\end{bmatrix}

(47)

for the image $I$ with a resolution of $w\times h$ pixel is estimated by setting the focal length to
对于分辨率为 $w\times h$ 像素的图像 $I$ ，通过设定焦距来估算

f=\max(h,w)

(48)

and the camera centre as
并将相机中心作为

\mathbf{c}=\begin{bmatrix}c_{x}\\ c_{y}\end{bmatrix}=\frac{1}{2}\begin{bmatrix}w\\ h\end{bmatrix}

(49)

The above focal length estimation $f$ is quite reasonable by the statistics reported in Fig. 5 that have been extracted from the evaluation data and discussed in Sec. 4.1.3. Notice also that most of the deep filters have been trained on SIFT matches, yet to be robust and generalizable they must be able to mainly depend only on the scene and not on the kind of feature extracted by the pipeline.
上述焦距估计 $f$ 通过图 5 中的统计数据显得相当合理，这些数据提取自评估数据并在 4.1.3 节中进行了讨论。值得注意的是，大多数深度滤波器虽在 SIFT 匹配上进行了训练，但要实现稳健性和普适性，它们必须主要依赖于场景而非特征提取管道的类型。

For AdaLAM, no initial seeds or local similarity or affine clues were used as additional input as discussed above. Nevertheless, this kind of data could have been used only for SIFT+NNR or Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR. For CC, the inliers threshold was experimental set to $t_{h}=7$ px as in Sec. 3.1.2 instead of the value 1.5 px. suggested by authors. Otherwise, CC would not be able to work in general scenes with no planes. Note that AdaLAM and CC are the closest approaches to the proposed MOP+MiHo filtering pipeline. Moreover, among the compared methods, only FC-GNN is able to refine matches as MOP+MiHo+NCC does.
对于 AdaLAM，如上所述，未使用初始种子或局部相似性或仿射线索作为额外输入。然而，这类数据本可以仅用于 SIFT+NNR 或 Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR。对于 CC，内点阈值实验设定为 $t_{h}=7$ 像素，如 3.1.2 节所述，而非作者建议的 1.5 像素。否则，CC 将无法在无平面的一般场景中工作。请注意，AdaLAM 和 CC 是与提出的 MOP+MiHo 过滤管道最接近的方法。此外，在比较的方法中，仅 FC-GNN 能像 MOP+MiHo+NCC 那样精炼匹配。

RANSAC is also optionally evaluated as the final post-process to filtering matches according to epipolar or planar constraints. Three different RANSAC implementations were initially considered, i.e. the RANSAC implementation in PoseLib³³3https://github.com/PoseLib/PoseLib, DegenSAC⁴⁴4https://github.com/ducha-aiki/pydegensac [24] and MAGSAC⁵⁵5https://github.com/danini/magsac [23]. The maximum number of iterations for all RANSACs is set to $10^{5}$ , which is uncommonly high, to provide a more robust pose estimation and to make the whole pipeline only depend on the previous filtering and refinement stages. Five different inlier threshold values $t_{SAC}$ , i.e.
RANSAC 还作为最终的后处理步骤，根据极线或平面约束对匹配进行过滤。最初考虑了三种不同的 RANSAC 实现，即 PoseLib 中的 RANSAC 实现 ³ ，DegenSAC ⁴ [24]和 MAGSAC ⁵ [23]。所有 RANSAC 的最大迭代次数设置为 $10^{5}$ ，这一数值异常高，旨在提供更稳健的姿态估计，并使整个流程仅依赖于先前的过滤和优化阶段。采用了五个不同的内点阈值 $t_{SAC}$ ，即

t_{SAC}\in\{0.5\,\text{px},\,0.75\,\text{px},\,1\,\text{px},\,3\,\text{px},\,5% \,\text{px}\}

(50)

are considered. After a preliminary RANSAC ablation study⁶⁶6https://github.com/fb82/MiHo/tree/main/data/results according to proposed benchmark, the best general choice found is MAGSAC with 0.75 px. or 1 px. as inlier threshold $t_{SAC}$ , indicated by MAGSAC_↓ and MAGSAC_↑ respectively. On one hand, the former and more strict threshold is generally slightly better in terms of accuracy. On the other hand, the latter provides more inliers, so both results will be presented. Matching outputs without RANSAC are also reported for a complete analysis.
经过初步的 RANSAC 消融研究，根据提出的基准，发现最佳的通用选择是 MAGSAC，其内点阈值分别为 0.75 像素或 1 像素，分别标记为 MAGSAC _↓ 和 MAGSAC _↑ 。一方面，前者的更严格阈值在准确性上通常略胜一筹；另一方面，后者提供了更多的内点，因此将同时展示这两种结果。此外，还报告了未使用 RANSAC 的匹配输出，以进行全面的分析。

4.1.3 Datasets 4.1.3 数据集

Four different datasets have been employed in the evaluation. These include the MegaDepth [54] and ScanNet [55] datasets, respectively outdoor and indoor. The same 1500 image pairs for each dataset and Ground-Truth (GT) data indicated in the protocol employed in [14] are used. These datasets are de-facto standard in nowadays image matching benchmarking; sample image pairs for each dataset are shown in Fig. 1. MegaDepth test image pairs belong to only 2 different scenes and are resized proportionally so that the maximum size is equal to 1200 px, while ScanNet image pairs belong to 100 different scenes and are resized to $640\times 480$ px. Although according to previous evaluation [12] LightGlue provides better results when the images are rescaled so that the maximum dimension is 1600 px, in this evaluation the original 1200 px resize was maintained since more outliers and minor keypoint precision are achieved in the latter case, providing a configuration in which the filtering and refinement of the matches can be better revealed. For deep methods, the best weights that fit the kind of scene, i.e. indoor or outdoor, are used when available.
评估中采用了四个不同的数据集。其中包括室外场景的 MegaDepth [54]和室内场景的 ScanNet [55]数据集。每个数据集均使用与[14]中相同的 1500 对图像及其协议中指定的 Ground-Truth (GT)数据。这些数据集在当今图像匹配基准测试中已成为事实上的标准；各数据集的样本图像对如图 1 所示。MegaDepth 测试图像对仅属于两个不同场景，并按比例调整大小，使最大尺寸等于 1200 像素，而 ScanNet 图像对则属于 100 个不同场景，并调整至 $640\times 480$ 像素。尽管根据先前的评估[12]，当图像缩放至最大维度为 1600 像素时，LightGlue 表现更佳，但在本次评估中仍保持原始的 1200 像素缩放，因为后者能获得更多异常值和较小的关键点精度，从而提供一种配置，使匹配的过滤和细化得以更好地展现。对于深度方法，当可用时，会使用最适合场景类型（即室内或室外）的最佳权重。

To provide further insight into the case of outdoor scenes the Image Matching Challenge (IMC) PhotoTourism dataset⁷⁷7https://www.kaggle.com/competitions/image-matching-challenge-2022/data is also employed. The IMC PhotoTourism (IMC-PT) dataset is a curated collection of sixteen scenes derived by the SfM models of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset [106]. For each scene 800 image pairs have been randomly chosen, resulting in a total of 12800 image pairs. These scenes also provide 3D model scale as GT data so that metric evaluations are possible. Note that MegaDepth data are is roughly a subset of IMC-PT. Furthermore, YFCC100M has been often exploited to train deep image matching methods so it can be assumed that some of the compared deep filters are advantaged on this dataset, but this information cannot be retrieved or elaborated. Nevertheless, the proposed modules are handcrafted and not positively biased so the comparison is still valid for the main aim of the evaluation.
为了更深入地探讨户外场景的情况，还采用了 Image Matching Challenge (IMC) PhotoTourism 数据集。IMC PhotoTourism (IMC-PT)数据集是从 Yahoo Flickr Creative Commons 100 Million (YFCC100M)数据集的 SfM 模型中精心挑选出的十六个场景的集合[106]。每个场景随机选择了 800 对图像，总计 12800 对图像。这些场景还提供了 3D 模型尺度作为 GT 数据，使得度量评估成为可能。需要注意的是，MegaDepth 数据大致是 IMC-PT 的一个子集。此外，YFCC100M 常被用于训练深度图像匹配方法，因此可以假设一些比较的深度滤波器在此数据集上具有优势，但这一信息无法获取或详细阐述。尽管如此，所提出的模块是手工设计的，不存在正向偏差，因此比较仍然有效，符合评估的主要目的。

Lastly, the Planar dataset contains 135 image pairs from 35 scenes collected from HPatches [56], the Extreme View dataset (EVD) [57] and further datasets aggregated from [70, 32, 50]. Each scene usually includes five image pairs except those belonging to EVD consisting of a single image pair. The planar dataset includes scenes with challenging viewpoint changes possibly paired with strong illumination variations. All image pairs are adjusted to be roughly non-upright. Outdoor model weights are preferred for deep modules in the case of planar scenes. A thumbnail gallery for the scenes of the Planar dataset is shown in Fig. 6.
最后，Planar 数据集包含从 HPatches [56]、Extreme View 数据集（EVD）[57]以及从[70, 32, 50]中聚合的其他数据集中收集的 35 个场景中的 135 对图像。每个场景通常包含五对图像，除了属于 EVD 的场景仅包含一对图像。Planar 数据集包括具有挑战性视角变化的场景，可能伴随着强烈的光照变化。所有图像对都被调整为大致非直立状态。对于平面场景，深度模块更倾向于使用户外模型权重。图 6 展示了 Planar 数据集场景的缩略图画廊。

4.1.4 Error metrics 4.1.4 误差度量

For non-planar datasets, unlike most common benchmarks [14], it is assumed that no GT camera intrinsic parameters are available. This is the general setting that occurs in most practical SfM scenarios, especially in outdoor environments. For a given image pair, the fundamental matrix $\mathrm{F}$ can be estimated as
对于非平面数据集，与大多数常见基准测试[14]不同，假设没有可用的 GT 相机内参。这是大多数实际 SfM 场景中的通用设置，尤其是在户外环境中。对于给定的图像对，基本矩阵 $\mathrm{F}$ 可以被估计为

\mathrm{F}=\mathrm{K}_{2}^{-T}\mathrm{E}\,\mathrm{K}^{-1}_{1}=\mathrm{K}_{2}^{% -T}[\mathbf{t}]_{\times}\mathrm{R}\,\mathrm{K}^{-1}_{1}

(51)

where $\mathrm{E},\mathrm{R},\mathrm{K}_{1},\mathrm{K}_{2},[\mathbf{t}]_{\times}\in% \mathbb{R}^{3\times 3}$ are respectively the essential matrix, the rotation matrix, the intrinsic camera matrices and the skew-symmetric matrix corresponding to the translation vector $\mathbf{t}\in\mathbb{R}^{3}$ [45]. The maximum epipolar error for a match $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ according to $\mathrm{F}$ is
其中 $\mathrm{E},\mathrm{R},\mathrm{K}_{1},\mathrm{K}_{2},[\mathbf{t}]_{\times}\in% \mathbb{R}^{3\times 3}$ 分别表示本质矩阵、旋转矩阵、相机内参矩阵以及与平移向量 $\mathbf{t}\in\mathbb{R}^{3}$ 对应的反对称矩阵[45]。根据 $\mathrm{F}$ ，匹配 $m=(\mathbf{x}_{1},\mathbf{x}_{2})$ 的最大极线误差为

\varepsilon_{\mathrm{F}}(m)=\max\left(\frac{\mathbf{x}_{2}^{T}\mathrm{F}% \mathbf{x}_{1}}{\left\|[\mathrm{F}\mathbf{x}_{1}]_{[1:2]}\right\|^{2}},\frac{% \mathbf{x}_{1}^{T}\mathrm{F}^{T}\mathbf{x}_{2}}{\left\|[\mathrm{F}^{T}\mathbf{% x}_{2}]_{[1:2]}\right\|^{2}}\right)

(52)

where $[\mathbf{x}]_{[1:2]}$ denotes the vector obtained by the first two elements of $\mathbf{x}$ . In the case of planar images, the reprojection error $\varepsilon_{\mathrm{H}}(m)$ defined in Eq. 1 will be used instead.
其中 $[\mathbf{x}]_{[1:2]}$ 表示由 $\mathbf{x}$ 的前两个元素构成的向量。对于平面图像，将使用公式 1 中定义的重投影误差 $\varepsilon_{\mathrm{H}}(m)$ 。

Given a set of matches $\mathcal{M}_{b}=\{m\}$ obtained by a base pipeline of those presented in Sec. 4.1.1 and the final refined match set $\mathcal{M}$ according to one of the successive post-processing listed in Sec. 4.1.2, the GT fundamental matrix $\mathrm{\tilde{F}}$ is estimated according to Eq. 51 from the GT intrinsic and extrinsic camera parameters⁸⁸8actually these are quasi or pseudo GT data estimated from wide and reliable SfM models. in the case of the non-planar scene, or the GT homography $\mathrm{\tilde{H}}$ is provided for planar scenes. The notation $\nicefrac{{\mathrm{\tilde{F}}}}{{\mathrm{\tilde{H}}}}$ indicates the usage of the GT matrix according to the scene kind. The recall is computed as
给定由第 4.1.1 节所述基础管道获得的匹配集 $\mathcal{M}_{b}=\{m\}$ ，以及根据第 4.1.2 节列出的后续后处理之一得到的最终精炼匹配集 $\mathcal{M}$ ，在非平面场景情况下，根据公式 51 从 GT 内参和外参相机参数 ⁸ 估计 GT 基本矩阵 $\mathrm{\tilde{F}}$ ，或者为平面场景提供 GT 单应性矩阵 $\mathrm{\tilde{H}}$ 。符号 $\nicefrac{{\mathrm{\tilde{F}}}}{{\mathrm{\tilde{H}}}}$ 表示根据场景类型使用 GT 矩阵。召回率计算如下：

\text{recall}^{\mathcal{M}_{b}}_{\nicefrac{{\mathrm{\tilde{F}}}}{{\mathrm{% \tilde{H}}}}}(\mathcal{M})=\frac{\sum_{\begin{subarray}{c}m\in\mathcal{M}\\ t\in\mathcal{T}\end{subarray}}\left\llbracket\varepsilon_{\nicefrac{{\mathrm{% \tilde{F}}}}{{\mathrm{\tilde{H}}}}}(m)<t\right\rrbracket}{\sum_{\begin{% subarray}{c}m^{\prime}\in\mathcal{M}_{b}\\ t^{\prime}\in\mathcal{T}\end{subarray}}\left\llbracket\varepsilon_{\nicefrac{{% \mathrm{\tilde{F}}}}{{\mathrm{\tilde{H}}}}}(m^{\prime})<t^{\prime}\right\rrbracket}

(53)

and the precision as 以及精度为

\text{precision}_{\nicefrac{{\mathrm{\tilde{F}}}}{{\mathrm{\tilde{H}}}}}(% \mathcal{M})=\frac{1}{|\mathcal{M}|\,|\mathcal{T}|}\sum_{\begin{subarray}{c}m% \in\mathcal{M}\\ t\in\mathcal{T}\end{subarray}}\left\llbracket\varepsilon_{\nicefrac{{\mathrm{% \tilde{F}}}}{{\mathrm{\tilde{H}}}}}(m)<t\right\rrbracket

(54)

where the common definitions have been modified following the MAGSAC strategy to consider a set of threshold values
其中，根据 MAGSAC 策略对通用定义进行了修改，以考虑一组阈值

\mathcal{T}=\left\{t\in[1,t_{\varepsilon}]\cap\mathbb{Z}\right\}

(55)

limited by $t_{\varepsilon}=16$ . As additional statistic measurement the number of filtered matches is computed as
受限于 $t_{\varepsilon}=16$ 。作为额外的统计测量，计算了筛选匹配的数量为

\text{filtered}_{\mathcal{M}_{b}}(\mathcal{M})=1-\frac{|\mathcal{M}|}{|% \mathcal{M}_{b}|}

(56)

For a zero denominator in Eq. 53 the recall is forced to 0. Notice that in the case of a nonzero denominator, the recall is less than or equal to 1 only when $\mathcal{M}\subseteq\mathcal{M}_{b}$ which does not hold for pipelines where the keypoint refinement step effectively improves the matches. The number of filtered matches and the precision are less than or equal to 1 in any case by pipeline design. The final aggregated measurement for each error and dataset is provided by averaging on all dataset image pairs. It should be remarked that for non-planar scenes both precision and recall are non-perfect measures since epipolar error constraints can hold also for outliers in specific and common scene configurations [32].
对于方程 53 中分母为零的情况，召回率被强制为 0。注意，在分母非零的情况下，只有当 $\mathcal{M}\subseteq\mathcal{M}_{b}$ 时，召回率才小于或等于 1，这对于关键点细化步骤能有效改善匹配的管道并不成立。无论何种情况，通过管道设计，过滤后的匹配数和精度均小于或等于 1。每个错误和数据集的最终聚合测量结果通过对所有数据集图像对进行平均获得。值得注意的是，对于非平面场景，精度和召回率都不是完美的度量，因为在特定且常见的场景配置中，极线误差约束也可能适用于异常值[32]。

To provide more reliable measures also in view of the final aim of the image matching, the relative pose error is employed for the non-planar scenes. Two alternative estimations of the pose are given in terms of rotation and translation matrices. In the first one, the fundamental matrix is estimated from the matches in $\mathcal{M}$ by the 8-point algorithm using the normalized Direct Linear Transformation (DLT) [45] by the OpenCV function findFundamentalMat as $\mathrm{F}_{\mathcal{M}}$ and then the essential matrix is obtained as
为了在图像匹配的最终目标视角下提供更可靠的度量，非平面场景中采用了相对姿态误差。给出了两种姿态的替代估计，分别基于旋转和平移矩阵。在第一种方法中，基本矩阵通过 8 点算法从匹配中估计得到，使用 OpenCV 函数 findFundamentalMat 进行归一化直接线性变换（DLT）[45]，如 $\mathrm{F}_{\mathcal{M}}$ 所示，然后从中获取本质矩阵。

\mathrm{E}_{\mathcal{M}}=\mathrm{\tilde{K}}_{2}^{T}\mathrm{F}_{\mathcal{M}}% \mathrm{\tilde{K}}_{1}

(57)

given the GT intrinsic cameras $\mathrm{\tilde{K}}_{1}$ and $\mathrm{\tilde{K}}_{2}$ . The four possible choice of camera rotation and translation pairs are then obtained by SVD decomposition [45] using the OpenCV decomposeEssentialMat function as
给定 GT 内在相机 $\mathrm{\tilde{K}}_{1}$ 和 $\mathrm{\tilde{K}}_{2}$ 。通过使用 OpenCV 的 decomposeEssentialMat 函数进行 SVD 分解，得到了四组可能的相机旋转和平移对。

\mathcal{E}_{\mathcal{M}}=\left\{\mathrm{R}^{1}_{\mathcal{M}},\mathrm{R}^{2}_{% \mathcal{M}}\right\}\times\left\{\pm\mathbf{t}_{\mathcal{M}}\right\}

(58)

The second alternative follows the standard benchmarks. After having normalized keypoint matches according to the GT instrinsic cameras $\mathrm{\tilde{K}}_{1}$ and $\mathrm{\tilde{K}}_{2}$ so that
第二种方案遵循标准基准。在根据 GT 内在相机 $\mathrm{\tilde{K}}_{1}$ 和 $\mathrm{\tilde{K}}_{2}$ 对关键点匹配进行归一化处理后，

\mathcal{\widehat{M}}=\left\{\left(\mathrm{\tilde{K}}_{1}\mathbf{x}_{1},% \mathrm{\tilde{K}}_{2}\mathbf{x}_{2}\right):\left(\mathbf{x}_{1},\mathbf{x}_{2% }\right)\in\mathcal{M}\right\}

(59)

the essential matrix is computed directly as $\mathrm{E}_{\mathcal{\widehat{M}}}$ by the OpenCV function findEssentialMat using the 5-point algorithm [107] embedded into RANSAC with a normalized threshold $t_{\mathrm{E}}=0.5$ [14] and the maximim number of iteration equal to $10^{4}$ . The two possible solutions for the camera rotation and translation components are recovered by the OpenCV function recoverPose as
本质矩阵通过 OpenCV 函数 findEssentialMat 直接计算为 $\mathrm{E}_{\mathcal{\widehat{M}}}$ ，采用嵌入 RANSAC 的 5 点算法[@107]，归一化阈值为 $t_{\mathrm{E}}=0.5$ [@14]，最大迭代次数等于 $10^{4}$ 。OpenCV 函数 recoverPose 恢复了相机旋转和平移分量的两种可能解。

\mathcal{E}_{\mathcal{\widehat{M}}}=\left\{\mathrm{R}_{\mathcal{\widehat{M}}}% \right\}\times\left\{\pm\mathbf{t}_{\mathcal{\widehat{M}}}\right\}

(60)

Notice that this second approach makes it easier to find a correct solution since five points are required instead of eight for the minimum model and $\mathrm{\tilde{K}}_{1}$ and $\mathrm{\tilde{K}}_{2}$ are used earlier stage of the process. Nevertheless, these conditions are more unlikely to be met in practical real situations, especially in outdoor environments. Notice that the translation vectors in both $\mathcal{E}_{\mathcal{M}}$ and $\mathcal{E}_{\mathcal{\widehat{M}}}$ are defined up to scale. The final pose accuracy is provided in terms of the Area Under the Curve (AUC) of the best maximum angular error estimation for each dataset image pair and a given angular threshold
注意到，第二种方法更容易找到正确解，因为最小模型仅需五个点而非八个，且 $\mathrm{\tilde{K}}_{1}$ 和 $\mathrm{\tilde{K}}_{2}$ 在流程的较早阶段被使用。然而，这些条件在实际真实场景中，尤其是在户外环境中，更难以满足。注意， $\mathcal{E}_{\mathcal{M}}$ 和 $\mathcal{E}_{\mathcal{\widehat{M}}}$ 中的平移向量是按比例定义的。最终姿态精度以每个数据集图像对的最佳最大角度误差估计的曲线下面积（AUC）形式给出，并设定一个给定的角度阈值。

\theta\in\left\{5^{\circ},10^{\circ},20^{\circ}\right\}

(61)

The maximum angular error for an image pair is
图像对的最大角度误差为

\rho(\mathcal{M})=\min_{\left(\mathrm{R},\mathbf{t}\right)\in\nicefrac{{% \mathcal{E}_{\mathcal{M}}}}{{\mathcal{E}_{\mathcal{\widehat{M}}}}}}\max\left(% \rho\left(\mathbf{\tilde{t}},\mathbf{t}\right),\rho\left(\mathrm{\tilde{R}},% \mathrm{R}\right)\right)

(62)

with $\nicefrac{{\mathcal{E}_{\mathcal{M}}}}{{\mathcal{E}_{\mathcal{\widehat{M}}}}}$ indicating the use of one of the two options between Eq. 58 and Eq. 60 for the estimated pose components, $\mathbf{\tilde{t}}$ and $\mathrm{\tilde{R}}$ being the respective GT translation, defined again up to scale, and rotation. In the above formula, the angular error for the translation and rotation are respectively computed as
其中， $\nicefrac{{\mathcal{E}_{\mathcal{M}}}}{{\mathcal{E}_{\mathcal{\widehat{M}}}}}$ 表示在估计姿态分量时使用了公式 58 和公式 60 之间的两种选项之一， $\mathbf{\tilde{t}}$ 和 $\mathrm{\tilde{R}}$ 分别为相应的 GT 平移和旋转，再次定义为尺度上的等价。在上式中，平移和旋转的角度误差分别计算为

	$\displaystyle\rho(\mathbf{\tilde{t}},\mathbf{t})=$	$\displaystyle\text{acos}\left(\frac{\mathbf{\tilde{t}}^{T}\mathbf{t}}{\left\\|% \mathbf{\tilde{t}}\right\\|\,\left\\|\mathbf{t}\right\\|}\right)$		(63)
	$\displaystyle\rho(\mathrm{\tilde{R}},\mathrm{R})=$	$\displaystyle\text{acos}\left(\frac{1}{2}\left(1-\cos\left(\mathrm{\tilde{R}}^% {T}\mathrm{R}\right)\right)\right)$		(64)

The two obtained AUC values for the chosen pose estimation between Eq. 58 and Eq. 60 and a fixed threshold $\theta$ will be denoted by AUC ${}^{F}_{@\theta}$ , AUC ${}^{E}_{@\theta}$ respectively, with the superscript indicating the kind of error estimator employed, i.e. ‘ $F$ ’ for fundamental matrix as in Eq. 58, ‘ $E$ ’ for essential matrix as in Eq. 60. The final aggregated pose accuracy measures are their average over the $\theta$ values, indicated as AUC ${}^{F}_{\measuredangle}$ and AUC ${}^{E}_{\measuredangle}$ , where the subscript ‘ $\measuredangle$ ’ refers to the use of an angular translation error according to Eq. 63 to differentiate with respect to the metric translation error introduced hereafter.
所选姿态估计在公式 58 和公式 60 之间获得的两个 AUC 值以及固定阈值 $\theta$ 将分别记为 AUC ${}^{F}_{@\theta}$ 和 AUC ${}^{E}_{@\theta}$ ，上标表示所采用的误差估计器类型，即‘ $F$ ’代表基本矩阵如公式 58 所示，‘ $E$ ’代表本质矩阵如公式 60 所示。最终的综合姿态精度指标为其在 $\theta$ 值上的平均值，记作 AUC ${}^{F}_{\measuredangle}$ 和 AUC ${}^{E}_{\measuredangle}$ ，其中下标‘ $\measuredangle$ ’指代根据公式 63 使用角度平移误差进行区分，以区别于后续引入的度量平移误差。

In the case of IMC-PT dataset, a fully metric estimation of the translation component of the pose is possible since the real GT reconstruction scale is available, i.e. $\left\|\mathbf{\tilde{t}}\right\|$ is the real metric measurement. Inspired by the IMC 2022 benchmark7, Eq. 63 can be replaced by
在 IMC-PT 数据集的情况下，由于实际的 GT 重建尺度是可用的，即 $\left\|\mathbf{\tilde{t}}\right\|$ 是真实的度量测量值，因此可以对姿态的平移分量进行完全的度量估计。受 IMC 2022 基准测试 7 的启发，方程 63 可以替换为

\rho^{+}(\mathbf{\tilde{t}},\mathbf{t})=z\left\|\mathbf{\tilde{t}}-\left\|% \mathbf{\tilde{t}}\right\|\frac{\mathbf{t}}{\left\|\mathbf{t}\right\|}\right\|

(65)

so that both the unit vectors are rescaled according to the GT baseline length $\parallel\mathbf{\tilde{t}}\parallel$ . The fixed factor $z=10$ is used to translate errors from meters to angles when directly plugging $\rho^{+}$ into Eq. 62 instead of $\rho$ in the AUC calculation. This corresponds to replace the angular thresholds $\theta$ of Eq. 61 implicitly by pairs of thresholds
使得两个单位向量根据 GT 基线长度 $\parallel\mathbf{\tilde{t}}\parallel$ 重新缩放。固定因子 $z=10$ 用于在直接将 $\rho^{+}$ 代入公式 62 而非 $\rho$ 进行 AUC 计算时，将误差从米转换为角度。这相当于将公式 61 中的角度阈值 $\theta$ 隐式替换为成对的阈值。

(\theta,t_{+})\in\left\{(5^{\circ},0.5\,\text{m}),(10^{\circ},1\,\text{m}),(20% ^{\circ},2\,\text{m})\right\}

(66)

The translation thresholds $t_{+}$ are reasonable since the 3D scene extensions are roughly of the order of 100 m. This metric pose accuracy estimation provides a reasonable choice able to discriminate when the real baseline is negligible with respect to the whole scene so that the viewpoint change appears as rotation only. In this case, the translation angular component can be quite high as noise but it does not affect the pose estimation effectively. Using Eq. 63 would stress this error without a real justification [108]. Likewise above, the corresponding AUC values for the chosen pose estimation between Eq. 58 and Eq. 60 will be denoted by AUC ${}^{F}_{@(\theta,t_{+})}$ , AUC ${}^{E}_{@(\theta,t_{+})}$ respectively, for the given threshold pair $(\theta,t_{+})$ . The final metric accuracy scores are given by the average over the $(\theta,t_{+})$ value pairs and indicated AUC ${}^{F}_{\square}$ and AUC ${}^{E}_{\square}$ , respectively, where the ‘ $\square$ ’ subscript indicates the metric translation error according to Eq. 65.
由于 3D 场景扩展大致在 100 米量级，因此 $t_{+}$ 的平移阈值是合理的。这种姿态精度估计提供了一个合理的选择，能够在真实基线相对于整个场景可忽略不计时进行区分，使得视点变化仅表现为旋转。在这种情况下，平移的角分量可能作为噪声相当高，但并不实质影响姿态估计。使用公式 63 会放大这一误差，而缺乏实际依据[108]。同样地，对于公式 58 和公式 60 之间的选定姿态估计，相应的 AUC 值将分别用 AUC ${}^{F}_{@(\theta,t_{+})}$ 和 AUC ${}^{E}_{@(\theta,t_{+})}$ 表示，对应于给定的阈值对 $(\theta,t_{+})$ 。最终的度量精度得分通过对 $(\theta,t_{+})$ 值对的平均得出，并分别记为 AUC ${}^{F}_{\square}$ 和 AUC ${}^{E}_{\square}$ ，其中‘ $\square$ ’下标表示根据公式 65 的度量平移误差。

For the planar dataset, the homography accuracy is measured analogously by AUC. The pose error is replaced by the maximum of the average reprojection error in the common area when using one of the two images as a reference. In detail, assuming that both images have the same resolution of $w\times h$ px., the shared pixel sets for each image are respectively
对于平面数据集，单应性精度同样通过 AUC 进行衡量。姿态误差被替换为在以两幅图像之一作为参考时，公共区域内的平均重投影误差的最大值。具体而言，假设两幅图像具有相同的分辨率 $w\times h$ 像素，则每幅图像的共享像素集分别为

	$\displaystyle\mathcal{X}_{1\rightarrow 2}=$	$\displaystyle\left\{\mathbf{x}\in\mathbb{Z}^{2}:\mathrm{\tilde{H}}\mathbf{x}% \in[1,w]\times[1,h]\right\}$		(67)
	$\displaystyle\mathcal{X}_{2\rightarrow 1}=$	$\displaystyle\left\{\mathbf{x}\in\mathbb{Z}^{2}:\mathrm{\tilde{H}}^{-1}\mathbf% {x}\in[1,w]\times[1,h]\right\}$		(68)

where $\mathrm{\tilde{H}}$ is the GT homography matrix. The homography estimated by the OpenCV function findHomography from the pipeline matches is denoted as $\mathrm{H}_{\mathcal{M}}$ and the maximum average reprojection error is given by
其中 $\mathrm{\tilde{H}}$ 表示 GT 单应矩阵。由 OpenCV 函数 findHomography 根据管道匹配估计的单应矩阵记为 $\mathrm{H}_{\mathcal{M}}$ ，最大平均重投影误差由

\pi(\mathcal{M})=\max\left(\pi_{\mathcal{X}_{1\rightarrow 2}}^{\mathrm{H}_{% \mathcal{M}},\mathrm{\tilde{H}}},\pi_{\mathcal{X}_{2\rightarrow 1}}^{\mathrm{H% }^{-1}_{\mathcal{M}},\mathrm{\tilde{H}}^{-1}}\right)

(69)

where the average reprojection error in a single image is estimated as
其中单张图像的平均重投影误差估计为

\pi_{\mathcal{X}}^{\mathrm{H}_{1},\mathrm{H}_{2}}=\frac{1}{\left|\mathcal{X}% \right|}\sum_{\mathbf{x}\in\mathcal{X}}\left\|\mathrm{H}_{1}\mathbf{x}-\mathrm% {H}_{2}\mathbf{x}\right\|

(70)

The AUC of the maximum average reprojection error in the common area for each image pair in the dataset and a fixed angular threshold
数据集中每对图像在公共区域的最大平均重投影误差的 AUC 及固定角度阈值

t_{\mathrm{H}}\in\left\{5\,\text{px},\,10\,\text{px},\,15\,\text{px}\right\}

(71)

is denoted by AUC ${}^{H}_{@t_{\mathrm{H}}}$ . The final homography accuracy is provided as the mean AUC over the different thresholds and is indicated by AUC ${}^{H}_{\square}$ to be consistent with the previous AUC notation adopted in the case of non-planar images, where the superscript ‘ $H$ ’ clearly stands for homography. Accuracy estimation considering the common area is more reliable than an accuracy estimation by averaging the reprojection error only in the four image corners as in [14]. This can be understood by observing that the GT planar homography is a pseudo GT estimated from correspondences, which lie inside the boundaries of both images. As an image point gets reprojected in the other image outside and far from the boundaries, which generally happens with the image corners as depicted in Fig. 2, there is no way to check the real error, which also increases going away from the keypoint matches being employed to estimate the GT homography.
用 AUC ${}^{H}_{@t_{\mathrm{H}}}$ 表示。最终的单应性精度通过不同阈值下的平均 AUC 给出，并标记为 AUC ${}^{H}_{\square}$ ，以与非平面图像情况下采用的先前 AUC 表示法保持一致，其中上标‘ $H$ ’明确表示单应性。考虑公共区域的精度估计比仅通过平均四个图像角点的重投影误差进行精度估计更为可靠，如[14]中所述。这可以通过观察到 GT 平面单应性是从位于两幅图像边界内的对应关系中估计出的伪 GT 来理解。当图像点在另一图像中被重投影到远离边界的位置时，通常如图 2 所示发生在图像角点处，无法检查实际误差，且随着远离用于估计 GT 单应性的关键点匹配，误差也会增加。

4.2 Results 4.2 结果

4.2.1 Prelude 4.2.1 前奏

Tables 1-7 report the results obtained according to the setup presented in Sec. 4.1 for the different base pipelines. The left aligned shift in the pipeline column indicates incremental inclusions of the modules as in the folder-tree style visualization. All the values in the tables are expressed in percentages. Extended tables including AUCs values in the non-aggregated form, i.e. for the different thresholds, are reported as online resources6. In any case, non-aggregated AUC ranks are almost stable and well correlated with the corresponding aggregated measure.
表 1-7 报告了根据第 4.1 节所述设置，针对不同基础管道的实验结果。管道列中的左对齐偏移表示模块的增量包含，如文件夹树状图所示。表中所有数值均以百分比表示。扩展表包括非聚合形式的 AUC 值，即针对不同阈值的 AUC 值，作为在线资源 6 提供。无论如何，非聚合 AUC 排名几乎稳定，并与相应的聚合度量高度相关。

The proposed pipeline modules are in bold. For the other numeric columns, the three highest values are also in bold, with the displayed bars normalized within the minimum and maximum value of the column. Increasing color shades of the bars identify the value range among the four incremental percentage intervals with exponentially decreasing length contained in $[0,50,75,82.5,100]\%$ after the value has been normalized by the minimum and maximum of each column. Darker shades indicate exponentially better ranks.
建议的管道模块以粗体显示。对于其他数值列，三个最高值也以粗体显示，显示的条形图在列的最小值和最大值之间进行了归一化。条形图的颜色深浅变化表示值在四个递增百分比区间内的范围，这些区间长度呈指数递减，位于 $[0,50,75,82.5,100]\%$ 之后，值已通过每列的最小值和最大值进行归一化。颜色越深，表示排名指数越好。

In the next discussion, the base pipelines will be referred to for brevity by the first component, e.g. Key.Net will refer to Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR. CC filtering was not evaluated on IMC-PT since it experienced a high number of unreasonable crashes for this dataset.
在接下来的讨论中，为简洁起见，基础管道将通过其首个组件来指代，例如，Key.Net 将指代 Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR。由于 IMC-PT 在此数据集上遭遇了大量不合理崩溃，因此未对其进行 CC 过滤评估。

Referring to the tables, pose estimation from the fundamental matrix is harder than that from the essential matrix. This can be observed by the value differences in the sided green and yellow columns. Moreover, using MAGSAC before RANSAC-based essential matrix estimation in general does not affect or even slightly decreases the accuracy. According to these observations, the behavior of the different matching filters is stressed when the pose is estimated from the fundamental matrix. For IMC-PT the corresponding angular (green and yellow columns) and metric (purple and olive green columns) AUC scores maintain the same relative order even if metric scores are lower in terms of absolute value, thus highlighting that real pose solution is still far to be solved.
参考表格，从基础矩阵进行姿态估计比从本质矩阵进行姿态估计更为困难。这一点可以通过两侧绿色和黄色列的数值差异观察到。此外，在基于 RANSAC 的本质矩阵估计之前使用 MAGSAC 通常不会影响甚至略微降低精度。根据这些观察，当从基础矩阵估计姿态时，不同匹配过滤器的行为受到强调。对于 IMC-PT，相应的角度（绿色和黄色列）和度量（紫色和橄榄绿列）AUC 分数保持相同的相对顺序，即使度量分数在绝对值上较低，这突显出实际姿态解决方案仍远未解决。

The previous observations are further confirmed by the analysis of the correlation within the employed error measures, reported in Table 8 for each dataset. On one hand, after the MAGSAC geometric constraints, for both planar and non-planar scenes there is a high correlation within recall, pose estimation accuracy as AUC, regardless of the estimation is derived from the fundamental or essential matrix, respectively indicated by the ‘ $F$ ’ and ‘ $E$ ’ superscripts, and these both are highly anti-correlated with the percentage of filtered matches. On the other hand, without MAGSAC in the non-planar sceness the recall and the pose accuracy estimated by the essential matrix are highly correlated and both strongly anti-correlated with the number of filtered matches, while precision has an almost moderate correlation with the pose accuracy estimated from the fundamental matrix only. Metric and angular AUC for the same kind of pose estimation have also high correlation levels. For the planar case, without MAGSAC the pose accuracy is only moderately correlated with both precision and recall, as well as anti-correlated with the percentages of the filtered matches within a similar degree.
前述观察结果通过表 8 中各数据集所采用误差度量的相关性分析进一步得到证实。一方面，在应用 MAGSAC 几何约束后，无论是平面还是非平面场景，召回率、姿态估计精度（以 AUC 表示）之间均存在高度相关性，无论该估计是源自基础矩阵还是本质矩阵，分别由上标‘ $F$ ’和‘ $E$ ’表示，且这两者均与过滤匹配的百分比高度负相关。另一方面，在非平面场景中未使用 MAGSAC 时，由本质矩阵估计的召回率和姿态精度高度相关，且两者均与过滤匹配的数量强烈负相关，而精度仅与由基础矩阵估计的姿态精度呈现近乎中等程度的相关性。同类型姿态估计的度量和角度 AUC 也具有高度相关性。对于平面情况，未使用 MAGSAC 时，姿态精度仅与精度和召回率适度相关，并与过滤匹配的百分比在相似程度上负相关。

For the sake of completeness, Table 9 reports the average number of input matches for each base method and dataset under evaluation. This information can be used together with the percentage of filtered matches to get an absolute estimation of the filtered matches. According to the reported values, SuperPoint and ALIKED provide the lower number of matches followed by SIFT and Key.Net, while DISK, LoFTR and DeDoDe v2 the highest number.
为了完整性，表 9 报告了每个基线方法和评估数据集的平均输入匹配数。这些信息可以与过滤匹配的百分比结合，以获得过滤匹配的绝对估计。根据报告的数值，SuperPoint 和 ALIKED 提供的匹配数较少，其次是 SIFT 和 Key.Net，而 DISK、LoFTR 和 DeDoDe v2 的匹配数最多。

To summarize, it is better to have more matches for pose registration even with higher levels of outlier contamination when the camera intrinsic parameters are available, the opposite holds when camera parameters are missing. Moreover, RANSAC flattens scores and strongly influences the final matches. Nevertheless, one error measure cannot replace another, although highly correlated, since they highlight different properties and behaviors of the matching pipeline.
总之，当相机内参可用时，即使存在较高水平的异常值污染，拥有更多匹配用于姿态注册是更好的选择；而当相机参数缺失时，情况则相反。此外，RANSAC 会平滑分数并对最终匹配产生强烈影响。尽管高度相关，一种误差度量不能替代另一种，因为它们突显了匹配流程中不同的特性和行为。

[Uncaptioned image] — Table 1: SIFT+NNR pipeline evaluation results. All values are percentages, please refer to Sec. 4.2.2. Best viewed in color and zoomed in.
表 1：SIFT+NNR 管道评估结果。所有数值均为百分比，详见第 4.2.2 节。最佳查看方式为彩色并放大。

4.2.2 SIFT+NNR 4.2.2 SIFT+NNR

Table 1 reports the results for SIFT. MOP+MiHo obtains the best pose estimation on Megadepth when this is computed from the essential matrix (AUC ${}^{E}_{\measuredangle}$ , yellow column), while adding NCC slightly decreases the accuracy. MOP-based filters, FC-GNN, and ACNe pose estimation are more accurate than the base SIFT even if the improvement is incremental. Other matching filters yield minimal degradation of the base accuracy in this specific setup. When the pose is derived from the fundamental matrix (AUC ${}^{F}_{\measuredangle}$ , green column), FC-GNN attains the highest score followed by MOP, which is almost unaltered by MiHo and again negatively altered by NCC. The improvement with respect to the base SIFT is more evident for AUC ${}^{F}_{\measuredangle}$ than for AUC ${}^{E}_{\measuredangle}$ . All filters generally improve on SIFT considering AUC ${}^{F}_{\measuredangle}$ , but only MOP-base filters, FC-GG, ACNe, and CC are really remarkable in terms of the accuracy gain. This behavior is also reflected in the recall (red column) and precision (blue column). In particular, MOP+MiHo seems to provide a better recall and FC-GNN more precision, where the recall is more correlated with AUC ${}^{E}_{\measuredangle}$ and the precision with AUC ${}^{F}_{\measuredangle}$ as discussed in Sec. 4.2.1.
表 1 报告了 SIFT 的结果。在从本质矩阵计算时（AUC ${}^{E}_{\measuredangle}$ ，黄色列），MOP+MiHo 在 Megadepth 上获得了最佳姿态估计，而加入 NCC 略微降低了准确性。基于 MOP 的滤波器、FC-GNN 和 ACNe 的姿态估计比基础 SIFT 更准确，尽管改进是渐进的。其他匹配滤波器在此特定设置下对基础准确性的降低微乎其微。当姿态从基础矩阵导出时（AUC ${}^{F}_{\measuredangle}$ ，绿色列），FC-GNN 得分最高，其次是 MOP，后者几乎不受 MiHo 影响，但再次受到 NCC 的负面影响。相对于基础 SIFT，AUC ${}^{F}_{\measuredangle}$ 的改进比 AUC ${}^{E}_{\measuredangle}$ 更为明显。所有滤波器在考虑 AUC ${}^{F}_{\measuredangle}$ 时通常都优于 SIFT，但只有 MOP 基础滤波器、FC-GG、ACNe 和 CC 在准确性提升方面真正显著。这种行为也反映在召回率（红色列）和精确度（蓝色列）上。特别是，MOP+MiHo 似乎提供了更好的召回率，而 FC-GNN 则提高了精确度，召回率与 AUC ${}^{E}_{\measuredangle}$ 更相关，精确度与 AUC ${}^{F}_{\measuredangle}$ 更相关，如第 4.2.1 节所述。

Similar considerations hold for ScanNet even if the absolute scores are lower due to a major scene complexity, yet NCC improves MOP+MiHo and ACNe accuracy is relatively closer to those of MOP-based filters and FC-GNN. Specifically, ACNe achieves the best AUC ${}^{E}_{\measuredangle}$ score followed by MOP+MiHo+NCC and then FC-GNN, while FC-GNN and ACNe are marginally better than MOP-based filters in terms of AUC ${}^{F}_{\measuredangle}$ . Moreover, ConvMatch, DeMatch, and OANet are better in pose accuracy than the base SIFT.
类似的考虑也适用于 ScanNet，尽管由于场景复杂度高导致绝对分数较低，但 NCC 提升了 MOP+MiHo 的性能，ACNe 的准确率相对更接近基于 MOP 的滤波器和 FC-GNN。具体而言，ACNe 在 AUC ${}^{E}_{\measuredangle}$ 得分上表现最佳，其次是 MOP+MiHo+NCC，然后是 FC-GNN，而在 AUC ${}^{F}_{\measuredangle}$ 方面，FC-GNN 和 ACNe 略优于基于 MOP 的滤波器。此外，ConvMatch、DeMatch 和 OANet 在姿态精度上均优于基础 SIFT。

CC and MOP+MiHo+NCC obtain the best pose accuracy (AUC ${}^{H}_{\square}$ , purple column) in planar scenes as they both rely on homographies. The base MOP is better in this case than MOP+MiHo and MOP+MiHo+NCC is better than base MOP. Other filters cannot overstep the base MAGSAC in the planar scenario. Moreover, the most aggressive filters in terms of filtered matches (gray columns), i.e. AdaLAM, OANet, and MS²DG-Net, are those with worse results. Since the NNR threshold was set deliberately high, this suggests a lack of adaptability to the input matches.
CC 和 MOP+MiHo+NCC 在平面场景中获得了最佳姿态精度（AUC ${}^{H}_{\square}$ ，紫色柱），因为它们都依赖于单应性。在此情况下，基础 MOP 优于 MOP+MiHo，而 MOP+MiHo+NCC 又优于基础 MOP。其他滤波器在平面场景中无法超越基础 MAGSAC。此外，在过滤匹配方面最为激进的滤波器（灰色柱），即 AdaLAM、OANet 和 MS ² DG-Net，其结果反而较差。由于最近邻阈值被有意设置得较高，这表明对输入匹配的适应性不足。

Finally, all filters provide better AUC ${}^{E}_{\measuredangle}$ score than the base SIFT in the non-planar IMC-PT dataset, with the exception of AdaLAM, MS²DG-Net, and GMS, but without noticeable differences as already seen for MegaDepth. ACNe achieves the best score, while FC-GNN improvements are smaller compared to those obtained for MegaDepth. For AUC ${}^{F}_{\measuredangle}$ the ranking is quite similar to that obtained for of ScanNet, with the difference that CLNet and NCMNet are better than the base SIFT. Analogously observations can be done for the metric pose accuracy (AUC ${}^{F}_{\square}$ and AUC ${}^{F}_{\square}$ , respectively purple and olive green columns) but, in this case, ACNe is preceded by MOP-base filtering and FC-GNN. Additionally, NCC improves MOP+MiHo for the images in this dataset.
最后，在非平面 IMC-PT 数据集中，所有滤波器均提供了优于基准 SIFT 的 AUC ${}^{E}_{\measuredangle}$ 得分，但 AdaLAM、MS ² DG-Net 和 GMS 除外，这些滤波器与 MegaDepth 中观察到的结果一样，差异并不显著。ACNe 获得了最佳得分，而 FC-GNN 的改进相较于 MegaDepth 所取得的提升则较小。对于 AUC ${}^{F}_{\measuredangle}$ ，其排名与 ScanNet 的结果相当相似，区别在于 CLNet 和 NCMNet 的表现优于基准 SIFT。类似地，对于度量姿态精度（分别对应紫色和橄榄绿的 AUC ${}^{F}_{\square}$ 和 AUC ${}^{F}_{\square}$ 列），ACNe 在此情况下被 MOP-base 滤波和 FC-GNN 超越。此外，NCC 在此数据集的图像上提升了 MOP+MiHo 的表现。

As previously mentioned and confirmed by a private discussion with authors, ACNe has a possible and consistent overlap with the training data used for the estimation of the indoor and outdoor model weights with ScanNet and IMC-PT, respectively. Moreover, although training data used to compute FC-GNN weights and the validation data are different for Megadepth, the kinds of scenes are similar. For SIFT as well as other base pipelines discussed next, the gap within AUC accuracy computed with fundamental and essential matrix estimations of FG-CNN is quite limited for MegaDepth and increases in order with IMC-PT and ScanNet. This suggests, together with the analysis of the intrinsic camera parameters reported in Fig. 5, the possibility that FC-GGN implicitly learned the camera intrinsic priors for the specific MegaDepth dataset, partially overlapping with IMC-PT, that cannot be always exploited in other non-planar datasets. Differently, MOP is not tuned on the data being handcrafted.
如前所述，并通过与作者的私下讨论确认，ACNe 可能与用于估计室内和室外模型权重的训练数据（分别基于 ScanNet 和 IMC-PT）存在一致的重叠。此外，尽管用于计算 FC-GNN 权重的训练数据与 Megadepth 的验证数据不同，但场景类型相似。对于 SIFT 以及其他接下来讨论的基础管道，使用基本矩阵和本质矩阵估计计算的 AUC 精度在 Megadepth 中差异甚微，并按 IMC-PT 和 ScanNet 的顺序递增。这表明，结合图 5 中报告的相机内参分析，FC-GGN 可能隐式学习了特定 Megadepth 数据集的相机内参先验，这些先验部分与 IMC-PT 重叠，但在其他非平面数据集中并不总能利用。不同的是，MOP 并未针对手工数据进行调整。

4.2.3 Key.Net+AffNet+HardNet+NNR

Table 2 reports the the results for Key.Net. Differently from SIFT, only MOP-based filtering and FC-GNN achieve better pose accuracy than the base Key.Net in the case of AUC ${}^{F}_{\measuredangle}$ and AUC ${}^{E}_{\measuredangle}$ , respectively, in the all non-planar datasets with or without MAGSAC. AUC ${}^{E}_{\measuredangle}$ score is higher for MOP+MiHo+NCC than FC-GNN and vice-versa in the case of AUC ${}^{F}_{\measuredangle}$ . This consideration also holds for the metric scores AUC ${}^{F}_{\square}$ and AUC ${}^{E}_{\square}$ . CC and MOP-based filter provide the best AUC ${}^{H}_{\square}$ scores in the planar dataset as for SIFT.
表 2 报告了 Key.Net 的结果。与 SIFT 不同，在 AUC ${}^{F}_{\measuredangle}$ 和 AUC ${}^{E}_{\measuredangle}$ 情况下，仅基于 MOP 的过滤和 FC-GNN 分别在所有非平面数据集（无论是否使用 MAGSAC）中实现了比基准 Key.Net 更高的姿态精度。在 AUC ${}^{E}_{\measuredangle}$ 得分上，MOP+MiHo+NCC 优于 FC-GNN，反之在 AUC ${}^{F}_{\measuredangle}$ 得分上则相反。这一考虑同样适用于 AUC ${}^{F}_{\square}$ 和 AUC ${}^{E}_{\square}$ 的度量得分。在平面数据集中，CC 和基于 MOP 的过滤器提供了与 SIFT 相同的最佳 AUC ${}^{H}_{\square}$ 得分。

Overall, the best score achieved by any Key.Net pipeline is lower than the corresponding SIFT one with the relevant exception of Key.Net with MOP-based filtering and NCC. For this pipeline, Key.Net plus MOP+MiHo+NCC obtains the best absolute scores also among SIFT-based pipelines when pose accuracy is estimated from the essential matrix. This indirectly supports the different nature of the SIFT blobs and Key.Net corners. From the authors’ experience, on one hand, blobs are characterized to be more flat-symmetric and hence less discriminative than corners. However, on the other hand, for the same reasons keypoint localization is more stable for blobs than corners. A proper refinement step as NCC thus improves the matching process for corner-like features. Notice also that the recall improvement obtained by adding NCC with respect to the same pipeline without NCC, and in minor part also the precision, give indications of previous matches that were outliers due to a low keypoint accuracy and have been turned into inliers.
总体而言，任何 Key.Net 管道的最佳得分均低于相应的 SIFT 得分，相关例外是基于 MOP 过滤和 NCC 的 Key.Net。对于此管道，Key.Net 结合 MOP+MiHo+NCC 在从本质矩阵估计姿态精度时，不仅在 Key.Net 管道中获得最佳绝对分数，也在基于 SIFT 的管道中表现最佳。这间接支持了 SIFT 斑点和 Key.Net 角点性质的不同。根据作者的经验，一方面，斑点因其更为平面对称而比角点更具辨识度；另一方面，由于相同原因，斑点的关键点定位比角点更为稳定。因此，像 NCC 这样的适当细化步骤能提升类似角点的特征匹配过程。还需注意的是，添加 NCC 相对于无 NCC 的同一管道所获得的召回率提升，以及在较小程度上也包括的精确度提升，表明先前因关键点精度低而被视为异常值的匹配已转变为内点。

Moreover, filters yield worse scores than the base pipeline plus MAGSAC for Key.Net, with the exception of FC-GNN and MOP-based filtering. This highlights a critical aspect of learning-based methods, that need an appropriate tuning and training on the base of the other pipeline components. Notice also that the difference between the corresponding AUC ${}^{F}_{\measuredangle}$ and AUC ${}^{E}_{\measuredangle}$ for FC-GNN on MegaDepth is more limited than with other datasets, and in particular with respect to IMC-PT. A possible justification is that the network is able to learn camera intrinsic priors of the MegaDepth images, which have less variability as outdoor scenes than IMC-PT. As discussed in the evaluation setup, all deep filters have been trained on SIFT, including FC-GNN, suggesting less generalization to different image features with respect to handcrafted approaches based on geometric constraints like MOP.
此外，对于 Key.Net，过滤器产生的分数比基础管道加上 MAGSAC 更差，FC-GNN 和基于 MOP 的过滤除外。这突显了基于学习方法的一个关键方面，即需要在其他管道组件的基础上进行适当的调整和训练。还注意到，FC-GNN 在 MegaDepth 上的相应 AUC ${}^{F}_{\measuredangle}$ 和 AUC ${}^{E}_{\measuredangle}$ 之间的差异比其他数据集更为有限，特别是相对于 IMC-PT。一个可能的解释是，网络能够学习 MegaDepth 图像的相机内在先验，这些图像相对于 IMC-PT 的户外场景具有较少的变异性。如评估设置中所讨论的，所有深度过滤器都是在 SIFT 上训练的，包括 FC-GNN，这表明相对于基于几何约束的手工方法如 MOP，其对不同图像特征的泛化能力较差。

Finally, unlike SIFT, directly applying NCC after the base pipeline without MOP provides some improvements, even if these are inferior to when NCC is preceded by MOP-base filtering, confirming the original assumption that affine patch normalization is better than normalization by similarity for the refinement, but not as good as planar patch normalization.
最后，与 SIFT 不同，在基础流程后直接应用 NCC 而不经过 MOP 处理，确实带来了一定程度的改进，尽管这些改进不如 NCC 前经过 MOP-base 滤波时显著，这证实了最初的假设，即仿射块归一化在细化过程中优于相似性归一化，但仍不及平面块归一化效果。

4.2.4 SuperPoint+LightGlue
4.2.4 SuperPoint+LightGlue

Table 3 reports the results for SuperPoint. Overall, the relative rank among the datasets and pipelines for SuperPoint is comparable with Key.Net. This is not accidental since both Key.Net and SuperPoint are based on corner-like keypoints so the same considerations and observations done for Key.Net hold for SuperPoint. Moreover, in terms of absolute score, SuperPoint with MOP+MiHo+NCC achieves on MegaDepth the top AUC ${}^{E}_{\measuredangle}$ value among all the evaluated pipelines, or a value almost equal to the best absolute AUC ${}^{E}_{\measuredangle}$ provided by LoFTR for ScanNet. Finally, the improvement in adding NCC in terms of recall and precision with respect to Key.Net is inferior but still relevant, implying a more accurate base corner relocalization of SuperPoint.
表 3 报告了 SuperPoint 的结果。总体而言，SuperPoint 在数据集和管道间的相对排名与 Key.Net 相当。这并非偶然，因为 Key.Net 和 SuperPoint 均基于类似角点的关键点，因此针对 Key.Net 的相同考量和观察同样适用于 SuperPoint。此外，在绝对分数方面，SuperPoint 结合 MOP+MiHo+NCC 在 MegaDepth 上达到了所有评估管道中的最高 AUC ${}^{E}_{\measuredangle}$ 值，或几乎与 LoFTR 在 ScanNet 上提供的最佳绝对 AUC ${}^{E}_{\measuredangle}$ 值相当。最后，相较于 Key.Net，添加 NCC 在召回率和精确度上的提升虽不及前者显著，但仍具意义，表明 SuperPoint 的基础角点重定位更为精确。

4.2.5 ALIKED+LightGlue 4.2.5 ALIKED+LightGlue

Table 4 reports the results for ALIKED. Overall, no method with or without MAGSAC has better AUCs than the base ALIKED with or without MAGSAC, respectively. All filters strongly degrade AUC scores with the exception of MOP-based filtering, FC-GNN, and AdaLAM. AdaLAM also has exactly the same AUCs as the base ALIKED. This can be interpreted positively since in case of perfect input matches a filters should not alter it. FC-GNN provides better results than the base ALIKED only on MegaDepth, while MOP+MiHo with or without NCC provides more or less the same AUCs of the base pipeline in the non-planar cases. As for AdaLAM, this is a positive feature. In the planar dataset, only MiHo-based filtering is remarkably better than the base pipeline while most of the other filters are slightly inferior to the base pipeline. Note also that corresponding SuperPoint pipelines are better than the ALIKE ones in terms of the best absolute AUC scores.
表 4 报告了 ALIKED 的结果。总体而言，无论是否使用 MAGSAC，没有任何方法的 AUC 优于基础 ALIKED（无论是否使用 MAGSAC）。除基于 MOP 的过滤、FC-GNN 和 AdaLAM 外，所有过滤器均显著降低了 AUC 分数。AdaLAM 的 AUC 与基础 ALIKED 完全相同，这可以被正面解读，因为在输入匹配完美的情况下，过滤器不应改变它。FC-GNN 仅在 MegaDepth 上提供了比基础 ALIKED 更好的结果，而 MOP+MiHo（无论是否使用 NCC）在非平面情况下提供的 AUC 与基础管道大致相同。至于 AdaLAM，这是一个积极的特点。在平面数据集中，只有基于 MiHo 的过滤显著优于基础管道，而大多数其他过滤器略逊于基础管道。还值得注意的是，在最佳绝对 AUC 分数方面，相应的 SuperPoint 管道优于 ALIKE 管道。

ALIKED provides overall more robust and accurate base input matches than SuperPoint, also in terms of keypoint localization. As a downside, this strict accuracy of the input limits the number of less probable correct matches which could be better filtered or adjusted by a robust filtering approach.
ALIKED 在整体上提供了比 SuperPoint 更稳健和准确的基准输入匹配，在关键点定位方面也是如此。然而，这种严格的准确性限制了较少可能的正确匹配数量，这些匹配本可以通过稳健的过滤方法得到更好的筛选或调整。

4.2.6 DISK+LightGlue

Table 5 reports the results for DISK. The relative rank is generally quite similar to that obtained by ALIKED for the non-planar scenes. The main difference is that FC-GNN is marginally better than the base DISK for any dataset and not only for MegaDepth, while MOP-based filtering behaves as the base pipeline and NCC always degrades slightly the AUC. This is reasonable considering that DISK keypoints are more similar to blobs than corners as one can note by visual inspection. Finally, in terms of absolute scores, the top-rank AUCs for IMC-PT are better than those on SuperPoint but much lower on ScanNet.
表 5 报告了 DISK 的结果。其相对排名与 ALIKED 在非平面场景中获得的结果大体相似。主要区别在于，FC-GNN 在任何数据集上均略优于基础 DISK，而不仅仅是在 MegaDepth 上；基于 MOP 的过滤表现与基础管道一致，NCC 则始终略微降低 AUC。考虑到 DISK 关键点在视觉检查中更接近斑点而非角点，这一现象是合理的。最后，在绝对分数方面，IMC-PT 的顶级 AUC 在 SuperPoint 上优于其表现，但在 ScanNet 上则显著较低。

Concerning the planar scenes, MAGSAC decreases the final pose accuracy and, with the exception of NCMNet and FC-GNN, all other filters, including MOP-based ones, are noticeably worse than the base pipeline in terms of AUC. In particular, CC is the filter with lower AUC ${}^{H}_{\square}$ .
关于平面场景，MAGSAC 降低了最终姿态的精度，除了 NCMNet 和 FC-GNN 外，包括基于 MOP 的所有其他滤波器，其 AUC 明显低于基准流水线。特别是 CC 滤波器的 AUC ${}^{H}_{\square}$ 最低。

4.2.7 LoFTR 4.2.7 局部特征变换回归

Table 6 reports the results for LoFTR. Unlike other architectures relying on SuperGlue, MAGSAC greatly improves LoFTR base pipeline. Similarly to ALIKED, except AdaLAM, FC-GNN, and MOP-based filtering, other filters are below the base LoFTR with MAGSAC in non-planar scenes. In particular, AdaLAM, MOP, and MOP+MiHo are equal to LoFTR with MAGSAC in outdoor datasets, yet the refinement with FC-GNN or by MOP-based filtering and NCC greatly improves AUCs. This is highlighted by the high precision and the recall values exceeding those of the base LoFTR when using MOP-based filtering with NCC and without MAGSAC. As for previous base pipelines, FC-GNN is better than MOP-based filters with NCC in the case of pose estimation from the fundamental matrix. Nevertheless, refinement by NCC or FC-GNN introduces score losses in the indoor ScanNet. In terms of absolute score, top AUC scores in the case of estimation of the pose by the fundamental matrix are higher than the corresponding one of SuperGlue in the outdoor scenes.
表 6 展示了 LoFTR 的结果。与依赖 SuperGlue 的其他架构不同，MAGSAC 显著提升了 LoFTR 基础流程的性能。与 ALIKED 类似，除 AdaLAM、FC-GNN 和基于 MOP 的过滤外，其他过滤器在非平面场景中均低于使用 MAGSAC 的 LoFTR 基础流程。特别地，AdaLAM、MOP 和 MOP+MiHo 在户外数据集上与使用 MAGSAC 的 LoFTR 表现相当，但通过 FC-GNN 或基于 MOP 的过滤与 NCC 的优化显著提升了 AUC 值。这一点在使用基于 MOP 的过滤与 NCC 且未使用 MAGSAC 时，高精度和召回率超过基础 LoFTR 的情况下尤为明显。对于先前的基础流程，在基于基础矩阵的姿态估计中，FC-GNN 优于基于 MOP 的过滤与 NCC。然而，在室内 ScanNet 中，NCC 或 FC-GNN 的优化引入了评分损失。就绝对评分而言，在基于基础矩阵的姿态估计中，户外场景的最高 AUC 评分高于 SuperGlue 的相应评分。

For planar images, homography estimation accuracy in terms of AUC ${}^{H}_{\square}$ is improved in order for CC, CLNet, NCMNet, and FC-GNN, but MOP-based NCC refinements achieve the top scores, which are also the best as absolute values with respect to other pipelines. Notice that MOP and MOP+MiHo without NCC achieve good performances in between those previously mentioned.
对于平面图像，为了提高 CC、CLNet、NCMNet 和 FC-GNN 的 AUC ${}^{H}_{\square}$ 下的单应性估计精度，但基于 MOP 的 NCC 优化取得了最高分数，这些分数在与其他流程的绝对值相比时也是最佳的。值得注意的是，不含 NCC 的 MOP 和 MOP+MiHo 在上述两者之间也表现良好。

4.2.8 DeDoDe v2

Table 7 reports the results for DeDoDe v2. MAGSAC improves pose accuracy according to AUC ${}^{F}_{\measuredangle}$ or AUC ${}^{H}_{\square}$ , and AdaLAM has the same scores of the base DeDoDe v2 for every dataset. Moreover, both CC and GMS in general have better AUC ${}^{F}_{\measuredangle}$ scores than the base pipeline in outdoor datasets. MOP and MOP+MiHo give improvements excluding the indoor datasets where the original score is maintained, and NCC inclusion degrades the accuracy for the outdoor datasets while improving it for indoor and planar scenes. Excluding the planar scenes, FC-GNN achieved the best results even if strictly better than MOP only for AUC ${}^{F}_{\measuredangle}$ , being comparable to MOP-based filtering in the case of AUC ${}^{E}_{\measuredangle}$ .
表 7 报告了 DeDoDe v2 的结果。MAGSAC 根据 AUC ${}^{F}_{\measuredangle}$ 或 AUC ${}^{H}_{\square}$ 提高了姿态精度，而 AdaLAM 在每个数据集上的得分与基础 DeDoDe v2 相同。此外，CC 和 GMS 在户外数据集上的 AUC ${}^{F}_{\measuredangle}$ 得分通常优于基础管道。MOP 和 MOP+MiHo 在排除室内数据集的情况下有所改进，保持了原始得分，而 NCC 的加入降低了户外数据集的精度，但提高了室内和平面场景的精度。排除平面场景后，FC-GNN 取得了最佳结果，尽管严格来说仅在 AUC ${}^{F}_{\measuredangle}$ 上优于 MOP，在 AUC ${}^{E}_{\measuredangle}$ 情况下与基于 MOP 的过滤相当。

4.2.9 Visual qualitative analysis
4.2.9 视觉定性分析

Figure 7 shows visual results of patch alignment by NCC for two example image pairs, superimposed as red-green anaglyphs. Patches have been randomly chosen for visualization, and full patch sets for each evaluated pipeline and dataset combination can be generated by the provided supporting tool1. Inside each multirow in the figure, the top row shows the initial patch alignment specified by the homography pair $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ as defined in Sec. 3.3, while the second row shows the results after applying NCC without MOP-based homography solutions. Likewise, the third row shows the initial alignment of the MOP+MiHo extended homography pair $(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})$ and the bottom row the final results with NCC. In the left clusters $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ , with or without NCC, has the highest correlation value. The blue bar on the left of each patch superimposition indicates the corresponding epipolar error of the patch center with respect to the pseudo GT. The reader is invited to zoom in and inspect the above figure, which provides examples of the possible configurations and alignment failure cases.
图 7 展示了通过 NCC 进行补丁对齐的视觉结果，针对两个示例图像对，以红绿立体图形式叠加显示。补丁是随机选择的，用于可视化，每个评估流程和数据集组合的完整补丁集可通过提供的支持工具 1 生成。在图中的每个多行内，顶行显示由单应性对 $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ 指定的初始补丁对齐，如第 3.3 节所定义；第二行显示应用 NCC 后的结果，未使用基于 MOP 的单应性解决方案。同样，第三行展示 MOP+MiHo 扩展单应性对 $(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})$ 的初始对齐，而底行显示最终的 NCC 结果。在左侧簇 $(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})$ 中，无论是否使用 NCC，均具有最高的相关值。每个补丁叠加左侧的蓝色条表示相对于伪 GT 的补丁中心对应的极线误差。邀请读者放大并检查上图，该图提供了可能的配置和配准失败案例的示例。

For true planar regions, such as the patches including the St. Peter’s Basilica epigraph, it is evident that MOP+MiHo extended patches are better than the base ones in terms of alignment and distortions with respect to the original images. Nevertheless, MOP patch alignment can go wrong in some circumstances as shown in the leftmost column of the top multirow. This happens for instance when the homography assignment described in Sec. 3.1.4 picks a homography with a small reprojection error but also a poor wrong support set. Furthermore, NCC similarity maximization is not always the best choice as highlighted by the patches containing the inscription in the left clusters, where the registration is visibly better with MOP although base patches achieve the highest NCC correlation value.
对于真正的平面区域，例如包含圣彼得大教堂题字的贴片，显然 MOP+MiHo 扩展贴片在相对于原始图像的对齐和畸变方面优于基础贴片。然而，如顶部多行最左列所示，MOP 贴片对齐在某些情况下可能会出错。例如，当第 3.1.4 节所述的单应性分配选择了一个重投影误差小但支持集不佳的单应性时，就会发生这种情况。此外，NCC 相似性最大化并非总是最佳选择，如左群集中的题字贴片所示，尽管基础贴片达到了最高的 NCC 相关值，但 MOP 贴片的配准效果明显更好。

In the case of patches representing almost flat regions or straight contour edges according to the Harris corner definition [62], NCC can be misled and degrade the alignment increasing the epipolar error, as in the rightmost patch overlay in the third multirow. Notice also that although visually improving the overall patch alignment, NCC can increase the epipolar error of the effective keypoint center, as for the ‘BV’ inscription in the third multirow. Anyways, when NCC keypoint accuracy decreases assuming that the pseudo-GT sub-pixel estimation can be always trusted, the epipolar error increment is generally limited as reported in the figure and by the results of Tables 1-7.
在代表几乎平坦区域或根据 Harris 角点定义的直轮廓边缘的补丁情况下[62]，NCC 可能会被误导，导致对齐效果下降并增加极线误差，如第三多行中最右侧的补丁叠加所示。还需注意的是，尽管 NCC 在视觉上改善了整体补丁对齐效果，但它可能会增加有效关键点中心的极线误差，如第三多行中的“BV”铭文所示。无论如何，当 NCC 关键点精度下降时，假设伪 GT 亚像素估计始终可信，极线误差的增加通常是有限的，正如图示和表 1-7 的结果所报告的那样。

Figures 8-9 present some example image pairs, respectively in the case of non-planar and planar scenes, for base matching pipelines where NCC refinement is effective according to the quantitative evaluation. Base matches after being filtered by MOP+MiHo+NCC and FC-GNN, with or without RANSAC, are reported, where the match color indicates the corresponding maximum epipolar error within the pair images.
图 8-9 分别展示了在非平面和平面场景中的一些示例图像对，这些图像对基于匹配流水线，根据定量评估，NCC 细化是有效的。报告了经过 MOP+MiHo+NCC 和 FC-GNN 过滤后的基础匹配，无论是否使用 RANSAC，匹配颜色表示在该对图像中对应的最大极线误差。

Comparing the first row for each scene, i.e. without RANSAC-based filtering, both MOP+MiHo and FC-GNN remove outliers, highlighted in blue, but MOP+MiHo is more relaxed in the task, acting as a sort of coarse Local Optimized RANSAC (LO-RANSAC) [109] step. As reported in Tables 1-7, this is beneficial for the effective RANSAC especially when the outlier contamination of the input matches is high, such as for SIFT and Key.Net, and does not negatively affect the final results in the other cases. Concerning the refinement, both NCC and FC-GNN are able to improve the keypoint accuracy, as indicated by the color shift of the matches, but the epipolar error variation of their refined matches differ. For NCC the same observations and considerations discussed for Figure 7 hold.
比较每个场景的第一行，即未进行基于 RANSAC 的过滤时，MOP+MiHo 和 FC-GNN 均能去除异常值，这些异常值以蓝色突出显示，但 MOP+MiHo 在任务中表现得更为宽松，类似于一种粗略的局部优化 RANSAC（LO-RANSAC）[109]步骤。如表 1 至表 7 所示，这对有效 RANSAC 尤其有利，特别是在输入匹配的异常值污染较高时，例如对于 SIFT 和 Key.Net，并且不会对其他情况下的最终结果产生负面影响。关于细化，NCC 和 FC-GNN 都能提高关键点精度，这一点通过匹配的颜色变化得以体现，但它们细化后的匹配的极线误差变化有所不同。对于 NCC，图 7 中讨论的相同观察和考虑同样适用。

4.2.10 Running times 4.2.10 运行时间

Computational time evaluation has been carried out on half of the evaluation image pairs of MegaDepth and ScanNet. The system employed is an Intel Core i9-10900K CPU with 64 GB of RAM equipped with a NVIDIA GeForce RTX 2080 Ti, running Ubuntu 20.04. Average running times are detailed in the additional material6.
计算时间评估已在 MegaDepth 和 ScanNet 的一半评估图像对上进行。所使用的系统为配备 64GB RAM 的 Intel Core i9-10900K CPU，并搭载 NVIDIA GeForce RTX 2080 Ti 显卡，运行 Ubuntu 20.04。平均运行时间详见附加材料 6。

In general, the base matching pipelines run within 0.5 s with exception of SIFT and Key.Net, which run within 1 s since their pipelines are not completely based on PyTorch and do not exploit the inherent parallelization offered. The overhead added by the filtering modules is almost irrelevant for all the compared filters except CC and the proposed ones. Specifically, CC lacks a parallel implementation and can roughly process 1000 matches in 3 s, NNC is able to refine about 3000 patch pairs within 1 s, while MOP and MOP+MiHo computations depend not only on the size of the input but also on the outlier rate of the input images. In this latter case, running times can be mainly controlled by the number of the maximum iterations for RANSAC $c_{\max}$ , defined in Sec. 3.1.3 and set to 500, 1000, 1500 and 2000 for the analysis. It is not trivial to derive the running time trend for MOP and MOP+MiHo. In general, MOP goes roughly from 1 s to 2 s for SIFT and Key.Net moving $c_{\max}$ from 500 to 2000, and from 2.5 s to 7s with other pipelines. For MOP+MiHo, setting $c_{\max}$ from 500 to 2000 changes the running times from 1.5 s to 4 s for all pipelines except LoFTR and DeDoDe v2, for which the running time moves from 3 s to 6 s. Compared with other filtering and refinement modules MOP and MOP+MiHo running time is relatively high but still acceptable for off-line, medium-scale image matching, as well as for the NCC refinement module. The proposed modules have been implemented in PyTorch exploiting its base parallel support and both GPU workload and memory usage were around 25% when processing data. Likely, hardware-related and coding optimization could speed up these modules. Notice also that deep-architecture is parallel by design while MOP-based computation inherits RANSAC serialization. Nevertheless, better optimization to adaptively determine the stopping criteria and to avoid redundant computation dynamically or in a coarse-to-fine manner are planned as future work.
总体而言，基础匹配管道在 0.5 秒内运行，SIFT 和 Key.Net 除外，它们在 1 秒内运行，因为其管道并非完全基于 PyTorch，并未充分利用提供的内在并行化。过滤模块带来的开销对于除 CC 和所提出方法外的所有比较过滤器几乎可以忽略不计。具体来说，CC 缺乏并行实现，大约需要 3 秒处理 1000 个匹配，NNC 能在 1 秒内细化约 3000 个补丁对，而 MOP 和 MOP+MiHo 的计算不仅依赖于输入大小，还取决于输入图像的异常值率。在后一种情况下，运行时间主要可通过 RANSAC 的最大迭代次数控制，该参数在 3.1.3 节中定义，分析中设为 500、1000、1500 和 2000。推导 MOP 和 MOP+MiHo 的运行时间趋势并不简单。通常，MOP 在 SIFT 和 Key.Net 中从 500 到 2000 的变化下，运行时间大致从 1 秒到 2 秒，其他管道则从 2.5 秒到 7 秒。对于 MOP+MiHo，从 500 到 2000 的设置变化使得运行时间从 1 秒开始变化。所有管道的运行时间从 5 秒缩短至 4 秒，但 LoFTR 和 DeDoDe v2 除外，它们的运行时间从 3 秒延长至 6 秒。与其他过滤和细化模块相比，MOP 和 MOP+MiHo 的运行时间相对较高，但对于离线、中等规模的图像匹配，以及 NCC 细化模块而言，仍属可接受范围。所提出的模块已在 PyTorch 中实现，充分利用其基础并行支持，处理数据时 GPU 负载和内存使用率均约为 25%。硬件相关及代码优化有望进一步加速这些模块。值得注意的是，深度架构天生具备并行性，而基于 MOP 的计算则沿袭了 RANSAC 的串行化特性。未来工作计划包括优化以自适应确定停止准则，并动态或由粗到细地避免冗余计算。

Overall, according to the running time data, it turns out that the default value $c_{\max}$ =2000 used for the main comparative evaluation in Tables 1-7 is optimal in terms of matching quality only for SIFT and Key.Net, while it slightly decreases the matching results for the other pipelines with respect to $c_{\max}$ =500, which also improves the running times. In detail, moving from $c_{\max}$ =500 to $c_{\max}$ =2000 causes an increment of around 2% of the recall and of about 1% for the AUC pose estimation in the case of SIFT and Key.Net, which contain as initial configuration a higher number of outliers than other pipelines, for which both recall and pose AUC remain instead stable or even worsen. This can be motivated by observing that the next homography discovered in MOP-based approaches is more noisy or even not correct in the long run than the previous ones, and high $c_{\max}$ values force to retrieve a plane and delay the iteration stops. According to these considerations, it is reasonable that forcing the homography retrieval by increasing $c_{\max}$ starts to degrade the found solutions as the initial outlier ratio decreases.
总体而言，根据运行时间数据，结果表明，表 1 至表 7 中用于主要对比评估的默认值 $c_{\max}$ =2000 仅在 SIFT 和 Key.Net 的匹配质量方面为最优，而对于其他管道，相对于 $c_{\max}$ =500，它略微降低了匹配结果，同时提高了运行时间。具体来说，从 $c_{\max}$ =500 过渡到 $c_{\max}$ =2000，在 SIFT 和 Key.Net 的情况下，召回率增加了约 2%，AUC 姿态估计增加了约 1%，这两者初始配置中包含的异常值数量高于其他管道，对于这些管道，召回率和姿态 AUC 保持稳定甚至恶化。这可以通过观察到，基于 MOP 的方法中发现的后续单应性比之前的单应性在长期内更嘈杂甚至不正确，而高 $c_{\max}$ 值迫使检索平面并延迟迭代停止来解释。基于这些考虑，随着初始异常值比例的降低，通过增加 $c_{\max}$ 来强制检索单应性开始降低找到的解决方案是合理的。

Notice also that MiHo can take part in the MOP early stopping since it introduces further constraints in the homography to be searched, justifying higher running times for MOP than those of MOP+MiHo in some configurations, although MiHo should ideally need more computational resources as two tied homographies are estimated at each step instead of only one. In particular, MOP+MiHo running time equals MOP one for LoFTR and DeDoDe v2, it is greater for SuperPoint, ALIKED, and DISK, and it is lower for SIFT and Key.Net. Excluding SIFT and Key.Net, MOP+MiHo find more homographies than MOP, suggesting the reliability of the organization of input matches within the scene as almost-similarity transformations.
还需注意，由于 MiHo 在所搜索的单应性中引入了进一步的约束，因此它能够参与 MOP 的早期停止，这解释了在某些配置下 MOP 的运行时间为何高于 MOP+MiHo，尽管从理想情况来看，MiHo 由于每一步需要估计两个关联的单应性而非仅一个，理应需要更多的计算资源。具体而言，MOP+MiHo 的运行时间在 LoFTR 和 DeDoDe v2 上与 MOP 相同，对于 SuperPoint、ALIKED 和 DISK 则更长，而对 SIFT 和 Key.Net 则较短。排除 SIFT 和 Key.Net 后，MOP+MiHo 找到的单应性数量多于 MOP，这表明输入匹配在场景中的组织具有近似相似变换的可靠性。

5 Conclusions and future work
5 结论与未来工作

This paper presented a novel handcrafted modular approach to filtering and refining image matches relying on the principle that the scene flow can be approximated as virtual local and overlapping planar homographies. Extensive evaluation shows that the approach is valid and robust.
本文提出了一种新颖的手工模块化方法，用于基于场景流可近似为虚拟局部重叠平面单应性的原理，对图像匹配进行过滤和优化。广泛评估表明，该方法是有效且鲁棒的。

MOP+MiHo+NCC is able to effectively discard outliers and to improve the match localization as well as the pose accuracy with Key.Net, SuperPoint, and LoFTR. MOP+MiHo without NCC filtering is able to remove outliers within SIFT providing a better RANSAC estimation as well. In the case of ALIKED, DISK, and DeDoDe v2, where the base pipeline only contains accurate matches, the proposed pipeline does not alter the final output. Nevertheless, in terms of absolute accuracy, LoFTR and SuperPoint reached the overall best results among all the evaluated pipelines when boosted by the proposed approach in three of the evaluated datasets, while in the remaining IMC-PT dataset, the best is reached by DISK with FC-GNN, which is the current deep competitor and alternative to MOP+MiHo+NCC.
MOP+MiHo+NCC 能够有效剔除异常值，并通过 Key.Net、SuperPoint 和 LoFTR 提高匹配定位精度以及姿态准确性。不带 NCC 过滤的 MOP+MiHo 也能在 SIFT 中去除异常值，从而提供更优的 RANSAC 估计。对于仅包含精确匹配的 ALIKED、DISK 和 DeDoDe v2 基础流程，所提出的流程并未改变最终输出。然而，在绝对精度方面，LoFTR 和 SuperPoint 在所评估的三个数据集中，通过该方法提升后取得了所有评估流程中的最佳结果；而在剩余的 IMC-PT 数据集中，DISK 与 FC-GNN 组合达到了最佳，这是当前深度竞争者及 MOP+MiHo+NCC 的替代方案。

Overall, the above behavior suggests that current end-to-end deep base matching architecture are in general too strict in removing non-accurate matches which could be refined. In this sense, the proposed analysis and the designed approaches provide further insight that can be used for developing better deep image matching networks. FC-GNN provides better pose estimation than the proposed method for unknown camera intrinsics, with a minimal gap with respect to the pose estimation with known camera intrinsics in MegaDepth. Future research will involve the inclusion of camera intrinsic statistics and heuristics within the RANSAC framework.
总体而言，上述行为表明当前的端到端深度基匹配架构在去除不精确匹配方面通常过于严格，这些不精确匹配本可以通过进一步优化得到改善。从这个角度看，所提出的分析和设计方法提供了进一步的见解，可用于开发更优秀的深度图像匹配网络。对于未知相机内参的情况，FC-GNN 的姿态估计优于所提出的方法，且在 MegaDepth 中，与已知相机内参的姿态估计相比，差距极小。未来的研究将涉及在 RANSAC 框架内纳入相机内参统计数据和启发式方法。

MOP has proved to be effective in finding spatial planar clusters. As future work these clusters could be organized and exploited more in-depth to further filter matches as well as to expand them within the framework described in [50]. Moreover, planar clusters could be used also to characterize the scene structure at higher levels in terms of object segmentation and the relative positions and occlusions of the found elements.
MOP 已被证明在发现空间平面聚类方面是有效的。作为未来的工作，这些聚类可以更深入地组织和利用，以进一步过滤匹配，并在[50]所述的框架内扩展它们。此外，平面聚类还可用于在更高层次上表征场景结构，涉及对象分割以及所发现元素的相对位置和遮挡关系。

MiHo is able to improve the alignment in almost-planar motion, yet it lacks a rigorous mathematical formulation which would help to better understand its properties and limitations. Future development in this direction will be explored, as well as its possible applications to current image stitching solutions.
MiHo 能够改善近似平面运动中的对齐效果，但缺乏严格的数学表述，这有助于更好地理解其特性和局限性。未来将探索这一方向的发展，并研究其可能应用于当前图像拼接解决方案的潜力。

NCC inherits the strengths and weaknesses of the base template matching approach it is based on. NCC is effective on corner-like patches but on strong edge-like structures or flat homogeneous regions, more typical of blobs, NCC can mislead the alignment refinement. More elaborated and robust approaches based on the gradient orientation histogram correlation, on the Iterative Closest Point (ICP) [110] or variational methods will be taken in consideration for future developments. Match densification through local patch homographies will be investigated too.
NCC 继承了其基于的基准模板匹配方法的优点和缺点。NCC 在类似角点的区域表现有效，但在强烈的边缘结构或平坦均匀区域（更典型的如斑点）上，NCC 可能会误导对齐优化。未来发展将考虑采用基于梯度方向直方图相关性、迭代最近点（ICP）[110]或变分方法的更精细和鲁棒的途径。通过局部块单应性的匹配密集化也将被研究。

MOP+MiHo+NCC shifts the same initial keypoint belonging to an image in different directions when this image is paired with other images of the same scene. This implies that the proposed method cannot be embedded directly within current SfM pipelines since it would prevent the generation of keypoint tracks. Solutions for handling multiple keypoint shifts in SfM, for instance by constraint keypoint refinement to only one image by a greedy strategy as more images are added to the model, will be evaluated in future work.
MOP+MiHo+NCC 在同一图像与其他同场景图像配对时，会将图像中相同的初始关键点向不同方向移动。这意味着所提出的方法无法直接嵌入当前的 SfM 流程中，因为它会阻碍关键点轨迹的生成。未来工作将评估在 SfM 中处理多个关键点偏移的解决方案，例如通过贪婪策略将关键点优化限制为仅针对新增图像中的一个图像进行。

MOP+MiHo+NCC offers the advantage of handcrafted design with respect to deep architectures, including explainability, direct design tuning, and adaptability for specific tasks and no training data dependence. Nevertheless, MOP+MiHo+NCC sequential nature inherited by RANSAC makes it less efficient in terms of computation in nowadays parallel hardware and programming infrastructures exploited straightly by deep networks. Future works will investigate design approaches and optimization strategies to speed up the computational times.
MOP+MiHo+NCC 在深度架构的手工设计方面具有优势，包括可解释性、直接设计调优以及针对特定任务的适应性，且无需依赖训练数据。然而，由于继承了 RANSAC 的顺序特性，MOP+MiHo+NCC 在当今并行硬件和深度网络直接利用的编程基础设施中，计算效率较低。未来的工作将研究设计方法和优化策略，以加快计算时间。

Finally, the reported comprehensive comparative analysis highlighted several critical aspects in current image matching methods, including the dependency of the filtering methods on the specific dataset and the base pipeline, as well as the pose estimation performance drop observed when camera intrinsic are unknown, quite common in real scenarios. This evaluation will be extended to additional approaches and further complex datasets.
最后，所报告的综合比较分析突显了当前图像匹配方法中的几个关键方面，包括过滤方法对特定数据集和基础管道的依赖性，以及在相机内参未知时观察到的姿态估计性能下降，这在实际场景中相当常见。此评估将进一步扩展到其他方法和更复杂的数据集。

References 参考文献

\bibcommenthead
Brown and Lowe [2007] Brown M, Lowe DG. Automatic Panoramic Image Stitching using Invariant Features. International Journal of Computer Vision. 2007;74:59–73.
Brown M, Lowe DG. 使用不变特征的自动全景图像拼接。计算机视觉国际期刊。2007;74:59–73.
Szeliski [2022] Szeliski R. Computer Vision: Algorithms and Applications. 2nd ed. Springer-Verlag; 2022.
Szeliski R. 计算机视觉：算法与应用。第二版。斯普林格出版社；2022 年。
Snavely et al. [2008] Snavely N, Seitz SM, Szeliski R. Modeling the World from Internet Photo Collections. International Journal of Computer Vision. 2008;80(2):189–210.
Snavely N, Seitz SM, Szeliski R. 从互联网照片集合中建模世界。计算机视觉国际期刊。2008;80(2):189–210.
Schönberger and Frahm [2016] Schönberger J, Frahm JM. Structure-from-Motion Revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 4104–4113.
Schönberger J, Frahm JM. 运动重构结构再探。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2016 年。第 4104-4113 页。
Pan et al. [2024] Pan L, Barath D, Pollefeys M, Schönberger J. Global Structure-from-Motion Revisited. In: Proceedings of the European Conference on Computer Vision (ECCV); 2024. p. 4104–4113.
潘磊, 巴拉特 D, 波利菲斯 M, 舍恩伯格 J. 全局运动结构再探. 见: 欧洲计算机视觉会议(ECCV)论文集; 2024. 页码: 4104–4113.
Mur-Artal et al. [2015] Mur-Artal R, Montiel JMM, Tardos JD. ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics. 2015;31(5):1147–1163.
Mur-Artal R, Montiel JMM, Tardos JD. ORB-SLAM：一种多功能且精确的单目 SLAM 系统。IEEE 机器人学汇刊。2015;31(5):1147–1163。
Mildenhall et al. [2020] Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In: Proceedings of the European Conference on Computer Vision (ECCV); 2020. p. 405–421.
Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R. NeRF: 将场景表示为用于视图合成的神经辐射场。在：欧洲计算机视觉会议（ECCV）论文集；2020 年。第 405-421 页。
Gao et al. [2023] Gao K, Gao Y, He H, Lu D, Xu L, Li J. NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review. In: arXiv:2210.00379; 2023. .
高凯, 高阳, 何浩, 陆丹, 徐磊, 李杰. NeRF: 三维视觉中的神经辐射场——全面综述. 见: arXiv:2210.00379; 2023.
Kerbl et al. [2023] Kerbl B, Kopanas G, Leimkuehler T, Drettakis G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics. 2023;42(4):1–14.
Kerbl B, Kopanas G, Leimkuehler T, Drettakis G. 三维高斯样条用于实时辐射场渲染。ACM 图形学报。2023;42(4):1-14.
Chen and Wang [2024] Chen G, Wang W. A Survey on 3D Gaussian Splatting. In: arXiv:2401.03890; 2024. .
陈刚, 王伟. 三维高斯样条综述. 见: arXiv:2401.03890; 2024.
Jin et al. [2021] Jin Y, Mishkin D, Mishchuk A, Matas J, Fua P, Yi KM, et al. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision. 2021;129(2):517–547.
金勇, 米什金 D, 米什丘克 A, 马塔斯 J, 富阿 P, 易 KM, 等. 跨越宽基线的图像匹配：从理论到实践. 国际计算机视觉杂志. 2021;129(2):517–547.
Lindenberger et al. [2023] Lindenberger P, Sarlin PE, Pollefeys M. LightGlue: Local Feature Matching at Light Speed. In: Proceedings of the International Conference on Computer Vision (ICCV); 2023. p. 17627–17638.
林登伯格 P, 萨林 PE, 波利菲斯 M. LightGlue: 光速下的局部特征匹配. 见: 国际计算机视觉会议(ICCV)论文集; 2023. 页码: 17627–17638.
Zhao et al. [2023] Zhao X, Wu X, Chen W, Chen PCY, Xu Q, Li Z. ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation. IEEE Transactions on Instrumentation and Measurement. 2023;72:1–16.
赵 X, 吴 X, 陈 W, 陈 PCY, 徐 Q, 李 Z. ALIKED: 通过可变形变换的轻量级关键点与描述子提取网络. 电气与电子工程师协会仪器与测量汇刊. 2023;72:1–16.
Sun et al. [2021] Sun J, Shen Z, Wang Y, Bao H, Zhou X. LoFTR: Detector-Free Local Feature Matching with Transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 8922–8931.
孙杰, 沈泽, 王宇, 包晗, 周鑫. LoFTR: 无检测器的局部特征匹配与 Transformer. 见: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2021. 页码: 8922–8931.
Tang et al. [2022] Tang S, Zhang J, Zhu S, Tan P. QuadTree Attention for Vision Transformers. In: Proceedings of the International Conference on Learning Representations (ICLR); 2022. .
唐 S, 张 J, 朱 S, 谭 P. 四叉树注意力在视觉变换器中的应用. 见: 国际学习表征会议(ICLR)论文集; 2022.
Wang et al. [2022] Wang Q, Zhang J, Yang K, Peng K, Stiefelhagen R. MatchFormer: Interleaving Attention in Transformers for Feature Matching. In: Proceedings of the Asian Conference on Computer Vision (ACCV); 2022. p. 2746–2762.
王琦, 张杰, 杨凯, 彭凯, Stiefelhagen R. MatchFormer: 在 Transformer 中交错注意力机制用于特征匹配. 见: 亚洲计算机视觉会议(ACCV)论文集; 2022. 页码: 2746–2762.
Wang et al. [2024] Wang S, Leroy V, Cabon Y, Chidlovskii B, Revaud J. DUSt3R: Geometric 3D Vision Made Easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 20697–20709.
王 S, Leroy V, Cabon Y, Chidlovskii B, Revaud J. DUSt3R: 几何三维视觉简易化。见：IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集；2024 年。第 20697-20709 页。
Edstedt et al. [2024] Edstedt J, Sun Q, Bökman G, Wadenbäck M, Felsberg M. RoMa: Robust Dense Feature Matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 19790–19800.
Edstedt J, Sun Q, Bökman G, Wadenbäck M, Felsberg M. RoMa: 鲁棒密集特征匹配。在：IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集；2024 年。第 19790-19800 页。
Lowe [2004] Lowe DG. Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision. 2004;60(2):91–110.
Lowe DG. 尺度不变关键点的独特图像特征。国际计算机视觉杂志。2004;60(2):91–110.
Fischler and Bolles [1981] Fischler M, Bolles R. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM. 1981;24(6):381–395.
Fischler M, Bolles R. 随机样本一致性：一种用于模型拟合及其在图像分析和自动化制图中的应用的范式。ACM 通讯。1981;24(6):381–395.
Raguram et al. [2013] Raguram R, Chum O, Pollefeys M, Matas J, Frahm J. USAC: A Universal Framework for Random Sample Consensus. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(8):2022–2038.
Raguram R, Chum O, Pollefeys M, Matas J, Frahm J. USAC：随机样本一致性的通用框架。IEEE 模式分析与机器智能汇刊。2013;35(8):2022–2038。
Barath and Matas [2018] Barath D, Matas J. Graph-cut RANSAC. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. p. 6733–6741.
Barath D, Matas J. 图割 RANSAC. 见: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2018. 页码: 6733–6741.
Barath et al. [2019] Barath D, Matas J, Noskova J. MAGSAC: Marginalizing Sample Consensus. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 10197–10205.
Barath D, Matas J, Noskova J. MAGSAC: 边缘化样本一致性。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2019 年。第 10197-10205 页。
Chum et al. [2005] Chum O, Werner T, Matas J. Two-View Geometry Estimation Unaffected by a Dominant Plane. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1; 2005. p. 772–779.
Chum O, Werner T, Matas J. 不受主导平面影响的双视图几何估计。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集。第 1 卷；2005 年。第 772-779 页。
Ma et al. [2019] Ma J, Zhao J, Jiang J, Zhou H, Guo X. Locality Preserving Matching. International Journal of Computer Vision. 2019;127(5):512–531.
马杰, 赵杰, 蒋杰, 周浩, 郭旭. 局部保持匹配. 国际计算机视觉杂志. 2019;127(5):512–531.
Bian et al. [2020] Bian JW, Lin WY, Liu Y, Zhang L, Yeung SK, Cheng MM, et al. GMS: Grid-Based Motion Statistics for Fast, Ultra-robust Feature Correspondence. International Journal of Computer Vision. 2020;128:1580–1593.
卞建武, 林文宇, 刘洋, 张磊, 叶世光, 程明明, 等. GMS: 基于网格的运动统计方法用于快速、超鲁棒的特征对应. 国际计算机视觉杂志. 2020;128:1580–1593.
Cho and Lee [2012] Cho M, Lee KM. Progressive graph matching: Making a move of graphs via probabilistic voting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2012. p. 398–405.
Cho M, Lee KM. 渐进式图匹配：通过概率投票实现图的移动。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2012 年。第 398-405 页。
Lee et al. [2020] Lee S, Lim J, Suh IH. Progressive Feature Matching: Incremental Graph Construction and Optimization. IEEE Transactions on Image Processing. 2020;29:6992–7005.
李 S, 林 J, 苏 IH. 渐进特征匹配：增量图构建与优化. 《IEEE 图像处理汇刊》. 2020;29:6992–7005.
Ma et al. [2014] Ma J, Zhao J, Tian J, Yuille AL, Tu Z. Robust point matching via vector field consensus. IEEE Transactions on Image Processing. 2014;23(4):1706–1721.
马健, 赵杰, 田捷, Yuille AL, 涂子琛. 通过矢量场一致性实现稳健点匹配. IEEE 图像处理汇刊. 2014;23(4):1706–1721.
Lin et al. [2018] Lin W, Wang F, Cheng M, Yeung S, Torr PHS, Do MN, et al. CODE: Coherence Based Decision Boundaries for Feature Correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;40(1):34–47.
林伟, 王峰, 程明, 杨思颖, Torr PHS, Do MN, 等. CODE: 基于连贯性的特征对应决策边界. 模式分析与机器智能 IEEE 汇刊. 2018;40(1):34–47.
Lin et al. [2014] Lin WYD, Cheng MM, Lu J, Yang H, Do MN, Torr P. Bilateral Functions for Global Motion Modeling. In: Proceedings of the European Conference on Computer Vision (ECCV); 2014. p. 341–356.
林 WYD, 程 MM, 卢 J, 杨 H, Do MN, Torr P. 用于全局运动建模的双边函数. 见: 欧洲计算机视觉会议(ECCV)论文集; 2014. 页码: 341–356.
Bellavia [2023] Bellavia F. SIFT Matching by Context Exposed. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;45(2):2445–2457.
Bellavia F. 通过上下文揭示的 SIFT 匹配. 模式分析与机器智能 IEEE 汇刊. 2023;45(2):2445–2457.
Cavalli et al. [2020] Cavalli L, Larsson V, Oswald MR, Sattler T, Pollefeys M. AdaLAM: Handcrafted Outlier Detection Revisited. In: Proceedings of the European Conference on Computer Vision (ECCV); 2020. p. 770–787.
Cavalli L, Larsson V, Oswald MR, Sattler T, Pollefeys M. AdaLAM: 手工设计的异常值检测再探讨. 见: 欧洲计算机视觉会议(ECCV)论文集; 2020. 页码: 770–787.
Yi et al. [2018] Yi KM, Trulls E, Ono Y, Lepetit V, Salzmann M, Fua P. Learning to Find Good Correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. p. 2666–2674.
Yi KM, Trulls E, Ono Y, Lepetit V, Salzmann M, Fua P. 学习寻找优质对应关系。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2018 年。第 2666-2674 页。
Sun et al. [2020] Sun W, Jiang W, Trulls E, Tagliasacchi A, Yi KM. ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 11286–11295.
孙伟, 蒋伟, Trulls E, Tagliasacchi A, Yi KM. ACNe: 注意力上下文归一化用于鲁棒的置换等变学习. 见: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2020. 页码: 11286–11295.
Zhang et al. [2019] Zhang J, Sun D, Luo Z, Yao A, Zhou L, Shen T, et al. Learning Two-View Correspondences and Geometry Using Order-Aware Network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 5844–5853.
张杰, 孙东, 罗志, 姚安, 周磊, 沈涛, 等. 基于有序感知网络的学习两视图对应关系与几何结构. 见: IEEE/CVF 国际计算机视觉大会(ICCV)论文集; 2019. 页码: 5844–5853.
Zhao et al. [2021] Zhao C, Ge Y, Zhu F, Zhao R, Li H, Salzmann M. Progressive Correspondence Pruning by Consensus Learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2021. p. 6464–6473.
赵晨, 葛宇, 朱峰, 赵睿, 李辉, Salzmann M. 通过共识学习逐步剪枝对应关系. 见: IEEE 国际计算机视觉大会(ICCV)论文集; 2021. 页码: 6464–6473.
Liu et al. [2024] Liu X, Qin R, Yan J, Yang J. NCMNet: Neighbor Consistency Mining Network for Two-View Correspondence Pruning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024;p. 1–19.
刘鑫, 秦瑞, 闫杰, 杨杰. NCMNet: 邻域一致性挖掘网络用于双视图对应关系修剪. 《IEEE 模式分析与机器智能汇刊》. 2024;页码: 1–19.
Dai et al. [2022] Dai L, Liu Y, Ma J, Wei L, Lai T, Yang C, et al. MS²DG-Net: Progressive Correspondence Learning via Multiple Sparse Semantics Dynamic Graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 8973–8982.
戴力, 刘洋, 马杰, 魏亮, 赖涛, 杨晨, 等. MS ² DG-Net: 基于多重稀疏语义动态图的渐进式对应学习. 见: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2022. 页码: 8973–8982.
Zhang and Ma [2023] Zhang S, Ma J. ConvMatch: Rethinking Network Design for Two-View Correspondence Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;46:2920–2935.
张松, 马杰. ConvMatch: 重新思考用于双视图对应学习的网络设计. 《IEEE 模式分析与机器智能汇刊》. 2023;46:2920–2935.
Zhang et al. [2024] Zhang S, Li Z, Gao Y, Ma J. DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 20278–20287.
张三, 李四, 高五, 马六. DeMatch: 深度分解运动场用于双视图对应学习. 见: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2024. 页码: 20278–20287.
Zhou et al. [2021] Zhou Q, Sattler T, Leal-Taixe L. Patch2Pix: Epipolar-Guided Pixel-Level Correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 4669–4678.
周琦, Sattler T, Leal-Taixe L. Patch2Pix: 极线引导的像素级对应关系. 见: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2021. 页码: 4669–4678.
Kim et al. [2024] Kim S, Pollefeys M, Barath D. Learning to Make Keypoints Sub-Pixel Accurate. In: Proceedings of the European Conference on Computer Vision (ECCV); 2024. .
Kim S, Pollefeys M, Barath D. 学习使关键点达到亚像素精度。在：欧洲计算机视觉会议（ECCV）论文集；2024 年。
Xu et al. [2024] Xu H, Zhou J, Yang H, Pan R, Li C. FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 25213–25222.
徐浩, 周杰, 杨浩, 潘瑞, 李晨. FC-GNN: 从干扰中恢复可靠且准确的对应关系. 见: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2024. 页码: 25213–25222.
Hartley and Zisserman [2000] Hartley RI, Zisserman A. Multiple View Geometry in Computer Vision. Cambridge University Press; 2000.
哈特利 RI, 齐瑟曼 A. 计算机视觉中的多视图几何. 剑桥大学出版社; 2000.
Kanazawa and Kawakami [2004] Kanazawa Y, Kawakami H. Detection of Planar Regions with Uncalibrated Stereo using Distributions of Feature Points. In: Proceedings of the British Machine Vision Conference (BMVC); 2004. p. 1–10.
金泽 Y, 川上 H. 利用特征点分布检测未标定双目视觉中的平面区域. 见: 英国机器视觉会议(BMVC)论文集; 2004. 页码: 1–10.
Vincent and Laganiere [2001] Vincent E, Laganiere R. Detecting planar homographies in an image pair. In: Proceedings of the International Symposium on Image and Signal Processing and Analysis; 2001. p. 182–187.
Vincent E, Laganiere R. 检测图像对中的平面单应性。见：国际图像与信号处理分析研讨会论文集；2001 年。第 182-187 页。
Gonzales and Woods [2017] Gonzales R, Woods RE. Digital Image Processing. 4th ed. Pearson College Division; 2017.
冈萨雷斯 R, 伍兹 RE. 数字图像处理. 第 4 版. 培生教育出版集团; 2017.
Bellavia et al. [2023] Bellavia F, Morelli L, Colombo C, Remondino F. Progressive Keypoint Localization and Refinement in Image Matching. In: Proceeding of the Image Analysis and Processing Workshops (ICIAPW); 2023. p. 322–334.
Bellavia F, Morelli L, Colombo C, Remondino F. 图像匹配中的渐进式关键点定位与优化。在：图像分析与处理研讨会（ICIAPW）论文集；2023 年。第 322-334 页。
Bellavia [2024] Bellavia F. Image Matching by Bare Homography. IEEE Transactions on Image Processing. 2024;33:696–708.
Bellavia F. 基于裸单应性的图像匹配。IEEE 图像处理汇刊。2024;33:696–708.
Bellavia and Colombo [2015] Bellavia F, Colombo C. Estimating the best reference homography for planar mosaics from videos. In: Proceedings International Conference on Computer Vision Theory and Applications (VISAPP); 2015. p. 512–519.
Bellavia F, Colombo C. 从视频中估计平面马赛克的最佳参考单应性。在：计算机视觉理论与应用国际会议（VISAPP）论文集；2015 年。第 512-519 页。
Barroso-Laguna et al. [2019] Barroso-Laguna A, Riba E, Ponsa D, Mikolajczyk K. Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters. In: Proceedings of the International Conference on Computer Vision (ICCV); 2019. p. 5836–5844.
Barroso-Laguna A, Riba E, Ponsa D, Mikolajczyk K. Key.Net: 基于手工设计与学习型 CNN 滤波器的特征点检测。载于：国际计算机视觉会议（ICCV）论文集；2019 年。第 5836–5844 页。
DeTone et al. [2018] DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: Self-Supervised Interest Point Detection and Description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. p. 224–236.
DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: 自监督兴趣点检测与描述。在：IEEE 计算机视觉与模式识别会议(CVPR)论文集；2018 年。第 224-236 页。
Li and Snavely [2018] Li Z, Snavely N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. p. 2041–2050.
李 Z, Snavely N. MegaDepth: 从互联网照片中学习单视图深度预测. 在: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2018. 页码: 2041–2050.
Dai et al. [2017] Dai A, Chang AX, Savva M, Halber M, Nießner TFM. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 5828–5839.
戴 A, 常 AX, 萨夫瓦 M, 哈尔伯 M, 尼辛纳 TFM. ScanNet: 室内场景的丰富标注三维重建. 在: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2017. 页码: 5828–5839.
Balntas et al. [2017] Balntas V, Lenc K, Vedaldi A, Mikolajczyk K. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 3852–3861.
Balntas V, Lenc K, Vedaldi A, Mikolajczyk K. HPatches: 手工与学习局部描述符的基准与评估。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2017 年。第 3852-3861 页。
Mishkin et al. [2015] Mishkin D, Matas J, Perdoch M. MODS: Fast and robust method for two-view matching. Computer Vision and Image Understanding. 2015;141:81–93.
Mishkin D, Matas J, Perdoch M. MODS：一种快速且鲁棒的两视图匹配方法。计算机视觉与图像理解。2015；141：81–93。
Ma et al. [2021] Ma J, Jiang J, Fan A, Jiang J, Yan J. Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision. 2021;129(7):23–79.
马杰, 蒋杰, 范安, 蒋杰, 闫杰. 从手工特征到深度特征的图像匹配：综述. 国际计算机视觉杂志. 2021;129(7):23–79.
Xu et al. [2024] Xu S, Chen S, Xu R, Wang C, Lu P, Guo L. Local Feature Matching Using Deep Learning: A Survey. Information Fusion. 2024;107:102344.
徐松, 陈松, 徐瑞, 王晨, 陆鹏, 郭磊. 基于深度学习的局部特征匹配研究综述. 信息融合. 2024;107:102344.
Tuytelaars and Mikolajczyk [2008] Tuytelaars T, Mikolajczyk K. Local Invariant Feature Detectors: A Survey. vol. 3. Foundations and Trends in Computer Graphics and Vision; 2008.
Tuytelaars T, Mikolajczyk K. 局部不变特征检测器：综述. 第 3 卷. 计算机图形学与视觉基础与趋势; 2008.
Fan et al. [2016] Fan B, Wang Z, Wu F. Local Image Descriptor: Modern Approaches. vol. 108. Springer; 2016.
范斌, 王志, 吴飞. 局部图像描述符：现代方法. 第 108 卷. 斯普林格; 2016.
Harris and Stephens [1988] Harris C, Stephens M. A Combined Corner and Edge Detector. In: Proceedings of the 4th Alvey Vision Conference; 1988. p. 147–151.
Harris C, Stephens M. 结合角点和边缘检测器。见：第四届阿尔维视觉会议论文集；1988 年。第 147-151 页。
Alcantarilla et al. [2013] Alcantarilla PF, Nuevo J, Bartoli A. Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Space. In: Proceedings of the British Machine Vision Conference (BMVC). vol. 34; 2013. p. 1281–1298.
Alcantarilla PF, Nuevo J, Bartoli A. 非线性尺度空间中加速特征的快速显式扩散。在：英国机器视觉会议（BMVC）论文集。第 34 卷；2013 年。第 1281-1298 页。
Bay et al. [2008] Bay H, Ess A, Tuytelaars T, Van Gool L. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding. 2008;110(3):346–359.
Bay H, Ess A, Tuytelaars T, Van Gool L. 加速稳健特征（SURF）。计算机视觉与图像理解。2008;110(3):346–359.
Rosten and Drummond [2006] Rosten E, Drummond T. Machine Learning for High-Speed Corner Detection. In: Proceedings of the European Conference on Computer Vision (ECCV); 2006. p. 430–443.
Rosten E, Drummond T. 高速角点检测的机器学习方法. 见: 欧洲计算机视觉会议(ECCV)论文集; 2006. 页码: 430–443.
Santellani et al. [2024] Santellani E, Zach M, Sormann C, Rossi M, Kuhn A, Fraundorfer F. GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring. In: arXiv:2408.17149; 2024. .
Santellani E, Zach M, Sormann C, Rossi M, Kuhn A, Fraundorfer F. GMM-IKRS: 高斯混合模型用于可解释关键点细化和评分。arXiv:2408.17149; 2024.
Morel and Yu [2009] Morel JM, Yu G. ASIFT: A New Framework for Fully Affine Invariant Image Comparison. SIAM Journal on Imaging Sciences. 2009;2(2):438–469.
Morel JM, Yu G. ASIFT: 一种全新的全仿射不变图像比较框架。工业与应用数学学会图像科学期刊。2009;2(2):438–469.
Rublee et al. [2011] Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2011. p. 2564–2571.
Rublee E, Rabaud V, Konolige K, Bradski G. ORB: SIFT 或 SURF 的高效替代方案。在：IEEE 国际计算机视觉大会（ICCV）论文集；2011 年。第 2564-2571 页。
R. Elvira [2024] R Elvira JMMM J D Tardós. CudaSIFT-SLAM: multiple-map visual SLAM for full procedure mapping in real human endoscopy. In: arXiv:2405.16932; 2024. .
R Elvira JMMM J D Tardós. CudaSIFT-SLAM：用于真实人体内窥镜全过程映射的多地图视觉 SLAM。arXiv:2405.16932; 2024.
Mikolajczyk and Schmid [2004] Mikolajczyk K, Schmid C. Scale & affine invariant interest point detectors. International Journal of Computer Vision. 2004;60(1):63–86.
Mikolajczyk K, Schmid C. 尺度与仿射不变的兴趣点检测器。国际计算机视觉杂志。2004;60(1):63–86.
Barath [2023] Barath D. On Making SIFT Features Affine Covariant. International Journal of Computer Vision. 2023;131(9):2316–2332.
Barath D. 关于使 SIFT 特征具有仿射协变性。国际计算机视觉杂志。2023;131(9):2316–2332。
Bellavia and Colombo [2018] Bellavia F, Colombo C. Rethinking the sGLOH Descriptor. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;40(4):931–944.
Bellavia F, Colombo C. 重新思考 sGLOH 描述符。IEEE 模式分析与机器智能汇刊。2018;40(4):931–944.
Lenc et al. [2014] Lenc K, Matas J, Mishkin D. A Few Things One Should Know About Feature Extraction, Description and Matching. In: Proceedings of the Computer Vision Winter Workshop (CVWW); 2014. p. 67–74.
Lenc K, Matas J, Mishkin D. 关于特征提取、描述与匹配的几点认识。见：计算机视觉冬季研讨会（CVWW）论文集；2014 年。第 67-74 页。
Ma et al. [2018] Ma J, Jiang J, Zhou H, Zhao J, Guo X. Guided Locality Preserving Feature Matching for Remote Sensing Image Registration. IEEE Transactions on Geoscience and Remote Sensing. 2018;56(8):4435–4447.
马杰, 蒋杰, 周浩, 赵杰, 郭旭. 基于引导局部保持特征匹配的遥感图像配准. 地球科学与遥感 IEEE 汇刊. 2018;56(8):4435–4447.
Chum and Matas [2005] Chum O, Matas J. Matching with PROSAC - progressive sample consensus. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2005. p. 220–226.
Chum O, Matas J. 使用 PROSAC 进行匹配——渐进样本一致性。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2005 年。第 220-226 页。
Ni et al. [2009] Ni N, Hailin J, Dellaert F. GroupSAC: Efficient consensus in the presence of groupings. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2009. p. 2193–2200.
倪 N, 海林 J, 德拉特 F. GroupSAC: 在存在分组情况下的高效共识。见：IEEE 国际计算机视觉大会（ICCV）论文集；2009 年。第 2193-2200 页。
Toldo and Fusiello [2008] Toldo R, Fusiello A. Robust Multiple Structures Estimation with J-Linkage. In: Proceedings of the European Conference on Computer Vision (ECCV); 2008. p. 537–547.
Toldo R, Fusiello A. 基于 J-Linkage 的鲁棒多结构估计。见：欧洲计算机视觉会议（ECCV）论文集；2008 年。第 537-547 页。
Barath and Matas [2019] Barath D, Matas J. Progressive-X: Efficient, Anytime, Multi-Model Fitting Algorithm. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2019. p. 3780–3788.
Barath D, Matas J. Progressive-X：高效、随时可用的多模型拟合算法。见：IEEE 国际计算机视觉大会（ICCV）论文集；2019 年。第 3780-3788 页。
Barath et al. [2023] Barath D, Rozumnyi D, Eichhardt I, Hajder L, Matas J. Finding Geometric Models by Clustering in the Consensus Space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 5414–5424.
Barath D, Rozumnyi D, Eichhardt I, Hajder L, Matas J. 通过在一致空间中聚类寻找几何模型. 在: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2023. 页码: 5414–5424.
Mishchuk et al. [2017] Mishchuk A, Mishkin D, Radenovic F, Matas J. Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS); 2017. .
Mishchuk A, Mishkin D, Radenovic F, Matas J. 努力了解邻居的边界：局部描述子学习损失。在：神经信息处理系统会议（NeurIPS）论文集；2017 年。
Tian et al. [2019] Tian Y, Yu X, Fan B, Wu F, Heijnen H, Balntas V. SOSNet: Second Order Similarity Regularization for Local Descriptor Learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 11016–11025.
田勇, 余翔, 范斌, 吴飞, Heijnen H, Balntas V. SOSNet: 局部描述子学习的二阶相似性正则化. 见: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2019. 页码: 11016–11025.
Tian et al. [2020] Tian Y, Laguna AB, Ng T, Balntas V, Mikolajczyk K. HyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet Loss. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS). vol. 33; 2020. p. 7401–7412.
田勇, 拉古纳 AB, 吴涛, 巴尔纳斯 V, 米科拉伊奇克 K. HyNet: 利用混合相似性度量与三元组损失学习局部描述符. 见: 神经信息处理系统会议(NeurIPS)论文集. 第 33 卷; 2020. 页码: 7401–7412.
Yi et al. [2016] Yi KM, Verdie Y, Fua P, Lepetit V. Learning to Assign Orientations to Feature Points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 107–116.
易 KM, 维尔迪 Y, 富阿 P, 勒佩蒂 V. 学习为特征点分配方向。在: IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2016. 页码: 107–116.
Mishkin et al. [2018] Mishkin D, Radenovic F, Matas J. Repeatability Is Not Enough: Learning Affine Regions via Discriminability. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 284–300.
Mishkin D, Radenovic F, Matas J. 重复性不足：通过判别性学习仿射区域。在：欧洲计算机视觉会议（ECCV）论文集；2018 年。第 284-300 页。
Bellavia and Mishkin [2022] Bellavia F, Mishkin D. HarrisZ⁺: Harris Corner Selection for Next-Gen Image Matching Pipelines. Pattern Recognition Letters. 2022;158:141–147.
Bellavia F, Mishkin D. HarrisZ ⁺ : 下一代图像匹配管道中的 Harris 角点选择。模式识别快报。2022;158:141–147。
Yi et al. [2016] Yi KM, Trulls E, Lepetit V, Fua P. LIFT: Learned Invariant Feature Transform. In: Proceedings of the European Conference on Computer Vision (ECCV); 2016. p. 467–483.
Yi KM, Trulls E, Lepetit V, Fua P. LIFT：学习不变特征变换。见：欧洲计算机视觉会议（ECCV）论文集；2016 年。第 467-483 页。
Revaud et al. [2019] Revaud J, Weinzaepfel P, de Souza CR, Humenberger M. R2D2: Repeatable and Reliable Detector and Descriptor. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS). vol. 32; 2019. .
Revaud J, Weinzaepfel P, de Souza CR, Humenberger M. R2D2：可重复且可靠的检测器与描述符。见：第 32 届神经信息处理系统会议（NeurIPS）论文集。第 32 卷；2019 年。
Zhao et al. [2022] Zhao X, Wu X, Miao J, Chen W, Chen PCY, Li Z. ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction. IEEE Transactions on Multimedia. 2022;25:3101–3112.
赵旭, 吴鑫, 苗杰, 陈伟, 陈培源, 李智. ALIKE: 精确且轻量级的关键点检测与描述子提取. 电气电子工程师学会多媒体汇刊. 2022;25:3101–3112.
Tyszkiewicz et al. [2020] Tyszkiewicz MJ, Fua P, Trulls E. DISK: Learning local features with policy gradient. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS). vol. 33; 2020. p. 14254–14265.
Tyszkiewicz MJ, Fua P, Trulls E. DISK：基于策略梯度学习局部特征。在：第 32 届神经信息处理系统会议（NeurIPS）论文集。第 33 卷；2020 年。第 14254-14265 页。
Dusmanu et al. [2019] Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, et al. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 8084–8093.
Dusmanu M, Rocco I, Pajdla T, Pollefeys M, Sivic J, Torii A, 等. D2-Net: 一种用于联合检测和描述局部特征的可训练 CNN. 见: 2019 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2019. 页码: 8084–8093.
Tian et al. [2020] Tian Y, Balntas V, Ng T, Laguna AB, Demiris Y, Mikolajczyk K. D2D: Keypoint Extraction with Describe to Detect Approach. In: Proceedings of the 15th Asian Conference on Computer Vision (ACCV); 2020. p. 223–240.
田勇, 巴尔纳斯 V, 黄涛, 拉古纳 AB, 德米里斯 Y, 米科拉伊奇克 K. D2D: 基于描述检测方法的关键点提取. 见: 第 15 届亚洲计算机视觉会议(ACCV)论文集; 2020. 页码: 223–240.
Edstedt et al. [2024] Edstedt J, Bökman G, Wadenbäck M, Felsberg M. DeDoDe: Detect, Don’t Describe — Describe, Don’t Detect for Local Feature Matching. In: International Conference on 3D Vision (3DV); 2024. p. 148–157.
Edstedt J, Bökman G, Wadenbäck M, Felsberg M. DeDoDe: 检测，不描述——描述，不检测，用于局部特征匹配。在：国际三维视觉会议（3DV）；2024 年。第 148–157 页。
Edstedt et al. [2024] Edstedt J, Bökman G, Zhao Z. DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 4245–4253.
Edstedt J, Bökman G, Zhao Z. DeDoDe v2: 分析与改进 DeDoDe 关键点检测器. 在: IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集; 2024. p. 4245–4253.
Sarlin et al. [2020] Sarlin PE, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 4938–4947.
Sarlin PE, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: 基于图神经网络的特征匹配学习。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2020 年。第 4938-4947 页。
Bökman and Kahl [2022] 伯克曼和卡尔 [2022] Bökman G, Kahl F. A case for using rotation invariant features in state of the art feature matchers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 5110–5119.
Bökman G, Kahl F. 在现代特征匹配器中使用旋转不变特征的案例。在：IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集；2022 年。第 5110-5119 页。
Bökman et al. [2024] Bökman 等人[2024] Bökman G, Edstedt J, Felsberg M, Kahl F. Steerers: A framework for rotation equivariant keypoint descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024. p. 4885–4895.
Bökman G, Edstedt J, Felsberg M, Kahl F. 转向器：一种旋转等变关键点描述符框架。在：IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集；2024 年。第 4885-4895 页。
Santellani et al. [2023] 桑特拉尼等人 [2023] Santellani E, Sormann C, Rossi M, Kuhn A, Fraundorfer F. S-TREK: Sequential Translation and Rotation Equivariant Keypoints for local feature extraction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2023. p. 9694–9703.
Santellani E, Sormann C, Rossi M, Kuhn A, Fraundorfer F. S-TREK: 用于局部特征提取的序列平移与旋转等变关键点。在：IEEE 国际计算机视觉大会（ICCV）论文集；2023 年。第 9694-9703 页。
Vaswani et al. [2017] Vaswani 等人[2017] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS); 2017. .
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, 等. 注意力机制就是你所需要的。在：国际神经信息处理系统会议（NeurIPS）论文集；2017 年。
Brachmann and Rother [2019]
布拉赫曼和罗瑟[2019] Brachmann E, Rother C. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 4321–4330.
Brachmann E, Rother C. 神经引导的 RANSAC：学习模型假设的采样位置。在：IEEE/CVF 国际计算机视觉会议（ICCV）论文集；2019 年。第 4321-4330 页。
Wei et al. [2023] 魏等人 [2023] Wei T, Patel Y, Shekhovtsov A, Matas J, Barath D. Generalized differentiable RANSAC. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 17649–17660.
魏 T, 帕特尔 Y, 谢霍夫特索夫 A, 马塔斯 J, 巴拉特 D. 广义可微 RANSAC. 见: IEEE/CVF 国际计算机视觉会议(ICCV)论文集; 2023. 页码: 17649–17660.
Lindenberger et al. [2021]
林登伯格等人 [2021] Lindenberger P, Sarlin PE, Larsson V, Pollefeys M. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2021. p. 5987–5997.
林登伯格 P, 萨林 PE, 拉尔森 V, 波莱费斯 M. 基于特征度量优化的像素级精确运动结构恢复. 见: IEEE 国际计算机视觉大会(ICCV)论文集; 2021. 页码: 5987–5997.
Trucco and Verri [1998] 特鲁科与韦里 [1998] Trucco E, Verri A. Introductory Techniques for 3-D Computer Vision. Prentice Hall; 1998.
特鲁科 E, 维里 A. 三维计算机视觉入门技术. 普伦蒂斯霍尔出版社; 1998.
Bradski [2000] 布拉德斯基 [2000] Bradski G. The OpenCV Library. Dr Dobb’s Journal of Software Tools. 2000;.
布拉德斯基 G. OpenCV 库. Dr Dobb 软件工具期刊. 2000;.
Arandjelović and Zisserman [2012]
Arandjelović和 Zisserman [2012] Arandjelović R, Zisserman A. Three things everyone should know to improve object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2012. p. 2911–2918.
Arandjelović R, Zisserman A. 每个人都应了解的三件事以提升目标检索。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2012 年。第 2911-2918 页。
Riba et al. [2020] 里巴等人 [2020] Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G. Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. In: Winter Conference on Applications of Computer Vision; 2020. p. 3674–3683.
Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G. Kornia: 一个面向 PyTorch 的开源可微分计算机视觉库。在：计算机视觉应用冬季会议；2020 年。第 3674-3683 页。
Heinly et al. [2015] 海因利等人 [2015] Heinly J, Schönberger JL, Dunn E, Frahm JM. Reconstructing the World* in Six Day *(As Captured by the Yahoo 100 Million Image Dataset). In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015. p. 3287–3295.
Heinly J, Schönberger JL, Dunn E, Frahm JM. 六天内重构世界*（基于雅虎 1 亿张图像数据集）。在：IEEE 计算机视觉与模式识别会议（CVPR）论文集；2015 年。第 3287-3295 页。
Nister [2004] 尼斯特 [2004] Nister D. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;26(6):756–770.
Nister D. 五点相对姿态问题的有效解法。IEEE 模式分析与机器智能汇刊。2004;26(6):756–770.
Barath et al. [2023] 巴拉特等人 [2023] Barath D, Mishkin D, and W Förstner MP, Matas J. A Large Scale Homography Benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2023. p. 21360–21370.
巴拉特 D, 米什金 D, 福斯特纳 W MP, 马塔斯 J. 大规模单应性基准测试. 见: 2023 年 IEEE 计算机视觉与模式识别会议(CVPR)论文集; 2023. 页码: 21360–21370.
Chum et al. [2003] Chum 等人[2003] Chum O, Matas J, Kittler J. Locally Optimized RANSAC. In: Pattern Recognition, 25th DAGM Symposiumn; 2003. p. 236–243.
Chum O, Matas J, Kittler J. 局部优化 RANSAC. 见: 模式识别, 第 25 届 DAGM 研讨会; 2003. 页码: 236–243.
Besl and McKay [1992] Besl 和 McKay [1992] Besl PJ, McKay ND. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1992;14:239–256.
Besl PJ, McKay ND. 一种用于三维形状配准的方法。IEEE 模式分析与机器智能汇刊。1992;14:239–256.

	+MOP+MiHo+NCCg	+FC-GNNj

SuperPoint+LigthGlue 超点+光胶			+MAGSAC_↑

Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR			+MAGSAC_↑

SIFT+NNRg			+MAGSAC_↑

LoFTR 局部特征变换			+MAGSAC_↑

	+MOP+MiHo+NCCg	+FC-GNNj

SuperPoint+LigthGlue 超点+光胶			+MAGSAC_↑

LoFTRg			+MAGSAC_↑

SIFT+NNR
Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR
SuperPoint+LightGlue 超点+光胶
ALIKED+LightGlue
DISK+LightGlue
LoFTR 局部特征变换
DeDoDe v2
SIFT+NNR
Key.Net+ $\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}$ +NNR
SuperPoint+LightGlue 超点+光胶
ALIKED+LightGlue
DISK+LightGlue
LoFTR 局部特征变换
DeDoDe v2

Image Matching Filtering and Refinement by Planes and Beyond图像匹配滤波与平面及超越平面的优化