Image Matching Filtering and Refinement by Planes and Beyond
图像匹配过滤和细化的平面及更大范围

Fabio Bellavia $^{1^{*}}$ , Zhenjun Zhao $^{2}$ , Luca Morelli $^{3}$ , Fabio Remondino $^{3}$
Fabio Bellavia $^{1^{*}}$ , Zhenjun Zhao $^{2}$ , Luca Morelli $^{3}$ , Fabio Remondino $^{3}$ $^{1 *}$ Università degli Studi di Palermo, Italy.
$^{1 *}$ 意大利巴勒莫大学。 $^{2}$ The Chinese University of Hong Kong, China.
$^{2}$ 中国香港中文大学。 $^{3}$ Bruno Kessler Foundation, Italy.
$^{3}$ 意大利布鲁诺-凯斯勒基金会。

*Corresponding author(s). E-mail(s): fabio.bellavia@unipa.it;
*通讯作者。电子邮件：fabio.bellavia@unipa.it；Contributing authors: ericzzj89@gmail.com; lmorelli@fbk.eu; remondino@fbk.eu;
供稿作者：ericzzj89@gmail.com; lmorelli@fbk.eu; remondino@fbk.eu；

Abstract 摘要

This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and faulttolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
本文介绍了一种模块化的非深度学习方法，用于过滤和完善图像匹配中的稀疏对应关系。假设场景中的运动流可以通过局部同构变换来近似，那么就可以使用一种基于 RANSAC 的迭代方法，将匹配图像聚合成与虚拟平面相对应的重叠集群，并丢弃不符合要求的对应关系。此外，底层平面结构设计提供了与匹配相关的局部斑块之间的显式映射，可在斑块重新投影后通过交叉相关模板匹配对关键点位置进行选择性改进。最后，为了提高鲁棒性和容错性，防止违反片断平面近似假设，还设计了一种进一步的策略，通过引入中间同构，将两个补丁投影到一个共同平面，从而最大限度地减少平面重投影中的相对补丁失真。所提出的方法在标准数据集和图像匹配管道上进行了广泛评估，并与最先进的方法进行了比较。与当前其他比较方法不同的是，所提出的基准方法还考虑到了更普遍、更真实、更实用的情况，即相机本征不可用。实验结果表明，我们提出的基于几何图形的非深度学习方法取得了优于或与最近最先进的深度学习方法相当的性能。最后，本研究表明，在所考虑的研究方向上，实际图像匹配解决方案仍有发展潜力，未来可将其纳入新型深度图像匹配架构中。

Keywords: image matching, keypoint refinement, planar homography, normalized cross-correlation, SIFT, SuperGlue, LoFTR
关键词：图像匹配、关键点细化、平面同构、归一化交叉相关、SIFT、SuperGlue、LoFTR

1 Introduction 1 引言

1.1 Background 1.1 背景

Image matching constitutes the backbone of most higher-level computer vision tasks and applications that require image registration to recover the 3D scene structure, including the camera
图像匹配是大多数高级计算机视觉任务和应用的基础，这些任务和应用需要进行图像配准以恢复三维场景结构，包括相机
poses. Image stitching [1, 2], Structure-fromMotion (SfM) [2-5], Simultaneous Localization and Mapping (SLAM)

[2, 6]

and the more recently Neural Radiance Fields (NeRF) [7, 8] and Gaussian Splatting (GS)

[9, 10]

are probably nowadays the most common, widespread and major tasks critically relying on image matching.
姿势。图像拼接（Image stitching）[1, 2]、运动结构（Structure-fromMotion，SfM）[2-5]、同步定位与映射（Simultaneous Localization and Mapping，SLAM）

[2, 6]

，以及最近的神经辐射场（Neural Radiance Fields，NeRF）[7, 8]和高斯拼接（Gaussian Splatting，GS）

[9, 10]

，可能是当今最常见、最普遍和最主要的依赖图像匹配的任务。

Image matching traditional paradigm can be organized into a modular pipeline concerning keypoint extraction, feature description, and the proper correspondence matching [11]. This representation is being neglected in modern deep end-to-end image matching methods due to their intrinsic design which guarantees a better global optimization at the expense of the interpretability of the system. Notwithstanding that end-toend deep networks represent for many aspects the State-Of-The-Art (SOTA) in image matching, such in terms of the robustness and accuracy of the estimated poses in complex scenes [1218], handcrafted image matching pipelines such as the popular Scale Invariant Feature Transform (SIFT) [19] are employed still today in practical applications

[2, 4]

due to their scalability, adaptability, and understandability. Furthermore, handcrafted or deep modules remain present in postprocess modern monolithic deep architectures, to further filter the final output matches as done notably by the RAndom SAmpling Consensus (RANSAC) [20] based on geometric constraints.
图像匹配的传统范式可以组织成一个模块化流水线，包括关键点提取、特征描述和适当的对应匹配[11]。现代深度端到端图像匹配方法由于其固有设计，在保证更好的全局优化的同时，却牺牲了系统的可解释性，因此忽略了这一表征。尽管端到端深度网络在许多方面代表了图像匹配的最新技术（SOTA），例如在复杂场景中估计姿势的鲁棒性和准确性方面[1218]，但由于其可扩展性、适应性和可理解性，手工图像匹配管道，如流行的尺度不变特征变换（SIFT）[19]，如今仍在实际应用中

[2, 4]

。此外，手工或深度模块仍然存在于后处理的现代单片深度架构中，以进一步过滤最终输出的匹配结果，特别是基于几何约束的 RAndom SAmpling Consensus (RANSAC) [20] 所做的工作。

Image matching filters to prune correspondences are not limited to RANSAC, which in any of its form [21-24] still remains mandatory nowadays as final step, and have been evolved from handcrafted methods [25-33] to deep ones [34-41]. In addition, more recently methods inspired by coarse-to-fine paradigm, such as the Local Feature TRansformer (LoFTR) [14], have been developed to refine correspondences [42-44], which can be considered as the introduction of a feedbacksystem in the pipelined interpretation of the matching process.
用于修剪对应关系的图像匹配滤波器并不局限于 RANSAC，如今 RANSAC 的任何形式 [21-24] 作为最后一步仍然是强制性的，并且已经从手工方法 [25-33] 演进到深度方法 [34-41]。此外，最近受从粗到细范式启发而开发的方法，如局部特征变换器（LoFTR）[14]，也被用来完善对应关系[42-44]，这可视为在匹配过程的流水线解释中引入了一个反馈系统。

1.2 Main contribution 1.2 主要贡献

As contribution, within the former background, this paper introduces a modular approach to filter and refine matches as post-process before RANSAC. The key idea is that the motion flow within the images, i.e. the correspondences, can be approximated by local overlapping homography transformations [45]. This is the core concept of traditional image matching pipelines, where local patches are approximated by less constrained transformations going from a fine to a coarse scale level. Specifically, besides the similarity and affine
作为贡献，本文在前者的背景下引入了一种模块化方法，在 RANSAC 之前作为后处理过滤和完善匹配。其主要思想是，图像中的运动流，即对应关系，可以通过局部重叠同构变换来近似[45]。这也是传统图像匹配管道的核心理念，即通过从精细到粗糙尺度的较少约束变换来近似局部斑块。具体来说，除了相似性和仿射
transformations, the planar homography is introduced at a further level. The whole approach proceeds as follows:
在此基础上，进一步引入平面同构。整个方法的流程如下：

The Multiple Overlapping Planes (MOP) module aggregates matches into soft clusters, each one related to a planar homography roughly representing the transformation of included matches. On one hand, matches unable to fit inside any cluster are discarded as outliers, on the other hand, the associated planar map can be used to normalize the patches of the related keypoint correspondences. As more matches are involved in the homography estimation, the patch normalization process becomes more robust than the canonical ones used for instance in SIFT. Furthermore, homographies better adapt the warp as they are more general than similarity or affine transformations. MOP is implemented by iterating RANSAC to greedily estimate the next best plane according to a multi-model fitting strategy. The main difference from previous similar approaches [46, 47], is in maintaining matches during the iterations in terms of strong and weak inliers according to two different reprojection errors to guide the cluster partitioning.
多重重叠平面（MOP）模块将匹配数据聚合成软集群，每个集群都与大致代表所包含匹配数据变换的平面同源性相关。一方面，无法融入任何聚类的匹配会被视为异常值而被剔除，另一方面，相关的平面图可用于对相关关键点对应的补丁进行归一化处理。由于有更多的匹配点参与到同构估计中，因此补丁归一化过程比 SIFT 中使用的典型补丁归一化过程更加稳健。此外，同源性比相似性或仿射变换更通用，因此能更好地适应翘曲。MOP 通过迭代 RANSAC 来实现，根据多模型拟合策略贪婪地估计下一个最佳平面。与之前类似方法[46, 47]的主要区别在于，在迭代过程中，根据两种不同的重投影误差，在强离群值和弱离群值方面保持匹配，以指导聚类划分。
The Middle Homography (MiHo) module additionally improves the patch normalization by further minimizing the relative patch distortion. More in detail, instead of reprojecting the patch of one image into the other image taken as reference according to the homography associated with the cluster, both the corresponding patches are reprojected into a virtual plane chosen to be in the middle of the base homography transformations so that globally the deformation is distributed into both the two patches, reducing the patch distortions with respect to the original images due to interpolation but also to the planar approximation.
中间同构（MiHo）模块通过进一步减少相对补丁失真，改进了补丁归一化。更详细地说，不是根据与集群相关的同轴度将一个图像的补丁重新投影到作为参考的另一个图像上，而是将两个相应的补丁重新投影到一个虚拟平面上，该平面被选择为基本同轴度变换的中间位置，这样从整体上看，变形就会分布到两个补丁上，从而减少由于插值和平面近似造成的相对于原始图像的补丁失真。
The Normalized Cross-Correlation (NCC) is employed following traditional template matching $[2, 48]$ to improve the keypoint localization in the reference middle homography virtual plane using one patch as a template to being localized on the other patch. The relative shift adjustment found on the virtual plane is finally back-projected through into the original images. In order to improve the process, the template patch is perturbed by small rotations and
归一化交叉相关（NCC）是按照传统的模板匹配 $[2, 48]$ ，利用一个补丁作为模板，在参考中间同色虚拟平面上改进关键点定位，并在另一个补丁上进行定位。最后将虚拟平面上发现的相对位移调整结果反投影到原始图像中。为了改进这一过程，模板补丁会受到微小的旋转和
anisotropic scaling changes for a better search in the solution space.
各向异性的缩放变化可以更好地搜索解空间。

MOP and MiHo are purely geometric-based, requiring only keypoint coordinates, while NCC relies on the image intensity in the local area of the patches. The early idea of MOP

+ MiHo + NCC

can be found in [49]. Furthermore, MOP overlapping plane exploitation concept have been introduced in [50], while MiHo definition was draft in [51]. The proposed approach is handcrafted and hence more general as is does not require re-training to adapt to the particular kind of scene and its behavior is more interpretable as not hidden by deep architectures. As reported and discussed later in this paper, RANSAC takes noticeable benefits from MOP + MiHo as pre-filter before it in most configurations and scene kinds, and in any case does not deteriorate the matches. Concerning MOP + MiHo+NCC instead, keypoint localization accuracy seems to strongly depend on the kind of extracted keypoint. In particular, for cornerlike keypoints, which are predominant in detectors such as the Keypoint Network (Key.Net) [52] or SuperPoint [53], NCC greatly improves the keypoint position, while for blob-like keypoints, predominant for instance in SIFT, NCC application can degrade the keypoint localization, due probably to the different nature of their surrounding areas. Nevertheless, the proposed approach indicates that there are still margins for improvements in current image matching solutions, even the deep-based ones, exploiting the analyzed methodology.
MOP 和 MiHo 纯粹基于几何原理，只需要关键点坐标，而 NCC 则依赖于补丁局部区域的图像强度。早期的 MOP

+ MiHo + NCC

概念见 [49]。此外，[50] 还引入了 MOP 重叠平面利用概念，[51] 则起草了 MiHo 定义。本文提出的方法是手工制作的，因此更具通用性，因为它不需要重新训练来适应特定类型的场景，而且其行为更易于解释，不会被深度架构所隐藏。正如本文后面所报告和讨论的那样，在大多数配置和场景类型中，RANSAC 都能明显受益于 MOP + MiHo 作为前置过滤器，而且在任何情况下都不会降低匹配度。而对于 MOP + MiHo+NCC 而言，关键点定位精度似乎在很大程度上取决于提取关键点的类型。特别是对于角状关键点（在关键点网络（Key.Net）[52] 或 SuperPoint [53]等检测器中占主导地位），NCC 可大大提高关键点定位的准确性；而对于 blob 状关键点（在 SIFT 等检测器中占主导地位），NCC 的应用可能会降低关键点定位的准确性，这可能是由于其周围区域的性质不同造成的。尽管如此，所提出的方法表明，当前的图像匹配解决方案仍有改进的余地，即使是基于深度的解决方案，也可以利用所分析的方法进行改进。

1.3 Additional contribution
1.3 额外捐款

As further contribution, an extensive comparative evaluation is provided against eleven less and more recent SOTA deep and handcrafted image matching filtering methods on both non-planar indoor and outdoor standard benchmark datasets [11, 54, 55], as well as planar scenes [56, 57]. Each filtering approach, including several combinations of the three proposed modules, has been tested to post-process matches obtained with seven different image matching pipelines and end-to-end networks, also considering distinct RANSAC configurations. Moreover, unlike the mainstream evaluation which has been taken as current standard [14], the proposed benchmark assumes that no camera intrinsics are available, reflecting a
作为进一步的贡献，在非平面室内和室外标准基准数据集[11, 54, 55]以及平面场景[56, 57]上，与 11 种较少和较新的 SOTA 深度和手工图像匹配过滤方法进行了广泛的比较评估。每种过滤方法（包括三个拟议模块的若干组合）都经过测试，用于对使用七种不同图像匹配管道和端到端网络获得的匹配结果进行后处理，同时还考虑了不同的 RANSAC 配置。此外，与作为当前标准的主流评估[14]不同，所提议的基准假定没有可用的相机本征，这反映了
more general, realistic, and practical scenario. On this premise, the pose error accuracy is defined, analyzed, and employed with different protocols intended to highlight the specific complexity of the pose estimation working with fundamental and essential matrices [45], respectively. This analysis reveal the many-sided nature of the image matching problem and of its solutions.
更普遍、更现实、更实用的场景。在此前提下，对姿态误差精度进行了定义、分析，并采用了不同的协议，以分别突出使用基本矩阵和基本矩阵进行姿态估计的特殊复杂性[45]。这一分析揭示了图像匹配问题及其解决方案的多面性。

1.4 Paper organization 1.4 纸张组织

The rest of the paper is organized as follows. The related work is presented in Sec. 2, the design of MOP, MiHo, and NCC is discussed in Sec. 3, and the evaluation setup and the results are analyzed in Sec.4. Finally, conclusions and future work are outlined in Sec. 5.
本文的其余部分安排如下。第 2 节介绍了相关工作，第 3 节讨论了 MOP、MiHo 和 NCC 的设计，第 4 节分析了评估设置和结果。最后，第 5 节概述了结论和未来工作。

1.5 Additional resources 1.5 额外资源

Code and data are freely available online

^{1}

. Data also include high-resolution versions of the images, plots, and tables presented in this paper

^{2}

.
代码和数据可在网上免费获取

^{1}

。数据还包括本文所展示的图像、绘图和表格的高分辨率版本

^{2}

。

2.1 Disclamer and further references
2.1 免责声明和其他参考资料

The following discussion is far to be exhaustive due to more than three decades of active research and progress in the field, which is still ongoing. The reader can refer to [58-61] for a more comprehensive analysis than that reported below, which is related and covers only the methods discussed in the rest of the paper.
由于三十多年来这一领域的积极研究和进展仍在继续，下文的讨论远非详尽无遗。读者可参阅 [58-61] 以了解比下文报告更全面的分析，下文报告仅涉及本文其余部分讨论的方法。

2.2 Non-deep image matching
2.2 非深度图像匹配

Traditionally, image matching is divided into three main steps: keypoint detection, description, and proper matching.
传统上，图像匹配分为三个主要步骤：关键点检测、描述和正确匹配。

2.2.1 Keypoint detectors 2.2.1 关键点探测器

Keypoint detection is aimed at finding salient local image regions whose feature descriptor vectors are at the same time-invariant to image transformations, i.e. repeatable, and distinguishable from other non-corresponding descriptors, i.e. discriminant. Clearly, discriminability decreases as long
关键点检测的目的是找到局部图像的突出区域，这些区域的特征描述向量同时与图像变换无关（即可重复），并能与其他非对应描述向量区分开来（即具有判别性）。显然，只要

$^{1}$ https://github.com/fb82/MiHo
$^{2}$ https://github.com/fb82/MiHo/tree/main/data/paper

Image Matching Filtering and Refinement by Planes and Beyond 图像匹配过滤和细化的平面及更大范围