这是用户在 2024-11-21 15:14 为 https://arxiv.org/html/2411.09484?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY-NC-ND 4.0
许可证:CC BY-NC-ND 4.0
arXiv:2411.09484v2 [cs.CV] 15 Nov 2024
arXiv:2411.09484v2 [cs.CV] 2024 年 11 月 15 日

[1]\fnmFabio \surBellavia
[1]\fnm 法比奥 \sur 贝拉维亚

[1]\orgnameUniversità degli Studi di Palermo, \countryItaly 2]\orgnameThe Chinese University of Hong Kong, \countryChina 3]\orgnameBruno Kessler Foundation, \countryItaly
[1]巴勒莫大学,意大利 [2]香港中文大学,中国 [3]布鲁诺·凯斯勒基金会,意大利

Image Matching Filtering and Refinement
by Planes and Beyond
图像匹配滤波与平面及超越平面的优化

fabio.bellavia@unipa.it    \fnmZhenjun \surZhao ericzzj89@gmail.com    \fnmLuca \surMorelli lmorelli@fbk.eu    \fnmFabio \surRemondino remondino@fbk.eu * [ [
Abstract 摘要

This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
本文介绍了一种模块化的非深度学习方法,用于在图像匹配中过滤和提炼稀疏对应关系。假设场景内的运动流可通过局部单应变换近似,采用基于迭代 RANSAC 的方法将匹配聚合到对应于虚拟平面的重叠簇中,并舍弃不符合的对应关系。此外,底层平面结构设计提供了匹配关联局部块之间的显式映射,使得在块重投影后可通过互相关模板匹配进行关键点位置的选配精炼。最后,为增强对分段平面近似假设违背的鲁棒性和容错性,设计了一种进一步的策略,通过引入中间单应变换将两个块投影到同一平面,以最小化平面重投影中的相对块畸变。所提出的方法在标准数据集和图像匹配流程中得到了广泛评估,并与最先进的方法进行了比较。 与当前其他比较不同,所提出的基准还考虑了更普遍、真实且实际的情况,即相机内参不可用的情况。实验结果表明,我们提出的非深度学习、基于几何的方法在性能上优于或与最新的深度学习方法相当。最后,本研究指出,在所考虑的研究方向上,实际图像匹配解决方案仍具有开发潜力,未来可整合到新颖的深度图像匹配架构中。

keywords:
image matching, keypoint refinement, planar homography, normalized cross-correlation, SIFT, SuperGlue, LoFTR
关键词:图像匹配,关键点优化,平面单应性,归一化互相关,SIFT,SuperGlue,LoFTR

1 Introduction 1 引言

1.1 Background 1.1 背景

Image matching constitutes the backbone of most higher-level computer vision tasks and applications that require image registration to recover the 3D scene structure, including the camera poses. Image stitching [1, 2], Structure-from-Motion (SfM) [2, 3, 4, 5], Simultaneous Localization and Mapping (SLAM) [6, 2] and the more recently Neural Radiance Fields (NeRF) [7, 8] and Gaussian Splatting (GS) [9, 10] are probably nowadays the most common, widespread and major tasks critically relying on image matching.
图像匹配构成了大多数需要图像配准以恢复三维场景结构(包括相机姿态)的高级计算机视觉任务和应用的支柱。图像拼接[1, 2]、运动恢复结构(SfM)[2, 3, 4, 5]、同时定位与地图构建(SLAM)[6, 2],以及近期出现的神经辐射场(NeRF)[7, 8]和高斯喷射(GS)[9, 10],可能是当今最常见、最广泛且最关键依赖图像匹配的主要任务。

Image matching traditional paradigm can be organized into a modular pipeline concerning keypoint extraction, feature description, and the proper correspondence matching [11]. This representation is being neglected in modern deep end-to-end image matching methods due to their intrinsic design which guarantees a better global optimization at the expense of the interpretability of the system. Notwithstanding that end-to-end deep networks represent for many aspects the State-Of-The-Art (SOTA) in image matching, such in terms of the robustness and accuracy of the estimated poses in complex scenes [12, 13, 14, 15, 16, 17, 18], handcrafted image matching pipelines such as the popular Scale Invariant Feature Transform (SIFT) [19] are employed still today in practical applications [4, 2] due to their scalability, adaptability, and understandability. Furthermore, handcrafted or deep modules remain present in post-process modern monolithic deep architectures, to further filter the final output matches as done notably by the RAndom SAmpling Consensus (RANSAC) [20] based on geometric constraints.
传统图像匹配范式可以组织成一个模块化流水线,涉及关键点提取、特征描述以及适当的对应匹配[11]。由于其内在设计保证了更好的全局优化,同时牺牲了系统的可解释性,这种表示在现代端到端的深度图像匹配方法中被忽视。尽管端到端深度网络在许多方面代表了图像匹配的最新技术(SOTA),例如在复杂场景中估计姿态的鲁棒性和准确性[12, 13, 14, 15, 16, 17, 18],但像流行的尺度不变特征变换(SIFT)[19]这样的手工图像匹配流水线,由于其可扩展性、适应性和可理解性,至今仍在实际应用中使用[4, 2]。此外,手工或深度模块仍然存在于现代单一深度架构的后处理阶段,以进一步过滤最终的匹配结果,如基于几何约束的随机采样一致性(RANSAC)[20]所做的那样。

Image matching filters to prune correspondences are not limited to RANSAC, which in any of its form [21, 22, 23, 24] still remains mandatory nowadays as final step, and have been evolved from handcrafted methods [25, 26, 27, 28, 29, 30, 31, 32, 33] to deep ones [34, 35, 36, 37, 38, 39, 40, 41]. In addition, more recently methods inspired by coarse-to-fine paradigm, such as the Local Feature TRansformer (LoFTR) [14], have been developed to refine correspondences [42, 43, 44], which can be considered as the introduction of a feedback-system in the pipelined interpretation of the matching process.
用于修剪对应关系的图像匹配滤波器不仅限于 RANSAC,无论其形式如何[21, 22, 23, 24],至今仍作为最终步骤必不可少,并且已从手工方法[25, 26, 27, 28, 29, 30, 31, 32, 33]演变为深度方法[34, 35, 36, 37, 38, 39, 40, 41]。此外,最近受由粗到细范式启发的方法,如局部特征变换器(LoFTR)[14],已被开发用于细化对应关系[42, 43, 44],这可以视为在匹配过程的流水线解释中引入了反馈系统。

1.2 Main contribution 1.2 主要贡献

As contribution, within the former background, this paper introduces a modular approach to filter and refine matches as post-process before RANSAC. The key idea is that the motion flow within the images, i.e. the correspondences, can be approximated by local overlapping homography transformations [45]. This is the core concept of traditional image matching pipelines, where local patches are approximated by less constrained transformations going from a fine to a coarse scale level. Specifically, besides the similarity and affine transformations, the planar homography is introduced at a further level. The whole approach proceeds as follows:
作为贡献,在前述背景下,本文提出了一种模块化方法,用于在 RANSAC 之前对匹配进行后处理筛选和提纯。其核心思想是,图像内的运动流,即对应关系,可以通过局部重叠的单应性变换来近似[45]。这是传统图像匹配流程的核心概念,其中局部块通过从精细到粗略尺度级别的较少约束变换来近似。具体而言,除了相似性和仿射变换外,还在更深层次引入了平面单应性。整个方法的流程如下:

  • The Multiple Overlapping Planes (MOP) module aggregates matches into soft clusters, each one related to a planar homography roughly representing the transformation of included matches. On one hand, matches unable to fit inside any cluster are discarded as outliers, on the other hand, the associated planar map can be used to normalize the patches of the related keypoint correspondences. As more matches are involved in the homography estimation, the patch normalization process becomes more robust than the canonical ones used for instance in SIFT. Furthermore, homographies better adapt the warp as they are more general than similarity or affine transformations. MOP is implemented by iterating RANSAC to greedily estimate the next best plane according to a multi-model fitting strategy. The main difference from previous similar approaches [46, 47], is in maintaining matches during the iterations in terms of strong and weak inliers according to two different reprojection errors to guide the cluster partitioning.


    • 多重重叠平面(MOP)模块将匹配点聚合为软簇,每个簇关联一个大致表示包含匹配点变换的平面单应性。一方面,无法适应任何簇的匹配点被视为离群点丢弃;另一方面,关联的平面图可用于归一化相关关键点对应区域的补丁。随着更多匹配点参与单应性估计,补丁归一化过程比 SIFT 等方法中使用的典型归一化更为稳健。此外,单应性更适应变形,因其比相似性或仿射变换更为通用。MOP 通过迭代 RANSAC 贪婪估计下一个最佳平面,采用多模型拟合策略实现。与先前类似方法[46, 47]的主要区别在于,在迭代过程中根据两种不同的重投影误差维持强弱内点,以指导簇的划分。
  • The Middle Homography (MiHo) module additionally improves the patch normalization by further minimizing the relative patch distortion. More in detail, instead of reprojecting the patch of one image into the other image taken as reference according to the homography associated with the cluster, both the corresponding patches are reprojected into a virtual plane chosen to be in the middle of the base homography transformations so that globally the deformation is distributed into both the two patches, reducing the patch distortions with respect to the original images due to interpolation but also to the planar approximation.


    • 中值单应性(MiHo)模块通过进一步最小化相对补丁畸变,进一步改进了补丁归一化。更详细地说,不是将一张图像的补丁根据与聚类关联的单应性重投影到作为参考的另一张图像中,而是将两个对应的补丁都重投影到一个虚拟平面上,该平面选择位于基础单应性变换的中间位置,从而使得全局变形分布到两个补丁中,减少了由于插值以及平面近似导致的相对于原始图像的补丁畸变。
  • The Normalized Cross-Correlation (NCC) is employed following traditional template matching [48, 2] to improve the keypoint localization in the reference middle homography virtual plane using one patch as a template to being localized on the other patch. The relative shift adjustment found on the virtual plane is finally back-projected through into the original images. In order to improve the process, the template patch is perturbed by small rotations and anisotropic scaling changes for a better search in the solution space.


    • 采用归一化互相关(NCC)方法,遵循传统的模板匹配[48, 2],以提升在参考中间单应性虚拟平面上关键点的定位精度,其中使用一个补丁作为模板,在另一补丁上进行定位。在虚拟平面上找到的相对位移调整最终通过反投影回原始图像中。为优化该过程,模板补丁经过小幅旋转和各向异性尺度变化扰动,以在解空间中进行更有效的搜索。

MOP and MiHo are purely geometric-based, requiring only keypoint coordinates, while NCC relies on the image intensity in the local area of the patches. The early idea of MOP+MiHo+NCC can be found in [49]. Furthermore, MOP overlapping plane exploitation concept have been introduced in [50], while MiHo definition was draft in [51]. The proposed approach is handcrafted and hence more general as is does not require re-training to adapt to the particular kind of scene and its behavior is more interpretable as not hidden by deep architectures. As reported and discussed later in this paper, RANSAC takes noticeable benefits from MOP+MiHo as pre-filter before it in most configurations and scene kinds, and in any case does not deteriorate the matches. Concerning MOP+MiHo+NCC instead, keypoint localization accuracy seems to strongly depend on the kind of extracted keypoint. In particular, for corner-like keypoints, which are predominant in detectors such as the Keypoint Network (Key.Net) [52] or SuperPoint [53], NCC greatly improves the keypoint position, while for blob-like keypoints, predominant for instance in SIFT, NCC application can degrade the keypoint localization, due probably to the different nature of their surrounding areas. Nevertheless, the proposed approach indicates that there are still margins for improvements in current image matching solutions, even the deep-based ones, exploiting the analyzed methodology.
MOP 和 MiHo 完全基于几何,仅需关键点坐标,而 NCC 则依赖于补丁局部区域的图像强度。MOP+MiHo+NCC 的早期构想可见于[49]。此外,MOP 重叠平面利用概念在[50]中提出,而 MiHo 的定义则在[51]中草拟。所提出的方法是手工设计的,因此更具通用性,无需重新训练以适应特定场景,且其行为因不隐藏于深度架构中而更易解释。如本文后续所述和讨论,在大多数配置和场景类型中,RANSAC 在作为预过滤器使用 MOP+MiHo 时获益显著,且无论如何都不会损害匹配效果。至于 MOP+MiHo+NCC,关键点定位精度似乎强烈依赖于所提取关键点的类型。特别是对于类似角点的关键点,这类关键点在 Keypoint Network(KeyNet)等检测器中占主导地位。Net)[52]或 SuperPoint [53],NCC 显著提升了关键点位置的精度,然而对于类似斑点的关键点,如 SIFT 中常见的那样,NCC 的应用可能会降低关键点的定位精度,这可能归因于其周围区域性质的不同。尽管如此,所提出的方法表明,即使在基于深度学习的图像匹配解决方案中,仍存在改进的空间,利用所分析的方法论。

1.3 Additional contribution
1.3 额外贡献

As further contribution, an extensive comparative evaluation is provided against eleven less and more recent SOTA deep and handcrafted image matching filtering methods on both non-planar indoor and outdoor standard benchmark datasets [54, 55, 11], as well as planar scenes [56, 57]. Each filtering approach, including several combinations of the three proposed modules, has been tested to post-process matches obtained with seven different image matching pipelines and end-to-end networks, also considering distinct RANSAC configurations. Moreover, unlike the mainstream evaluation which has been taken as current standard [14], the proposed benchmark assumes that no camera intrinsics are available, reflecting a more general, realistic, and practical scenario. On this premise, the pose error accuracy is defined, analyzed, and employed with different protocols intended to highlight the specific complexity of the pose estimation working with fundamental and essential matrices [45], respectively. This analysis reveal the many-sided nature of the image matching problem and of its solutions.
作为进一步的贡献,本文对十一项较新和较旧的 SOTA 深度学习与手工设计的图像匹配过滤方法进行了广泛的对比评估,涵盖了非平面室内外标准基准数据集[54, 55, 11]以及平面场景[56, 57]。每种过滤方法,包括所提出的三个模块的多种组合,均已用于对通过七种不同图像匹配流程和端到端网络获得的匹配结果进行后处理,同时考虑了不同的 RANSAC 配置。此外,与当前标准的主流评估[14]不同,本文提出的基准假设不提供相机内参,更贴近普遍、现实且实用的场景。在此前提下,定义、分析并采用了姿态误差精度,通过不同协议来凸显在基本矩阵和本质矩阵下进行姿态估计的特定复杂性[45]。这一分析揭示了图像匹配问题及其解决方案的多面性。

1.4 Paper organization 1.4 论文结构

The rest of the paper is organized as follows. The related work is presented in Sec. 2, the design of MOP, MiHo, and NCC is discussed in Sec. 3, and the evaluation setup and the results are analyzed in Sec.4. Finally, conclusions and future work are outlined in Sec. 5.
本文其余部分组织如下。相关工作在第 2 节中介绍,MOP、MiHo 和 NCC 的设计在第 3 节中讨论,评估设置和结果在第 4 节中分析。最后,结论和未来工作在第 5 节中概述。

1.5 Additional resources 1.5 附加资源

Code and data are freely available online111https://github.com/fb82/MiHo. Data also include high-resolution versions of the images, plots, and tables presented in this paper222https://github.com/fb82/MiHo/tree/main/data/paper.
代码和数据可在线免费获取 1 。数据还包括本文中展示的图像、图表和表格的高分辨率版本 2

2 Related work 2 相关工作

2.1 Disclamer and further references
2.1 免责声明及进一步参考文献

The following discussion is far to be exhaustive due to more than three decades of active research and progress in the field, which is still ongoing. The reader can refer to [58, 59, 60, 61] for a more comprehensive analysis than that reported below, which is related and covers only the methods discussed in the rest of the paper.
以下讨论远非详尽无遗,因为该领域已活跃研究并取得进展超过三十年,且这一进程仍在继续。读者可参阅[58, 59, 60, 61]以获取比下文更为全面的分析,下文仅涉及与本文其余部分讨论方法相关的部分内容。

2.2 Non-deep image matching
2.2 非深度图像匹配

Traditionally, image matching is divided into three main steps: keypoint detection, description, and proper matching.
传统上,图像匹配分为三个主要步骤:关键点检测、描述和适当匹配。

2.2.1 Keypoint detectors 2.2.1 关键点检测器

Keypoint detection is aimed at finding salient local image regions whose feature descriptor vectors are at the same time-invariant to image transformations, i.e. repeatable, and distinguishable from other non-corresponding descriptors, i.e. discriminant. Clearly, discriminability decreases as long as invariance increases and vice-versa. Keypoints are usually recognized as corner-like and blob-like ones, even if there is no clear separation within them outside ad-hoc ideal images. Corners are predominantly extracted by the Harris detector [62], while blobs by SIFT [19]. Several other handcrafted detectors exist besides the Harris and the SIFT one [63, 64, 65], but excluding deep image matching nowadays the former and their extensions are the most commonly adopted. Harris and SIFT keypoint extractors rely respectively on the covariance matrix of the first-order derivatives and the Hessian matrix of the second-order derivatives. Blobs and corners can be considered each other analogues for different derivative orders, as in the end, a keypoint in both cases correspond to a local maxima response to a filter which is a function of the determinant of the respective kind of matrix. Blob keypoint centers tend to be pushed away from edges, which characterize the regions, more than corners due to the usage of high-order derivatives. More recently, the joint extraction and refinement of robust keypoints has been investigated by the Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring (GMM-IKRS) [66]. The idea is to extract image keypoints after several homography warps and cluster them after back-projection into the input images so as to select the best clusters in terms of robustness and deviation, as another perspective of the Affine SIFT (ASIFT) [67] strategy.
关键点检测旨在寻找显著的局部图像区域,其特征描述符向量同时对图像变换具有时间不变性,即重复性,并且能与其他非对应描述符区分开来,即具有判别性。显然,判别性随不变性的增加而降低,反之亦然。关键点通常被识别为角点状和斑点状,尽管在理想图像之外,它们之间没有明确的区分。角点主要通过 Harris 检测器[62]提取,而斑点则通过 SIFT[19]提取。除了 Harris 和 SIFT 之外,还存在其他几种手工设计的检测器[63, 64, 65],但如今在深度图像匹配之外,前者及其扩展是最常用的。Harris 和 SIFT 关键点提取器分别依赖于一阶导数的协方差矩阵和二阶导数的 Hessian 矩阵。斑点和角点可以被视为不同导数阶数的对应物,因为在最终情况下,两种情况下的关键点都对应于对某种矩阵行列式函数的滤波器的局部最大响应。 Blob 关键点中心往往被推离边缘,这些边缘特征区域比角点更明显,这是由于使用了高阶导数。最近,通过高斯混合模型进行可解释关键点精化和评分(GMM-IKRS)[66],研究了鲁棒关键点的联合提取与精化。其思路是在多次单应性变换后提取图像关键点,并将它们反投影回输入图像进行聚类,从而根据鲁棒性和偏差选择最佳聚类,这是仿射 SIFT(ASIFT)[67]策略的另一种视角。

2.2.2 Keypoint descriptors
2.2.2 关键点描述符

Feature descriptors aim at embedding the keypoint characterization into numerical vectors. There has been a long research interest in this area, both in terms of handcrafted methods and non-deep machine learning ones, which contributed actively to the current development and the progress made by current SOTA deep image matching. Among non-deep keypoint descriptors, nowadays only the SIFT descriptor is practically still used and widespread due to its balance within computational efficiency, scaling, and matching robustness. SIFT descriptor is defined as the gradient orientation histogram of the local neighborhood of the keypoint, also referred to as patch. Many high-level computer vision applications like COLMAP [4] in the case of SfM are based on SIFT. Besides SIFT keypoint detector & descriptor, the Oriented FAST and Rotated BRIEF (ORB) [68] is a pipeline surviving still today in the deep era worth to be mentioned. ORB is the core of the popular ORB-SLAM [6], due to its computational efficiency, even if less robust than SIFT. Nevertheless, it seems to be going to be replaced by SIFT with the increased computational capability of current hardware [69].
特征描述符旨在将关键点特征嵌入数值向量中。长期以来,这一领域吸引了大量研究兴趣,涵盖手工设计方法和非深度机器学习方法,它们积极推动了当前深度图像匹配技术的发展与进步。在非深度关键点描述符中,如今仅 SIFT 描述符因其计算效率、尺度适应性和匹配鲁棒性之间的平衡而实际应用广泛。SIFT 描述符定义为关键点局部邻域的梯度方向直方图,亦称为图像块。众多高级计算机视觉应用,如 SfM 中的 COLMAP[4],均基于 SIFT。除 SIFT 关键点检测器与描述符外,Oriented FAST 与 Rotated BRIEF(ORB)[68]这一流程在深度学习时代仍值得提及,它作为 ORB-SLAM[6]的核心,因其计算效率而流行,尽管鲁棒性不及 SIFT。然而,随着当前硬件计算能力的提升,ORB 似乎正被 SIFT 所取代[69]。

2.2.3 Patch normalization 2.2.3 补丁归一化

Actually, the patch normalization takes place in between the detection and description of the keypoints, and it is critical for effective computation of the descriptor to be associated with the keypoint. Patch normalization is implicitly hidden inside both keypoint extraction and description and its task roughly consists of aligning and registering patches before descriptor computation [19]. The key idea is that complex image deformations can be locally approximated by more simple transformations as long as the considered neighborhood is small. Under these assumptions, SIFT normalizes the patch by a similarity transformation, where the scale is inferred by the detector and the dominant rotation is computed from the gradient orientation histogram. Affine covariant patch normalization [70, 67, 71] was indeed the natural evolution, which can improve the normalization in case of high degrees of perspective distortions. Notice that in the case of only small scene rotation, up-right patches computed neglecting orientation alignment can lead to better matches since a strong geometric constraint limiting the solution space is imposed [72]. Concerning instead patch photometric consistency, affine normalization, i.e. by mean and standard deviation, is usually the preferred choice being sufficiently robust in the general case [19].
实际上,补丁归一化发生在关键点检测与描述之间,对于有效计算与关键点相关联的描述符至关重要。补丁归一化隐含在关键点提取和描述过程中,其任务大致包括在描述符计算前对齐和注册补丁[19]。核心思想是,只要考虑的邻域足够小,复杂的图像形变可以在局部近似为更简单的变换。在这些假设下,SIFT 通过相似性变换对补丁进行归一化,其中尺度由检测器推断,主旋转则从梯度方向直方图中计算得出。仿射协变补丁归一化[70, 67, 71]确实是自然演进,能够在存在高度透视畸变的情况下改进归一化效果。值得注意的是,在场景旋转较小的情况下,忽略方向对齐计算的直立补丁可能会带来更好的匹配结果,因为强几何约束限制了解空间[72]。 关于补丁光度一致性,仿射归一化,即通过均值和标准差进行归一化,通常是首选方法,因为它在一般情况下足够稳健[19]。

2.2.4 Descriptor similarity and global geometric constraints
2.2.4 描述符相似性与全局几何约束

The last step is the proper matching, which mainly pairs keypoints by considering the distance similarity on the associated descriptors. Nearest Neighbor (NN) and Nearest Neighbor Ratio (NNR) [19] are the common and simple approaches even if additional extensions have been developed [73, 32]. For many tasks descriptor similarity alone does not provide satisfactory results, which can be achieved instead by including a further filtering post-process by constraining matches geometrically. RANSAC provides an effective method to discard outliers under the more general epipolar geometry constraints of the scene or in the case of planar homographies. The standard RANSAC continues to be extended in several ways [21, 22]. Worth mentioning are the Degenerate SAmpling Consensus (DegenSAC) [24] to avoid degenerate configurations, and the MArGinalized SAmpling Consensus (MAGSAC) [23] for defining a robust inlier threshold which remains the most critical parameter.
最后一步是适当的匹配,主要通过考虑关联描述符上的距离相似性来配对关键点。最近邻(NN)和最近邻比率(NNR)[19]是常见且简单的匹配方法,尽管已开发出其他扩展方法[73, 32]。对于许多任务,仅依赖描述符相似性往往无法获得满意的结果,而通过几何约束进行进一步过滤的后处理步骤则能实现更好的匹配效果。RANSAC 提供了一种有效的方法,在场景的更普遍的极线几何约束下或平面单应性的情况下,剔除异常值。标准 RANSAC 在多个方面得到了扩展[21, 22]。值得一提的是,退化采样一致性(DegenSAC)[24]用于避免退化配置,而边缘化采样一致性(MAGSAC)[23]则用于定义一个鲁棒的内点阈值,这是最关键的参数。

2.2.5 Local geometric constraints
2.2.5 局部几何约束

RANSAC exploits global geometric constraints. A more general and relaxed geometric assumption than RANSAC is to consider only local neighborhood consistency across the images as done in Grid-based Motion Statistics (GMS) [26] using square neighborhood, but also circular [25] and affine-based [27, 28] neighborhoods or Delaunay triangulation [32, 28] have been employed. In most cases, these approaches require an initial set of robust seed matches for initialization whose selection may become critical [74], while other approaches explicitly exploit descriptor similarities to rank matches [32]. In this respect, the vanilla RANSAC and GMS are pure geometric image filters. Notice that descriptor similarities have been also exploited successfully within RANSAC [75]. Closely connected to local filters, another class of methods is designed instead to estimate and interpolate the motion field upon which to check match consistency [29, 31, 30]. Local neighborhood consistency can be framed also into RANSAC as done by Adaptive Locally-Affine Matching (AdaLAM) [33] which runs multiple local affine RANSACs inside the circular neighborhoods of seed matches. AdaLAM design can be related with GroupSAC [76], which draws hypothesis model sample correspondences from different clusters, and with the more general multi-model fitting problem for which several solutions have been investigated [77, 78]. Among these approaches the recent Consensus Clustering (CC) [79] has been developed for the specific case of multiple planar homographies.
RANSAC 利用了全局几何约束。比 RANSAC 更为普遍且宽松的几何假设是仅考虑图像间的局部邻域一致性,如基于网格的运动统计(GMS)[26]采用方形邻域,但也有使用圆形[25]、仿射[27, 28]邻域或德劳内三角剖分[32, 28]的方法。在大多数情况下,这些方法需要一组鲁棒的初始匹配种子进行初始化,其选择可能变得至关重要[74],而其他方法则明确利用描述符相似性对匹配进行排序[32]。就此而言,原始的 RANSAC 和 GMS 是纯粹的几何图像滤波器。值得注意的是,描述符相似性在 RANSAC 内部也得到了成功利用[75]。与局部滤波器紧密相关的是,另一类方法旨在估计并插值运动场,以此检查匹配一致性[29, 31, 30]。局部邻域一致性也可以纳入 RANSAC 框架,如自适应局部仿射匹配(AdaLAM)[33]在种子匹配的圆形邻域内运行多个局部仿射 RANSAC。 AdaLAM 设计可与 GroupSAC[76]相关联,后者从不同聚类中提取假设模型样本对应关系,并与更普遍的多模型拟合问题相关,对此已有多种解决方案被探讨[77, 78]。在这些方法中,最新的共识聚类(CC)[79]针对多平面单应性的特定情况进行了开发。

2.3 Deep image matching 2.3 深度图像匹配

2.3.1 Standalone pipeline modules
2.3.1 独立管道模块

Deep learning evolution has provided a boost to image matching as well as other computer vision research fields thanks to the evolution of the deep architectures, hardware and software computational capability and design, and the huge amount of datasets now available for the training. Hybrid pipelines are raised first, replacing handcrafted descriptors with deep ones, as the feature descriptor module is the most natural and immediate to be replaced in the pipeline evolution. SOTA in this sense is represented by the Hard Network (HardNet) [80] and similar networks [81, 82] trained on contrastive learning with hard mining. For what concerns the keypoint extraction, better results with respect to the handcrafted counterparts were achieved by driving the network to mimic handcrafted detector design as for Key.Net [52]. Deep learning has been successfully applied to design standalone patch normalization modules to estimate the reference patch orientation [83, 84] or more complete affine transformations as in the Affine Network [84]. The mentioned deep modules have been employed successfully in many hybrid matching pipelines, especially in conjunction with SIFT [11] or Harris corners [85].
深度学习的演进为图像匹配及其他计算机视觉研究领域带来了显著提升,这得益于深度架构、硬件与软件计算能力及设计的进步,以及如今可用于训练的海量数据集。首先提出的混合管道方案,以深度特征描述符取代手工设计的描述符,因为特征描述模块在管道演进中最自然且直接可被替换。在此背景下,最先进的代表是 HardNet(HardNet)[80]及其类似网络[81, 82],它们通过对比学习与硬挖掘进行训练。在关键点提取方面,通过引导网络模仿手工检测器设计,如 Key.Net [52],取得了优于手工设计方法的更好结果。深度学习还被成功应用于设计独立的图像块归一化模块,以估计参考图像块的方向[83, 84],或更复杂的仿射变换,如 AffineNet [84]中所实现。 上述深度模块已在众多混合匹配流程中成功应用,尤其在与 SIFT[11]或 Harris 角点[85]结合时表现出色。

2.4 Joint detector & descriptor networks
2.4 联合检测器与描述符网络

A relevant characteristic in deep image matching which made it able to exceed handcrafted image matching pipelines was gained by allowing concurrently optimizing both the detector & descriptor. SuperPoint [53] integrates both of them in a single network and does not require a separate yet consecutive training of these as [86]. A further design feature that enabled SuperPoint to become a reference in the field is the use of homographic adaptation, which consists of the use of planar homographies to generate corresponding images during the training, to allow self-supervised training. Further solutions were proposed by architectures similar to SuperPoint as in [87]. Moreover, the Accurate and Lightweight Keypoint Detection and Descriptor Extraction (ALIKE) [88] implements a differentiable Non-Maximum Suppression (NMS) for the keypoint selection on the score map, and the DIScrete Keypoints (DISK) [89] employs reinforcement learning. ALIKE design is further improved by A LIghter Keypoint and descriptor Extraction network with Deformable transformation (ALIKED) [13] using deformable local features for each sparse keypoint instead of dense descriptor maps, resulting in saving computational budget. Alternative rearrangements in the pipeline structure to use the same sub-network for both detector & descriptor (detect-and-describe) [90], or to extract descriptors and then keypoints (describe-then-detect) [91] instead of the common approach (detect-then-describe) have been also investigated. More recently, the idea of decoupling the two blocks by formulating the matching problems in terms of 3D tracks in large-scale SfM has been applied in Detect, Don’t Describe–Describe, Don’t Detect (DeDoDe) [92] and DeDoDe v2 [93] with interesting results.
深度图像匹配中一个相关特性使其能够超越手工设计的图像匹配流程,是通过同时优化检测器和描述符实现的。SuperPoint [53] 将两者整合于单一网络中,无需像 [86] 那样分别且连续地进行训练。SuperPoint 成为该领域标杆的另一设计特点是采用了单应性适应,即在训练过程中利用平面单应性生成对应图像,以实现自监督训练。类似 SuperPoint 的架构如 [87] 也提出了进一步的解决方案。此外,精确轻量级关键点检测与描述符提取(ALIKE)[88] 实现了可微分的非极大值抑制(NMS)用于得分图上的关键点选择,而离散关键点(DISK)[89] 则运用了强化学习。 ALIKE 设计通过采用可变形变换的轻量级关键点和描述子提取网络(ALIKED)[13]得到了进一步改进,该网络为每个稀疏关键点使用可变形局部特征,而非密集描述子图,从而节省了计算预算。此外,还研究了流水线结构中的替代重排,例如使用同一子网络同时进行检测和描述(detect-and-describe)[90],或先提取描述子再检测关键点(describe-then-detect)[91],而非传统方法(detect-then-describe)。最近,通过在大规模 SfM 中以 3D 轨迹形式表述匹配问题,实现两个模块解耦的思想被应用于 Detect, Don’t Describe–Describe, Don’t Detect(DeDoDe)[92]和 DeDoDe v2 [93],取得了有趣的结果。

2.4.1 Full end-to-end architectures
2.4.1 全端到端架构

Full end-to-end deep image matching architectures arose with SuperGlue [94], which incorporates a proper matching network exploiting self and cross attentions through transformers to jointly encode keypoint positions and visual data into a graph structure, and a differentiable form of the Sinkhorn algorithm to provide the final assignments. SuperGlue can represent a breakthrough in the areas and has been extended by the recent LightGlue [12] improving its efficiency, accuracy and training process. Detector-free end-to-end image matching architectures have been also proposed, which avoid explicitly computing a sparse keypoint map, considering instead a semi-dense strategy. The Local Feature TRansformer (LoFTR) [14] is a competitive realization of these matching architectures, which deploys a coarse-to-fine strategy using self and cross attention layers on dense feature maps and has been employed as a base for further design extensions [15, 16]. Notice that end-to-end image matching networks are not fully rotational invariant and can handle only relatively small scene rotations, except specific architectures designed with this purpose [95, 96, 97]. Coarse-to-fine strategy following LoFTR design has been also applied to hybrid canonical pipelines with only deep patch and descriptor modules, leading to matching results comparable in challenging scenarios at an increased computation cost [50]. More recently, dense image matching networks that directly estimate the scene as dense point maps [17] or employing Gaussian process [18], achieve SOTA results in many benchmarks but are more computationally expensive than sparse or semi-dense methods.
全端到端深度图像匹配架构随着 SuperGlue[94]的出现而兴起,该架构通过利用自注意力和交叉注意力机制的匹配网络,将关键点位置和视觉数据联合编码为图结构,并采用可微分的 Sinkhorn 算法进行最终分配。SuperGlue 在相关领域代表了突破性进展,并由最近的 LightGlue[12]进一步扩展,提升了其效率、准确性和训练过程。此外,还提出了无需显式计算稀疏关键点图的端到端图像匹配架构,转而采用半密集策略。Local Feature TRansformer (LoFTR)[14]是这些匹配架构的竞争性实现,它采用粗到细的策略,在密集特征图上使用自注意力和交叉注意力层,并已被用作进一步设计扩展的基础[15, 16]。值得注意的是,端到端图像匹配网络并非完全旋转不变,仅能处理相对较小的场景旋转,除非是为此目的设计的特定架构[95, 96, 97]。 LoFTR 设计中的由粗到细策略也被应用于仅包含深度块和描述符模块的混合规范管道,在计算成本增加的情况下,在具有挑战性的场景中实现了可比的匹配结果[50]。最近,直接估计场景为密集点图[17]或采用高斯过程[18]的密集图像匹配网络,在许多基准测试中达到了最先进的结果,但相比于稀疏或半密集方法,其计算成本更高。

2.4.2 Outlier rejection with geometric constraints
2.4.2 基于几何约束的异常值剔除

The first effective deep matching filter module was achieved with the introduction of the context normalization [34], which is the normalization by mean and standard deviation, that guarantees to preserve permutation equivalence. The Attentive Context Network (ACNe) [35] further includes local and global attention to exclude outliers from context normalization, while the Order-Aware Network (OANet) [36] adds additional layers to learn how to cluster unordered sets of correspondences to incorporate the data context and the spatial correlation. The Consensus Learning Network (CLNet) [37] prunes matches by filtering data according a local-to-global dynamic neighborhood consensus graphs in consecutive rounds. The Neighbor Consistency Mining Network (NCMNet) [38] improves CLNet architecture by considering different neighborhoods in the feature space, the keypoint coordinate space, and a further global space that considers both previous ones altogether with cross-attention. More recently, the Multiple Sparse Semantics Dynamic Graph Network (MS2DG-Net) [39] designs the filtering in terms of a neighborhood graph through transformers [98] to capture semantics similar structures in the local topology among correspondences. Unlike previous deep approaches, ConvMatch[40] builds a smooth motion field by making use of Convolutional Neural Network (CNN) layers to verify the consistency of the matches. DeMatch [41] refines ConvMatch motion field estimation for a better accommodation of scene disparities by decomposing the rough motion field into several sub-fields. Finally, notice that deep differentiable RANSAC modules have been investigated as well [99, 100].
首个有效的深度匹配滤波模块通过引入上下文归一化[34]得以实现,该归一化基于均值和标准差,确保了置换等价性。注意力上下文网络(ACNe)[35]进一步结合局部和全局注意力,以排除上下文归一化中的异常值,而顺序感知网络(OANet)[36]则增加了额外层,用于学习如何聚类无序的对应集,以整合数据上下文和空间相关性。共识学习网络(CLNet)[37]通过在连续轮次中依据局部到全局的动态邻域共识图过滤数据来修剪匹配。邻域一致性挖掘网络(NCMNet)[38]通过考虑特征空间、关键点坐标空间以及综合前两者的全局空间中的不同邻域,改进了 CLNet 架构,并引入了交叉注意力机制。 最近,多重稀疏语义动态图网络(MS 2 DG-Net)[39]通过变换器[98]设计了基于邻域图的过滤机制,以捕捉对应关系间局部拓扑中的语义相似结构。与先前的深度方法不同,ConvMatch[40]利用卷积神经网络(CNN)层构建平滑的运动场,以验证匹配的一致性。DeMatch[41]则通过将粗糙的运动场分解为多个子场,改进了 ConvMatch 的运动场估计,从而更好地适应场景差异。最后,值得注意的是,深度可微分的 RANSAC 模块也得到了研究[99, 100]。

2.4.3 Keypoint refinement 2.4.3 关键点精炼

The coarse-to-fine strategy trend has pushed the research of solutions focusing on the refinement of keypoint localization as a standalone module after the coarse assignment of the matches. Worth mentioning is Pixel-Perfect SfM [101] that learns to refine multi-view tracks on their photometric appearance and the SfM structure, while Patch2Pix [42] refines and filters patches exploiting image dense features extracted by the network backbone. Similar to Patch2Pix, Keypt2Subpx [43] is a recent lightweight module to refine only matches requiring as input the descriptors of the local heat map of the related correspondences beside keypoints position. Finally, the Filtering and Calibrating Graph Neural Network (FC-GNN) [44] is an attention-based Graph Neural Network (GNN) jointly leveraging contextual and local information to both filter and refine matches. Unlike Patch2Pix and Keypt2Subpx, does not require feature maps or descriptor confidences as input, as also the proposed MOP+MiHo+NCC. Similarly to FC-GNN, the proposed approach uses both global context and local information to refine matches, respectively to warp and check photometric consistency, but differently from FC-GNN it currently only employs global contextual data since does not implement any feedback system to discard matches on patch similarity after the alignment refinement.
粗到细策略的趋势推动了将关键点定位细化作为粗匹配分配后独立模块的研究。值得一提的是 Pixel-Perfect SfM [101],它学习根据光度外观和 SfM 结构来细化多视图轨迹,而 Patch2Pix [42]则利用网络主干提取的图像密集特征来细化并过滤补丁。与 Patch2Pix 类似,Keypt2Subpx [43]是一个近期轻量级模块,仅对需要细化的匹配进行处理,输入包括相关对应关系的局部热图描述符以及关键点位置。最后,过滤与校准图神经网络(FC-GNN)[44]是一种基于注意力的图神经网络(GNN),联合利用上下文和局部信息来过滤和细化匹配。与 Patch2Pix 和 Keypt2Subpx 不同,FC-GNN 不要求输入特征图或描述符置信度,正如所提出的 MOP+MiHo+NCC 方法一样。 与 FC-GNN 类似,所提出的方法同时利用全局上下文和局部信息来优化匹配,分别用于扭曲和对光度一致性进行检查,但与 FC-GNN 不同的是,它目前仅采用全局上下文数据,因为尚未实现任何反馈系统在对齐优化后基于块相似性丢弃匹配。

3 Proposed approach 3 提出的方法

3.1 Multiple Overlapping Planes (MOP)
3.1 多重重叠平面(MOP)

3.1.1 Preliminaries 3.1.1 预备知识

MOP assumes that motion flow within the images can be piecewise approximated by local planar homography.
MOP 假设图像内的运动流可以通过局部平面单应性进行分段近似。

Representing with an abuse of notation the reprojection of a point 𝐱2\mathbf{x}\in\mathbb{R}^{2}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT through a planar homography H3×3\mathrm{H}\in\mathbb{R}^{3\times 3}roman_H ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT with H\mathrm{H}roman_H non-singular as H𝐱\mathrm{H}\mathbf{x}roman_H bold_x, a match m=(𝐱1,𝐱2)m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) within two keypoints 𝐱1I1\mathbf{x}_{1}\in I_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱2I2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is considered an inlier based on the maximum reprojection error
用符号 H𝐱\mathrm{H}\mathbf{x}roman_H bold_x 表示通过非奇异平面单应性 H\mathrm{H}roman_H 对点 𝐱2superscript2\mathbf{x}\in\mathbb{R}^{2}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 的重投影,若两个关键点 𝐱1I1subscript1subscript1\mathbf{x}_{1}\in I_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐱2I2subscript2subscript2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 之间的匹配 m=(𝐱1,𝐱2)subscript1subscript2m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 基于最大重投影误差被视为内点

εH(m)=max(𝐱2H𝐱1,𝐱1H1𝐱2)\varepsilon_{\mathrm{H}}(m)=\max\left(\left\|\mathbf{x}_{2}-\mathrm{H}\mathbf{% x}_{1}\right\|,\left\|\mathbf{x}_{1}-\mathrm{H}^{-1}\mathbf{x}_{2}\right\|\right)italic_ε start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_m ) = roman_max ( ∥ bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_H bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ , ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ) (1)

For a set of matches ={m}\mathcal{M}=\{m\}caligraphic_M = { italic_m }, the inlier subset is provided according to a threshold ttitalic_t as
对于一组匹配 ={m}\mathcal{M}=\{m\}caligraphic_M = { italic_m } ,根据阈值 ttitalic_t 提供内点子集

𝒥H,t={m:εH(m)t}\mathcal{J}^{\mathcal{M}}_{\mathrm{H},t}=\left\{m\in\mathcal{M}:\varepsilon_{% \mathrm{H}}(m)\leq t\right\}caligraphic_J start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t end_POSTSUBSCRIPT = { italic_m ∈ caligraphic_M : italic_ε start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_m ) ≤ italic_t } (2)

A proper scene planar transformation must be quasi-affine to preserve the convex hull [45] with respect to the minimum sampled model generating H\mathrm{H}roman_H in RANSAC so the inlier set effectively employed is
正确的场景平面变换必须保持准仿射性,以保留相对于 RANSAC 中最小采样模型生成的凸包,从而使有效利用的内点集为

H,t=𝒥H,t𝒬H\mathcal{I}^{\mathcal{M}}_{\mathrm{H},t}=\mathcal{J}^{\mathcal{M}}_{\mathrm{H}% ,t}\cap\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}caligraphic_I start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t end_POSTSUBSCRIPT = caligraphic_J start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t end_POSTSUBSCRIPT ∩ caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT (3)

where 𝒬H\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT is the set of matches in \mathcal{M}caligraphic_M satisfying the quasi-affininity property with respect to the RANSAC minimum model of H\mathrm{H}roman_H described later in Eq. 12.
其中 𝒬Hsubscriptsuperscript\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT\mathcal{M}caligraphic_M 中满足与后文公式 12 所述的 H\mathrm{H}roman_H 的 RANSAC 最小模型相关的准亲和性属性的匹配集合。

3.1.2 Main loop 3.1.2 主循环

MOP iterates through a loop where a new homography is discovered and the match set is reduced for the next iteration. Iterations halt when the counter cfc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT of the sequential failures at the end of a cycle reaches the maximum allowable value cf=3c_{f}^{\star}=3italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 3. If not incremented at the end of the cycle, cfc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is reset to 0. As output, MOP returns the set \mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT of found homographies H\mathrm{H}roman_H which covers the image visual flow.
MOP 通过循环迭代,每次发现新的单应性矩阵并缩减匹配集以进行下一次迭代。当循环结束时连续失败的计数器 cfsubscriptc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 达到最大允许值 cf=3superscriptsubscript3c_{f}^{\star}=3italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 3 时,迭代停止。如果在循环结束时未增加, cfsubscriptc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 将被重置为 0。作为输出,MOP 返回一组 superscript\mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 已找到的单应性矩阵 H\mathrm{H}roman_H ,这些矩阵覆盖了图像视觉流。

Starting with cf=0c_{f}=0italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0, 0=\mathcal{H}^{0}=\emptysetcaligraphic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ∅ and the initial set of input matches 0\mathcal{M}^{0}caligraphic_M start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT provided by the image matching pipeline, MOP at each iteration kkitalic_k looks for the best planar homography Hk\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT compatible with the current match set k\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT according to a relaxed inlier threshold tl=15t_{l}=15italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 15 px. When the number of inliers provided by tlt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is less than the minimum required amount, set to n=12n=12italic_n = 12, i.e.
cf=0subscript0c_{f}=0italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 00=superscript0\mathcal{H}^{0}=\emptysetcaligraphic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ∅ 以及图像匹配流水线提供的初始输入匹配集 0superscript0\mathcal{M}^{0}caligraphic_M start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT 开始,MOP 在每次迭代 kkitalic_k 中寻找与当前匹配集 ksuperscript\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 最兼容的最佳平面单应性 Hksuperscript\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,依据的是一个放宽的内点阈值 tl=15subscript15t_{l}=15italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 15 像素。当 tlsubscriptt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 提供的内点数量少于所需的最小数量 n=1212n=12italic_n = 12 时,即

|Hk,tlk|<n\left|\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t_{l}}\right|<n| caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT | < italic_n (4)

the current Hk\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT homography is discarded and the failure counter cfc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is incremented without updating the iteration counter kkitalic_k, the next match set k+1\mathcal{M}^{k+1}caligraphic_M start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT and the output list k+1\mathcal{H}^{k+1}caligraphic_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT. Otherwise, MOP removes the inlier matches of Hk\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for a threshold tkt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for the next iteration k+1k+1italic_k + 1
当前的 Hksuperscript\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 单应性被丢弃,失败计数器 cfsubscriptc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 递增,而不更新迭代计数器 kkitalic_k 、下一匹配集 k+1superscript1\mathcal{M}^{k+1}caligraphic_M start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT 和输出列表 k+1superscript1\mathcal{H}^{k+1}caligraphic_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT 。否则,MOP 将移除 Hksuperscript\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 的内点匹配,以 tksuperscriptt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 为阈值,用于下一次迭代 k+11k+1italic_k + 1

k+1=kk\mathcal{M}^{k+1}=\mathcal{M}^{k}\setminus\mathcal{I}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∖ caligraphic_I start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (5)

where k=Hk,tkk\mathcal{I}^{k}=\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t^{k}}caligraphic_I start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and updates the output list
其中 k=Hk,tkksuperscriptsubscriptsuperscriptsuperscriptsuperscriptsuperscript\mathcal{I}^{k}=\mathcal{I}^{\mathcal{M}^{k}}_{\mathrm{H}^{k},t^{k}}caligraphic_I start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 并更新输出列表

k+1=k{Hk}\mathcal{H}^{k+1}=\mathcal{H}^{k}\cup\{\mathrm{H}^{k}\}caligraphic_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = caligraphic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∪ { roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } (6)

The threshold tkt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT adapts to the evolution of the homography search and is equal to a strict threshold th=tl/2t_{h}=t_{l}/2italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / 2 in case the cardinality of the inlier set using tht_{h}italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is greater than half of nnitalic_n, i.e. in the worst case no more than half of the added matches should belong to previous overlapping planes. If this condition does not hold tkt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT turns into the relaxed threshold tlt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and the failure counter cfc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is incremented
阈值 tksuperscriptt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 适应单应性搜索的演变,并且在使用 thsubscriptt_{h}italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 的内点集基数大于 nnitalic_n 的一半时,等于严格的阈值 th=tl/2subscriptsubscript2t_{h}=t_{l}/2italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / 2 ,即在最坏情况下,不超过一半的添加匹配应属于先前的重叠平面。如果此条件不成立, tksuperscriptt^{k}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 变为宽松的阈值 tlsubscriptt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ,并且失败计数器 cfsubscriptc_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 递增。

tk={thif|Hk,thk|>n2tlotherwiset^{k}=\begin{cases}t_{h}&\text{if}\quad\left|\mathcal{I}^{\mathcal{M}^{k}}_{% \mathrm{H}^{k},t_{h}}\right|>\frac{n}{2}\\ t_{l}&\text{otherwise}\end{cases}italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL start_CELL if | caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT | > divide start_ARG italic_n end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (7)

Using the relaxed threshold tlt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to select the best homography Hk\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is required since the real motion flow is only locally approximated by planes, and removing only strong inliers by tht_{h}italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for the next iteration k+1k+1italic_k + 1 guarantees smooth changes within overlapping planes and more robustness to noise. Nevertheless, in case of slow convergence or when the homography search gets stuck around a wide overlapping or noisy planar configuration, limiting the search to matches excluded by previous planes can provide a way out. The final set of homography returned is
使用松弛阈值 tlsubscriptt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 来选择最佳单应性 Hksuperscript\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 是必要的,因为实际运动流仅由局部平面近似,并且仅通过 thsubscriptt_{h}italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 移除强内点以进行下一次迭代 k+11k+1italic_k + 1 ,可确保重叠平面内的平滑变化并增强对噪声的鲁棒性。然而,在收敛缓慢或单应性搜索陷入广泛重叠或噪声平面配置的情况下,将搜索限制在先前平面排除的匹配点上,可提供一种解决方案。最终返回的单应性集合为

=k\mathcal{H}^{\star}=\mathcal{H}^{k^{\star}}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = caligraphic_H start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (8)

found by the last iteration kk^{\star}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT when the failure counter reaches cfc_{f}^{\star}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.
由最后一次迭代 ksuperscriptk^{\star}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 发现,当失败计数器达到 cfsuperscriptsubscriptc_{f}^{\star}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 时。

3.1.3 Inside RANSAC 3.1.3 RANSAC 内部机制

Within each MOP iteration kkitalic_k, a homography Hk\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is extracted by RANSAC on the match set k\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. The vanilla RANSAC is modified according to the following heuristics to improve robustness and avoid degenerate cases.
在每个 MOP 迭代 kkitalic_k 中,通过在匹配集 ksuperscript\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 上应用 RANSAC 提取单应性 Hksuperscript\mathrm{H}^{k}roman_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 。根据以下启发式方法对传统 RANSAC 进行修改,以提高鲁棒性并避免退化情况。

A minimum number cmin=50c_{\min}=50italic_c start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 50 of RANSAC loop iterations is required besides the max number of iterations cmax=2000c_{\max}=2000italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 2000. In addition, a minimum hypothesis sampled model
除了最大迭代次数 cmax=2000subscript2000c_{\max}=2000italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 2000 外,还需要最少的 RANSAC 循环迭代次数 cmin=50subscript50c_{\min}=50italic_c start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 50 。此外,还需要采样模型的最小假设。

𝒮={(𝐬1i,𝐬2i)}k,1i4\mathcal{S}=\left\{\left(\mathbf{s}_{1i},\mathbf{s}_{2i}\right)\right\}% \subseteq\mathcal{M}^{k},\quad 1\leq i\leq 4caligraphic_S = { ( bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ) } ⊆ caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 1 ≤ italic_i ≤ 4 (9)

is discarded if the distance within the keypoints in one of the images is less than the relaxed threshold tlt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, i.e. it must hold
如果图像中关键点之间的距离小于松弛阈值 tlsubscriptt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ,则被舍弃,即必须满足

min1i,j4ijw=1,2𝐬wi𝐬wjtl\min_{\begin{subarray}{c}1\leq i,j\leq 4\\ i\neq j\\ w=1,2\end{subarray}}\left\|\mathbf{s}_{wi}-\mathbf{s}_{wj}\right\|\geq t_{l}roman_min start_POSTSUBSCRIPT start_ARG start_ROW start_CELL 1 ≤ italic_i , italic_j ≤ 4 end_CELL end_ROW start_ROW start_CELL italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL italic_w = 1 , 2 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ bold_s start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT - bold_s start_POSTSUBSCRIPT italic_w italic_j end_POSTSUBSCRIPT ∥ ≥ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (10)

Furthermore, being H\mathrm{H}roman_H the corresponding homography derived by the normalized 8-point algorithm [45], the smaller singular value found by solving through SVD the 8×98\times 98 × 9 homogeneous system must be relatively greater than zero for a stable solution not close to degeneracy. According to these observations, the minimum singular value is constrained to be greater than 0.05. Note that an absolute threshold is feasible since data are normalized before SVD.
此外,由于 H\mathrm{H}roman_H 对应于通过归一化 8 点算法[45]推导出的单应性,通过 SVD 求解 8×9898\times 98 × 9 齐次系统时发现的最小奇异值必须相对大于零,以确保解的稳定性,避免接近退化。基于这些观察,最小奇异值被限制在大于 0.05。需要注意的是,由于 SVD 之前数据已归一化,因此采用绝对阈值是可行的。

Next, matches in 𝒮\mathcal{S}caligraphic_S must satisfy quasi-affinity with respect to the derived homography H\mathrm{H}roman_H. This can be verified by checking that the sign of the last coordinate of the reprojection H𝐬1i\mathrm{H}\mathbf{s}_{1i}roman_H bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT in non-normalized homogenous coordinates are the same for 1i41\leq i\leq 41 ≤ italic_i ≤ 4, i.e. for all the four keypoint in the first image. This can be expressed as
接下来, 𝒮\mathcal{S}caligraphic_S 中的匹配点必须满足相对于推导出的单应性矩阵 H\mathrm{H}roman_H 的准亲和性。这可以通过检查重投影 H𝐬1isubscript1\mathrm{H}\mathbf{s}_{1i}roman_H bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT 在非归一化齐次坐标中的最后一个坐标的符号是否对 1i4141\leq i\leq 41 ≤ italic_i ≤ 4 一致来验证,即对于第一张图像中的所有四个关键点。这可以表示为

[H𝐬1i]3[H𝐬1j]3=1,1i,j4[\mathrm{H}\mathbf{s}_{1i}]_{3}\cdot[\mathrm{H}\mathbf{s}_{1j}]_{3}=1,\quad% \forall 1\leq i,j\leq 4[ roman_H bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ [ roman_H bold_s start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 , ∀ 1 ≤ italic_i , italic_j ≤ 4 (11)

where [𝐱]3[\mathbf{x}]_{3}[ bold_x ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT denotes the third and last vector element of the point 𝐱\mathbf{x}bold_x in non-normalized homogeneous coordinates. Following the above discussion, the quasi-affine set of matches 𝒬H\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT in Eq. 3 is formally defined as
其中 [𝐱]3subscriptdelimited-[]3[\mathbf{x}]_{3}[ bold_x ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 表示点 𝐱\mathbf{x}bold_x 在非归一化齐次坐标中的第三个也是最后一个向量元素。根据上述讨论,匹配的准仿射集 𝒬Hsubscriptsuperscript\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT 在公式 3 中正式定义为

𝒬H={m:[H𝐱1]3[H𝐬11]3=1}\mathcal{Q}^{\mathcal{M}}_{\mathrm{H}}=\left\{m\in\mathcal{M}:[\mathrm{H}% \mathbf{x}_{1}]_{3}\cdot[\mathrm{H}\mathbf{s}_{11}]_{3}=1\right\}caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT = { italic_m ∈ caligraphic_M : [ roman_H bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ [ roman_H bold_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 } (12)

For better numerical stability, the analogous checks are executed also in the reverse direction through H𝒮1\mathrm{H}^{-1}_{\mathcal{S}}roman_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT.
为了提高数值稳定性,类似的检查也在 H𝒮1subscriptsuperscript1\mathrm{H}^{-1}_{\mathcal{S}}roman_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT 中以反向执行。

Lastly, to speed up the RANSAC search, a buffer ={H~i}\mathcal{B}=\{\tilde{\mathrm{H}}_{i}\}caligraphic_B = { over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } with 1iz1\leq i\leq z1 ≤ italic_i ≤ italic_z is globally maintained in order to contain the top z=5z=5italic_z = 5 discarded sub-optimal homographies encountered within a RANSAC run. For the current match set k\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT if H~0\tilde{\mathrm{H}}_{0}over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the current best RANSAC model maximizing the number of inliers, H~i\tilde{\mathrm{H}}_{i}\in\mathcal{B}over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_B are ordered by the number viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of inliers excluding those compatible with the previous homographies in the buffer, i.e. it must holds that
最后,为了加速 RANSAC 搜索,全局维护了一个缓冲区 ={H~i}subscript\mathcal{B}=\{\tilde{\mathrm{H}}_{i}\}caligraphic_B = { over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ,其中包含 1iz11\leq i\leq z1 ≤ italic_i ≤ italic_z ,用于存储在一次 RANSAC 运行中遇到的排名前 z=55z=5italic_z = 5 的次优单应性矩阵。对于当前的匹配集 ksuperscript\mathcal{M}^{k}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,如果 H~0subscript0\tilde{\mathrm{H}}_{0}over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 是当前最佳 RANSAC 模型,其内点数量最多,那么 H~isubscript\tilde{\mathrm{H}}_{i}\in\mathcal{B}over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_B 则按内点数量 visubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 排序,排序时排除那些与缓冲区中先前单应性矩阵兼容的内点,即必须满足以下条件:

vivi+1v_{i}\geq v_{i+1}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (13)

where 其中

vi=|H~ik(0j<iH~jk)|v_{i}=\left|\mathcal{I}^{\mathcal{M}^{k}}_{\tilde{\mathrm{H}}_{i}}\setminus% \left(\begin{subarray}{c}\bigcup\\ {0\leq j<i}\end{subarray}\;\mathcal{I}^{\mathcal{M}^{k}}_{\tilde{\mathrm{H}}_{% j}}\right)\right|italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ ( start_ARG start_ROW start_CELL ⋃ end_CELL end_ROW start_ROW start_CELL 0 ≤ italic_j < italic_i end_CELL end_ROW end_ARG caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG roman_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | (14)

The buffer \mathcal{B}caligraphic_B can be updated when the number of inliers of the homography associated with the current sampled hypothesis is greater than the minimum viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, by inserting in the case the new homography, removing the low-scoring one and resorting to the buffer. Moreover, at the beginning of a RANSAC run inside a MOP iteration, the minimum hypothesis model sampling sets 𝒮\mathcal{S}caligraphic_S corresponding to homographies in \mathcal{B}caligraphic_B are evaluated before proceeding with true random samples to provide a bootstrap for the solution search. This also guarantees a global sequential inspection of the overall solution search space within MOP.
缓冲区 \mathcal{B}caligraphic_B 可以在与当前采样假设相关联的单应性的内点数量大于最小值 visubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 时进行更新,具体操作是插入新的单应性,移除得分较低的单应性,并重新调整缓冲区。此外,在 MOP 迭代内部的 RANSAC 运行开始时,会先评估与 \mathcal{B}caligraphic_B 中单应性对应的假设模型采样集 𝒮\mathcal{S}caligraphic_S ,然后再进行真正的随机采样,以此为解搜索提供引导。这同时也确保了在 MOP 内对整个解搜索空间进行全局顺序检查。

3.1.4 Homography assignment
3.1.4 单应性分配

After the final set of homographies \mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is computed, the filtered match set \mathcal{M}^{\star}caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is obtained by removing all matches mm\in\mathcal{M}italic_m ∈ caligraphic_M for which there is no homography H\mathrm{H}roman_H in \mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT having them as inlier, i.e.
在计算出最终的一组单应性矩阵 superscript\mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 后,通过移除所有匹配项 mm\in\mathcal{M}italic_m ∈ caligraphic_M 来获得过滤后的匹配集 superscript\mathcal{M}^{\star}caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ,这些匹配项 mm\in\mathcal{M}italic_m ∈ caligraphic_Msuperscript\mathcal{H}^{\star}caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 中没有对应的单应性矩阵 H\mathrm{H}roman_H 将其视为内点,即

={m:HmH,tl}\mathcal{M}^{\star}=\left\{m\in\mathcal{M}:\exists\mathrm{H}\in\mathcal{H}^{% \star}\;m\in\mathcal{I}^{\mathcal{M}}_{\mathrm{H},t_{l}}\right\}caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { italic_m ∈ caligraphic_M : ∃ roman_H ∈ caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT italic_m ∈ caligraphic_I start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT } (15)

Next, a homography must be associated with each survived match mm\in\mathcal{M}^{\star}italic_m ∈ caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, which will be used for patch normalization.
接下来,必须为每个幸存的匹配项 msuperscriptm\in\mathcal{M}^{\star}italic_m ∈ caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 关联一个单应性矩阵,用于块归一化。

A possible choice is to assign the homography Hmε\mathrm{H}_{m}^{\varepsilon}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT which gives the minimum reprojection error, i.e.
一个可能的选择是分配单应性矩阵 Hmεsuperscriptsubscript\mathrm{H}_{m}^{\varepsilon}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT ,它产生最小的重投影误差,即

Hmε=argminHεH(m)\mathrm{H}_{m}^{\varepsilon}=\operatornamewithlimits{argmin}_{\mathrm{H}\in% \mathcal{H}^{\star}}\varepsilon_{\mathrm{H}}(m)roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT roman_H ∈ caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_m ) (16)

However, this choice is likely to select homographies corresponding to narrow planes in the image with small consensus sets, which leads to a low reprojection error but more unstable and prone to noise assignment.
然而,这种选择可能会挑选出对应于图像中狭窄平面的单应性变换,这些变换具有较小的共识集,从而导致重投影误差较低,但更不稳定且易受噪声影响。

Another choice is to assign to the match mmitalic_m the homography Hm\mathrm{H}_{m}^{\mathcal{I}}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT compatible with mmitalic_m with the wider consensus set, i.e.
另一种选择是将匹配 mmitalic_m 分配给与 mmitalic_m 兼容且具有更广泛共识集的单应性 Hmsuperscriptsubscript\mathrm{H}_{m}^{\mathcal{I}}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ,即

Hm=argmaxH,mH,tl|H,tl|\mathrm{H}_{m}^{\mathcal{I}}=\operatornamewithlimits{argmax}_{\mathrm{H}\in% \mathcal{H}^{\star},\,m\in\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}% }\left|\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}\right|roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT roman_H ∈ caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_m ∈ caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT | caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT | (17)

which points to more stable and robust homographies but can lead to incorrect patches when the corresponding reprojection error is not among the minimum ones.
这表明了更稳定和鲁棒的单应性,但在对应的重投影误差并非最小值时,可能导致错误的图像块。

The homography Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT actually chosen for the match association provides a solution within the above. Specifically, for a match mmitalic_m, defining qmq_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as the median of the number of inliers of the top 5 compatible homographies according to the number of inliers for tlt_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the compatible homography with the minimum reprojection error among those with an inlier number at least equal to qmq_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
所选用于匹配关联的单应性矩阵 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 确实在上述范围内提供了解决方案。具体而言,对于一个匹配 mmitalic_m ,定义 qmsubscriptq_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 为根据内点数量对前 5 个兼容单应性矩阵的内点数量中位数, Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 是在内点数量至少等于 qmsubscriptq_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 的单应性矩阵中,具有最小重投影误差的兼容单应性矩阵。

Hm=argminH,|H,tl|qmεH(m)\mathrm{H}_{m}=\operatornamewithlimits{argmin}_{\mathrm{H}\in\mathcal{H}^{% \star},\,\left|\mathcal{I}^{\mathcal{M}^{\star}}_{\mathrm{H},t_{l}}\right|\geq q% _{m}}\varepsilon_{H}(m)roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT roman_H ∈ caligraphic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , | caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ≥ italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_m ) (18)

Figure 1 shows an example of the achieved solution directly in combination with MiHo homography estimation described hereafter in Sec. 3.2, since MOP or MOP+MiHo in visual cluster representation does not present appreciable differences. Keypoint belonging to discarded matches are marked with black diamonds, while the clusters highlighted by other combinations of markers and colors indicate the resulting filtered matches mm\in\mathcal{M}^{\star}italic_m ∈ caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT with the selected planar homography Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Notice that clusters are not designed to highlight scene planes but points that move according to the same homography transformation within the image pair.
图 1 展示了一个直接结合 MiHo 单应性估计(详见第 3.2 节)所获得的解决方案示例,因为在视觉聚类表示中,MOP 或 MOP+MiHo 并未呈现显著差异。属于被丢弃匹配的关键点以黑色菱形标记,而其他组合标记和颜色高亮的聚类则表示经过筛选的匹配结果 msuperscriptm\in\mathcal{M}^{\star}italic_m ∈ caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ,以及选定的平面单应性 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 。值得注意的是,聚类并非旨在突出场景平面,而是强调在图像对中根据相同单应性变换移动的点。

Refer to caption
Refer to caption
Refer to caption
Figure 1: MOP+MiHo clustering and filtered matches for some image pair examples. Each combination of makers and colors is associated to a unique virtual planar homography as described in Secs. 3.1-3.2, while discarded matches are indicated by black diamonds. The matching pipeline employed is the the one relying on Key.Net described in Sec 4.1.1. The images of the middle and bottom rows belong to MegaDepth and ScanNet respectively, described later in Sec. 4.1.3. Best viewed in color and zoomed in.
图 1:MOP+MiHo 聚类及部分图像对示例的筛选匹配。每种标记和颜色的组合对应于一个独特的虚拟平面单应性,如第 3.1-3.2 节所述,而舍弃的匹配则用黑色菱形表示。所采用的匹配流程是基于 Key.Net 的,详见第 4.1.1 节。中间行和底行图像分别来自 MegaDepth 和 ScanNet,具体描述见第 4.1.3 节。最佳查看方式为彩色并放大。

3.2 Middle Homography (MiHo)
3.2 中间单应性(MiHo)

3.2.1 Rationale 3.2.1 基本原理

The Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT homography estimated by Eq. 18 for the match m=(𝐱1,𝐱2)m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) allows to reproject the local patch neighborhood of x1I1x_{1}\in I_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT onto the corresponding one centered in 𝐱2I2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Nevertheless, Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT approximates the true transformation within the images which becomes less accurate as long as the distance from the keypoint center increases. This implies that patch alignment could be invalid for wider, and theoretically more discriminant, patches.
通过公式 18 估计的匹配 m=(𝐱1,𝐱2)subscript1subscript2m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 单应性,可以将 x1I1subscript1subscript1x_{1}\in I_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 的局部块邻域重新投影到以 𝐱2I2subscript2subscript2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 为中心的对应邻域上。然而, Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 近似于图像内的真实变换,随着与关键点中心距离的增加,其准确性逐渐降低。这意味着对于更宽、理论上更具辨别力的块,块对齐可能失效。

The idea behind MiHo is to break the homography Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in the middle by two homographies Hm1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Hm2\mathrm{H}_{m_{2}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that Hm=Hm2Hm1\mathrm{H}_{m}=\mathrm{H}_{m_{2}}\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT so that the patch on 𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT gets deformed by Hm1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT less than Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and likewise the patch on 𝐱2\mathbf{x}_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT gets less deformed by Hm21\mathrm{H}_{m_{2}}^{-1}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT than Hm1\mathrm{H}_{m}^{-1}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. This means that visually on the reprojected patch a unit area square should remain almost similar to the original one. As this must hold from both the images, the deformation error must be distributed almost equally between the two homographies Hm1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Hm2\mathrm{H}_{m_{2}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Moreover, since interpolation degrades with up-sampling, MiHo aims to balance the down-sampling of the patch at finer resolution and the up-sampling of the patch at the coarser resolution.
MiHo 背后的理念是通过两个单应性变换 Hm1subscriptsubscript1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPTHm2subscriptsubscript2\mathrm{H}_{m_{2}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 将单应性 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 在中间打破,使得 Hm=Hm2Hm1subscriptsubscriptsubscript2subscriptsubscript1\mathrm{H}_{m}=\mathrm{H}_{m_{2}}\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,从而使得位于 𝐱1subscript1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 上的图像块经 Hm1subscriptsubscript1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 变形程度小于 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,同样地,位于 𝐱2subscript2\mathbf{x}_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 上的图像块经 Hm21superscriptsubscriptsubscript21\mathrm{H}_{m_{2}}^{-1}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 变形程度也小于 Hm1superscriptsubscript1\mathrm{H}_{m}^{-1}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 。这意味着在重新投影的图像块上,单位面积的正方形应与原始形状几乎相似。由于这一要求必须从两幅图像中都成立,变形误差必须在两个单应性 Hm1subscriptsubscript1\mathrm{H}_{m_{1}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPTHm2subscriptsubscript2\mathrm{H}_{m_{2}}roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 之间几乎均匀分布。此外,鉴于插值质量随上采样而下降,MiHo 旨在平衡在更精细分辨率下对图像块的下采样与在较粗分辨率下对图像块的上采样。

3.2.2 Implementation 3.2.2 实现

While a strict analytical formulation of the above problem which leads to a practical satisfying solution is not easy to derive, the heuristic found in [51] can be exploited to modify RANSAC within MOP to account for the required constraints. Specifically, each match mmitalic_m is replaced by two corresponding matches m1=(𝐱1,𝐦)m_{1}=(\mathbf{x}_{1},\mathbf{m})italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_m ) and m2=(𝐦,𝐱2)m_{2}=(\mathbf{m},\mathbf{x}_{2})italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( bold_m , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) where 𝐦\mathbf{m}bold_m is the midpoint within the two keypoints
虽然严格分析上述问题并导出一个实际满意的解决方案并不容易,但[51]中发现的启发式方法可以被利用来修改 MOP 中的 RANSAC,以考虑所需约束。具体而言,每个匹配 mmitalic_m 被替换为两个对应的匹配 m1=(𝐱1,𝐦)subscript1subscript1m_{1}=(\mathbf{x}_{1},\mathbf{m})italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_m )m2=(𝐦,𝐱2)subscript2subscript2m_{2}=(\mathbf{m},\mathbf{x}_{2})italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( bold_m , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,其中 𝐦\mathbf{m}bold_m 是两个关键点之间的中点。

𝐦=𝐱1+𝐱22\mathbf{m}=\frac{\mathbf{x}_{1}+\mathbf{x}_{2}}{2}bold_m = divide start_ARG bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG (19)

Hence the RANSAC input match set k={m}\mathcal{M}^{k}=\{m\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_m } for the MOP kkitalic_k-th iteration is split in the two match sets 1k={m1}\mathcal{M}^{k}_{1}=\{m_{1}\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and 2k={m2}\mathcal{M}^{k}_{2}=\{m_{2}\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Within a RANSAC iteration, a sample 𝒮\mathcal{S}caligraphic_S defined by Eq. 9 is likewise split into
因此,MOP 第 kkitalic_k 次迭代的 RANSAC 输入匹配集 k={m}superscript\mathcal{M}^{k}=\{m\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_m } 被划分为两个匹配集 1k={m1}subscriptsuperscript1subscript1\mathcal{M}^{k}_{1}=\{m_{1}\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }2k={m2}subscriptsuperscript2subscript2\mathcal{M}^{k}_{2}=\{m_{2}\}caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } 。在 RANSAC 迭代过程中,由公式 9 定义的样本 𝒮\mathcal{S}caligraphic_S 同样被分割为

𝒮1={(𝐬1i,𝐬1i+𝐬2i2)}1k\displaystyle\mathcal{S}_{1}=\left\{\left(\mathbf{s}_{1i},\tfrac{\mathbf{s}_{1% i}+\mathbf{s}_{2i}}{2}\right)\right\}\subseteq\mathcal{M}^{k}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { ( bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , divide start_ARG bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT + bold_s start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) } ⊆ caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
𝒮2={(𝐬1i+𝐬2i2,𝐬2i)}2k,1i4\displaystyle\mathcal{S}_{2}=\left\{\right(\tfrac{\mathbf{s}_{1i}+\mathbf{s}_{% 2i}}{2},\mathbf{s}_{2i}\left)\right\}\subseteq\mathcal{M}^{k}_{2},1\leq i\leq 4caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { ( divide start_ARG bold_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT + bold_s start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG , bold_s start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ) } ⊆ caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ 4 (20)

leading to two concurrent homographies H1\mathrm{H}_{1}roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H2\mathrm{H}_{2}roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to be verified simultaneously with the inlier set in analogous way to Eq. 3
导致两个并发的单应性 H1subscript1\mathrm{H}_{1}roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTH2subscript2\mathrm{H}_{2}roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 需与内点集以类似式 3 的方式同时验证

(H1,H2),tk=(𝒥H1,t1k𝒥H2,t2k)(𝒬H11k𝒬H22k)\mathcal{I}^{\mathcal{M}^{k}}_{(\mathrm{H}_{1},\mathrm{H}_{2}),t}=\left(% \mathcal{J}^{\mathcal{M}^{k}_{1}}_{\mathrm{H}_{1},t}\upharpoonleft% \downharpoonright\mathcal{J}^{\mathcal{M}^{k}_{2}}_{\mathrm{H}_{2},t}\right)% \cap\left(\mathcal{Q}^{\mathcal{M}^{k}_{1}}_{\mathrm{H}_{1}}\upharpoonleft% \downharpoonright\mathcal{Q}^{\mathcal{M}^{k}_{2}}_{\mathrm{H}_{2}}\right)caligraphic_I start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_t end_POSTSUBSCRIPT = ( caligraphic_J start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ↿ ⇂ caligraphic_J start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ) ∩ ( caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ↿ ⇂ caligraphic_Q start_POSTSUPERSCRIPT caligraphic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (21)

where for two generic match set 1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the \upharpoonleft\downharpoonright↿ ⇂ operator rejoints splitted matches according to
其中,对于两个通用的匹配集 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTabsent\upharpoonleft\downharpoonright↿ ⇂ 运算符根据规则重新连接已拆分的匹配项

12={(𝐱1,𝐱2):\displaystyle\mathcal{M}_{1}\upharpoonleft\downharpoonright\mathcal{M}_{2}=\{(% \mathbf{x}_{1},\mathbf{x}_{2}):caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↿ ⇂ caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : (𝐱1,𝐦)1\displaystyle(\mathbf{x}_{1},\mathbf{m})\in\mathcal{M}_{1}( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_m ) ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
\displaystyle\wedge (𝐱2,𝐦)2}\displaystyle(\mathbf{x}_{2},\mathbf{m})\in\mathcal{M}_{2}\}( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_m ) ∈ caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } (22)

All the other MOP steps defined in Sec. 3.1, including RANSAC degeneracy checks, threshold adaptation, and final homography assignment, follow straightly in an analogous way. Overall, the principal difference is that a pair of homography (Hm1,Hm2)(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})( roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is obtained instead of the single Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT homography to be operated simultaneously on split match sets 1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT instead of the original one \mathcal{M}caligraphic_M.
第 3.1 节中定义的所有其他 MOP 步骤,包括 RANSAC 退化检查、阈值调整和最终单应性分配,均以类似方式直接进行。总体而言,主要区别在于获得的是一对单应性 (Hm1,Hm2)subscriptsubscript1subscriptsubscript2(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})( roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,而非单一的 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 单应性,同时作用于分割匹配集 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,而非原始的 \mathcal{M}caligraphic_M

Figure 2 shows the differences in applying MiHo to align to planar scenes with respect to directly reproject in any of the two images considered as reference. It can be noted that the distortion differences are overall reduced, as highlighted by the corresponding grid deformation. Moreover, MiHo homography strongly resembles an affine transformation or even a similarity, i.e. a rotation and scale change only, imposing in some sense a tight constraint on the transformation than the base homography. To balance this additional restrain with respect to the original MOP, in MOP+MiHo the minimum requirement amount of inliers for increasing the failure counter defined by Eq. 4 has been relaxed after experimental validation from n=12n=12italic_n = 12 to n=8n=8italic_n = 8.
图 2 展示了将 MiHo 应用于对齐平面场景与直接在任意两张被视为参考的图像中重新投影之间的差异。可以看出,扭曲差异整体上有所减少,这在相应的网格变形中得到了体现。此外,MiHo 单应性非常接近仿射变换,甚至相似变换,即仅涉及旋转和尺度变化,在某种意义上对变换施加了比基础单应性更严格的约束。为了在 MOP+MiHo 中平衡这一额外约束与原始 MOP 的关系,经过实验验证后,增加失败计数器所需的最小内点数量已从 n=1212n=12italic_n = 12 放宽至 n=88n=8italic_n = 8

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Visual comparison between the standard planar homography and MiHo tied homographies as defined in Sec. 3.2.1. (2(a)) Two corresponding images, framed in blue, are used in turn as reference in the top and middle rows while the other one is reprojected by a planar homography, show in the same row. In the case of MiHo, both images are warped as shown in the bottom row, yet the overall distortion is reduced. The local transformation is better highlighted in (2(b)) with the related warping of the unit-square representation of the original images. (2(c),2(d)) The analogous visualization for a more critical configuration with stronger relative viewpoint change. Best viewed in color and zoomed in.
图 2:标准平面单应性与 MiHo 绑定单应性在视觉上的对比,定义见第 3.2.1 节。(2(a)) 两幅对应图像(蓝色框内)依次作为参考图像,分别位于上排和中排,另一幅则通过平面单应性进行重投影,显示在同一排。对于 MiHo 方法,两幅图像均进行了扭曲处理,如底排所示,但整体畸变更小。局部变换在(2(b))中通过原始图像单位正方形的相关扭曲更为突出地展现。(2(c),2(d)) 对于相对视角变化更剧烈的更具挑战性配置的类似可视化。最佳观看效果为彩色并放大。

3.2.3 Translation invariance
3.2.3 平移不变性

MiHo is invariant to translations as long as the algorithm used to extract the homography provides the same invariance. This can be easily verified since when translation vectors 𝐭a\mathbf{t}_{a}bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐭b\mathbf{t}_{b}bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are respectively added to all keypoints 𝐱1I1\mathbf{x}_{1}\in I_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱2I2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to get 𝐱1=𝐱1+𝐭a\mathbf{x}^{\prime}_{1}=\mathbf{x}_{1}+\mathbf{t}_{a}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and 𝐱2=𝐱2+𝐭b\mathbf{x}^{\prime}_{2}=\mathbf{x}_{2}+\mathbf{t}_{b}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT the corresponding midpoints are of the form
只要用于提取单应性的算法提供相同的平移不变性,MiHo 对平移是不变的。这一点很容易验证,因为当平移向量 𝐭asubscript\mathbf{t}_{a}bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT𝐭bsubscript\mathbf{t}_{b}bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 分别添加到所有关键点 𝐱1I1subscript1subscript1\mathbf{x}_{1}\in I_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐱2I2subscript2subscript2\mathbf{x}_{2}\in I_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 以得到 𝐱1=𝐱1+𝐭asubscriptsuperscript1subscript1subscript\mathbf{x}^{\prime}_{1}=\mathbf{x}_{1}+\mathbf{t}_{a}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT𝐱2=𝐱2+𝐭bsubscriptsuperscript2subscript2subscript\mathbf{x}^{\prime}_{2}=\mathbf{x}_{2}+\mathbf{t}_{b}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 时,相应的中间点形式为

𝐦=𝐱1+𝐱22=𝐱1+𝐱22+𝐭a+𝐭b2=𝐦+𝐭\mathbf{m}^{\prime}=\frac{\mathbf{x}^{\prime}_{1}+\mathbf{x}^{\prime}_{2}}{2}=% \frac{\mathbf{x}_{1}+\mathbf{x}_{2}}{2}+\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2% }=\mathbf{m}+\mathbf{t}bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG = divide start_ARG bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG = bold_m + bold_t (23)

where 𝐭=𝐭a+𝐭b2\mathbf{t}=\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2}bold_t = divide start_ARG bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG is a fixed translation vector is added to all the corresponding original midpoints 𝐦\mathbf{m}bold_m. Overall, translation keypoints in the respective images have the effect of adding a fixed translation to the corresponding midpoint. Hence MiHo is invariant to translations since the normalization employed by the normalized 8-point algorithm is invariant to similarity transformations [45], including translations.
其中, 𝐭=𝐭a+𝐭b2subscriptsubscript2\mathbf{t}=\frac{\mathbf{t}_{a}+\mathbf{t}_{b}}{2}bold_t = divide start_ARG bold_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG 是一个固定的平移向量,被加到所有相应的原始中点 𝐦\mathbf{m}bold_m 上。总体而言,在各自图像中的平移关键点具有为相应中点添加固定平移的效果。因此,MiHo 对平移是不变的,因为归一化 8 点算法所采用的归一化处理对包括平移在内的相似变换是不变的 [45]。

3.2.4 Fixing rotations 3.2.4 固定旋转

MiHo is not rotation invariant. Figure 3 illustrates how midpoints change with rotations. In the ideal MiHo configuration where the images are upright the area of the image formed in the midpoint plane is within the areas of the original ones, making a single cone when considering the images as stacked. The area of the midpoint plane is lower than the minimum area between the two images when there is a rotation about 180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, and could degenerate to the apex of the corresponding double cone. A simple strategy to get an almost-optimal configuration was experimentally verified to work well in practice under the observation that images acquired by cameras tend to have a possible relative rotation α\alphaitalic_α of a multiple of 9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT.
MiHo 不具有旋转不变性。图 3 展示了中点随旋转的变化情况。在理想的 MiHo 配置中,当图像竖直时,中点平面内形成的图像面积介于原始图像面积之间,将图像视为堆叠时形成一个单锥。当围绕 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT 进行旋转时,中点平面的面积小于两图像间的最小面积,并可能退化为对应双锥的顶点。通过实验验证,一种简单的策略在实践中表现良好,该策略基于以下观察:相机获取的图像往往具有 α\alphaitalic_α 的相对旋转,且该旋转为 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT 的倍数。

In detail, let us define a given input match mi=(𝐱1i,𝐱2iα)αm_{i}=(\mathbf{x}_{1i},\mathbf{x}^{\alpha}_{2i})\in\mathcal{M}^{\alpha}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and the corresponding midpoint 𝐦iα=𝐱1i+𝐱2iα2\mathbf{m}^{\alpha}_{i}=\frac{\mathbf{x}_{1i}+\mathbf{x}_{2i}^{\alpha}}{2}bold_m start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, where 𝐱1iI1\mathbf{x}_{1i}\in I_{1}bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱2iαI2α\mathbf{x}^{\alpha}_{2i}\in I^{\alpha}_{2}bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ∈ italic_I start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT being the image I2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT rotated by α\alphaitalic_α. One has to choose α𝒜={0,π2,π,3π2}\alpha^{\star}\in\mathcal{A}=\{0,\tfrac{\pi}{2},\pi,\tfrac{3\pi}{2}\}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ caligraphic_A = { 0 , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , italic_π , divide start_ARG 3 italic_π end_ARG start_ARG 2 end_ARG } to maximize the configuration for which midpoint inter-distances are between corresponding keypoint inter-distances
具体而言,我们定义一个给定的输入匹配 mi=(𝐱1i,𝐱2iα)αsubscriptsubscript1subscriptsuperscript2superscriptm_{i}=(\mathbf{x}_{1i},\mathbf{x}^{\alpha}_{2i})\in\mathcal{M}^{\alpha}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT 及其对应的中间点 𝐦iα=𝐱1i+𝐱2iα2subscriptsuperscriptsubscript1superscriptsubscript22\mathbf{m}^{\alpha}_{i}=\frac{\mathbf{x}_{1i}+\mathbf{x}_{2i}^{\alpha}}{2}bold_m start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ,其中 𝐱1iI1subscript1subscript1\mathbf{x}_{1i}\in I_{1}bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐱2iαI2αsubscriptsuperscript2subscriptsuperscript2\mathbf{x}^{\alpha}_{2i}\in I^{\alpha}_{2}bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ∈ italic_I start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 是图像 I2subscript2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 旋转 α\alphaitalic_α 后的结果。需要选择 α𝒜={0,π2,π,3π2}superscript0232\alpha^{\star}\in\mathcal{A}=\{0,\tfrac{\pi}{2},\pi,\tfrac{3\pi}{2}\}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ caligraphic_A = { 0 , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , italic_π , divide start_ARG 3 italic_π end_ARG start_ARG 2 end_ARG } 以最大化中间点间距在对应关键点间距之间的配置。

α=argmaxmi,mjαα𝒜𝐦iα𝐦jα[dij,dij]\alpha^{\star}=\operatornamewithlimits{argmax}_{\begin{subarray}{c}m_{i},m_{j}% \in\mathcal{M}^{\alpha}\\ \alpha\in\mathcal{A}\end{subarray}}\left\llbracket\parallel\mathbf{m}^{\alpha}% _{i}-\mathbf{m}^{\alpha}_{j}\parallel\in\left[d_{\downarrow}^{ij},d^{\uparrow}% _{ij}\right]\right\rrbracketitalic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_M start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α ∈ caligraphic_A end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ⟦ ∥ bold_m start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_m start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ∈ [ italic_d start_POSTSUBSCRIPT ↓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ⟧ (24)

where 其中

dij\displaystyle d_{\downarrow}^{ij}italic_d start_POSTSUBSCRIPT ↓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT =min(𝐱1i𝐱1j,𝐱2iα𝐱2jα)\displaystyle=\min\left(\parallel\mathbf{x}_{1i}-\mathbf{x}_{1j}\parallel,% \parallel\mathbf{x}^{\alpha}_{2i}-\mathbf{x}^{\alpha}_{2j}\parallel\right)= roman_min ( ∥ bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ∥ , ∥ bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ∥ )
dij\displaystyle d^{\uparrow}_{ij}italic_d start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =max(𝐱1i𝐱1j,𝐱2iα𝐱2jα)\displaystyle=\max\left(\parallel\mathbf{x}_{1i}-\mathbf{x}_{1j}\parallel,% \parallel\mathbf{x}^{\alpha}_{2i}-\mathbf{x}^{\alpha}_{2j}\parallel\right)= roman_max ( ∥ bold_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ∥ , ∥ bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ∥ ) (25)

and \llbracket\cdot\rrbracket⟦ ⋅ ⟧ is the Iverson bracket. The global orientation estimation α\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT can be computed efficiently on the initial input matches \mathcal{M}caligraphic_M before running MOP, adjusting keypoints accordingly. Final homographies can be adjusted in turn by removing the rotation given by α\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Figure 4 shows the MOP+MiHo result obtained on a set of matches without or with orientation pre-processing, highlighting the better matches obtained in the latter case.
其中 delimited-⟦⟧\llbracket\cdot\rrbracket⟦ ⋅ ⟧ 是艾弗森括号。全局方向估计 αsuperscript\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 可以在运行 MOP 之前,通过对初始输入匹配 \mathcal{M}caligraphic_M 进行高效计算,并相应调整关键点。最终的单应性矩阵可以通过去除由 αsuperscript\alpha^{\star}italic_α start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT 给出的旋转来进行调整。图 4 展示了在未进行或进行方向预处理的情况下,通过 MOP+MiHo 方法在一组匹配上获得的结果,突出了在后一种情况下获得的更好匹配。

Refer to caption
(a) 00^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
Refer to caption
(b) 9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
Refer to caption
(c) 180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
Refer to caption
(d) 270270^{\circ}270 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
Figure 3: Midpoints and rotations with MiHo. The blue and red quadrilaterals are linked by a homography which is defined by the four corner correspondences, indicated by dashed gray lines. The midpoints of the corner correspondences identify the reference green quadrilateral and the derived two MiHo planar homographies with the original quadrilaterals. In the optimal MiHo configuration (2(a)) the area of the quadrilateral defined by the midpoints is maximized. Incremental relative rotations within the two original quadrilaterals by 9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT decrease the above area through (2(b)) to the minimum (2(c)). From the minimum further rotations increase the area through (2(d)) to the maximum again. As heuristic, in the best configuration the distance of any two midpoint corners should be within the distances of the corresponding original corners of both images as detailed in Sec. 3.2.4. Note that for the specific example, in the worst case there is also a violation of the planar orientation constraints [45]. Best viewed in color and zoomed in.
图 3:MiHo 的中点和旋转。蓝色和红色四边形通过由四个角点对应关系定义的单应性相连接,这些对应关系由虚线灰色表示。角点对应的中点确定了参考绿色四边形以及与原始四边形相关的两个 MiHo 平面单应性。在最佳 MiHo 配置(图 2(a))中,由中点定义的四边形面积最大化。通过在两个原始四边形内进行增量相对旋转,上述面积从(图 2(b))减少到最小(图 2(c))。从最小值进一步旋转,面积通过(图 2(d))再次增加到最大。作为启发式方法,在最佳配置中,任意两个中点角之间的距离应在其对应原始图像角点距离范围内,详见第 3.2.4 节。请注意,对于特定示例,在最坏情况下,平面方向约束也可能被违反[45]。最佳彩色放大查看。
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) MOP+MiHo without rotation fixing
(a) 未进行旋转修正的 MOP+MiHo
Refer to caption
(b) MOP+MiHo with rotation fixing
(b) 旋转固定的 MOP+MiHo
Figure 4: MOP+MiHo visual clustering (4(a)) without and (4(b)) with rotation handling in the worst case of a 180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT relative rotation as described in Sec. 3.2.4. The same notation and image pairs of Fig. 1 are used. Although the results with rotation handling are clearly better, these are not the same of Fig. 1 since the input matches provided by Key.Net are different as OriNet [84] has been further included in the pipeline to compute patch orientations. Notice that the base MOP obtains results similar to (4(b)) without special needs to handle image rotations. Best viewed in color and zoomed in.
图 4:MOP+MiHo 视觉聚类(4(a))在未处理旋转和(4(b))在处理最坏情况下的 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT 相对旋转(如第 3.2.4 节所述)时的表现。使用了与图 1 相同的符号和图像对。尽管处理旋转后的结果明显更优,但这些结果与图 1 并不相同,因为 Key.Net 提供的输入匹配有所不同,OriNet [84]已进一步纳入流程以计算补丁方向。值得注意的是,基础 MOP 无需特殊处理即可获得与(4(b))相似的结果。最佳查看方式为彩色并放大。

3.3 Normalized Cross Correlation (NCC)
3.3 归一化互相关(NCC)

3.3.1 Local image warping 3.3.1 局部图像扭曲

NCC is employed to effectively refine corresponding keypoint localization. The approach assumes that corresponding keypoints have been roughly aligned locally by the transformations provided as a homography pair. In the warped aligning space, template matching [48] is employed to refine translation offsets in order to maximize the NCC peak response when the patch of a keypoint is employed as a filter onto the other image of the corresponding keypoint.
NCC 被用于有效精炼相应关键点的定位。该方法假设通过作为单应性对提供的变换,相应关键点已在局部大致对齐。在扭曲对齐的空间中,采用模板匹配[48]来精炼平移偏移,以在将关键点的补丁用作另一图像中对应关键点的滤波器时,最大化 NCC 峰值响应。

For the match m=(𝐱1,𝐱2)m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) within images I1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT there can be multiple potential choices of the warping homography pair, which are denoted by the set of associated homography pairs m={(H1,H2)}\mathcal{H}_{m}=\{(\mathrm{H}_{1},\mathrm{H}_{2})\}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) }. In detail, according to the previous steps of the matching pipeline, a candidate is the base homography pair (H1b,H2b)(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})( roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) defined as
对于图像 I1subscript1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTI2subscript2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 中的匹配 m=(𝐱1,𝐱2)subscript1subscript2m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,可能存在多个潜在的扭曲单应性对选择,这些选择由相关单应性对集合 m={(H1,H2)}subscriptsubscript1subscript2\mathcal{H}_{m}=\{(\mathrm{H}_{1},\mathrm{H}_{2})\}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } 表示。具体而言,根据匹配流水线的先前步骤,候选者是定义为基单应性对 (H1b,H2b)subscriptsuperscript1subscriptsuperscript2(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})( roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 的单应性对。

  • (I,I)(\mathrm{I},\mathrm{I})( roman_I , roman_I ), where I3×3\mathrm{I}\in\mathbb{R}^{3\times 3}roman_I ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT is the identity matrix, when no local neighborhood data are provided, e.g. for SuperPoint;


    (I,I)(\mathrm{I},\mathrm{I})( roman_I , roman_I ) ,其中 I3×3superscript33\mathrm{I}\in\mathbb{R}^{3\times 3}roman_I ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT 是单位矩阵,当未提供局部邻域数据时,例如在 SuperPoint 中;
  • (A1,A2)(\mathrm{A}_{1},\mathrm{A}_{2})( roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where A1,A23×3\mathrm{A}_{1},\mathrm{A}_{2}\in\mathbb{R}^{3\times 3}roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT are affine or similarity transformations in case of affine covariant detector & descriptor, as for AffNet or SIFT, respectively;


    (A1,A2)subscript1subscript2(\mathrm{A}_{1},\mathrm{A}_{2})( roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,其中 A1,A23×3subscript1subscript2superscript33\mathrm{A}_{1},\mathrm{A}_{2}\in\mathbb{R}^{3\times 3}roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT 在仿射协变检测器和描述符的情况下分别是仿射或相似变换,如 AffNet 或 SIFT

and the other is extended pair (H1e,H2e)(\mathrm{H}^{e}_{1},\mathrm{H}^{e}_{2})( roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) defined as
另一个是扩展对 (H1e,H2e)subscriptsuperscript1subscriptsuperscript2(\mathrm{H}^{e}_{1},\mathrm{H}^{e}_{2})( roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,定义为

  • (I,Hm)(\mathrm{I},\mathrm{H}_{m})( roman_I , roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), where Hm\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is defined in Eq. 18, in case MOP is used without MiHo;


    (I,Hm)subscript(\mathrm{I},\mathrm{H}_{m})( roman_I , roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,其中 Hmsubscript\mathrm{H}_{m}roman_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 在公式 18 中定义,适用于未使用 MiHo 的 MOP 情况;
  • (Hm1,Hm2)(\mathrm{H}_{m_{1}},\mathrm{H}_{m_{2}})( roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) as described in Sec. 3.2.1, in case of MOP+MiHo


    • 如 3.2.1 节所述,在 MOP+MiHo 情况下
  • (H1b,H2b)(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2})( roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as above otherwise.


    • 如上所述,否则。

Since the found transformations are approximated, the extended homography pair is perturbed in practice by small shear factors ffitalic_f and rotation changes ρ\rhoitalic_ρ through the affine transformations
由于所发现的变换是近似的,扩展的单应性对在实际应用中通过仿射变换受到微小的剪切因子 ffitalic_f 和旋转变化 ρ\rhoitalic_ρ 的扰动

Aρ,f=[fcos(ρ)-fsin(ρ)0sin(ρ)cos(ρ)0001]\mathrm{A}_{\rho,f}=\begin{bmatrix}f\cos(\rho)&\text{-}f\sin(\rho)&0\\ \hfill\sin(\rho)&\hfill\cos(\rho)&0\\ 0&0&1\end{bmatrix}roman_A start_POSTSUBSCRIPT italic_ρ , italic_f end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_f roman_cos ( italic_ρ ) end_CELL start_CELL - italic_f roman_sin ( italic_ρ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_ρ ) end_CELL start_CELL roman_cos ( italic_ρ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (26)

with 

ρ={π6,π12,0,π12,π6}\displaystyle\rho\in\mathcal{R}=\{-\tfrac{\pi}{6},-\tfrac{\pi}{12},0,\tfrac{% \pi}{12},\tfrac{\pi}{6}\}italic_ρ ∈ caligraphic_R = { - divide start_ARG italic_π end_ARG start_ARG 6 end_ARG , - divide start_ARG italic_π end_ARG start_ARG 12 end_ARG , 0 , divide start_ARG italic_π end_ARG start_ARG 12 end_ARG , divide start_ARG italic_π end_ARG start_ARG 6 end_ARG } (27)
f={57,56,1,65,75}\displaystyle f\in\mathcal{F}=\{\tfrac{5}{7},\tfrac{5}{6},1,\tfrac{6}{5},% \tfrac{7}{5}\}italic_f ∈ caligraphic_F = { divide start_ARG 5 end_ARG start_ARG 7 end_ARG , divide start_ARG 5 end_ARG start_ARG 6 end_ARG , 1 , divide start_ARG 6 end_ARG start_ARG 5 end_ARG , divide start_ARG 7 end_ARG start_ARG 5 end_ARG } (28)

so that 以便

m=\displaystyle\mathcal{H}_{m}=caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = {(H1b,H2b)}\displaystyle\left\{\left(\mathrm{H}^{b}_{1},\mathrm{H}^{b}_{2}\right)\right\}{ ( roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) }
\displaystyle\cup {(Aρ,fH1e,H2e):(ρ,s)×}\displaystyle\left\{\left(\mathrm{A}_{\rho,f}\mathrm{H}^{e}_{1},\mathrm{H}^{e}% _{2}\right):(\rho,s)\in\mathcal{R}\times\mathcal{F}\right\}{ ( roman_A start_POSTSUBSCRIPT italic_ρ , italic_f end_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : ( italic_ρ , italic_s ) ∈ caligraphic_R × caligraphic_F }
\displaystyle\cup {(H1e,Aρ,fH2e):(ρ,s)×}\displaystyle\left\{\left(\mathrm{H}^{e}_{1},\mathrm{A}_{\rho,f}\mathrm{H}^{e}% _{2}\right):(\rho,s)\in\mathcal{R}\times\mathcal{F}\right\}{ ( roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_A start_POSTSUBSCRIPT italic_ρ , italic_f end_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : ( italic_ρ , italic_s ) ∈ caligraphic_R × caligraphic_F } (29)

3.3.2 Computation 3.3.2 计算

Let us denote a generic image by IIitalic_I, the intensity values of IIitalic_I for a generic pixel 𝐱2\mathbf{x}\in\mathbb{R}^{2}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as I(𝐱)I(\mathbf{x})italic_I ( bold_x ), the window radius extension as
我们用 IIitalic_I 表示一个通用图像, IIitalic_I 在通用像素 𝐱2superscript2\mathbf{x}\in\mathbb{R}^{2}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 处的强度值为 I(𝐱)I(\mathbf{x})italic_I ( bold_x ) ,窗口半径扩展为

𝒲r={x[r,r]}\mathcal{W}_{r}=\left\{x\in[-r,r]\cap\mathbb{Z}\right\}caligraphic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { italic_x ∈ [ - italic_r , italic_r ] ∩ blackboard_Z } (30)

such that the squared window set of offsets is
使得偏移量的平方窗口集为

𝒲r2=𝒲r×𝒲r\mathcal{W}^{2}_{r}=\mathcal{W}_{r}\times\mathcal{W}_{r}caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = caligraphic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × caligraphic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (31)

and has cardinality equal to
且基数等于

|𝒲r2|=(2w+1)2\left|\mathcal{W}^{2}_{r}\right|=(2w+1)^{2}| caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | = ( 2 italic_w + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (32)

The squared patch of IIitalic_I centered in 𝐱\mathbf{x}bold_x with radius wwitalic_w is the pixel set
𝐱\mathbf{x}bold_x 为中心、半径为 wwitalic_wIIitalic_I 方形补丁是像素集

𝒫𝐱,rI={I(𝐱+𝐰):𝐰𝒲r2}\mathcal{P}^{I}_{\mathbf{x},r}=\left\{I(\mathbf{x}+\mathbf{w}):\mathbf{w}\in% \mathcal{W}^{2}_{r}\right\}caligraphic_P start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT = { italic_I ( bold_x + bold_w ) : bold_w ∈ caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } (33)

while the mean intensity value of the patch 𝒫𝐱,rI\mathcal{P}^{I}_{\mathbf{x},r}caligraphic_P start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT is
而补丁 𝒫𝐱,rIsubscriptsuperscript\mathcal{P}^{I}_{\mathbf{x},r}caligraphic_P start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT 的平均强度值为

μ𝐱,rI=1|𝒲r2|𝐰𝒲2I(𝐱+𝐰)\mu^{I}_{\mathbf{x},r}=\frac{1}{\left|\mathcal{W}^{2}_{r}\right|}\sum_{\mathbf% {w}\in\mathcal{W}^{2}}I(\mathbf{x}+\mathbf{w})italic_μ start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_I ( bold_x + bold_w ) (34)

and the variance is 方差为

σ𝐱,rI=1|𝒲r2|𝐰𝒲2(I(𝐱+𝐰)μ𝐱,rI)2\sigma^{I}_{\mathbf{x},r}=\frac{1}{\left|\mathcal{W}^{2}_{r}\right|}\sum_{% \mathbf{w}\in\mathcal{W}^{2}}\left(I(\mathbf{x}+\mathbf{w})-\mu^{I}_{\mathbf{x% },r}\right)^{2}italic_σ start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_I ( bold_x + bold_w ) - italic_μ start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (35)

The normalized intensity value I¯𝐱,r(𝐲)\overline{I}_{\mathbf{x},r}(\mathbf{y})over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT ( bold_y ) by the mean and standard deviation of the patch 𝐏𝐱,rI\mathbf{P}^{I}_{\mathbf{x},r}bold_P start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT for a pixel 𝐲\mathbf{y}bold_y in IIitalic_I is
通过像素 𝐲\mathbf{y}bold_yIIitalic_I 中的补丁 𝐏𝐱,rIsubscriptsuperscript\mathbf{P}^{I}_{\mathbf{x},r}bold_P start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT 的均值和标准差归一化的强度值 I¯𝐱,r(𝐲)subscript\overline{I}_{\mathbf{x},r}(\mathbf{y})over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT ( bold_y )

I¯𝐱,r(𝐲)=I(𝐲)μ𝐱,rIσ𝐱,rI\overline{I}_{\mathbf{x},r}(\mathbf{y})=\frac{I(\mathbf{y})-\mu^{I}_{\mathbf{x% },r}}{\sqrt{\sigma^{I}_{\mathbf{x},r}}}over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT ( bold_y ) = divide start_ARG italic_I ( bold_y ) - italic_μ start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x , italic_r end_POSTSUBSCRIPT end_ARG end_ARG (36)

which is ideally robust to affine illumination changes [48]. Lastly, the similarity between two patches 𝒫𝐚,rA\mathcal{P}^{A}_{\mathbf{a},r}caligraphic_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a , italic_r end_POSTSUBSCRIPT and 𝒫𝐛,rB\mathcal{P}^{B}_{\mathbf{b},r}caligraphic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_b , italic_r end_POSTSUBSCRIPT is
这理想地对仿射光照变化具有鲁棒性[48]。最后,两个块 𝒫𝐚,rAsubscriptsuperscript\mathcal{P}^{A}_{\mathbf{a},r}caligraphic_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a , italic_r end_POSTSUBSCRIPT𝒫𝐛,rBsubscriptsuperscript\mathcal{P}^{B}_{\mathbf{b},r}caligraphic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_b , italic_r end_POSTSUBSCRIPT 之间的相似性是

S𝐚,𝐛,rA,B=𝐰𝒲r2A¯𝐚,r(𝐚+𝐰)B¯𝐛,r(𝐛+𝐰)S^{A,B}_{\mathbf{a},\mathbf{b},r}=\sum_{\mathbf{w}\in\mathcal{W}^{2}_{r}}% \overline{A}_{\mathbf{a},r}(\mathbf{a}+\mathbf{w})\overline{B}_{\mathbf{b},r}(% \mathbf{b}+\mathbf{w})italic_S start_POSTSUPERSCRIPT italic_A , italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a , bold_b , italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_w ∈ caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT bold_a , italic_r end_POSTSUBSCRIPT ( bold_a + bold_w ) over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT bold_b , italic_r end_POSTSUBSCRIPT ( bold_b + bold_w ) (37)

so that, for the match m=(𝐱1,𝐱2)m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and a set of associated homography pairs m\mathcal{H}_{m}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT derived in Eq. 3.3.1, the refined keypoint offsets (𝐭1,𝐭2)(\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star})( bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) and the best aligning homography pair (H1,H2)(\mathrm{H}_{1}^{\star},\mathrm{H}_{2}^{\star})( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) are given by
从而,对于匹配 m=(𝐱1,𝐱2)subscript1subscript2m=(\mathbf{x}_{1},\mathbf{x}_{2})italic_m = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 以及在公式 3.3.1 中导出的一组相关单应性对 msubscript\mathcal{H}_{m}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,精炼的关键点偏移量 (𝐭1,𝐭2)superscriptsubscript1superscriptsubscript2(\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star})( bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) 和最佳对齐的单应性对 (H1,H2)superscriptsubscript1superscriptsubscript2(\mathrm{H}_{1}^{\star},\mathrm{H}_{2}^{\star})( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) 由以下公式给出:

TmI1,I2=\displaystyle T^{I_{1},I_{2}}_{m}=italic_T start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ((𝐭1,𝐭2),(H1,H2))\displaystyle((\mathbf{t}_{1}^{\star},\mathbf{t}_{2}^{\star}),(\mathrm{H}_{1}^% {\star},\mathrm{H}_{2}^{\star}))( ( bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) , ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) )
=\displaystyle== argmax(𝐭1,𝐭2)𝒲r2(H1,H2)mSH1𝐱1+𝐭𝟏,H2𝐱2+𝐭𝟐,rH1I1,H2I2\displaystyle\operatornamewithlimits{argmax}_{\begin{subarray}{c}(\mathbf{t}_{% 1},\mathbf{t}_{2})\in\overleftrightarrow{\mathcal{W}}^{2}_{r}\\ (\mathrm{H}_{1},\mathrm{H}_{2})\in\mathcal{H}_{m}\end{subarray}}S^{\mathrm{H}_% {1}\circ I_{1},\mathrm{H}_{2}\circ I_{2}}_{\mathrm{H}_{1}\mathbf{x}_{1}+% \mathbf{t_{1}},\mathrm{H}_{2}\mathbf{x}_{2}+\mathbf{t_{2}},r}roman_argmax start_POSTSUBSCRIPT start_ARG start_ROW start_CELL ( bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ over↔ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , italic_r end_POSTSUBSCRIPT (38)

where HI\mathrm{H}\circ Iroman_H ∘ italic_I means the warp of the image IIitalic_I by H\mathrm{H}roman_H, H𝐱\mathrm{H}\mathbf{x}roman_H bold_x is the transformed pixel coordinates and
其中 HI\mathrm{H}\circ Iroman_H ∘ italic_I 表示图像 IIitalic_I 通过 H\mathrm{H}roman_H 的扭曲, H𝐱\mathrm{H}\mathbf{x}roman_H bold_x 是变换后的像素坐标,

𝒲r2=(𝟎×𝒲r2)(𝒲r2×𝟎)\overleftrightarrow{\mathcal{W}}^{2}_{r}=\left(\mathbf{0}\times\mathcal{W}^{2}% _{r}\right)\cup\left(\mathcal{W}^{2}_{r}\times\mathbf{0}\right)over↔ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( bold_0 × caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∪ ( caligraphic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × bold_0 ) (39)

is the set of offset translation to check considering in turn one of the images as template since this is not a symmetric process. Notice that 𝐭1,𝐭2\mathbf{t}^{\star}_{1},\mathbf{t}^{\star}_{2}bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT cannot be both different from 𝟎\mathbf{0}bold_0 according to 𝒲r2\overleftrightarrow{\mathcal{W}}^{2}_{r}over↔ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. The final refined match
是考虑偏移量翻译的集合,依次将其中一张图像作为模板进行检查,因为这是一个非对称过程。注意,根据 𝒲r2subscriptsuperscript2\overleftrightarrow{\mathcal{W}}^{2}_{r}over↔ start_ARG caligraphic_W end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT𝐭1,𝐭2subscriptsuperscript1subscriptsuperscript2\mathbf{t}^{\star}_{1},\mathbf{t}^{\star}_{2}bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 不能同时与 𝟎0\mathbf{0}bold_0 不同。最终的精细匹配

m=(𝐱1,𝐱2)m^{\star}=\left(\mathbf{x}_{1}^{\star},\mathbf{x}_{2}^{\star}\right)italic_m start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (40)

is obtained by reprojecting back to the original images, i.e.
是通过重新投影回原始图像获得的,即

𝐱1=H11(H1𝐱1+𝐭1)\displaystyle\mathbf{x}_{1}^{\star}={\mathrm{H}^{\star}_{1}}^{-1}\left(\mathrm% {H}_{1}\mathbf{x}_{1}+\mathbf{t}_{1}^{\star}\right)bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT )
𝐱2=H21(H2𝐱2+𝐭2)\displaystyle\mathbf{x}_{2}^{\star}={\mathrm{H}^{\star}_{2}}^{-1}\left(\mathrm% {H}_{2}\mathbf{x}_{2}+\mathbf{t}_{2}^{\star}\right)bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (41)

The normalized cross correlation can be computed efficiently by convolution, where bilinear interpolation is employed to warp images. The patch radius rritalic_r is also the offset search radius and has been experimentally set to r=102thr=10\approx\sqrt{2}t_{h}italic_r = 10 ≈ square-root start_ARG 2 end_ARG italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT px. According to preliminary experiments, a wider radius extension can break the planar neighborhood assumption and patch alignment.
归一化互相关可以通过卷积高效计算,其中采用双线性插值对图像进行扭曲。补丁半径 rritalic_r 同时也是偏移搜索半径,并已通过实验设定为 r=102th102subscriptr=10\approx\sqrt{2}t_{h}italic_r = 10 ≈ square-root start_ARG 2 end_ARG italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 像素。根据初步实验,更宽的半径扩展会破坏平面邻域假设及补丁对齐。

3.3.3 Sub-pixel precision 3.3.3 亚像素精度

The refinement offset can be further enhanced by parabolic interpolation adding sub-pixel precision [102]. Notice that other forms of interpolations have been investigated but they did not provide any sub-pixel accuracy improvements according to previous experiments [49]. The NCC response map at 𝐮2\mathbf{u}\in\mathbb{R}^{2}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the best warped space with origin the peak value can be written as
通过抛物线插值可进一步提升细化偏移量,增加亚像素精度[102]。值得注意的是,尽管其他形式的插值方法已被研究,但根据先前实验[49],它们并未带来亚像素级别的精度提升。在最佳扭曲空间中,以峰值为原点的 NCC 响应图 𝐮2superscript2\mathbf{u}\in\mathbb{R}^{2}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 可表示为

Sm(𝐮)=SH1𝐱1+𝐭1+𝐮,H2𝐱2+𝐭2+𝐮,rH1I1,H2I2S^{\star}_{m}(\mathbf{u})=S^{\mathrm{H}^{\star}_{1}\circ I_{1},\mathrm{H}^{% \star}_{2}\circ I_{2}}_{\mathrm{H}^{\star}_{1}\mathbf{x}_{1}+\mathbf{t}^{\star% }_{1}+\mathbf{u},\mathrm{H}^{\star}_{2}\mathbf{x}_{2}+\mathbf{t}^{\star}_{2}+% \mathbf{u},r}italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_u ) = italic_S start_POSTSUPERSCRIPT roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_u , roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_u , italic_r end_POSTSUBSCRIPT (42)

from TmI1,I2T_{m}^{I_{1},I_{2}}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT computed by Eq. 3.3.2. The sub-pixel refinement offsets in the horizontal direction are computed as the vertex b2a-\tfrac{b}{2a}- divide start_ARG italic_b end_ARG start_ARG 2 italic_a end_ARG of a parabola ax2+bx+c=yax^{2}+bx+c=yitalic_a italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b italic_x + italic_c = italic_y fitted on the horizontal neighborhood centered in the peak of the maximized NNC response map, i.e.
由公式 3.3.2 计算得到的 TmI1,I2superscriptsubscriptsubscript1subscript2T_{m}^{I_{1},I_{2}}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 。水平方向上的亚像素细化偏移量计算为抛物线 ax2+bx+c=ysuperscript2ax^{2}+bx+c=yitalic_a italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b italic_x + italic_c = italic_y 的顶点 b2a2-\tfrac{b}{2a}- divide start_ARG italic_b end_ARG start_ARG 2 italic_a end_ARG ,该抛物线拟合于以最大化 NNC 响应图峰值为中心的水平邻域上,即

a\displaystyle aitalic_a =Sm([10])2Sm([00])+Sm([-1 0])2\displaystyle=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right)}{2}= divide start_ARG italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) - 2 italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) + italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) end_ARG start_ARG 2 end_ARG
b\displaystyle bitalic_b =Sm([10])Sm([-1 0])2\displaystyle=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right)}{2}= divide start_ARG italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) - italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) end_ARG start_ARG 2 end_ARG (43)

and analogously for the vertical offsets. Explicitly the sub-pixel refinement offset is 𝐩=[pxpy]\mathbf{p}=\left[\begin{smallmatrix}p_{x}\\ p_{y}\end{smallmatrix}\right]bold_p = [ start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW ] where
并且类似地适用于垂直偏移。具体而言,亚像素精炼偏移为 𝐩=[pxpy]delimited-[]subscriptsubscript\mathbf{p}=\left[\begin{smallmatrix}p_{x}\\ p_{y}\end{smallmatrix}\right]bold_p = [ start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW ] ,其中

px=Sm([-1 0])Sm([10])2(Sm([10])2Sm([00])+Sm([-1 0]))\displaystyle p_{x}=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}\text{-}% 1\\ \;0\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{smallmatrix% }1\\ 0\end{smallmatrix}\right]\right)}{2(S^{\star}_{m}\left(\left[\begin{% smallmatrix}1\\ 0\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \text{-}1\\ \;0\end{smallmatrix}\right]\right))}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) - italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) end_ARG start_ARG 2 ( italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) - 2 italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) + italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) ) end_ARG
py=Sm([ 0-1])Sm([01])2(Sm([01])2Sm([00])+Sm([ 0-1]))\displaystyle p_{y}=\frac{S^{\star}_{m}\left(\left[\begin{smallmatrix}\;0\\ \text{-}1\end{smallmatrix}\right]\right)-S^{\star}_{m}\left(\left[\begin{% smallmatrix}0\\ 1\end{smallmatrix}\right]\right)}{2(S^{\star}_{m}\left(\left[\begin{% smallmatrix}0\\ 1\end{smallmatrix}\right]\right)-2S^{\star}_{m}\left(\left[\begin{smallmatrix}% 0\\ 0\end{smallmatrix}\right]\right)+S^{\star}_{m}\left(\left[\begin{smallmatrix}% \;0\\ \text{-}1\end{smallmatrix}\right]\right))}italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL end_ROW ] ) - italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW ] ) end_ARG start_ARG 2 ( italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW ] ) - 2 italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW ] ) + italic_S start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL end_ROW ] ) ) end_ARG (44)

and from Eq. 3.3.2 the final refined match
并从公式 3.3.2 得出最终的精细匹配

m=(𝐱1,𝐱2)m^{\star\star}=(\mathbf{x}_{1}^{\star\star},\mathbf{x}_{2}^{\star\star})italic_m start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT ) (45)

becomes 变为

𝐱1=H11(H1𝐱1+𝐭1+𝐭1𝟎𝐩)\displaystyle\mathbf{x}_{1}^{\star\star}={\mathrm{H}^{\star}_{1}}^{-1}\left(% \mathrm{H}_{1}\mathbf{x}_{1}+\mathbf{t}_{1}^{\star}+\left\llbracket\,\mathbf{t% }_{1}^{\star}\neq\mathbf{0}\,\right\rrbracket\,\mathbf{p}\right)bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT = roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + ⟦ bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≠ bold_0 ⟧ bold_p )
𝐱2=H21(H2𝐱2+𝐭2+𝐭2𝟎𝐩)\displaystyle\mathbf{x}_{2}^{\star\star}={\mathrm{H}^{\star}_{2}}^{-1}\left(% \mathrm{H}_{2}\mathbf{x}_{2}+\mathbf{t}_{2}^{\star}+\left\llbracket\,\mathbf{t% }_{2}^{\star}\neq\mathbf{0}\,\right\rrbracket\,\mathbf{p}\right)bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT = roman_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + ⟦ bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≠ bold_0 ⟧ bold_p ) (46)

where the Iverson bracket zeroes the sub-pixel offset increment 𝐩\mathbf{p}bold_p in the reference image according to Eq. 39.
其中,Iverson 括号根据公式 39 将参考图像中的亚像素偏移增量 𝐩\mathbf{p}bold_p 置零。

4 Evaluation 4 评估

4.1 Setup 4.1 设置

To ease the reading, each pipeline module or error measure that will be included in the results is highlighted in bold.
为方便阅读,结果中包含的每个管道模块或误差度量均以粗体突出显示。

4.1.1 Base pipelines 4.1.1 基础管道

Seven base pipelines have been tested to provide the input matches to be filtered. For each of these, the maximum number of keypoints was set to 8000 to retain as many as possible matches. This leads to including more outliers and better assessing the filtering potential. Moreover, nowadays benchmarks and deep matching architecture designs assume upright image pairs, where the relative orientation between images is roughly no more than π8\frac{\pi}{8}divide start_ARG italic_π end_ARG start_ARG 8 end_ARG. This additional constraint allows us to retrieve better initial matches in common user acquisition scenarios. Current general standard benchmark datasets, including those employed in this evaluation, are built so that the image pairs do not violate this constraint. Notice that many SOTA joint and end-to-end deep architectures do not tolerate strong rotations within images by design.
已对七种基础管道进行了测试,以提供待过滤的输入匹配。对于每一种管道,关键点最大数量设置为 8000,以尽可能保留更多匹配。这导致包含更多异常值,从而更好地评估过滤潜力。此外,当前的基准测试和深度匹配架构设计均假设图像对为竖直状态,图像间的相对方向大致不超过 π88\frac{\pi}{8}divide start_ARG italic_π end_ARG start_ARG 8 end_ARG 。这一额外约束使我们在常见的用户采集场景中能够检索到更好的初始匹配。目前通用的标准基准数据集,包括本次评估所使用的数据集,均构建为图像对不违反此约束。值得注意的是,许多最先进的联合和端到端深度架构在设计上无法容忍图像内的强旋转。

SIFT+NNR [19] is included as the standard and reference handcrafted pipeline. The OpenCV [103] implementation was used for SIFT, exploiting RootSIFT [104] for descriptor normalization, while NNR implementation is provided by Kornia [105]. To stress the match filtering robustness of the successive steps of the matching pipeline, which is the goal of this evaluation, the NNR ratio threshold was set rather high, i.e. to 0.95, while common adopted values range in [0.7,0.9][0.7,0.9][ 0.7 , 0.9 ] depending on the scene complexity, with higher values yielding to less discriminative matches and then possible outliers. Moreover, upright patches are employed by zeroing the dominant orientation for a fair comparison with other recent deep approaches.
SIFT+NNR [19] 被纳入作为标准和参考的手工设计流程。SIFT 采用 OpenCV [103] 实现,利用 RootSIFT [104] 进行描述子归一化,而 NNR 实现则由 Kornia [105] 提供。为强调匹配流程中各后续步骤的匹配过滤鲁棒性,这是本次评估的目标,NNR 比率阈值设置得相当高,即 0.95,而通常采用的值范围在 [0.7,0.9]0.70.9[0.7,0.9][ 0.7 , 0.9 ] 之间,具体取决于场景复杂度,较高值会导致匹配的区分度降低,从而可能引入异常值。此外,为公平比较其他近期深度方法,采用直立图像块,通过将主导方向置零来实现。

Key.Net+AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG+NNR [11], a modular pipeline that achieves good results in common benchmarks, is also taken into account. Excluding the NNR stage, it can be considered a modular deep pipeline. The Kornia implementation is used for the evaluation. As for SIFT, the NNR threshold is set very high, i.e. to 0.99, while more common values range in [0.8,0.95][0.8,0.95][ 0.8 , 0.95 ]. The deep orientation estimation module OriNet [84] generally siding AffNet is not included to provide upright matches.
Key.Net+ AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG +NNR [ 11],一种在常见基准测试中取得良好结果的模块化流水线,也被纳入考虑。排除 NNR 阶段后,它可视为一种模块化的深度流水线。评估中使用了 Kornia 实现。至于 SIFT,NNR 阈值设置得非常高,即 0.99,而更常见的取值范围在 [0.8,0.95]0.80.95[0.8,0.95][ 0.8 , 0.95 ] 。深度方向估计模块 OriNet [ 84]通常与 AffNet 并列,但为了提供直立匹配,未被包含在内。

Other base pipelines considered are SOTA full end-to-end matching networks. In details, these are SuperPoint+LightGlue, DISK+LightGlue and ALIKED+LightGlue as implemented in [12]. The input images are rescaled so the smaller size is 1024 following LightGlue default. For ALIKED the aliked-n16rot model weights are employed which according to the authors [13] can handle slight image rotation better and are more stable on viewpoint change.
其他考虑的基础管道是 SOTA 全端到端匹配网络。具体来说,这些是 SuperPoint+LightGlue、DISK+LightGlue 和 ALIKED+LightGlue,如[12]中所实现。输入图像按 LightGlue 默认设置缩放,使较小尺寸为 1024。对于 ALIKED,采用了 aliked-n16rot 模型权重,据作者[13]称,该权重能更好地处理轻微图像旋转,并在视角变化时更为稳定。

The last base pipelines added in the evaluation are DeDoDe v2 [93], which provides an alternative end-to-end deep architecture different from the above, and LoFTR, a semi-dense detector-free end-to-end network. These latter pipelines also achieve SOTA results in popular benchmarks. The authors’ implementation is employed for DeDoDe v2, setting to matching threshold to 0.01, while LoFTR implementation available through Kornia is used in the latter case.
评估中最后加入的基础管道是 DeDoDe v2 [93],它提供了一种与上述不同的端到端深度架构,以及 LoFTR,一种半密集的无检测器端到端网络。这些后者的管道在流行的基准测试中也达到了 SOTA 结果。对于 DeDoDe v2,采用了作者的实现,匹配阈值设为 0.01,而在后一种情况下,使用了通过 Kornia 提供的 LoFTR 实现。

4.1.2 Match filtering and post-processing
4.1.2 匹配滤波与后处理

Eleven match filters are applied after the base pipelines and included in the evaluation besides MOP+MiHo+NCC. For the proposed filtering sub-pipeline, all the five available combinations have been tested, i.e. MOP, MOP+NCC, NCC, MOP+MiHo, MOP+MiHo+NCC, to analyze the behavior of each module. In particular, applying NCC after SIFT+NNR or Key.Net+AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG+NNR can highlight the improvement introduced by MOP which uses homography-based patch normalization against similarity or affine patches, respectively. Likewise, applying NCC to the remaining end-to-end architectures should remark the importance of patch normalization. The setup parameters indicated in Sec. 3 is employed, which has worked well in preliminary experiments. Notice that MiHo orientation adjustment described in Sec. 3.2.4 is applied although not required in the case of upright images. This allows us to indirectly assess the correctness of base assumptions and the overall general method robustness since in the case of wrong orientation estimation the consequential failure can only decrease the match quality.
在基础管道之后应用了十一个匹配过滤器,并在评估中除了 MOP+MiHo+NCC 之外还包括了这些过滤器。对于所提出的过滤子管道,所有五个可用的组合均已测试,即 MOP、MOP+NCC、NCC、MOP+MiHo、MOP+MiHo+NCC,以分析每个模块的行为。特别是,在 SIFT+NNR 或 Key.Net+ AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG +NNR 之后应用 NCC,可以突出 MOP 引入的改进,该改进分别使用基于单应性的补丁归一化对抗相似性或仿射补丁。同样,将 NCC 应用于剩余的端到端架构应强调补丁归一化的重要性。采用第 3 节中指示的设置参数,该参数在初步实验中表现良好。请注意,尽管在直立图像的情况下不需要,但仍应用了第 3.2.4 节中描述的 MiHo 方向调整。这使我们能够间接评估基本假设的正确性和整体通用方法的鲁棒性,因为在方向估计错误的情况下,后续的失败只能降低匹配质量。

To provide a fair and general comparison, the analysis was restricted to filters that require as input only the keypoint positions of the matches and the image pair. This excludes approaches requiring in addition descriptor similarities [25, 28, 27], related patch transformations [71], intermediate network representations [42, 43] or a SfM framework [101]. The compared methods are GMS [26], AdaLAM [33] and CC [79] as handcrafted filters, and ACNe [35], CLNet [37], ConvMatch [40], DeMatch [41], FC-GNN [44], MS2DG-Net [39], NCMNet [38] and OANet [36] as deep modules. The implementations from the respective authors have been employed for all filters with the default setup parameters if not stated otherwise. Except FC-GNN and OANet, deep filters require as input the intrinsic camera matrices of the two images, which is not commonly available in real practical scenarios the proposed evaluation was designed for. To bypass this restriction, the approach presented in ACNe was employed, for which the intrinsic camera matrix
为了提供公平且广泛的比较,分析仅限于那些仅需匹配的关键点位置和图像对作为输入的滤波器。这排除了那些还需要描述符相似性[25, 28, 27]、相关补丁变换[71]、中间网络表示[42, 43]或 SfM 框架[101]的方法。所比较的方法包括手工设计的滤波器 GMS[26]、AdaLAM[33]和 CC[79],以及深度模块 ACNe[35]、CLNet[37]、ConvMatch[40]、DeMatch[41]、FC-GNN[44]、MS 2 DG-Net[39]、NCMNet[38]和 OANet[36]。除非另有说明,所有滤波器均采用各自作者提供的实现,并使用默认设置参数。除 FC-GNN 和 OANet 外,深度滤波器需要输入两幅图像的内在相机矩阵,这在实际应用场景中并不常见,而本次评估正是为此设计的。为绕过这一限制,采用了 ACNe 中提出的方法,该方法的内在相机矩阵

K=[f0cx0fcy001]\mathrm{K}=\begin{bmatrix}f&0&c_{x}\\ 0&f&c_{y}\\ 0&0&1\end{bmatrix}roman_K = [ start_ARG start_ROW start_CELL italic_f end_CELL start_CELL 0 end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_f end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (47)

for the image IIitalic_I with a resolution of w×hw\times hitalic_w × italic_h pixel is estimated by setting the focal length to
对于分辨率为 w×hw\times hitalic_w × italic_h 像素的图像 IIitalic_I ,通过设定焦距来估算

f=max(h,w)f=\max(h,w)italic_f = roman_max ( italic_h , italic_w ) (48)

and the camera centre as
并将相机中心作为

𝐜=[cxcy]=12[wh]\mathbf{c}=\begin{bmatrix}c_{x}\\ c_{y}\end{bmatrix}=\frac{1}{2}\begin{bmatrix}w\\ h\end{bmatrix}bold_c = [ start_ARG start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ start_ARG start_ROW start_CELL italic_w end_CELL end_ROW start_ROW start_CELL italic_h end_CELL end_ROW end_ARG ] (49)

The above focal length estimation ffitalic_f is quite reasonable by the statistics reported in Fig. 5 that have been extracted from the evaluation data and discussed in Sec. 4.1.3. Notice also that most of the deep filters have been trained on SIFT matches, yet to be robust and generalizable they must be able to mainly depend only on the scene and not on the kind of feature extracted by the pipeline.
上述焦距估计 ffitalic_f 通过图 5 中的统计数据显得相当合理,这些数据提取自评估数据并在 4.1.3 节中进行了讨论。值得注意的是,大多数深度滤波器虽在 SIFT 匹配上进行了训练,但要实现稳健性和普适性,它们必须主要依赖于场景而非特征提取管道的类型。

For AdaLAM, no initial seeds or local similarity or affine clues were used as additional input as discussed above. Nevertheless, this kind of data could have been used only for SIFT+NNR or Key.Net+AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG+NNR. For CC, the inliers threshold was experimental set to th=7t_{h}=7italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 7 px as in Sec. 3.1.2 instead of the value 1.5 px. suggested by authors. Otherwise, CC would not be able to work in general scenes with no planes. Note that AdaLAM and CC are the closest approaches to the proposed MOP+MiHo filtering pipeline. Moreover, among the compared methods, only FC-GNN is able to refine matches as MOP+MiHo+NCC does.
对于 AdaLAM,如上所述,未使用初始种子或局部相似性或仿射线索作为额外输入。然而,这类数据本可以仅用于 SIFT+NNR 或 Key.Net+ AffNetHardNet\begin{subarray}{c}\text{AffNet}\\ \text{HardNet}\end{subarray}start_ARG start_ROW start_CELL AffNet end_CELL end_ROW start_ROW start_CELL HardNet end_CELL end_ROW end_ARG +NNR。对于 CC,内点阈值实验设定为 th=7subscript7t_{h}=7italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 7 像素,如 3.1.2 节所述,而非作者建议的 1.5 像素。否则,CC 将无法在无平面的一般场景中工作。请注意,AdaLAM 和 CC 是与提出的 MOP+MiHo 过滤管道最接近的方法。此外,在比较的方法中,仅 FC-GNN 能像 MOP+MiHo+NCC 那样精炼匹配。

RANSAC is also optionally evaluated as the final post-process to filtering matches according to epipolar or planar constraints. Three different RANSAC implementations were initially considered, i.e. the RANSAC implementation in PoseLib333https://github.com/PoseLib/PoseLib, DegenSAC444https://github.com/ducha-aiki/pydegensac [24] and MAGSAC555https://github.com/danini/magsac [23]. The maximum number of iterations for all RANSACs is set to 10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, which is uncommonly high, to provide a more robust pose estimation and to make the whole pipeline only depend on the previous filtering and refinement stages. Five different inlier threshold values tSACt_{SAC}italic_t start_POSTSUBSCRIPT italic_S italic_A italic_C end_POSTSUBSCRIPT, i.e.
RANSAC 还作为最终的后处理步骤,根据极线或平面约束对匹配进行过滤。最初考虑了三种不同的 RANSAC 实现,即 PoseLib 中的 RANSAC 实现 3 ,DegenSAC 4 [24]和 MAGSAC 5 [23]。所有 RANSAC 的最大迭代次数设置为 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ,这一数值异常高,旨在提供更稳健的姿态估计,并使整个流程仅依赖于先前的过滤和优化阶段。采用了五个不同的内点阈值 tSACsubscriptt_{SAC}italic_t start_POSTSUBSCRIPT italic_S italic_A italic_C end_POSTSUBSCRIPT ,即

tSAC{0.5px, 0.75px, 1px, 3px, 5px}t_{SAC}\in\{0.5\,\text{px},\,0.75\,\text{px},\,1\,\text{px},\,3\,\text{px},\,5% \,\text{px}\}italic_t start_POSTSUBSCRIPT italic_S italic_A italic_C end_POSTSUBSCRIPT ∈ { 0.5 px , 0.75 px , 1 px , 3 px , 5 px } (50)

are considered. After a preliminary RANSAC ablation study666https://github.com/fb82/MiHo/tree/main/data/results according to proposed benchmark, the best general choice found is MAGSAC with 0.75 px. or 1 px. as inlier threshold tSACt_{SAC}italic_t start_POSTSUBSCRIPT italic_S italic_A italic_C end_POSTSUBSCRIPT, indicated by MAGSAC and MAGSAC respectively. On one hand, the former and more strict threshold is generally slightly better in terms of accuracy. On the other hand, the latter provides more inliers, so both results will be presented. Matching outputs without RANSAC are also reported for a complete analysis.
经过初步的 RANSAC 消融研究,根据提出的基准,发现最佳的通用选择是 MAGSAC,其内点阈值分别为 0.75 像素或 1 像素,分别标记为 MAGSAC 和 MAGSAC 。一方面,前者的更严格阈值在准确性上通常略胜一筹;另一方面,后者提供了更多的内点,因此将同时展示这两种结果。此外,还报告了未使用 RANSAC 的匹配输出,以进行全面的分析。

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: (5(a)) Probability distribution of the ratio between the GT focal length ffitalic_f and the maximum image dimension max(h,w)\max(h,w)roman_max ( italic_h , italic_w ) in the non-planar outdoor MegaDepth and IMC PhotoTourism datasets introduced in Sec. 4.1.3. (5(b)) The difference between the outdoor data with respect to the indoor ScanNet dataset, also presented in Sec. 4.1.3, is shown in a logarithmic scale for a more effective visual comparison. For the indoor scenes, the ratio is fixed to 0.9 as images have been acquired by the same iPad Air2 device [55]. For the outdoor data, images have been acquired by different cameras and the ratio values cluster roughly around 1. As a heuristic derived by inspecting the images, the ratio is roughly equal to the zooming factor employed for acquiring the image. (5(c)) Analogous representation as a scatter plot in terms of the GT camera center [cxcy]T[c_{x}\,c_{y}]^{T}[ italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT corroborates previous histograms and the intrinsic camera approximation heuristic used in the experiments described in Sec. 4.1.2. (5(d)) The 2D distribution of the ratios for the image pairs employed in the evaluation as an overlay in RGB color channels. Notice that the distribution overlap of the outdoor datasets with the corresponding indoor dataset is almost null, as one can check in (5(b)) and by inspecting the cross shape in the bottom left side of (5(d)). Best viewed in color and zoomed in.
图 5:(5(a)) 非平面户外 MegaDepth 和 IMC PhotoTourism 数据集中,GT 焦距 ffitalic_f 与最大图像尺寸 max(h,w)\max(h,w)roman_max ( italic_h , italic_w ) 之比的分布概率,详见 4.1.3 节。(5(b)) 户外数据与室内 ScanNet 数据集的差异,同样在 4.1.3 节中展示,以对数尺度呈现,以便更有效地进行视觉比较。对于室内场景,由于图像由同一台 iPad Air2 设备[55]采集,该比值固定为 0.9。对于户外数据,图像由不同相机采集,比值大致集中在 1 附近。通过观察图像得出的启发式结论是,该比值大致等于采集图像时所用的缩放因子。(5(c)) 以散点图形式表示的类似表示,涉及 GT 相机中心 [cxcy]Tsuperscriptdelimited-[]subscriptsubscript[c_{x}\,c_{y}]^{T}[ italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,证实了先前的直方图以及 4.1.2 节实验中使用的内在相机近似启发式。(5(d)) 评估中使用的图像对比值的二维分布,以 RGB 颜色通道叠加显示。 注意,户外数据集与相应室内数据集的分布重叠几乎为零,这一点可从(5(b))中看出,并通过检查(5(d))左下角的交叉形状得以验证。最佳查看方式为彩色并放大。

4.1.3 Datasets 4.1.3 数据集

Four different datasets have been employed in the evaluation. These include the MegaDepth [54] and ScanNet [55] datasets, respectively outdoor and indoor. The same 1500 image pairs for each dataset and Ground-Truth (GT) data indicated in the protocol employed in [14] are used. These datasets are de-facto standard in nowadays image matching benchmarking; sample image pairs for each dataset are shown in Fig. 1. MegaDepth test image pairs belong to only 2 different scenes and are resized proportionally so that the maximum size is equal to 1200 px, while ScanNet image pairs belong to 100 different scenes and are resized to 640×480640\times 480640 × 480 px. Although according to previous evaluation [12] LightGlue provides better results when the images are rescaled so that the maximum dimension is 1600 px, in this evaluation the original 1200 px resize was maintained since more outliers and minor keypoint precision are achieved in the latter case, providing a configuration in which the filtering and refinement of the matches can be better revealed. For deep methods, the best weights that fit the kind of scene, i.e. indoor or outdoor, are used when available.
评估中采用了四个不同的数据集。其中包括室外场景的 MegaDepth [54]和室内场景的 ScanNet [55]数据集。每个数据集均使用与[14]中相同的 1500 对图像及其协议中指定的 Ground-Truth (GT)数据。这些数据集在当今图像匹配基准测试中已成为事实上的标准;各数据集的样本图像对如图 1 所示。MegaDepth 测试图像对仅属于两个不同场景,并按比例调整大小,使最大尺寸等于 1200 像素,而 ScanNet 图像对则属于 100 个不同场景,并调整至 640×480640480640\times 480640 × 480 像素。尽管根据先前的评估[12],当图像缩放至最大维度为 1600 像素时,LightGlue 表现更佳,但在本次评估中仍保持原始的 1200 像素缩放,因为后者能获得更多异常值和较小的关键点精度,从而提供一种配置,使匹配的过滤和细化得以更好地展现。对于深度方法,当可用时,会使用最适合场景类型(即室内或室外)的最佳权重。

To provide further insight into the case of outdoor scenes the Image Matching Challenge (IMC) PhotoTourism dataset777https://www.kaggle.com/competitions/image-matching-challenge-2022/data is also employed. The IMC PhotoTourism (IMC-PT) dataset is a curated collection of sixteen scenes derived by the SfM models of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset [106]. For each scene 800 image pairs have been randomly chosen, resulting in a total of 12800 image pairs. These scenes also provide 3D model scale as GT data so that metric evaluations are possible. Note that MegaDepth data are is roughly a subset of IMC-PT. Furthermore, YFCC100M has been often exploited to train deep image matching methods so it can be assumed that some of the compared deep filters are advantaged on this dataset, but this information cannot be retrieved or elaborated. Nevertheless, the proposed modules are handcrafted and not positively biased so the comparison is still valid for the main aim of the evaluation.
为了更深入地探讨户外场景的情况,还采用了 Image Matching Challenge (IMC) PhotoTourism 数据集。IMC PhotoTourism (IMC-PT)数据集是从 Yahoo Flickr Creative Commons 100 Million (YFCC100M)数据集的 SfM 模型中精心挑选出的十六个场景的集合[106]。每个场景随机选择了 800 对图像,总计 12800 对图像。这些场景还提供了 3D 模型尺度作为 GT 数据,使得度量评估成为可能。需要注意的是,MegaDepth 数据大致是 IMC-PT 的一个子集。此外,YFCC100M 常被用于训练深度图像匹配方法,因此可以假设一些比较的深度滤波器在此数据集上具有优势,但这一信息无法获取或详细阐述。尽管如此,所提出的模块是手工设计的,不存在正向偏差,因此比较仍然有效,符合评估的主要目的。

Lastly, the Planar dataset contains 135 image pairs from 35 scenes collected from HPatches [56], the Extreme View dataset (EVD) [57] and further datasets aggregated from [70, 32, 50]. Each scene usually includes five image pairs except those belonging to EVD consisting of a single image pair. The planar dataset includes scenes with challenging viewpoint changes possibly paired with strong illumination variations. All image pairs are adjusted to be roughly non-upright. Outdoor model weights are preferred for deep modules in the case of planar scenes. A thumbnail gallery for the scenes of the Planar dataset is shown in Fig. 6.
最后,Planar 数据集包含从 HPatches [56]、Extreme View 数据集(EVD)[57]以及从[70, 32, 50]中聚合的其他数据集中收集的 35 个场景中的 135 对图像。每个场景通常包含五对图像,除了属于 EVD 的场景仅包含一对图像。Planar 数据集包括具有挑战性视角变化的场景,可能伴随着强烈的光照变化。所有图像对都被调整为大致非直立状态。对于平面场景,深度模块更倾向于使用户外模型权重。图 6 展示了 Planar 数据集场景的缩略图画廊。

Refer to caption
Figure 6: Examples of image pair for each scene in the planar dataset introduced in Sec. 4.1.3. Frame colors refer in order to the original dataset of the scene: HPatches [56], EVD [57], Oxford [70][32] and [50]. Best viewed in color and zoomed in.
图 6:平面数据集中每个场景的图像对示例,详见第 4.1.3 节。帧颜色依次对应于场景的原始数据集:HPatches [56]、EVD [57]、Oxford [70]、[32] 和 [50]。最佳观看方式为彩色并放大。

4.1.4 Error metrics 4.1.4 误差度量

For non-planar datasets, unlike most common benchmarks