AAGS: Appearance-Aware 3D Gaussian Splatting with Unconstrained Photo Collections AAGS:外观感知的无约束照片集合的 3D 高斯溅射
Wencong Zhang ^(1){ }^{1}, Zhiyang Guo ^(2){ }^{2}, Wengang Zhou ^(1){ }^{1}, Houqiang Li ^(1){ }^{1}^(1**){ }^{1 *} the CAS Key Laboratory of Technology in GIPAS, EEIS Department, the University of Science and Technology of China, Fuxing Street, Hefei, 230052, Anhui, China. ^(1**){ }^{1 *} 中国科学技术大学电气与信息工程系 GIPAS 技术 CAS 重点实验室,复兴街,合肥,230052,安徽,中国。
Reconstructing 3D scenes from unconstrained collections of in-the-wild photographs has consistently been a challenging problem. The main difficulty lies in different appearance conditions and transient occluders of uncontrolled image samples. With the advancement of Neural Radiance Fields (NeRF), previous works have developed some effective strategies to tackle this issue. However, limited by deep networks and volumetric rendering techniques, these methods generally require substantial time costs. Recently, the advent of 3D Gaussian Splatting (3DGS) has significantly accelerated the training and rendering speed of 3D reconstruction tasks. Nevertheless, vanilla 3DGS struggles to distinguish varying appearances of in-the-wild photo collections. To address the aforementioned problems, we propose Appearance-Aware 3D Gaussian Splatting (AAGS), a novel extension of 3DGS to unconstrained photo collections. Specifically, we employ an appearance extractor to capture global features for image samples, enabling the distinction of visual conditions, e.g., illumination and weather, across different observations. Furthermore, to mitigate the impact of transient occluders, we design a transient-removal module that adaptively learns a 2 D visibility map to decompose the static target from complex realworld scenes. Extensive experiments are conducted to validate the effectiveness and superiority of our AAGS. Compared with previous works, our method not only achieves better reconstruction and rendering quality, but also significantly reduces both training and rendering overhead. Code will be released at https://github.com/Zhang-WenCong/AAGS. 从无约束的野外照片集合中重建 3D 场景一直是一个具有挑战性的问题。主要困难在于不同的外观条件和不受控制的图像样本的瞬时遮挡物。随着神经辐射场(NeRF)的进步,以前的工作开发了一些有效的策略来解决这个问题。然而,由于深度网络和体积渲染技术的限制,这些方法通常需要大量的时间成本。最近,3D 高斯溅射(3DGS)的出现显著加快了 3D 重建任务的训练和渲染速度。然而,普通的 3DGS 在区分野外照片集合的不同外观方面存在困难。为了解决上述问题,我们提出了外观感知 3D 高斯溅射(AAGS),这是 3DGS 在无约束照片集合上的一种新扩展。具体而言,我们采用外观提取器来捕捉图像样本的全局特征,使得能够区分不同观察中的视觉条件,例如照明和天气。 此外,为了减轻瞬态遮挡物的影响,我们设计了一个瞬态去除模块,该模块自适应地学习二维可见性图,以从复杂的现实场景中分解静态目标。我们进行了大量实验,以验证我们 AAGS 的有效性和优越性。与之前的工作相比,我们的方法不仅实现了更好的重建和渲染质量,而且显著减少了训练和渲染开销。代码将发布在 https://github.com/Zhang-WenCong/AAGS。
Fig. 1 Using wild photos as input (a), our method is able to capture different appearances from diverse observations, render in-the-wild scenes from novel viewpoints, and effectively remove the impact of transient occluders (b). 图 1 使用野生照片作为输入(a),我们的方法能够从不同的观察中捕捉到不同的外观,从新颖的视角渲染野外场景,并有效去除瞬态遮挡物的影响(b)。
The tasks of novel view synthesis and 3D reconstruction have consistently been at the forefront of research in computer graphics and computer vision such as [1, 2]. Since the first introduction of Neural Radiance Fields [3] (NeRF), this implicit 3D representation has exhibited promising results in various tasks by integrating traditional volumetric rendering techniques with advanced deep learning methodologies. Utilizing dozens of photographs from different viewpoints along with their corresponding camera parameters, NeRF trains a neural network to represent the 3D scene, achieving groundbreaking performance in photo-realistic novel view synthesis tasks. Recently, 3D Gaussian Splatting [4] (3DGS) has garnered widespread attention due to its efficient reconstruction capabilities compared with NeRF. Employing explicit representations alongside a rasterization-based rendering process, 3DGS achieves real-time rendering speeds while maintaining rendering quality comparable to state-of-the-art NeRF methods [5-8]. 新视图合成和 3D 重建的任务一直处于计算机图形学和计算机视觉研究的前沿,例如[1, 2]。自从首次引入神经辐射场[3](NeRF)以来,这种隐式 3D 表示在各种任务中展现了令人鼓舞的结果,通过将传统的体积渲染技术与先进的深度学习方法相结合。利用来自不同视角的数十张照片及其相应的相机参数,NeRF 训练一个神经网络来表示 3D 场景,在照片真实感的新视图合成任务中取得了突破性的表现。最近,3D 高斯溅射[4](3DGS)因其与 NeRF 相比的高效重建能力而受到广泛关注。3DGS 采用显式表示和基于光栅化的渲染过程,实现了实时渲染速度,同时保持了与最先进的 NeRF 方法[5-8]相当的渲染质量。
Typically, both NeRF [3] and 3DGS [4] primarily focus on the reconstruction of static scenes. In other words, the input photo collections are assumed to maintain visual dynamic variation or transient objects. However, for in-the-wild scenes such as scenic spots, the available photos are usually captured within a large time span from days up to even years. Therefore, it is difficult to control the capturing conditions, resulting in diverse seasons, weather, illumination, and other appearance variances of photos. Even photos taken at the same location and time can have diverse appearances due to differences in device settings, e.g., exposure time, filters, and tone-mapping. Worse still, different transient objects, e.g., pedestrians, tourists, and vehicles, become intractable obstacles within the reconstruction process. If we directly apply vanilla 通常,NeRF [3] 和 3DGS [4] 主要集中在静态场景的重建。换句话说,输入的照片集合假设保持视觉动态变化或瞬态物体。然而,对于野外场景,如风景名胜区,现有的照片通常是在几天到甚至几年的较长时间跨度内拍摄的。因此,很难控制拍摄条件,导致照片在季节、天气、光照和其他外观变化方面的多样性。即使在相同的地点和时间拍摄的照片,由于设备设置的差异,例如曝光时间、滤镜和色调映射,也可能具有不同的外观。更糟糕的是,不同的瞬态物体,例如行人、游客和车辆,成为重建过程中的难以处理的障碍。如果我们直接应用普通的
NeRF or 3DGS to the aforementioned photo collections, the reconstructed scene will be filled with artifacts and floaters. This is exactly what unconstrained reconstruction task tries to wrestle with - how to reconstruct a scene with different appearance conditions from unconstrained photo collections. Addressing this problem typically confronts two major challenges: different appearances across images and potential interference from transient occluders. 对于上述照片集合的 NeRF 或 3DGS,重建的场景将充满伪影和浮动物。这正是无约束重建任务所要解决的问题——如何从无约束的照片集合中重建具有不同外观条件的场景。解决这个问题通常面临两个主要挑战:图像之间的不同外观和来自瞬态遮挡物的潜在干扰。
Several pioneers have attempted to tackle these challenges using various strategies. NeRF-W [9] tries to assign an appearance embedding to each image and adds transient confidence to the radiance field, thereby separately modeling static and transient objects with appearance awareness. Subsequent efforts, such as Ha-NeRF [10] and CR-NeRF [11], leverage image features more effectively to obtain better appearance embeddings and employ 2D visibility maps to exclude transient objects. However, these methods are limited by NeRF’s inherent rendering quality and efficiency, requiring dozens of hours to train and tens of seconds to render one single image. A recent concurrent 3DGS-based method, SWAG [12], significantly accelerates training and rendering compared to previous NeRF-based methods. However, it still uses the per-image embedding approach in NeRF-W [9], thus encounters a similar issue - it requires the use of half of the test set to train the appearance embeddings for those images. 几位先驱者尝试使用各种策略来应对这些挑战。NeRF-W [9] 试图为每个图像分配一个外观嵌入,并向辐射场添加瞬态置信度,从而分别建模静态和瞬态对象,并具备外观意识。后续的努力,如 Ha-NeRF [10] 和 CR-NeRF [11],更有效地利用图像特征以获得更好的外观嵌入,并使用 2D 可见性图来排除瞬态对象。然而,这些方法受到 NeRF 固有的渲染质量和效率的限制,训练需要数十小时,渲染一张单独的图像需要数十秒。最近一种基于 3DGS 的方法 SWAG [12] 相较于之前的 NeRF 基础方法显著加快了训练和渲染速度。然而,它仍然使用 NeRF-W [9] 中的每图像嵌入方法,因此遇到了类似的问题——它需要使用一半的测试集来训练这些图像的外观嵌入。
To address the aforementioned issues, in this work, we propose Appearance-Aware 3D Gaussian Splatting (AAGS), a novel method based on 3DGS for reconstructing scenes from unconstrained photo collections. Specifically, we notice that the visual observation is determined by the physical properties of the object itself and the external lighting conditions. Therefore, we assign a specific feature vector (instead of color as in vanilla 3DGS) as a new intrinsic property to each Gaussian and meanwhile employ an appearance extractor to obtain global features from different images. These appearance features, together with the intrinsic property of the Gaussians, are processed by a color decoder to obtain the final per-Gaussian emitted color. Furthermore, since our objective is to render the static scenes, the additional reconstruction of transient objects is an unnecessary resource consumption. Therefore, we utilize a transient UNet to adaptively generate a 2 D visibility map that helps exclude transient occluders in optimization. In contrast to previous NeRF-based approaches that individually query pixel visibility, our method directly performs prediction on the entire image, enabling more effective utilization of spatial contextual information and enhancing the accuracy in segmenting transient and static objects. 为了解决上述问题,在本工作中,我们提出了一种基于 3DGS 的外观感知 3D 高斯点云(AAGS)新方法,用于从不受限制的照片集合中重建场景。具体而言,我们注意到视觉观察是由物体本身的物理特性和外部光照条件决定的。因此,我们为每个高斯分配一个特定的特征向量(而不是普通 3DGS 中的颜色)作为新的内在属性,同时使用外观提取器从不同图像中获取全局特征。这些外观特征与高斯的内在属性一起,通过颜色解码器处理,以获得最终的每个高斯发射颜色。此外,由于我们的目标是渲染静态场景,因此对瞬态物体的额外重建是一个不必要的资源消耗。因此,我们利用瞬态 UNet 自适应生成一个 2D 可见性图,以帮助在优化中排除瞬态遮挡物。 与之前基于 NeRF 的方法逐个查询像素可见性不同,我们的方法直接对整个图像进行预测,从而更有效地利用空间上下文信息,并提高了对瞬态和静态物体的分割准确性。
In summary, our contributions are as follows: 总之,我们的贡献如下:
We propose a novel method AAGS, which effectively extracts global appearance features from diverse unconstrained images and integrates them with the per-Gaussian intrinsic property to render in-the-wild scenes under varying appearances. 我们提出了一种新颖的方法 AAGS,它有效地从多样的无约束图像中提取全局外观特征,并将其与每个高斯内在属性结合,以在不同外观下渲染野外场景。
We propose a transient-removal module that fully leverages spatial contextual information to generate a 2D visibility map that excludes transient occluders. 我们提出了一个瞬态去除模块,充分利用空间上下文信息生成一个排除瞬态遮挡物的二维可见性图。
Extensive experimental results demonstrate that our method outperforms the baselines both qualitatively and quantitatively, and significantly reduces both training and rendering overhead. 大量实验结果表明,我们的方法在定性和定量上都优于基线,并显著减少了训练和渲染开销。
2 Related Work 2 相关工作
In this section, we will briefly review recent work in several related fields, including NeRF, Neural Rendering in the wild and 3D Gaussian Splatting. 在本节中,我们将简要回顾几个相关领域的最新工作,包括 NeRF、野外神经渲染和 3D 高斯溅射。
2.1 Neural Radiance Fields 2.1 神经辐射场
Neural Radiance Fields [3] (NeRF) consistently attracts considerable attention from both academia and industry, owing to their unparalleled capability in producing photorealistic renderings of novel views. It represents a scene as an implicit multilayer perceptron (MLP), which maps spatial points to their corresponding radiation and volume density. By applying the volumetric rendering to the volume density and radiation, the final rendered output is obtained. Since the volumetric rendering process is differentiable, it allows for the optimization of the implicit MLP that characterizes the scene. 神经辐射场 [3] (NeRF) 一直以来都受到学术界和工业界的广泛关注,因为它们在生成新视角的照片级真实感渲染方面具有无与伦比的能力。它将场景表示为一个隐式多层感知器 (MLP),该感知器将空间点映射到相应的辐射和体积密度。通过对体积密度和辐射应用体积渲染,可以获得最终的渲染输出。由于体积渲染过程是可微分的,因此可以优化表征场景的隐式 MLP。
Many subsequent works have made significant improvements to the origin NeRF in various aspects. Mip-NeRF [5] and Mip-NeRF 360 [6] enhance NeRF’s positional encoding by transitioning from Positional Encoding (PE) to Integrated Positional Encoding (IPE), effectively mitigating NeRF’s aliasing challenges. Plenoxels [13] presents the first scene representation that is entirely free of neural networks, providing a fully explicit formulation. Concurrently, Instant NGP [8] employs hash encoding and hybrid representations, which enhance the training and rendering speeds to exceed those of the original NeRF by more than a thousand times. TriMip-NeRF [7] integrates feature planes into NeRF, replacing the hash feature encoding used in Instant NGP [8]. This approach allows for the query of average features within a specified range, similar to the method employed by Mip-NeRF [5], thereby enhancing speed while ensuring high-quality rendering outcomes. Further innovations [14] integrate traditional rendering highlights by establishing normal vectors, augmenting NeRF’s proficiency in rendering highlights and reflective lights. Furthermore, the integration of the time dimension in [15-17] extends NeRF’s applicability to dynamic scene reconstruction. [18,19][18,19] eliminates the requirement for camera pose in the input and jointly optimizes both the radiation fields and camera pose. [20-22] investigate high-quality neural radiance field reconstruction with sparse input views. 许多后续作品在各个方面对原始 NeRF 进行了显著改进。Mip-NeRF [5] 和 Mip-NeRF 360 [6] 通过从位置编码(PE)过渡到集成位置编码(IPE),增强了 NeRF 的位置信息,有效缓解了 NeRF 的混叠问题。Plenoxels [13] 提出了第一个完全不依赖神经网络的场景表示,提供了一个完全明确的公式。同时,Instant NGP [8] 采用哈希编码和混合表示,训练和渲染速度超过原始 NeRF 的速度超过一千倍。TriMip-NeRF [7] 将特征平面集成到 NeRF 中,替代了 Instant NGP [8] 中使用的哈希特征编码。这种方法允许在指定范围内查询平均特征,类似于 Mip-NeRF [5] 所采用的方法,从而在确保高质量渲染结果的同时提高速度。进一步的创新 [14] 通过建立法向量集成传统渲染高光,增强了 NeRF 在渲染高光和反射光方面的能力。 此外,文献[15-17]中时间维度的整合扩展了 NeRF 在动态场景重建中的适用性。 [18,19][18,19] 消除了输入中对相机姿态的要求,并共同优化辐射场和相机姿态。文献[20-22]研究了稀疏输入视图下的高质量神经辐射场重建。
2.2 Neural Rendering in the Wild 2.2 野外的神经渲染
Numerous efforts have extended NeRF to unconstrained photo collections. There are two main challenges here: 1 . How to adapt to the varying appearance conditions of different images. 2. How to remove transient objects from the images, retaining only the static scene. The foundational work, NeRF-W [9], innovates by attributing an appearance embedding to each wild image, effectively capturing the unique appearance features of disparate images, while simultaneously reconstructing a static and transient radiance field to separate static and transient scenes. Subsequent work, HaNeRF [10], employs a feature extractor to derive appearance embeddings for each image, achieving a certain level of generalization that removes the need to train appearance embeddings with half of the test set images. It also moves away from separately reconstructing transient and static scenes, instead assigning a transient embedding 许多努力已将 NeRF 扩展到不受限制的照片集合。这里有两个主要挑战:1. 如何适应不同图像的变化外观条件。2. 如何从图像中去除瞬态物体,仅保留静态场景。基础工作 NeRF-W [9]通过为每个野生图像赋予外观嵌入,创新性地捕捉不同图像的独特外观特征,同时重建静态和瞬态辐射场,以分离静态和瞬态场景。后续工作 HaNeRF [10]采用特征提取器为每个图像推导外观嵌入,实现了一定程度的泛化,消除了使用一半测试集图像训练外观嵌入的需要。它还不再单独重建瞬态和静态场景,而是分配一个瞬态嵌入。
to each wild image to obtain the corresponding 2D image-dependent visibility map to eliminate transient objects. CR-NeRF [11] further refines the process of extracting image appearance embeddings, facilitating the rendering of multiple rays concurrently to capitalize on global information more efficiently. UP-NeRF [23] explores the reconstruction of in-the-wild images in camera-free poses condition. It utilizes image feature maps as proxies and introduces depth map loss to enhance the training of camera poses. However, these NeRF-based methods all grapple with long training and rendering times and insufficient detail reconstruction, making them challenging to apply in real-world applications. 为了获得相应的二维图像依赖可见性图以消除瞬态物体,对每个野生图像进行处理。CR-NeRF [11] 进一步优化了提取图像外观嵌入的过程,促进了多个光线的并行渲染,以更有效地利用全局信息。UP-NeRF [23] 探索了在无相机姿态条件下重建野外图像。它利用图像特征图作为代理,并引入深度图损失以增强相机姿态的训练。然而,这些基于 NeRF 的方法都面临着较长的训练和渲染时间以及细节重建不足的问题,使其在实际应用中具有挑战性。
2.3 3D Gaussian Splatting 2.3 3D 高斯溅射
3D Gaussian Splatting [4] (3DGS) introduces an innovative approach for novel view synthesis and scene reconstruction, offering a distinct alternative to Neural Radiance Fields [3] (NeRF). This method represents scenes using a collection of anisotropic Gaussians and leverages rasterization rendering techniques to project these Gaussians onto the image plane for rendering, achieving real-time rendering speed. By directly optimizing the properties of 3D Gaussians and abandoning neural networks, the training speed is significantly accelerated. Due to the adoption of a fully explicit representation, it surpasses NeRF in terms of usability, scalability, and applicability to large-scale scenes. 3D 高斯点云 [4] (3DGS) 引入了一种创新的方法用于新视角合成和场景重建,为神经辐射场 [3] (NeRF) 提供了一种独特的替代方案。该方法使用一组各向异性高斯来表示场景,并利用光栅化渲染技术将这些高斯投影到图像平面进行渲染,实现实时渲染速度。通过直接优化 3D 高斯的属性并放弃神经网络,训练速度显著加快。由于采用了完全显式的表示,它在可用性、可扩展性和对大规模场景的适用性方面超越了 NeRF。
Various subsequent works have extended 3DGS in many aspects. Investigations such as [24-26][24-26] explore dynamic scene representations within 3DGS by integrating a temporal dimension. Commencing with a static Gaussian ensemble, time is utilized as an input to compute Gaussian displacements, subsequently applied to the original Gaussians to determine their final positions for rendering. Inspired by Mip-NeRF [5], Mip-Splatting [27] addresses the aliasing issues in 3DGS by controlling the sizes of Gaussians and employing 2D Mip filters. Additionally, [28-30] implement more efficient Gaussian representations and compact scene representations, aiming to diminish the storage demands of 3DGS. [31, 32] use Level-of-Detail (LoD) strategy for efficient largescale 3DGS training and rendering. Further innovations [33, 34] respectively utilize an anisotropic spherical Gaussian (ASG) and incorporate mirror attributes into the 3DGS to enhance 3DGS’s proficiency in rendering highlights and reflective lights. [35] no longer requires COLMAP preprocessing to perform multi-view stereo reconstruction, achieving high-quality reconstruction without camera poses. 各种后续工作在许多方面扩展了 3DGS。诸如 [24-26][24-26] 的研究通过整合时间维度探索了 3DGS 中的动态场景表示。从静态高斯集合开始,时间被用作输入来计算高斯位移,随后应用于原始高斯以确定其最终位置以进行渲染。受到 Mip-NeRF [5]的启发,Mip-Splatting [27]通过控制高斯的大小和使用 2D Mip 滤波器来解决 3DGS 中的混叠问题。此外,[28-30]实现了更高效的高斯表示和紧凑的场景表示,旨在减少 3DGS 的存储需求。[31, 32]使用细节层次(LoD)策略进行高效的大规模 3DGS 训练和渲染。进一步的创新[33, 34]分别利用各向异性球形高斯(ASG)并将镜面属性纳入 3DGS,以增强 3DGS 在渲染高光和反射光方面的能力。[35]不再需要 COLMAP 预处理即可执行多视图立体重建,实现高质量重建而无需相机位姿。
Our work aims to extend 3DGS to unconstrained photo collections. Unlike previous methods [9-11, 23] based on NeRF, our method is built upon 3DGS, achieving reconstruction speeds at the minute level and real-time rendering capabilities, thereby facilitating its application in real-world scenarios. Compared with the concurrent work [12], our method does not require additional training of Gaussians to represent transient objects, making it more concise and efficient. 我们的工作旨在将 3DGS 扩展到无约束的照片集合。与基于 NeRF 的先前方法[9-11, 23]不同,我们的方法建立在 3DGS 之上,实现了分钟级别的重建速度和实时渲染能力,从而促进了其在现实场景中的应用。与同时进行的工作[12]相比,我们的方法不需要额外训练高斯来表示瞬态物体,使其更加简洁高效。
Fig. 2 An illustration of the overall pipeline of our approach. We extract appearance features V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} from given image I_(i)I_{i} using an appearance extractor. V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} is then integrated with the Gaussians’ inherent color feature V^(g)\boldsymbol{V}^{\boldsymbol{g}} and directional vectors gamma(d)\gamma(\boldsymbol{d}) and fed into a lightweight color decoder to obtain the final colors for the Gaussians. Subsequently, these Gaussians are splatted onto the image plane for rendering. Meanwhile, a transient UNet [36] generates the 2D visibility map M_(i)M_{i} from I_(i)I_{i}, which assists the model in mitigating the interference caused by transient noise during the training process. 图 2 我们方法整体流程的示意图。我们使用外观提取器从给定图像 I_(i)I_{i} 中提取外观特征 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 。然后将 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 与高斯的固有颜色特征 V^(g)\boldsymbol{V}^{\boldsymbol{g}} 和方向向量 gamma(d)\gamma(\boldsymbol{d}) 结合,并输入轻量级颜色解码器以获得高斯的最终颜色。随后,这些高斯被喷溅到图像平面上进行渲染。同时,一个瞬态 UNet [36] 从 I_(i)I_{i} 生成 2D 可见性图 M_(i)M_{i} ,这有助于模型在训练过程中减轻瞬态噪声造成的干扰。
3 Method 3 方法
In this section, we will first give the definition of unconstrained reconstruction task in Section 3.1. Next, we will provide an overview of the 3DGS [4] method and its limitations in Section 3.2. Then we will introduce the overall pipeline and two modules of our method. First, we employ an appearance extractor to obtain a latent appearance feature from the original in-the-wild image, thereby addressing the continuously changing appearance (Section 3.3). Furthermore, we design a transient object handling module to remove transient objects present in the photo collection (Section 3.4). The overall method pipeline is illustrated in Fig. 2. 在本节中,我们将首先在 3.1 节中给出无约束重建任务的定义。接下来,我们将在 3.2 节中提供 3DGS [4] 方法及其局限性的概述。然后,我们将介绍我们方法的整体流程和两个模块。首先,我们采用外观提取器从原始的野外图像中获取潜在的外观特征,从而解决不断变化的外观问题(3.3 节)。此外,我们设计了一个瞬态物体处理模块,以去除照片集合中存在的瞬态物体(3.4 节)。整体方法流程如图 2 所示。
3.1 Task Definition 3.1 任务定义
Given a collection of in-the-wild photos with different appearance conditions and transient objects, along with their corresponding camera intrinsic and extrinsic parameters, the unconstrained reconstruction task aims to reconstruct a scene that can adapt to diverse appearance conditions of different samples and remove transient objects. Once the scene has been reconstructed, we are able to render images with diverse appearances from any new viewpoint. 给定一组在自然环境中拍摄的照片,这些照片具有不同的外观条件和瞬态物体,以及它们对应的相机内参和外参,无约束重建任务旨在重建一个能够适应不同样本的多样外观条件并去除瞬态物体的场景。一旦场景被重建,我们就能够从任何新的视角渲染出具有多样外观的图像。
3.2 Preliminaries 3.2 初步事项
Within the context of 3D Gaussian Splatting, objects are represented by a collection of anisotropic three-dimensional Gaussians, each characterized by a set of learnable parameters as follows: 在三维高斯溅射的背景下,物体由一组各向异性的三维高斯表示,每个高斯由一组可学习的参数特征如下:
3D3 D center point: mu inR^(3)\boldsymbol{\mu} \in \mathbb{R}^{3}; 3D3 D 中心点: mu inR^(3)\boldsymbol{\mu} \in \mathbb{R}^{3} ;
3D rotation represented by a quaternion: q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4}; 3D 旋转由四元数表示: q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4} ;
3 D size (scaling factor): s inR^(3)s \in \mathbb{R}^{3}; 3 D 尺寸(缩放因子): s inR^(3)s \in \mathbb{R}^{3} ;
spherical harmonics coefficients (degrees of freedom: kk ) for view-dependent RGB color: shinR^(3(k+1)^(2))rarr c inR^(3)\mathbf{s h} \in \mathbb{R}^{3(k+1)^{2}} \rightarrow \boldsymbol{c} \in \mathbb{R}^{3}; 视依赖 RGB 颜色的球面谐波系数(自由度: kk ): shinR^(3(k+1)^(2))rarr c inR^(3)\mathbf{s h} \in \mathbb{R}^{3(k+1)^{2}} \rightarrow \boldsymbol{c} \in \mathbb{R}^{3} ;
where the scaling matrix S\boldsymbol{S} is derived from s inR^(3)s \in \mathbb{R}^{3} and the rotation matrix R\boldsymbol{R} is obtained from q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4}. 缩放矩阵 S\boldsymbol{S} 来源于 s inR^(3)s \in \mathbb{R}^{3} ,旋转矩阵 R\boldsymbol{R} 来源于 q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4} 。
Then, we can define Gaussians with the centroid point mu inR^(3)\mu \in \mathbb{R}^{3} and covariance matrix Sigma\boldsymbol{\Sigma} : 然后,我们可以定义以中心点 mu inR^(3)\mu \in \mathbb{R}^{3} 和协方差矩阵 Sigma\boldsymbol{\Sigma} 为基础的高斯分布:
Subsequently, we are required to project the 3 D Gaussians onto the image plane. According to [37], given a camera view matrix W\boldsymbol{W}, the covariance matrix Sigma^(')\boldsymbol{\Sigma}^{\prime} can be given as follows: 随后,我们需要将 3D 高斯投影到图像平面上。根据[37],给定相机视图矩阵 W\boldsymbol{W} ,协方差矩阵 Sigma^(')\boldsymbol{\Sigma}^{\prime} 可以表示如下:
Sigma^(')=JW SigmaW^(T)J^(T),\Sigma^{\prime}=J W \Sigma W^{T} J^{T},
where J\boldsymbol{J} is the Jacobian of the affine approximation of the camera projective transformation. 其中 J\boldsymbol{J} 是相机投影变换的仿射近似的雅可比矩阵。
Each Gaussian additionally possesses two extra attributes: opacity alpha\boldsymbol{\alpha}, and color c\boldsymbol{c} (represented using coefficients of spherical harmonics for view-dependency). Projecting 3DGS onto the image plane yields the corresponding 2D Gaussians. For each pixel, which may be projected onto multiple Gaussians, the final pixel color can be determined through a fast differentiable alpha\boldsymbol{\alpha}-blending process after sorting Gaussians by their depth: 每个高斯还具有两个额外属性:不透明度 alpha\boldsymbol{\alpha} 和颜色 c\boldsymbol{c} (使用球面谐波的系数表示视图依赖性)。将 3DGS 投影到图像平面上会产生相应的 2D 高斯。对于每个像素,可能会投影到多个高斯上,最终像素颜色可以通过快速可微分的 alpha\boldsymbol{\alpha} -混合过程在按深度排序高斯后确定:
C=sum_(i in N)(c_(i)alpha_(i)^(')prod_(j=1)^(i-1)(1-alpha_(j)^(')))C=\sum_{i \in N}\left(c_{i} \alpha_{i}^{\prime} \prod_{j=1}^{i-1}\left(1-\alpha_{j}^{\prime}\right)\right)
where the final opacity alpha_(i)^(')\alpha_{i}^{\prime} is the multiplication result of the learned opacity alpha_(i)\alpha_{i} of the Gaussian: 最终不透明度 alpha_(i)^(')\alpha_{i}^{\prime} 是高斯的学习不透明度 alpha_(i)\alpha_{i} 的乘积结果:
where x^(')x^{\prime} and mu_(i)^(')\mu_{i}^{\prime} are coordinates of the Gaussian in the projected space. 其中 x^(')x^{\prime} 和 mu_(i)^(')\mu_{i}^{\prime} 是投影空间中高斯的坐标。
Although 3D Gaussian Splatting can be optimized from a set of randomly initialized three-dimensional Gaussians, this approach often yields suboptimal results. Utilizing Structure from Motion (SfM) techniques, a sparse point cloud of the scene to be reconstructed can be obtained at the outset. Employing this point cloud to initialize a set of 3DGS with prior knowledge leads to a better result. Besides, 3DGS also 尽管 3D 高斯溅射可以从一组随机初始化的三维高斯中进行优化,但这种方法通常会产生次优结果。利用运动结构(SfM)技术,可以在一开始获得要重建场景的稀疏点云。利用这个点云来初始化一组具有先验知识的 3DGS 会导致更好的结果。此外,3DGS 还
controls Gaussians through splitting, duplicating, and deleting based on the magnitude of gradients received during backpropagation. For a more detailed exposition of the process, one may refer to the survey [38]. 通过根据反向传播过程中接收到的梯度大小进行分割、复制和删除来控制高斯分布。有关该过程的更详细说明,可以参考调查[38]。
3.3 Appearance Modeling 3.3 外观建模
To enhance the adaptability of 3DGS to the varying lighting conditions and postprocessing effects present in diverse wild images, we have implemented several modifications. These modifications are inspired by the insight that rendering outcomes are influenced both by external appearance conditions such as lighting, and by the intrinsic attributes of the objects themselves. Consequently, we have replaced the spherical harmonic coefficients, originally utilized for color representation in 3DGS, with a learnable implicit feature vector, denoted as V^(g)\boldsymbol{V}^{g}. This vector, with a length of n_(g)n_{g}, is designed to adaptively capture and represent the properties of 3DGS. Concurrently, we utilize a CNN-based appearance extractor E_(Phi)\mathrm{E}_{\Phi} to obtain the latent appearance vector V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} of length n_(a)n_{a} of each image, serving as the current appearance conditions. 为了增强 3DGS 对不同野生图像中变化的光照条件和后处理效果的适应性,我们实施了几项修改。这些修改的灵感来源于这样一个观点:渲染结果受到外部外观条件(如光照)和对象本身的内在属性的影响。因此,我们用一个可学习的隐式特征向量 V^(g)\boldsymbol{V}^{g} 替换了最初用于 3DGS 中颜色表示的球面谐波系数。这个长度为 n_(g)n_{g} 的向量旨在自适应地捕捉和表示 3DGS 的属性。同时,我们利用基于 CNN 的外观提取器 E_(Phi)\mathrm{E}_{\Phi} 来获取每个图像的潜在外观向量 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} ,其长度为 n_(a)n_{a} ,作为当前的外观条件。
Furthermore, to replace the view-dependent effects represented by spherical coefficients in the original 3DGS, we calculate a direction vector, denoted as d\boldsymbol{d}, based on the positions of each Gaussian and the camera’s position coordinates. We apply positional encoding to this direction vector d\boldsymbol{d} using spherical harmonics, which is denoted by gamma(d)\gamma(\boldsymbol{d}). The rationale for this choice lies in the fact that frequency-space encoding is better suited for directional vectors than component-wise encodings. Subsequently, we concatenate the color implicit vector V^(g)\boldsymbol{V}^{\boldsymbol{g}} of Gaussians, the image appearance latent vector V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} and directional encoding gamma(d)\gamma(\boldsymbol{d}), and feed them into an MLP-based color decoder F_(theta)\mathrm{F}_{\theta} to predict the color of Gaussians: 此外,为了替换原始 3DGS 中由球面系数表示的视图依赖效果,我们根据每个高斯的位置和相机的位置坐标计算一个方向向量,记作 d\boldsymbol{d} 。我们使用球面谐波对这个方向向量 d\boldsymbol{d} 进行位置编码,记作 gamma(d)\gamma(\boldsymbol{d}) 。选择这个方法的原因在于频率空间编码比分量编码更适合方向向量。随后,我们将高斯的颜色隐式向量 V^(g)\boldsymbol{V}^{\boldsymbol{g}} 、图像外观潜在向量 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 和方向编码 gamma(d)\gamma(\boldsymbol{d}) 进行连接,并将它们输入基于 MLP 的颜色解码器 F_(theta)\mathrm{F}_{\theta} ,以预测高斯的颜色:
We do not require the use of a large-scale MLP, because our Gaussians inherently store color features, the decoding burden on F_(theta)\mathrm{F}_{\theta} is not that substantial. Therefore, a lightweight MLP is sufficient for decoding. In our implementation, we use 4 layers of 64 hidden units. These predicted colors c\mathbf{c} are used as the final emitted colors for Gaussians. After getting all Gaussians’ colors and the camera parameters, we can use the differentiable tile rasterizer mentioned in Eq. (4) to render the scene with the appearance of the input image. 我们不需要使用大规模的 MLP,因为我们的高斯分布本质上存储了颜色特征,因此对 F_(theta)\mathrm{F}_{\theta} 的解码负担并不大。因此,轻量级的 MLP 足以进行解码。在我们的实现中,我们使用 4 层 64 个隐藏单元。这些预测的颜色 c\mathbf{c} 被用作高斯分布的最终发射颜色。在获取所有高斯的颜色和相机参数后,我们可以使用在公式(4)中提到的可微分瓷砖光栅化器来渲染具有输入图像外观的场景。
3.4 Transient Objects Handling 3.4 瞬态对象处理
Since our objective is to reconstruct static scenes, we argue that approaches like NeRFW [9] and SWAG [12] that also model transient scenes are doing redundant work that increases computational overhead. Therefore, we propose to create a 2 D visibility map to exclude transient objects. Technically, we employ a transient UNet denoted by U_(Psi)\mathrm{U}_{\Psi} to obtain the final 2 D visibility map M_(i)M_{i} for distinguishing transient and static objects in image I_(i)I_{i}. This strategy is facilitated by the high-speed rendering of 3DGS, which 由于我们的目标是重建静态场景,我们认为像 NeRFW [9]和 SWAG [12]这样的也建模瞬态场景的方法是在做冗余工作,增加了计算开销。因此,我们建议创建一个 2D 可见性图,以排除瞬态物体。从技术上讲,我们使用一个瞬态 UNet,记作 U_(Psi)\mathrm{U}_{\Psi} ,以获得最终的 2D 可见性图 M_(i)M_{i} ,用于区分图像 I_(i)I_{i} 中的瞬态和静态物体。这个策略得益于 3DGS 的高速渲染,
allows us to render the entire image in a single forward pass, whereas NeRF-based methods operate on pixel-wise basis. Leveraging UNet enables us to make full use of the intrinsic information present in the images, resulting in a more accurate distinction of transient parts. The whole process can be formulated as 允许我们在一次前向传递中渲染整个图像,而基于 NeRF 的方法则在像素级别上操作。利用 UNet 使我们能够充分利用图像中存在的内在信息,从而更准确地区分瞬态部分。整个过程可以表述为
In order to accelerate the speed as much as possible, our UNet adopts a downsampling and an upsampling layer, and finally outputs a binary classification map that is the same size as the input image I_(i)I_{i}. This map assigns values from 0 to 1 , where values closer to 1 indicate higher confidence that the current pixel corresponds to a static object. Note that we employ a self-supervised learning approach, wherein the model autonomously identifies pixels that are challenging to train across different images as transient objects unique to each image, without the need for additional segmentation priors for supervision. Additionally, to prevent it from the trivial solution where all pixels are assumed as transient, we also employ a regularization loss which serves as a regularization on the 2 D visibility map: 为了尽可能加快速度,我们的 UNet 采用了下采样和上采样层,最终输出一个与输入图像 I_(i)I_{i} 大小相同的二分类图。该图将值从 0 到 1 进行分配,其中值越接近 1 表示当前像素对应静态物体的置信度越高。请注意,我们采用了一种自监督学习方法,其中模型自主识别在不同图像中难以训练的像素,将其视为每个图像独特的瞬态物体,而无需额外的分割先验进行监督。此外,为了防止其陷入所有像素都被假定为瞬态的平凡解,我们还采用了一种正则化损失,作为对 2D 可见性图的正则化:
The obtained M_(i)M_{i} serves as a mask to eliminate the involvement of pixels representing transient objects during the optimization. Specifically, when calculating the reconstruction loss, the image needs to be softly masked by M_(i)M_{i} to remove the existence of transient objects. 获得的 M_(i)M_{i} 作为掩码,用于消除在优化过程中代表瞬态对象的像素的参与。具体来说,在计算重建损失时,图像需要通过 M_(i)M_{i} 进行柔性掩蔽,以去除瞬态对象的存在。
3.5 Optimization 3.5 优化
Similar to 3DGS [4], we use L1 loss and SSIM loss to regulate the similarity between predicted and ground-truth images. Both losses are modulated by our 2D visibility map M_(i)M_{i} to prevent modeling transient objects. The full loss function during the optimization is defined as follows: 与 3DGS [4]类似,我们使用 L1 损失和 SSIM 损失来调节预测图像与真实图像之间的相似性。这两种损失都通过我们的 2D 可见性图 M_(i)M_{i} 进行调节,以防止建模瞬态物体。优化过程中的完整损失函数定义如下:
In this section, we will discuss the various details of our experiments, including the datasets, baselines, evaluation metrics, and implementation details in Section 4.1. Comparisons with previous in-the-wild reconstruction work are presented in Section 4.2, while ablation studies are detailed in Section 4.3. Additionally, we offer further insights into appearance control in Section 4.4. 在本节中,我们将讨论实验的各种细节,包括数据集、基线、评估指标和第 4.1 节中的实现细节。与之前的野外重建工作的比较在第 4.2 节中呈现,而消融研究在第 4.3 节中详细说明。此外,我们在第 4.4 节中提供了关于外观控制的进一步见解。
Fig. 3 Qualitative comparison results of NeRF-based methods (NeRF-W [9], Ha-NeRF [10], CRNeRF [11]) and our method on Phototourism dataset. Our method can better capture lighting information and building details (e.g., highlights of the bronze horse and signpost in Brandenburg Gate, sunshine on stone in Trevi Fountain, and sunlight on the castle and long spear in the hand of a knight sculpture in Sacré-Cœur). 图 3 NeRF 基础方法(NeRF-W [9]、Ha-NeRF [10]、CRNeRF [11])和我们的方法在 Phototourism 数据集上的定性比较结果。我们的方法能够更好地捕捉光照信息和建筑细节(例如,勃兰登堡门的青铜马和指示牌的高光,特雷维喷泉石头上的阳光,以及圣心大教堂骑士雕像手中长矛上的阳光)。
4.1 Experimental Settings 4.1 实验设置
4.1.1 Dataset 4.1.1 数据集
Similar to previous work, we use the Phototourism dataset [39], an internet photo collection of cultural landmarks, as our primary dataset for this study. We use COLMAP [40] to obtain camera parameters and initial point clouds for three specific scenes within the dataset(“Brandenburg Gate”, “Sacré-Cour” and “Trevi Fountain”). Meanwhile, to demonstrate the robustness of our method, we also evlauate our method on NeRF-OSR dataset [41]. We choose 4 sites for experimentation - europa, lwp, st, and stjohann, and 1//81 / 8 of the images in each sites are selected as the test set. During the training phase, all images are downsampled by a factor of 2 . 与之前的工作类似,我们使用 Phototourism 数据集[39],这是一个关于文化地标的互联网照片集合,作为本研究的主要数据集。我们使用 COLMAP[40]来获取相机参数和数据集中三个特定场景(“勃兰登堡门”、“圣心大教堂”和“特雷维喷泉”)的初始点云。同时,为了展示我们方法的鲁棒性,我们还在 NeRF-OSR 数据集[41]上评估我们的方法。我们选择了 4 个实验地点 - europa、lwp、st 和 stjohann,并从每个地点中选择 1//81 / 8 张图像作为测试集。在训练阶段,所有图像都按 2 的比例进行下采样。
We primarily compare the rendering quality, training time and rendering speed of our method with NeRF-based methods, NeRF-W [9], Ha-NeRF [10], CR-NeRF [11] and the concurrent 3DGS-based methods, the original 3DGS [4] and SWAG [12] to demonstrate the superiority of our method. 我们主要将我们的方法与基于 NeRF 的方法进行比较,包括 NeRF-W [9]、Ha-NeRF [10]、CR-NeRF [11],以及同时期的基于 3DGS 的方法,即原始 3DGS [4]和 SWAG [12],以展示我们方法的优越性。
4.1.3 Metrics 4.1.3 指标
Rendering quality is assessed using performance metrics such as Peak Signal-toNoise Ratio (PSNR), Structural Similarity Index Measure [42] (SSIM), and Learned Perceptual Image Patch Similarity [43] (LPIPS). It should be noted that the LPIPS reported by the NeRF-based and 3DGS-based methods have different backbones. For a fair comparison, we use AlexNet as the backbone when comparing with the NeRF-based methods, while VGG is used when comparing with the 3DGS-based methods. 渲染质量是通过性能指标来评估的,例如峰值信噪比(PSNR)、结构相似性指数测量[42](SSIM)和学习感知图像块相似性[43](LPIPS)。需要注意的是,基于 NeRF 和基于 3DGS 的方法报告的 LPIPS 具有不同的骨干网络。为了公平比较,我们在与基于 NeRF 的方法比较时使用 AlexNet 作为骨干网络,而在与基于 3DGS 的方法比较时使用 VGG。
4.1.4 Implementation Details 4.1.4 实施细节
Our implementation is based on the official implementation of 3DGS [4] which uses PyTorch [44] and PyTorch CUDA extension. Both the training and testing phases are conducted on a single RTX 3090 GPU with 24G memory. The parameters we need to train include E_(Phi),F_(theta),U_(Psi)\mathrm{E}_{\Phi}, \mathrm{F}_{\theta}, \mathrm{U}_{\Psi} and the properties of the Gaussian itself. Our full loss is defined in Eq. (10). lambda_(1)\lambda_{1}, and image feature length n_(a),n_(g)n_{a}, n_{g} are the set of additional hyperparameters, we use 0.8,480.8,48, and 24 respectively. For the parameter lambda_(m)\lambda_{m}, we use 我们的实现基于 3DGS [4] 的官方实现,该实现使用 PyTorch [44] 和 PyTorch CUDA 扩展。训练和测试阶段均在单个 RTX 3090 GPU 上进行,内存为 24G。我们需要训练的参数包括 E_(Phi),F_(theta),U_(Psi)\mathrm{E}_{\Phi}, \mathrm{F}_{\theta}, \mathrm{U}_{\Psi} 和高斯本身的属性。我们的完整损失在公式(10)中定义。 lambda_(1)\lambda_{1} ,图像特征长度 n_(a),n_(g)n_{a}, n_{g} 是额外超参数的集合,我们分别使用 0.8,480.8,48 和 24。对于参数 lambda_(m)\lambda_{m} ,我们使用
Fig. 5 Qualitative comparison results of 3DGS-based methods (3DGS[4], SWAG [12]) and our method on Phototourism datasets. Due to SWAG not being open-sourced, we have selected a subset of results from its publication for comparison. Our rendering results are softer, aligning more closely with human vision. 图 5 基于 3DGS 的方法(3DGS[4],SWAG [12])与我们的方法在 Phototourism 数据集上的定性比较结果。由于 SWAG 未开源,我们选择了其出版物中的一部分结果进行比较。我们的渲染结果更柔和,更接近人类视觉。
exponential annealing weights, starting at 0.5 at the beginning of training and gradually decreasing to 0.15 . as iterations proceed. We utilize spherical harmonics encoding with a degree of 4 for encoding directional vectors. For the E_(Phi)\mathrm{E}_{\Phi} responsible for extracting image features, our implementation is consistent with Ha-NeRF [10] which has five convolutional layers, one average pooling layer, and a fully connected layer. For the U_(Psi)\mathrm{U}_{\Psi}, we employ one downsampling and one upsampling layer. The F_(theta)\mathrm{F}_{\theta} consists of 4 fully connected layers, each with a width of 64 . The training process closely follows the 3DGS approach, with a total of 30,000 iterations. Notably, after 15,000 iterations, we no longer employ the duplication, splitting, and deletion of Gaussians, thus maintaining a constant number of Gaussians. 指数退火权重,从训练开始时的 0.5 逐渐降低到 0.15,随着迭代的进行。我们使用 4 度的球面谐波编码来编码方向向量。对于 E_(Phi)\mathrm{E}_{\Phi} ,负责提取图像特征,我们的实现与 Ha-NeRF [10]一致,具有五个卷积层、一个平均池化层和一个全连接层。对于 U_(Psi)\mathrm{U}_{\Psi} ,我们采用一个下采样层和一个上采样层。 F_(theta)\mathrm{F}_{\theta} 由 4 个全连接层组成,每个层的宽度为 64。训练过程紧密遵循 3DGS 方法,总共进行 30,000 次迭代。值得注意的是,在 15,000 次迭代后,我们不再进行高斯的复制、分割和删除,从而保持高斯的数量不变。
4.2 Comparison with Existing Methods 4.2 与现有方法的比较
4.2.1 Rendering Results 4.2.1 渲染结果
Table 1, Fig. 3 and Table 2, Fig. 4, which show the quantitative results and qualitative results on Phototourism dataset and NeRF-OSR dataset respectively, demonstrate the comprehensive superiority of our method over NeRF-based methods. Although NeRFW, Ha-NeRF, and CR-NeRF use additional embedding to represent the appearance of different images and take measures to remove transient objects, they are unable to accurately capture the details of buildings and certain highlights (e.g., highlights of the bronze horse and signpost in Brandenburg Gate, sunshine on stone in Trevi Fountain, and sunlight on the castle and long spear in the hand of a knight sculpture in Sacré-Cour), thus leading to a lower fidelity in reproducing fine details compared to our method. Thanks to our Gaussian-intrinsic features and the appearance feature extracted from diverse images, our method outperforms all NeRF-based methods across all datasets and all metrics, additionally yielding superior visual effects. 表 1、图 3 和表 2、图 4 分别展示了 Phototourism 数据集和 NeRF-OSR 数据集的定量结果和定性结果,证明了我们的方法在综合上优于基于 NeRF 的方法。尽管 NeRFW、Ha-NeRF 和 CR-NeRF 使用额外的嵌入来表示不同图像的外观,并采取措施去除瞬态物体,但它们无法准确捕捉建筑物的细节和某些亮点(例如,布兰登堡门的铜马和指示牌的亮点、特雷维喷泉石头上的阳光,以及圣心大教堂骑士雕像手中的长矛上的阳光),因此在再现细节的保真度上低于我们的方法。得益于我们的高斯内在特征和从多样图像中提取的外观特征,我们的方法在所有数据集和所有指标上都优于所有基于 NeRF 的方法,并且产生了更优越的视觉效果。
Table 1 Quantitative results on Phototourism dataset. 表 1 光旅游数据集的定量结果。