AAGS: Appearance-Aware 3D Gaussian Splatting with Unconstrained Photo Collections AAGS:外观感知的无约束照片集合的 3D 高斯溅射
Wencong Zhang ^(1){ }^{1}, Zhiyang Guo ^(2){ }^{2}, Wengang Zhou ^(1){ }^{1}, Houqiang Li ^(1){ }^{1}^(1**){ }^{1 *} the CAS Key Laboratory of Technology in GIPAS, EEIS Department, the University of Science and Technology of China, Fuxing Street, Hefei, 230052, Anhui, China. ^(1**){ }^{1 *} 中国科学技术大学电气与信息工程系 GIPAS 技术 CAS 重点实验室,复兴街,合肥,230052,安徽,中国。
Reconstructing 3D scenes from unconstrained collections of in-the-wild photographs has consistently been a challenging problem. The main difficulty lies in different appearance conditions and transient occluders of uncontrolled image samples. With the advancement of Neural Radiance Fields (NeRF), previous works have developed some effective strategies to tackle this issue. However, limited by deep networks and volumetric rendering techniques, these methods generally require substantial time costs. Recently, the advent of 3D Gaussian Splatting (3DGS) has significantly accelerated the training and rendering speed of 3D reconstruction tasks. Nevertheless, vanilla 3DGS struggles to distinguish varying appearances of in-the-wild photo collections. To address the aforementioned problems, we propose Appearance-Aware 3D Gaussian Splatting (AAGS), a novel extension of 3DGS to unconstrained photo collections. Specifically, we employ an appearance extractor to capture global features for image samples, enabling the distinction of visual conditions, e.g., illumination and weather, across different observations. Furthermore, to mitigate the impact of transient occluders, we design a transient-removal module that adaptively learns a 2 D visibility map to decompose the static target from complex realworld scenes. Extensive experiments are conducted to validate the effectiveness and superiority of our AAGS. Compared with previous works, our method not only achieves better reconstruction and rendering quality, but also significantly reduces both training and rendering overhead. Code will be released at https://github.com/Zhang-WenCong/AAGS. 从无约束的野外照片集合中重建 3D 场景一直是一个具有挑战性的问题。主要困难在于不同的外观条件和不受控制的图像样本的瞬时遮挡物。随着神经辐射场(NeRF)的进步,以前的工作开发了一些有效的策略来解决这个问题。然而,由于深度网络和体积渲染技术的限制,这些方法通常需要大量的时间成本。最近,3D 高斯溅射(3DGS)的出现显著加快了 3D 重建任务的训练和渲染速度。然而,普通的 3DGS 在区分野外照片集合的不同外观方面存在困难。为了解决上述问题,我们提出了外观感知 3D 高斯溅射(AAGS),这是 3DGS 在无约束照片集合上的一种新扩展。具体而言,我们采用外观提取器来捕捉图像样本的全局特征,使得能够区分不同观察中的视觉条件,例如照明和天气。 此外,为了减轻瞬态遮挡物的影响,我们设计了一个瞬态去除模块,该模块自适应地学习二维可见性图,以从复杂的现实场景中分解静态目标。我们进行了大量实验,以验证我们 AAGS 的有效性和优越性。与之前的工作相比,我们的方法不仅实现了更好的重建和渲染质量,而且显著减少了训练和渲染开销。代码将发布在 https://github.com/Zhang-WenCong/AAGS。
Fig. 1 Using wild photos as input (a), our method is able to capture different appearances from diverse observations, render in-the-wild scenes from novel viewpoints, and effectively remove the impact of transient occluders (b). 图 1 使用野生照片作为输入(a),我们的方法能够从不同的观察中捕捉到不同的外观,从新颖的视角渲染野外场景,并有效去除瞬态遮挡物的影响(b)。
The tasks of novel view synthesis and 3D reconstruction have consistently been at the forefront of research in computer graphics and computer vision such as [1, 2]. Since the first introduction of Neural Radiance Fields [3] (NeRF), this implicit 3D representation has exhibited promising results in various tasks by integrating traditional volumetric rendering techniques with advanced deep learning methodologies. Utilizing dozens of photographs from different viewpoints along with their corresponding camera parameters, NeRF trains a neural network to represent the 3D scene, achieving groundbreaking performance in photo-realistic novel view synthesis tasks. Recently, 3D Gaussian Splatting [4] (3DGS) has garnered widespread attention due to its efficient reconstruction capabilities compared with NeRF. Employing explicit representations alongside a rasterization-based rendering process, 3DGS achieves real-time rendering speeds while maintaining rendering quality comparable to state-of-the-art NeRF methods [5-8]. 新视图合成和 3D 重建的任务一直处于计算机图形学和计算机视觉研究的前沿,例如[1, 2]。自从首次引入神经辐射场[3](NeRF)以来,这种隐式 3D 表示在各种任务中展现了令人鼓舞的结果,通过将传统的体积渲染技术与先进的深度学习方法相结合。利用来自不同视角的数十张照片及其相应的相机参数,NeRF 训练一个神经网络来表示 3D 场景,在照片真实感的新视图合成任务中取得了突破性的表现。最近,3D 高斯溅射[4](3DGS)因其与 NeRF 相比的高效重建能力而受到广泛关注。3DGS 采用显式表示和基于光栅化的渲染过程,实现了实时渲染速度,同时保持了与最先进的 NeRF 方法[5-8]相当的渲染质量。
Typically, both NeRF [3] and 3DGS [4] primarily focus on the reconstruction of static scenes. In other words, the input photo collections are assumed to maintain visual dynamic variation or transient objects. However, for in-the-wild scenes such as scenic spots, the available photos are usually captured within a large time span from days up to even years. Therefore, it is difficult to control the capturing conditions, resulting in diverse seasons, weather, illumination, and other appearance variances of photos. Even photos taken at the same location and time can have diverse appearances due to differences in device settings, e.g., exposure time, filters, and tone-mapping. Worse still, different transient objects, e.g., pedestrians, tourists, and vehicles, become intractable obstacles within the reconstruction process. If we directly apply vanilla 通常,NeRF [3] 和 3DGS [4] 主要集中在静态场景的重建。换句话说,输入的照片集合假设保持视觉动态变化或瞬态物体。然而,对于野外场景,如风景名胜区,现有的照片通常是在几天到甚至几年的较长时间跨度内拍摄的。因此,很难控制拍摄条件,导致照片在季节、天气、光照和其他外观变化方面的多样性。即使在相同的地点和时间拍摄的照片,由于设备设置的差异,例如曝光时间、滤镜和色调映射,也可能具有不同的外观。更糟糕的是,不同的瞬态物体,例如行人、游客和车辆,成为重建过程中的难以处理的障碍。如果我们直接应用普通的
NeRF or 3DGS to the aforementioned photo collections, the reconstructed scene will be filled with artifacts and floaters. This is exactly what unconstrained reconstruction task tries to wrestle with - how to reconstruct a scene with different appearance conditions from unconstrained photo collections. Addressing this problem typically confronts two major challenges: different appearances across images and potential interference from transient occluders. 对于上述照片集合的 NeRF 或 3DGS,重建的场景将充满伪影和浮动物。这正是无约束重建任务所要解决的问题——如何从无约束的照片集合中重建具有不同外观条件的场景。解决这个问题通常面临两个主要挑战:图像之间的不同外观和来自瞬态遮挡物的潜在干扰。
Several pioneers have attempted to tackle these challenges using various strategies. NeRF-W [9] tries to assign an appearance embedding to each image and adds transient confidence to the radiance field, thereby separately modeling static and transient objects with appearance awareness. Subsequent efforts, such as Ha-NeRF [10] and CR-NeRF [11], leverage image features more effectively to obtain better appearance embeddings and employ 2D visibility maps to exclude transient objects. However, these methods are limited by NeRF’s inherent rendering quality and efficiency, requiring dozens of hours to train and tens of seconds to render one single image. A recent concurrent 3DGS-based method, SWAG [12], significantly accelerates training and rendering compared to previous NeRF-based methods. However, it still uses the per-image embedding approach in NeRF-W [9], thus encounters a similar issue - it requires the use of half of the test set to train the appearance embeddings for those images. 几位先驱者尝试使用各种策略来应对这些挑战。NeRF-W [9] 试图为每个图像分配一个外观嵌入,并向辐射场添加瞬态置信度,从而分别建模静态和瞬态对象,并具备外观意识。后续的努力,如 Ha-NeRF [10] 和 CR-NeRF [11],更有效地利用图像特征以获得更好的外观嵌入,并使用 2D 可见性图来排除瞬态对象。然而,这些方法受到 NeRF 固有的渲染质量和效率的限制,训练需要数十小时,渲染一张单独的图像需要数十秒。最近一种基于 3DGS 的方法 SWAG [12] 相较于之前的 NeRF 基础方法显著加快了训练和渲染速度。然而,它仍然使用 NeRF-W [9] 中的每图像嵌入方法,因此遇到了类似的问题——它需要使用一半的测试集来训练这些图像的外观嵌入。
To address the aforementioned issues, in this work, we propose Appearance-Aware 3D Gaussian Splatting (AAGS), a novel method based on 3DGS for reconstructing scenes from unconstrained photo collections. Specifically, we notice that the visual observation is determined by the physical properties of the object itself and the external lighting conditions. Therefore, we assign a specific feature vector (instead of color as in vanilla 3DGS) as a new intrinsic property to each Gaussian and meanwhile employ an appearance extractor to obtain global features from different images. These appearance features, together with the intrinsic property of the Gaussians, are processed by a color decoder to obtain the final per-Gaussian emitted color. Furthermore, since our objective is to render the static scenes, the additional reconstruction of transient objects is an unnecessary resource consumption. Therefore, we utilize a transient UNet to adaptively generate a 2 D visibility map that helps exclude transient occluders in optimization. In contrast to previous NeRF-based approaches that individually query pixel visibility, our method directly performs prediction on the entire image, enabling more effective utilization of spatial contextual information and enhancing the accuracy in segmenting transient and static objects. 为了解决上述问题,在本工作中,我们提出了一种基于 3DGS 的外观感知 3D 高斯点云(AAGS)新方法,用于从不受限制的照片集合中重建场景。具体而言,我们注意到视觉观察是由物体本身的物理特性和外部光照条件决定的。因此,我们为每个高斯分配一个特定的特征向量(而不是普通 3DGS 中的颜色)作为新的内在属性,同时使用外观提取器从不同图像中获取全局特征。这些外观特征与高斯的内在属性一起,通过颜色解码器处理,以获得最终的每个高斯发射颜色。此外,由于我们的目标是渲染静态场景,因此对瞬态物体的额外重建是一个不必要的资源消耗。因此,我们利用瞬态 UNet 自适应生成一个 2D 可见性图,以帮助在优化中排除瞬态遮挡物。 与之前基于 NeRF 的方法逐个查询像素可见性不同,我们的方法直接对整个图像进行预测,从而更有效地利用空间上下文信息,并提高了对瞬态和静态物体的分割准确性。
In summary, our contributions are as follows: 总之,我们的贡献如下:
We propose a novel method AAGS, which effectively extracts global appearance features from diverse unconstrained images and integrates them with the per-Gaussian intrinsic property to render in-the-wild scenes under varying appearances. 我们提出了一种新颖的方法 AAGS,它有效地从多样的无约束图像中提取全局外观特征,并将其与每个高斯内在属性结合,以在不同外观下渲染野外场景。
We propose a transient-removal module that fully leverages spatial contextual information to generate a 2D visibility map that excludes transient occluders. 我们提出了一个瞬态去除模块,充分利用空间上下文信息生成一个排除瞬态遮挡物的二维可见性图。
Extensive experimental results demonstrate that our method outperforms the baselines both qualitatively and quantitatively, and significantly reduces both training and rendering overhead. 大量实验结果表明,我们的方法在定性和定量上都优于基线,并显著减少了训练和渲染开销。
2 Related Work 2 相关工作
In this section, we will briefly review recent work in several related fields, including NeRF, Neural Rendering in the wild and 3D Gaussian Splatting. 在本节中,我们将简要回顾几个相关领域的最新工作,包括 NeRF、野外神经渲染和 3D 高斯溅射。
2.1 Neural Radiance Fields 2.1 神经辐射场
Neural Radiance Fields [3] (NeRF) consistently attracts considerable attention from both academia and industry, owing to their unparalleled capability in producing photorealistic renderings of novel views. It represents a scene as an implicit multilayer perceptron (MLP), which maps spatial points to their corresponding radiation and volume density. By applying the volumetric rendering to the volume density and radiation, the final rendered output is obtained. Since the volumetric rendering process is differentiable, it allows for the optimization of the implicit MLP that characterizes the scene. 神经辐射场 [3] (NeRF) 一直以来都受到学术界和工业界的广泛关注,因为它们在生成新视角的照片级真实感渲染方面具有无与伦比的能力。它将场景表示为一个隐式多层感知器 (MLP),该感知器将空间点映射到相应的辐射和体积密度。通过对体积密度和辐射应用体积渲染,可以获得最终的渲染输出。由于体积渲染过程是可微分的,因此可以优化表征场景的隐式 MLP。
Many subsequent works have made significant improvements to the origin NeRF in various aspects. Mip-NeRF [5] and Mip-NeRF 360 [6] enhance NeRF’s positional encoding by transitioning from Positional Encoding (PE) to Integrated Positional Encoding (IPE), effectively mitigating NeRF’s aliasing challenges. Plenoxels [13] presents the first scene representation that is entirely free of neural networks, providing a fully explicit formulation. Concurrently, Instant NGP [8] employs hash encoding and hybrid representations, which enhance the training and rendering speeds to exceed those of the original NeRF by more than a thousand times. TriMip-NeRF [7] integrates feature planes into NeRF, replacing the hash feature encoding used in Instant NGP [8]. This approach allows for the query of average features within a specified range, similar to the method employed by Mip-NeRF [5], thereby enhancing speed while ensuring high-quality rendering outcomes. Further innovations [14] integrate traditional rendering highlights by establishing normal vectors, augmenting NeRF’s proficiency in rendering highlights and reflective lights. Furthermore, the integration of the time dimension in [15-17] extends NeRF’s applicability to dynamic scene reconstruction. [18,19][18,19] eliminates the requirement for camera pose in the input and jointly optimizes both the radiation fields and camera pose. [20-22] investigate high-quality neural radiance field reconstruction with sparse input views. 许多后续作品在各个方面对原始 NeRF 进行了显著改进。Mip-NeRF [5] 和 Mip-NeRF 360 [6] 通过从位置编码(PE)过渡到集成位置编码(IPE),增强了 NeRF 的位置信息,有效缓解了 NeRF 的混叠问题。Plenoxels [13] 提出了第一个完全不依赖神经网络的场景表示,提供了一个完全明确的公式。同时,Instant NGP [8] 采用哈希编码和混合表示,训练和渲染速度超过原始 NeRF 的速度超过一千倍。TriMip-NeRF [7] 将特征平面集成到 NeRF 中,替代了 Instant NGP [8] 中使用的哈希特征编码。这种方法允许在指定范围内查询平均特征,类似于 Mip-NeRF [5] 所采用的方法,从而在确保高质量渲染结果的同时提高速度。进一步的创新 [14] 通过建立法向量集成传统渲染高光,增强了 NeRF 在渲染高光和反射光方面的能力。 此外,文献[15-17]中时间维度的整合扩展了 NeRF 在动态场景重建中的适用性。 [18,19][18,19] 消除了输入中对相机姿态的要求,并共同优化辐射场和相机姿态。文献[20-22]研究了稀疏输入视图下的高质量神经辐射场重建。
2.2 Neural Rendering in the Wild 2.2 野外的神经渲染
Numerous efforts have extended NeRF to unconstrained photo collections. There are two main challenges here: 1 . How to adapt to the varying appearance conditions of different images. 2. How to remove transient objects from the images, retaining only the static scene. The foundational work, NeRF-W [9], innovates by attributing an appearance embedding to each wild image, effectively capturing the unique appearance features of disparate images, while simultaneously reconstructing a static and transient radiance field to separate static and transient scenes. Subsequent work, HaNeRF [10], employs a feature extractor to derive appearance embeddings for each image, achieving a certain level of generalization that removes the need to train appearance embeddings with half of the test set images. It also moves away from separately reconstructing transient and static scenes, instead assigning a transient embedding 许多努力已将 NeRF 扩展到不受限制的照片集合。这里有两个主要挑战:1. 如何适应不同图像的变化外观条件。2. 如何从图像中去除瞬态物体,仅保留静态场景。基础工作 NeRF-W [9]通过为每个野生图像赋予外观嵌入,创新性地捕捉不同图像的独特外观特征,同时重建静态和瞬态辐射场,以分离静态和瞬态场景。后续工作 HaNeRF [10]采用特征提取器为每个图像推导外观嵌入,实现了一定程度的泛化,消除了使用一半测试集图像训练外观嵌入的需要。它还不再单独重建瞬态和静态场景,而是分配一个瞬态嵌入。
to each wild image to obtain the corresponding 2D image-dependent visibility map to eliminate transient objects. CR-NeRF [11] further refines the process of extracting image appearance embeddings, facilitating the rendering of multiple rays concurrently to capitalize on global information more efficiently. UP-NeRF [23] explores the reconstruction of in-the-wild images in camera-free poses condition. It utilizes image feature maps as proxies and introduces depth map loss to enhance the training of camera poses. However, these NeRF-based methods all grapple with long training and rendering times and insufficient detail reconstruction, making them challenging to apply in real-world applications. 为了获得相应的二维图像依赖可见性图以消除瞬态物体,对每个野生图像进行处理。CR-NeRF [11] 进一步优化了提取图像外观嵌入的过程,促进了多个光线的并行渲染,以更有效地利用全局信息。UP-NeRF [23] 探索了在无相机姿态条件下重建野外图像。它利用图像特征图作为代理,并引入深度图损失以增强相机姿态的训练。然而,这些基于 NeRF 的方法都面临着较长的训练和渲染时间以及细节重建不足的问题,使其在实际应用中具有挑战性。
2.3 3D Gaussian Splatting 2.3 3D 高斯溅射
3D Gaussian Splatting [4] (3DGS) introduces an innovative approach for novel view synthesis and scene reconstruction, offering a distinct alternative to Neural Radiance Fields [3] (NeRF). This method represents scenes using a collection of anisotropic Gaussians and leverages rasterization rendering techniques to project these Gaussians onto the image plane for rendering, achieving real-time rendering speed. By directly optimizing the properties of 3D Gaussians and abandoning neural networks, the training speed is significantly accelerated. Due to the adoption of a fully explicit representation, it surpasses NeRF in terms of usability, scalability, and applicability to large-scale scenes. 3D 高斯点云 [4] (3DGS) 引入了一种创新的方法用于新视角合成和场景重建,为神经辐射场 [3] (NeRF) 提供了一种独特的替代方案。该方法使用一组各向异性高斯来表示场景,并利用光栅化渲染技术将这些高斯投影到图像平面进行渲染,实现实时渲染速度。通过直接优化 3D 高斯的属性并放弃神经网络,训练速度显著加快。由于采用了完全显式的表示,它在可用性、可扩展性和对大规模场景的适用性方面超越了 NeRF。
Various subsequent works have extended 3DGS in many aspects. Investigations such as [24-26][24-26] explore dynamic scene representations within 3DGS by integrating a temporal dimension. Commencing with a static Gaussian ensemble, time is utilized as an input to compute Gaussian displacements, subsequently applied to the original Gaussians to determine their final positions for rendering. Inspired by Mip-NeRF [5], Mip-Splatting [27] addresses the aliasing issues in 3DGS by controlling the sizes of Gaussians and employing 2D Mip filters. Additionally, [28-30] implement more efficient Gaussian representations and compact scene representations, aiming to diminish the storage demands of 3DGS. [31, 32] use Level-of-Detail (LoD) strategy for efficient largescale 3DGS training and rendering. Further innovations [33, 34] respectively utilize an anisotropic spherical Gaussian (ASG) and incorporate mirror attributes into the 3DGS to enhance 3DGS’s proficiency in rendering highlights and reflective lights. [35] no longer requires COLMAP preprocessing to perform multi-view stereo reconstruction, achieving high-quality reconstruction without camera poses. 各种后续工作在许多方面扩展了 3DGS。诸如 [24-26][24-26] 的研究通过整合时间维度探索了 3DGS 中的动态场景表示。从静态高斯集合开始,时间被用作输入来计算高斯位移,随后应用于原始高斯以确定其最终位置以进行渲染。受到 Mip-NeRF [5]的启发,Mip-Splatting [27]通过控制高斯的大小和使用 2D Mip 滤波器来解决 3DGS 中的混叠问题。此外,[28-30]实现了更高效的高斯表示和紧凑的场景表示,旨在减少 3DGS 的存储需求。[31, 32]使用细节层次(LoD)策略进行高效的大规模 3DGS 训练和渲染。进一步的创新[33, 34]分别利用各向异性球形高斯(ASG)并将镜面属性纳入 3DGS,以增强 3DGS 在渲染高光和反射光方面的能力。[35]不再需要 COLMAP 预处理即可执行多视图立体重建,实现高质量重建而无需相机位姿。
Our work aims to extend 3DGS to unconstrained photo collections. Unlike previous methods [9-11, 23] based on NeRF, our method is built upon 3DGS, achieving reconstruction speeds at the minute level and real-time rendering capabilities, thereby facilitating its application in real-world scenarios. Compared with the concurrent work [12], our method does not require additional training of Gaussians to represent transient objects, making it more concise and efficient. 我们的工作旨在将 3DGS 扩展到无约束的照片集合。与基于 NeRF 的先前方法[9-11, 23]不同,我们的方法建立在 3DGS 之上,实现了分钟级别的重建速度和实时渲染能力,从而促进了其在现实场景中的应用。与同时进行的工作[12]相比,我们的方法不需要额外训练高斯来表示瞬态物体,使其更加简洁高效。
Fig. 2 An illustration of the overall pipeline of our approach. We extract appearance features V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} from given image I_(i)I_{i} using an appearance extractor. V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} is then integrated with the Gaussians’ inherent color feature V^(g)\boldsymbol{V}^{\boldsymbol{g}} and directional vectors gamma(d)\gamma(\boldsymbol{d}) and fed into a lightweight color decoder to obtain the final colors for the Gaussians. Subsequently, these Gaussians are splatted onto the image plane for rendering. Meanwhile, a transient UNet [36] generates the 2D visibility map M_(i)M_{i} from I_(i)I_{i}, which assists the model in mitigating the interference caused by transient noise during the training process. 图 2 我们方法整体流程的示意图。我们使用外观提取器从给定图像 I_(i)I_{i} 中提取外观特征 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 。然后将 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 与高斯的固有颜色特征 V^(g)\boldsymbol{V}^{\boldsymbol{g}} 和方向向量 gamma(d)\gamma(\boldsymbol{d}) 结合,并输入轻量级颜色解码器以获得高斯的最终颜色。随后,这些高斯被喷溅到图像平面上进行渲染。同时,一个瞬态 UNet [36] 从 I_(i)I_{i} 生成 2D 可见性图 M_(i)M_{i} ,这有助于模型在训练过程中减轻瞬态噪声造成的干扰。
3 Method 3 方法
In this section, we will first give the definition of unconstrained reconstruction task in Section 3.1. Next, we will provide an overview of the 3DGS [4] method and its limitations in Section 3.2. Then we will introduce the overall pipeline and two modules of our method. First, we employ an appearance extractor to obtain a latent appearance feature from the original in-the-wild image, thereby addressing the continuously changing appearance (Section 3.3). Furthermore, we design a transient object handling module to remove transient objects present in the photo collection (Section 3.4). The overall method pipeline is illustrated in Fig. 2. 在本节中,我们将首先在 3.1 节中给出无约束重建任务的定义。接下来,我们将在 3.2 节中提供 3DGS [4] 方法及其局限性的概述。然后,我们将介绍我们方法的整体流程和两个模块。首先,我们采用外观提取器从原始的野外图像中获取潜在的外观特征,从而解决不断变化的外观问题(3.3 节)。此外,我们设计了一个瞬态物体处理模块,以去除照片集合中存在的瞬态物体(3.4 节)。整体方法流程如图 2 所示。
3.1 Task Definition 3.1 任务定义
Given a collection of in-the-wild photos with different appearance conditions and transient objects, along with their corresponding camera intrinsic and extrinsic parameters, the unconstrained reconstruction task aims to reconstruct a scene that can adapt to diverse appearance conditions of different samples and remove transient objects. Once the scene has been reconstructed, we are able to render images with diverse appearances from any new viewpoint. 给定一组在自然环境中拍摄的照片,这些照片具有不同的外观条件和瞬态物体,以及它们对应的相机内参和外参,无约束重建任务旨在重建一个能够适应不同样本的多样外观条件并去除瞬态物体的场景。一旦场景被重建,我们就能够从任何新的视角渲染出具有多样外观的图像。
3.2 Preliminaries 3.2 初步事项
Within the context of 3D Gaussian Splatting, objects are represented by a collection of anisotropic three-dimensional Gaussians, each characterized by a set of learnable parameters as follows: 在三维高斯溅射的背景下,物体由一组各向异性的三维高斯表示,每个高斯由一组可学习的参数特征如下:
3D3 D center point: mu inR^(3)\boldsymbol{\mu} \in \mathbb{R}^{3}; 3D3 D 中心点: mu inR^(3)\boldsymbol{\mu} \in \mathbb{R}^{3} ;
3D rotation represented by a quaternion: q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4}; 3D 旋转由四元数表示: q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4} ;
3 D size (scaling factor): s inR^(3)s \in \mathbb{R}^{3}; 3 D 尺寸(缩放因子): s inR^(3)s \in \mathbb{R}^{3} ;
spherical harmonics coefficients (degrees of freedom: kk ) for view-dependent RGB color: shinR^(3(k+1)^(2))rarr c inR^(3)\mathbf{s h} \in \mathbb{R}^{3(k+1)^{2}} \rightarrow \boldsymbol{c} \in \mathbb{R}^{3}; 视依赖 RGB 颜色的球面谐波系数(自由度: kk ): shinR^(3(k+1)^(2))rarr c inR^(3)\mathbf{s h} \in \mathbb{R}^{3(k+1)^{2}} \rightarrow \boldsymbol{c} \in \mathbb{R}^{3} ;
where the scaling matrix S\boldsymbol{S} is derived from s inR^(3)s \in \mathbb{R}^{3} and the rotation matrix R\boldsymbol{R} is obtained from q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4}. 缩放矩阵 S\boldsymbol{S} 来源于 s inR^(3)s \in \mathbb{R}^{3} ,旋转矩阵 R\boldsymbol{R} 来源于 q inR^(4)\boldsymbol{q} \in \mathbb{R}^{4} 。
Then, we can define Gaussians with the centroid point mu inR^(3)\mu \in \mathbb{R}^{3} and covariance matrix Sigma\boldsymbol{\Sigma} : 然后,我们可以定义以中心点 mu inR^(3)\mu \in \mathbb{R}^{3} 和协方差矩阵 Sigma\boldsymbol{\Sigma} 为基础的高斯分布:
Subsequently, we are required to project the 3 D Gaussians onto the image plane. According to [37], given a camera view matrix W\boldsymbol{W}, the covariance matrix Sigma^(')\boldsymbol{\Sigma}^{\prime} can be given as follows: 随后,我们需要将 3D 高斯投影到图像平面上。根据[37],给定相机视图矩阵 W\boldsymbol{W} ,协方差矩阵 Sigma^(')\boldsymbol{\Sigma}^{\prime} 可以表示如下:
Sigma^(')=JW SigmaW^(T)J^(T),\Sigma^{\prime}=J W \Sigma W^{T} J^{T},
where J\boldsymbol{J} is the Jacobian of the affine approximation of the camera projective transformation. 其中 J\boldsymbol{J} 是相机投影变换的仿射近似的雅可比矩阵。
Each Gaussian additionally possesses two extra attributes: opacity alpha\boldsymbol{\alpha}, and color c\boldsymbol{c} (represented using coefficients of spherical harmonics for view-dependency). Projecting 3DGS onto the image plane yields the corresponding 2D Gaussians. For each pixel, which may be projected onto multiple Gaussians, the final pixel color can be determined through a fast differentiable alpha\boldsymbol{\alpha}-blending process after sorting Gaussians by their depth: 每个高斯还具有两个额外属性:不透明度 alpha\boldsymbol{\alpha} 和颜色 c\boldsymbol{c} (使用球面谐波的系数表示视图依赖性)。将 3DGS 投影到图像平面上会产生相应的 2D 高斯。对于每个像素,可能会投影到多个高斯上,最终像素颜色可以通过快速可微分的 alpha\boldsymbol{\alpha} -混合过程在按深度排序高斯后确定:
C=sum_(i in N)(c_(i)alpha_(i)^(')prod_(j=1)^(i-1)(1-alpha_(j)^(')))C=\sum_{i \in N}\left(c_{i} \alpha_{i}^{\prime} \prod_{j=1}^{i-1}\left(1-\alpha_{j}^{\prime}\right)\right)
where the final opacity alpha_(i)^(')\alpha_{i}^{\prime} is the multiplication result of the learned opacity alpha_(i)\alpha_{i} of the Gaussian: 最终不透明度 alpha_(i)^(')\alpha_{i}^{\prime} 是高斯的学习不透明度 alpha_(i)\alpha_{i} 的乘积结果:
where x^(')x^{\prime} and mu_(i)^(')\mu_{i}^{\prime} are coordinates of the Gaussian in the projected space. 其中 x^(')x^{\prime} 和 mu_(i)^(')\mu_{i}^{\prime} 是投影空间中高斯的坐标。
Although 3D Gaussian Splatting can be optimized from a set of randomly initialized three-dimensional Gaussians, this approach often yields suboptimal results. Utilizing Structure from Motion (SfM) techniques, a sparse point cloud of the scene to be reconstructed can be obtained at the outset. Employing this point cloud to initialize a set of 3DGS with prior knowledge leads to a better result. Besides, 3DGS also 尽管 3D 高斯溅射可以从一组随机初始化的三维高斯中进行优化,但这种方法通常会产生次优结果。利用运动结构(SfM)技术,可以在一开始获得要重建场景的稀疏点云。利用这个点云来初始化一组具有先验知识的 3DGS 会导致更好的结果。此外,3DGS 还
controls Gaussians through splitting, duplicating, and deleting based on the magnitude of gradients received during backpropagation. For a more detailed exposition of the process, one may refer to the survey [38]. 通过根据反向传播过程中接收到的梯度大小进行分割、复制和删除来控制高斯分布。有关该过程的更详细说明,可以参考调查[38]。
3.3 Appearance Modeling 3.3 外观建模
To enhance the adaptability of 3DGS to the varying lighting conditions and postprocessing effects present in diverse wild images, we have implemented several modifications. These modifications are inspired by the insight that rendering outcomes are influenced both by external appearance conditions such as lighting, and by the intrinsic attributes of the objects themselves. Consequently, we have replaced the spherical harmonic coefficients, originally utilized for color representation in 3DGS, with a learnable implicit feature vector, denoted as V^(g)\boldsymbol{V}^{g}. This vector, with a length of n_(g)n_{g}, is designed to adaptively capture and represent the properties of 3DGS. Concurrently, we utilize a CNN-based appearance extractor E_(Phi)\mathrm{E}_{\Phi} to obtain the latent appearance vector V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} of length n_(a)n_{a} of each image, serving as the current appearance conditions. 为了增强 3DGS 对不同野生图像中变化的光照条件和后处理效果的适应性,我们实施了几项修改。这些修改的灵感来源于这样一个观点:渲染结果受到外部外观条件(如光照)和对象本身的内在属性的影响。因此,我们用一个可学习的隐式特征向量 V^(g)\boldsymbol{V}^{g} 替换了最初用于 3DGS 中颜色表示的球面谐波系数。这个长度为 n_(g)n_{g} 的向量旨在自适应地捕捉和表示 3DGS 的属性。同时,我们利用基于 CNN 的外观提取器 E_(Phi)\mathrm{E}_{\Phi} 来获取每个图像的潜在外观向量 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} ,其长度为 n_(a)n_{a} ,作为当前的外观条件。
Furthermore, to replace the view-dependent effects represented by spherical coefficients in the original 3DGS, we calculate a direction vector, denoted as d\boldsymbol{d}, based on the positions of each Gaussian and the camera’s position coordinates. We apply positional encoding to this direction vector d\boldsymbol{d} using spherical harmonics, which is denoted by gamma(d)\gamma(\boldsymbol{d}). The rationale for this choice lies in the fact that frequency-space encoding is better suited for directional vectors than component-wise encodings. Subsequently, we concatenate the color implicit vector V^(g)\boldsymbol{V}^{\boldsymbol{g}} of Gaussians, the image appearance latent vector V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} and directional encoding gamma(d)\gamma(\boldsymbol{d}), and feed them into an MLP-based color decoder F_(theta)\mathrm{F}_{\theta} to predict the color of Gaussians: 此外,为了替换原始 3DGS 中由球面系数表示的视图依赖效果,我们根据每个高斯的位置和相机的位置坐标计算一个方向向量,记作 d\boldsymbol{d} 。我们使用球面谐波对这个方向向量 d\boldsymbol{d} 进行位置编码,记作 gamma(d)\gamma(\boldsymbol{d}) 。选择这个方法的原因在于频率空间编码比分量编码更适合方向向量。随后,我们将高斯的颜色隐式向量 V^(g)\boldsymbol{V}^{\boldsymbol{g}} 、图像外观潜在向量 V_(i)^(a)\boldsymbol{V}_{\boldsymbol{i}}^{\boldsymbol{a}} 和方向编码 gamma(d)\gamma(\boldsymbol{d}) 进行连接,并将它们输入基于 MLP 的颜色解码器 F_(theta)\mathrm{F}_{\theta} ,以预测高斯的颜色:
We do not require the use of a large-scale MLP, because our Gaussians inherently store color features, the decoding burden on F_(theta)\mathrm{F}_{\theta} is not that substantial. Therefore, a lightweight MLP is sufficient for decoding. In our implementation, we use 4 layers of 64 hidden units. These predicted colors c\mathbf{c} are used as the final emitted colors for Gaussians. After getting all Gaussians’ colors and the camera parameters, we can use the differentiable tile rasterizer mentioned in Eq. (4) to render the scene with the appearance of the input image. 我们不需要使用大规模的 MLP,因为我们的高斯分布本质上存储了颜色特征,因此对 F_(theta)\mathrm{F}_{\theta} 的解码负担并不大。因此,轻量级的 MLP 足以进行解码。在我们的实现中,我们使用 4 层 64 个隐藏单元。这些预测的颜色 c\mathbf{c} 被用作高斯分布的最终发射颜色。在获取所有高斯的颜色和相机参数后,我们可以使用在公式(4)中提到的可微分瓷砖光栅化器来渲染具有输入图像外观的场景。
3.4 Transient Objects Handling 3.4 瞬态对象处理
Since our objective is to reconstruct static scenes, we argue that approaches like NeRFW [9] and SWAG [12] that also model transient scenes are doing redundant work that increases computational overhead. Therefore, we propose to create a 2 D visibility map to exclude transient objects. Technically, we employ a transient UNet denoted by U_(Psi)\mathrm{U}_{\Psi} to obtain the final 2 D visibility map M_(i)M_{i} for distinguishing transient and static objects in image I_(i)I_{i}. This strategy is facilitated by the high-speed rendering of 3DGS, which 由于我们的目标是重建静态场景,我们认为像 NeRFW [9]和 SWAG [12]这样的也建模瞬态场景的方法是在做冗余工作,增加了计算开销。因此,我们建议创建一个 2D 可见性图,以排除瞬态物体。从技术上讲,我们使用一个瞬态 UNet,记作 U_(Psi)\mathrm{U}_{\Psi} ,以获得最终的 2D 可见性图 M_(i)M_{i} ,用于区分图像 I_(i)I_{i} 中的瞬态和静态物体。这个策略得益于 3DGS 的高速渲染,
allows us to render the entire image in a single forward pass, whereas NeRF-based methods operate on pixel-wise basis. Leveraging UNet enables us to make full use of the intrinsic information present in the images, resulting in a more accurate distinction of transient parts. The whole process can be formulated as 允许我们在一次前向传递中渲染整个图像,而基于 NeRF 的方法则在像素级别上操作。利用 UNet 使我们能够充分利用图像中存在的内在信息,从而更准确地区分瞬态部分。整个过程可以表述为
In order to accelerate the speed as much as possible, our UNet adopts a downsampling and an upsampling layer, and finally outputs a binary classification map that is the same size as the input image I_(i)I_{i}. This map assigns values from 0 to 1 , where values closer to 1 indicate higher confidence that the current pixel corresponds to a static object. Note that we employ a self-supervised learning approach, wherein the model autonomously identifies pixels that are challenging to train across different images as transient objects unique to each image, without the need for additional segmentation priors for supervision. Additionally, to prevent it from the trivial solution where all pixels are assumed as transient, we also employ a regularization loss which serves as a regularization on the 2 D visibility map: 为了尽可能加快速度,我们的 UNet 采用了下采样和上采样层,最终输出一个与输入图像 I_(i)I_{i} 大小相同的二分类图。该图将值从 0 到 1 进行分配,其中值越接近 1 表示当前像素对应静态物体的置信度越高。请注意,我们采用了一种自监督学习方法,其中模型自主识别在不同图像中难以训练的像素,将其视为每个图像独特的瞬态物体,而无需额外的分割先验进行监督。此外,为了防止其陷入所有像素都被假定为瞬态的平凡解,我们还采用了一种正则化损失,作为对 2D 可见性图的正则化:
The obtained M_(i)M_{i} serves as a mask to eliminate the involvement of pixels representing transient objects during the optimization. Specifically, when calculating the reconstruction loss, the image needs to be softly masked by M_(i)M_{i} to remove the existence of transient objects. 获得的 M_(i)M_{i} 作为掩码,用于消除在优化过程中代表瞬态对象的像素的参与。具体来说,在计算重建损失时,图像需要通过 M_(i)M_{i} 进行柔性掩蔽,以去除瞬态对象的存在。
3.5 Optimization 3.5 优化
Similar to 3DGS [4], we use L1 loss and SSIM loss to regulate the similarity between predicted and ground-truth images. Both losses are modulated by our 2D visibility map M_(i)M_{i} to prevent modeling transient objects. The full loss function during the optimization is defined as follows: 与 3DGS [4]类似,我们使用 L1 损失和 SSIM 损失来调节预测图像与真实图像之间的相似性。这两种损失都通过我们的 2D 可见性图 M_(i)M_{i} 进行调节,以防止建模瞬态物体。优化过程中的完整损失函数定义如下:
In this section, we will discuss the various details of our experiments, including the datasets, baselines, evaluation metrics, and implementation details in Section 4.1. Comparisons with previous in-the-wild reconstruction work are presented in Section 4.2, while ablation studies are detailed in Section 4.3. Additionally, we offer further insights into appearance control in Section 4.4. 在本节中,我们将讨论实验的各种细节,包括数据集、基线、评估指标和第 4.1 节中的实现细节。与之前的野外重建工作的比较在第 4.2 节中呈现,而消融研究在第 4.3 节中详细说明。此外,我们在第 4.4 节中提供了关于外观控制的进一步见解。
Fig. 3 Qualitative comparison results of NeRF-based methods (NeRF-W [9], Ha-NeRF [10], CRNeRF [11]) and our method on Phototourism dataset. Our method can better capture lighting information and building details (e.g., highlights of the bronze horse and signpost in Brandenburg Gate, sunshine on stone in Trevi Fountain, and sunlight on the castle and long spear in the hand of a knight sculpture in Sacré-Cœur). 图 3 NeRF 基础方法(NeRF-W [9]、Ha-NeRF [10]、CRNeRF [11])和我们的方法在 Phototourism 数据集上的定性比较结果。我们的方法能够更好地捕捉光照信息和建筑细节(例如,勃兰登堡门的青铜马和指示牌的高光,特雷维喷泉石头上的阳光,以及圣心大教堂骑士雕像手中长矛上的阳光)。
4.1 Experimental Settings 4.1 实验设置
4.1.1 Dataset 4.1.1 数据集
Similar to previous work, we use the Phototourism dataset [39], an internet photo collection of cultural landmarks, as our primary dataset for this study. We use COLMAP [40] to obtain camera parameters and initial point clouds for three specific scenes within the dataset(“Brandenburg Gate”, “Sacré-Cour” and “Trevi Fountain”). Meanwhile, to demonstrate the robustness of our method, we also evlauate our method on NeRF-OSR dataset [41]. We choose 4 sites for experimentation - europa, lwp, st, and stjohann, and 1//81 / 8 of the images in each sites are selected as the test set. During the training phase, all images are downsampled by a factor of 2 . 与之前的工作类似,我们使用 Phototourism 数据集[39],这是一个关于文化地标的互联网照片集合,作为本研究的主要数据集。我们使用 COLMAP[40]来获取相机参数和数据集中三个特定场景(“勃兰登堡门”、“圣心大教堂”和“特雷维喷泉”)的初始点云。同时,为了展示我们方法的鲁棒性,我们还在 NeRF-OSR 数据集[41]上评估我们的方法。我们选择了 4 个实验地点 - europa、lwp、st 和 stjohann,并从每个地点中选择 1//81 / 8 张图像作为测试集。在训练阶段,所有图像都按 2 的比例进行下采样。
We primarily compare the rendering quality, training time and rendering speed of our method with NeRF-based methods, NeRF-W [9], Ha-NeRF [10], CR-NeRF [11] and the concurrent 3DGS-based methods, the original 3DGS [4] and SWAG [12] to demonstrate the superiority of our method. 我们主要将我们的方法与基于 NeRF 的方法进行比较,包括 NeRF-W [9]、Ha-NeRF [10]、CR-NeRF [11],以及同时期的基于 3DGS 的方法,即原始 3DGS [4]和 SWAG [12],以展示我们方法的优越性。
4.1.3 Metrics 4.1.3 指标
Rendering quality is assessed using performance metrics such as Peak Signal-toNoise Ratio (PSNR), Structural Similarity Index Measure [42] (SSIM), and Learned Perceptual Image Patch Similarity [43] (LPIPS). It should be noted that the LPIPS reported by the NeRF-based and 3DGS-based methods have different backbones. For a fair comparison, we use AlexNet as the backbone when comparing with the NeRF-based methods, while VGG is used when comparing with the 3DGS-based methods. 渲染质量是通过性能指标来评估的,例如峰值信噪比(PSNR)、结构相似性指数测量[42](SSIM)和学习感知图像块相似性[43](LPIPS)。需要注意的是,基于 NeRF 和基于 3DGS 的方法报告的 LPIPS 具有不同的骨干网络。为了公平比较,我们在与基于 NeRF 的方法比较时使用 AlexNet 作为骨干网络,而在与基于 3DGS 的方法比较时使用 VGG。
4.1.4 Implementation Details 4.1.4 实施细节
Our implementation is based on the official implementation of 3DGS [4] which uses PyTorch [44] and PyTorch CUDA extension. Both the training and testing phases are conducted on a single RTX 3090 GPU with 24G memory. The parameters we need to train include E_(Phi),F_(theta),U_(Psi)\mathrm{E}_{\Phi}, \mathrm{F}_{\theta}, \mathrm{U}_{\Psi} and the properties of the Gaussian itself. Our full loss is defined in Eq. (10). lambda_(1)\lambda_{1}, and image feature length n_(a),n_(g)n_{a}, n_{g} are the set of additional hyperparameters, we use 0.8,480.8,48, and 24 respectively. For the parameter lambda_(m)\lambda_{m}, we use 我们的实现基于 3DGS [4] 的官方实现,该实现使用 PyTorch [44] 和 PyTorch CUDA 扩展。训练和测试阶段均在单个 RTX 3090 GPU 上进行,内存为 24G。我们需要训练的参数包括 E_(Phi),F_(theta),U_(Psi)\mathrm{E}_{\Phi}, \mathrm{F}_{\theta}, \mathrm{U}_{\Psi} 和高斯本身的属性。我们的完整损失在公式(10)中定义。 lambda_(1)\lambda_{1} ,图像特征长度 n_(a),n_(g)n_{a}, n_{g} 是额外超参数的集合,我们分别使用 0.8,480.8,48 和 24。对于参数 lambda_(m)\lambda_{m} ,我们使用
Fig. 5 Qualitative comparison results of 3DGS-based methods (3DGS[4], SWAG [12]) and our method on Phototourism datasets. Due to SWAG not being open-sourced, we have selected a subset of results from its publication for comparison. Our rendering results are softer, aligning more closely with human vision. 图 5 基于 3DGS 的方法(3DGS[4],SWAG [12])与我们的方法在 Phototourism 数据集上的定性比较结果。由于 SWAG 未开源,我们选择了其出版物中的一部分结果进行比较。我们的渲染结果更柔和,更接近人类视觉。
exponential annealing weights, starting at 0.5 at the beginning of training and gradually decreasing to 0.15 . as iterations proceed. We utilize spherical harmonics encoding with a degree of 4 for encoding directional vectors. For the E_(Phi)\mathrm{E}_{\Phi} responsible for extracting image features, our implementation is consistent with Ha-NeRF [10] which has five convolutional layers, one average pooling layer, and a fully connected layer. For the U_(Psi)\mathrm{U}_{\Psi}, we employ one downsampling and one upsampling layer. The F_(theta)\mathrm{F}_{\theta} consists of 4 fully connected layers, each with a width of 64 . The training process closely follows the 3DGS approach, with a total of 30,000 iterations. Notably, after 15,000 iterations, we no longer employ the duplication, splitting, and deletion of Gaussians, thus maintaining a constant number of Gaussians. 指数退火权重,从训练开始时的 0.5 逐渐降低到 0.15,随着迭代的进行。我们使用 4 度的球面谐波编码来编码方向向量。对于 E_(Phi)\mathrm{E}_{\Phi} ,负责提取图像特征,我们的实现与 Ha-NeRF [10]一致,具有五个卷积层、一个平均池化层和一个全连接层。对于 U_(Psi)\mathrm{U}_{\Psi} ,我们采用一个下采样层和一个上采样层。 F_(theta)\mathrm{F}_{\theta} 由 4 个全连接层组成,每个层的宽度为 64。训练过程紧密遵循 3DGS 方法,总共进行 30,000 次迭代。值得注意的是,在 15,000 次迭代后,我们不再进行高斯的复制、分割和删除,从而保持高斯的数量不变。
4.2 Comparison with Existing Methods 4.2 与现有方法的比较
4.2.1 Rendering Results 4.2.1 渲染结果
Table 1, Fig. 3 and Table 2, Fig. 4, which show the quantitative results and qualitative results on Phototourism dataset and NeRF-OSR dataset respectively, demonstrate the comprehensive superiority of our method over NeRF-based methods. Although NeRFW, Ha-NeRF, and CR-NeRF use additional embedding to represent the appearance of different images and take measures to remove transient objects, they are unable to accurately capture the details of buildings and certain highlights (e.g., highlights of the bronze horse and signpost in Brandenburg Gate, sunshine on stone in Trevi Fountain, and sunlight on the castle and long spear in the hand of a knight sculpture in Sacré-Cour), thus leading to a lower fidelity in reproducing fine details compared to our method. Thanks to our Gaussian-intrinsic features and the appearance feature extracted from diverse images, our method outperforms all NeRF-based methods across all datasets and all metrics, additionally yielding superior visual effects. 表 1、图 3 和表 2、图 4 分别展示了 Phototourism 数据集和 NeRF-OSR 数据集的定量结果和定性结果,证明了我们的方法在综合上优于基于 NeRF 的方法。尽管 NeRFW、Ha-NeRF 和 CR-NeRF 使用额外的嵌入来表示不同图像的外观,并采取措施去除瞬态物体,但它们无法准确捕捉建筑物的细节和某些亮点(例如,布兰登堡门的铜马和指示牌的亮点、特雷维喷泉石头上的阳光,以及圣心大教堂骑士雕像手中的长矛上的阳光),因此在再现细节的保真度上低于我们的方法。得益于我们的高斯内在特征和从多样图像中提取的外观特征,我们的方法在所有数据集和所有指标上都优于所有基于 NeRF 的方法,并且产生了更优越的视觉效果。
Table 1 Quantitative results on Phototourism dataset. 表 1 光旅游数据集的定量结果。
The bold and the underlined numbers indicate the best and second-best results, respectively. LPIPS _(a)_{a} employs AlexNet as the backbone, whereas LPIPS_(v)\operatorname{LPIPS}_{v} employs VGG as the backbone. 加粗和下划线的数字分别表示最佳和第二最佳结果。LPIPS _(a)_{a} 采用 AlexNet 作为主干,而 LPIPS_(v)\operatorname{LPIPS}_{v} 采用 VGG 作为主干。
In addition to methods based on NeRF, we also conduct comparisons with concurrent methods based on 3DGS: the origin 3DGS and SWAG. The quantitative results are shown in Table 1 and the qualitative results are presented in Fig. 5. The original 3DGS struggles to capture the appearance differences between different images, leading to a low value on all metrics, while our method excels in reconstructing the appearance corresponding to each image. Furthermore, it should be noted that both NeRF-W and SWAG utilize half of the test set images for training respective appearance embedding, which presents an unfair advantage against us. Despite this, our method still achieves competitive performance without the necessity of utilizing an additional half of the training images and manages to outperform them on most of the metrics. As illustrated in Fig. 5, within the Trevi Fountain scene, SWAG tends to render sharper images, leading to marginal advantages in PSNR and SSIM. Conversely, our rendering results are softer, aligning more closely with human vision, thereby resulting in a significant reduction in LPIPS compared with SWAG. 除了基于 NeRF 的方法,我们还与基于 3DGS 的并行方法进行了比较:原始的 3DGS 和 SWAG。定量结果如表 1 所示,定性结果如图 5 所示。原始的 3DGS 在捕捉不同图像之间的外观差异方面表现不佳,导致所有指标的值都很低,而我们的方法在重建与每个图像对应的外观方面表现出色。此外,需要注意的是,NeRF-W 和 SWAG 都利用了一半的测试集图像来训练各自的外观嵌入,这对我们来说是一个不公平的优势。尽管如此,我们的方法仍然在不需要利用额外一半训练图像的情况下实现了具有竞争力的性能,并在大多数指标上超越了它们。如图 5 所示,在特雷维喷泉场景中,SWAG 倾向于渲染更清晰的图像,从而在 PSNR 和 SSIM 上获得微小的优势。相反,我们的渲染结果更柔和,更接近人类视觉,因此与 SWAG 相比,LPIPS 显著降低。
The bold and the underlined numbers indicate the best and second-best results, respectively. LPIPS _(v)_{v} employs VGG as the backbone. 加粗和下划线的数字分别表示最佳和第二最佳结果。LPIPS _(v)_{v} 采用 VGG 作为主干。
4.2.2 Efficiency 4.2.2 效率
Table 3 demonstrates the efficiency of our method. On a single RTX 3090 GPU, both training and rendering speeds are accelerated by nearly 1000 times compared to NeRFbased methods. For the concurrent work SWAG [12], both our training and rendering speeds have substantially increased. Compared to the original 3DGS, during training, our method merely incorporates an additional appearance extractor for feature extraction, a lightweight MLP for color decoding, and a transient UNet for predicting the 2D visibility map, with all networks being very small. This ensures that our approach is only slightly slower than the original 3DGS. During rendering, the elimination of transient objects and the consequent reduction in the number of Gaussians endow our method with more efficient real-time rendering capability. 表 3 展示了我们方法的效率。在单个 RTX 3090 GPU 上,与基于 NeRF 的方法相比,训练和渲染速度都加快了近 1000 倍。对于并行工作 SWAG [12],我们的训练和渲染速度都有了显著提高。与原始的 3DGS 相比,在训练过程中,我们的方法仅增加了一个额外的外观提取器用于特征提取,一个轻量级的 MLP 用于颜色解码,以及一个瞬态 UNet 用于预测 2D 可见性图,所有网络都非常小。这确保了我们的方法仅比原始的 3DGS 稍慢。在渲染过程中,瞬态物体的消除和高斯数量的减少使我们的方法具备了更高效的实时渲染能力。
4.3 Ablation Study 4.3 消融研究
In this section, We conduct ablation experiments separately on two modules: the appearance feature extracting on appearance modeling and the 2 D visibility map on transient objects handling. We denote the removal of the appearance modeling module as AAGS-T, and the exclusion of the transient objects handling module is represented as AAGS-A. We do ablation experiments across three scenes, Brandenburg 在本节中,我们分别对两个模块进行消融实验:外观建模中的外观特征提取和瞬态物体处理中的二维可见性图。我们将外观建模模块的移除表示为 AAGS-T,将瞬态物体处理模块的排除表示为 AAGS-A。我们在三个场景中进行消融实验,布兰登堡
Table 3 Comparison of Efficiency. 表 3 效率比较。
Method 方法
Training Time (hours) darr\downarrow 训练时间(小时) darr\downarrow
All comparisons are conducted on a single RTX 3090, with the average training time and FPS across three scenes reported for comparison. The bold and the underlined numbers indicate the best and secondbest results, respectively. Our method accelerates both training and rendering speeds by nearly 1000 times compared to NeRF-based methods. 所有比较都是在单个 RTX 3090 上进行的,报告了三个场景的平均训练时间和 FPS 以供比较。加粗和下划线的数字分别表示最佳和第二最佳结果。我们的方法相比于基于 NeRF 的方法,加速了近 1000 倍的训练和渲染速度。
Table 4 Ablation Studies and Number of Gaussians across three different real scenes from the Phototourism dataset [39]. 表 4 不同真实场景下的消融研究和高斯数量,来自 Phototourism 数据集[39]。
The bold and the underlined numbers indicate the best and second-best results, respectively. LPIPS _(v)_{v} here employs VGG as the backbone. #G means the number of Gaussians. Our method effectively reduces unnecessary Gaussians. 加粗和下划线的数字分别表示最佳和第二最佳结果。这里的 LPIPS _(v)_{v} 采用 VGG 作为主干。#G 表示高斯的数量。我们的方法有效地减少了不必要的高斯。
Gate, Sacré-Cour, and Trevi Fountain. The final quantitative results, as presented in Table 4, and qualitative results shown in Fig. 6 demonstrate the effectiveness of each component. 门,圣心大教堂和特雷维喷泉。表 4 中呈现的最终定量结果和图 6 中显示的定性结果证明了每个组件的有效性。
Fig. 6 Ablation study. Visual results of 3DGS, our AAGS, and two variants: AAGS-T and AAGSA. With appearance modeling and transient objects handling module, our method is able to capture appearance features and effectively reduce transient objects. 图 6 消融研究。3DGS、我们的 AAGS 和两个变体:AAGS-T 和 AAGSA 的视觉结果。通过外观建模和瞬态物体处理模块,我们的方法能够捕捉外观特征并有效减少瞬态物体。
4.3.1 Without appearance modeling 4.3.1 无需外观建模
In this experiment, we remove the appearance modeling module, resulting in the model’s inability to capture the appearance characteristics specific to each in-the-wild image, which significantly degrades visual performance. In contrast, our appearance modeling module extracts relevant image features, thereby facilitating the accurate reconstruction of the appearance characteristics across various images. 在这个实验中,我们移除了外观建模模块,导致模型无法捕捉到每个野外图像特有的外观特征,这显著降低了视觉性能。相比之下,我们的外观建模模块提取相关的图像特征,从而促进了对各种图像外观特征的准确重建。
4.3.2 Without transient objects handling 4.3.2 无需处理瞬态对象
In this experiment, we remove the transient UNet. Note that without transient objects handling may lead to some better metric performances cause that most test images lack transient objects. The qualitative results, as shown in Fig. 9, reveal that removing this module leads the model to tend towards modeling transient objects as well, ultimately resulting in the emergence of artifacts. As shown in the visibility map, our transient objects handling module effectively distinguishes which objects are transient, preventing them from interfering with the outcome (Note that, for visual effect, we have inverted the visibility map). Furthermore, we present the impact of this module on the number of Gaussians, as shown in Table 4. As we no longer model transient objects, the number of Gaussians has significantly decreased across all scenes, resulting in a substantial increase in both training and rendering efficiency. 在这个实验中,我们去除了瞬态 UNet。请注意,缺乏瞬态对象的处理可能会导致一些更好的指标性能,因为大多数测试图像缺乏瞬态对象。如图 9 所示,定性结果表明,去除该模块使模型也倾向于建模瞬态对象,最终导致伪影的出现。如可见性图所示,我们的瞬态对象处理模块有效地区分了哪些对象是瞬态的,防止它们干扰结果(请注意,为了视觉效果,我们反转了可见性图)。此外,我们展示了该模块对高斯数量的影响,如表 4 所示。由于我们不再建模瞬态对象,高斯的数量在所有场景中显著减少,从而在训练和渲染效率上都大幅提高。
4.4 Controllable Appearance 4.4 可控外观
Thanks to our appearance extractor, we are able to render images with different appearance features from arbitrary viewpoints, as shown in Fig. 7. Our approach is capable of rendering scenes with consistent visual characteristics from varying viewpoints. Furthermore, as shown in Fig. 8, we introduce four images generated from a fixed camera position, in which we interpolate between the appearance features associated with the images on the left and right. It can be observed that our method is capable of smoothly and naturally interpolating to obtain images under two different appearances without altering the 3D structure of the scene. 感谢我们的外观提取器,我们能够从任意视角渲染具有不同外观特征的图像,如图 7 所示。我们的方法能够从不同视角渲染具有一致视觉特征的场景。此外,如图 8 所示,我们介绍了从固定相机位置生成的四幅图像,在这些图像中,我们在左侧和右侧图像相关联的外观特征之间进行插值。可以观察到,我们的方法能够平滑自然地插值,以获得在两种不同外观下的图像,而不改变场景的 3D 结构。
5 Conclusion and Discussion 5 结论与讨论
In this paper, we propose Appearance-Aware 3D Gaussian Splatting, a method designed for reconstructing scenes from unconstrained photo collections. We assign a 在本文中,我们提出了外观感知的 3D 高斯喷溅技术,这是一种旨在从不受限制的照片集合中重建场景的方法。我们分配一个
Fig. 7 Appearance transfer results of our approach. The first row represents the reference images we want to extract appearance features from, and the first column represents the new viewpoints we want to render. It can be observed that the rendered results from novel views present the same appearances as the reference images. 图 7 我们方法的外观转移结果。第一行表示我们想要提取外观特征的参考图像,第一列表示我们想要渲染的新视角。可以观察到,从新视角渲染的结果与参考图像呈现相同的外观。
specific feature instead of color to each Gaussian and employ a CNN to extract appearance features to model appearance from different images. We also utilize a UNet to generate a 2 D visibility map that excludes transient objects. Extensive experiments demonstrate that our method outperforms the previous approaches qualitatively and quantitatively and significantly reduces both training and rendering times. 针对每个高斯分配特定特征而不是颜色,并使用卷积神经网络提取外观特征,以从不同图像中建模外观。我们还利用 UNet 生成一个排除瞬态物体的二维可见性图。大量实验表明,我们的方法在定性和定量上都优于以前的方法,并显著减少了训练和渲染时间。
Despite our promising performance, there is still room for improvement in this field. Firstly, our method struggles to effectively reconstruct highlights and objects 尽管我们的表现令人鼓舞,但在这个领域仍有改进的空间。首先,我们的方法在有效重建高光和物体方面存在困难。
Fig. 8 Appearance interpolation results of our approach. We perform linear interpolation on the features of Appearance A and Appearance B, and render six images for each scene. It can be seen that the results naturally transition from one appearance to the other, indicating that our appearance extractor can capture global color information accurately. 图 8 我们方法的外观插值结果。我们对外观 A 和外观 B 的特征进行线性插值,并为每个场景渲染六幅图像。可以看出,结果自然地从一种外观过渡到另一种外观,表明我们的外观提取器能够准确捕捉全局颜色信息。
Fig. 9 Ablation experiment on transient objects handling. Our 2D visibility map from transient UNet is capable of distinguishing transient objects and reducing unnecessary artifacts. Note that for a clear illustration, we have inverted the visibility map (a value closer to 1 indicates a higher confidence that the current pixel corresponds to a transient object). 图 9 瞬态物体处理的消融实验。我们的瞬态 UNet 生成的 2D 可见性图能够区分瞬态物体并减少不必要的伪影。请注意,为了清晰说明,我们已将可见性图反转(值越接近 1 表示当前像素对应瞬态物体的信心越高)。
such as clouds and flowing water. Secondly, similar to 3DGS, our approach is sensitive to accurate camera poses and well-initialized point clouds. Thirdly, due to our use of self supervised methods to train UNet, sometimes incorrect dynamic objects are recognized as static objects, as shown in Fig. 9. 例如云和流动的水。其次,类似于 3DGS,我们的方法对准确的相机姿态和良好初始化的点云敏感。第三,由于我们使用自监督方法训练 UNet,有时不正确的动态物体被识别为静态物体,如图 9 所示。