2024_08_26_359ba9122b3e5efd128bg

Style-NERF2NERF: 3D Style TRANSFER FROM StYle-AlignED Multi-VIEW IMAGES
样式-NERF2NERF：从 StYle-AlignED 多视角图像传输 3D 样式

Haruo Fujiwara 藤原春夫The University of Tokyo 东京大学fujiwara@mi.t.u-tokyo.ac.jp

Yusuke Mukuta 向田雄介The University of Tokyo / RIKEN
东京大学/理化学研究所mukuta@mi.t.u-tokyo.ac.jp

Tatsuya Harada 原田达也The University of Tokyo / RIKEN
东京大学/理化学研究所harada@mi.t.u-tokyo.ac.jp

June 25, 2024 2024 年 6 月 25 日

Abstract 摘要

We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
我们利用二维图像扩散模型的强大功能，提出了一种简单而有效的三维场景风格化管道。给定一个由一组多视角图像重建的 NeRF 模型，我们通过使用由风格对齐的图像到图像扩散模型生成的风格化图像来完善源 NeRF 模型，从而实现三维风格转移。给定目标风格提示后，我们首先利用具有注意力共享机制的深度条件扩散模型生成知觉相似的多视角图像。接下来，基于风格化的多视角图像，我们建议使用基于从预先训练的 CNN 模型中提取的特征图的切片瓦瑟斯坦损失来引导风格转移过程。我们的流程由多个解耦步骤组成，允许用户测试各种提示想法，并在进入 NeRF 微调阶段之前预览风格化三维结果。我们证明，我们的方法可以将不同的艺术风格转移到现实世界的三维场景中，而且质量极具竞争力。我们的项目页面上还有结果视频：https://haruolabs.github.io/style-n2n/

Keywords Neural Radiance Fields

Style Transfer

Sliced Wasserstein
关键词神经辐射场

风格转移

切片瓦瑟斯坦

1 Introduction 1 引言

Thanks to recent advancements in 3D reconstruction techniques such as Neural Radiance Fields (NeRF) Mildenhall et al. [2020], it is nowadays possible for creators to develop a 3D asset or a scene from captured real-world data without intensive labor. While such 3D reconstruction methods work well, editing an entire 3D scene to match a desired style or concept is not straightforward.
由于神经辐射场（NeRF）Mildenhall 等人[2020]等三维重建技术的最新进展，如今创作者可以根据捕捉到的真实世界数据开发三维资产或场景，而无需付出高强度的劳动。虽然这种三维重建方法效果很好，但要编辑整个三维场景以符合所需的风格或概念却并不简单。
For instance, editing conventional 3D scenes based on explicit representations like mesh often involves specialized tools and skills. Changing the appearance of the entire mesh-based scene would often require skilled labor, such as shape modeling, texture creation, and material parameter modifications.
例如，编辑基于网格等显式表示的传统 3D 场景通常需要专业工具和技能。要改变整个基于网格的场景的外观，通常需要熟练的技能，如形状建模、纹理创建和材料参数修改。
At the advent of implicit 3D representation techniques such as NeRF, style editing methods for 3D are also emerging Nguyen-Phuoc et al. [2022], Wang et al. [2023], Liu et al. [2023], Kamata et al. [2023], Haque et al. [2023], Dong and Wang [2024] to enhance creators' content development process. Following the recent development of 2 D image generation models, prominent works such as Instruct-NeRF2NeRF Haque et al. [[2023], Vachha and Haque [2024] and

Dong and Wang [2024] proposed to leverage the knowledge of large-scale pre-trained text-to-image (T2I) models to supervise the 3D NeRF editing process.
随着 NeRF 等隐式三维表示技术的出现，三维样式编辑方法也在不断涌现，如 Nguyen-Phuoc 等人[2022]、Wang 等人[2023]、Liu 等人[2023]、Kamata 等人[2023]、Haque 等人[2023]、Dong 和 Wang [2024]，以增强创作者的内容开发流程。继最近开发出 2 D 图像生成模型之后，Instruct-NeRF2NeRF Haque 等人[[2023]]、Vachha 和 Haque [2024] 以及

Dong 和 Wang [2024] 等著名作品提出利用大规模预训练文本到图像（T2I）模型的知识来监督 3D NeRF 编辑过程。
These methods employ a custom pipeline based on an instruction-based T2I model "Instruct-Pix2Pix" Brooks et al. [2023] to stylize a 3D scene with text instructions. While Instruct-NeRF2NeRF is proven to work well for editing 3D scenes including large-scale 360 environments, their method involves an iterative process of editing and replacing the
这些方法采用基于指令的 T2I 模型 "Instruct-Pix2Pix" Brooks 等人[2023]的自定义流水线，用文本指令对 3D 场景进行风格化处理。事实证明，Instruct-NeRF2NeRF 可以很好地编辑三维场景，包括大规模的 360 环境，但他们的方法涉及一个编辑和替换的迭代过程。
training data during NeRF optimization, occasionally resulting in unpredictable results. As editing by Instruct-Pix 2 Pix runs in tandem with NeRF training, we found adjusting or testing editing styles beforehand difficult.
在 NeRF 优化过程中，有时会出现无法预测的结果。由于 Instruct-Pix 2 Pix 的编辑与 NeRF 训练同步进行，我们发现很难事先调整或测试编辑风格。
To overcome this problem, we propose an artistic style-transfer method that trains a source 3D NeRF scene on stylized images prepared in advance by a text-guided style-aligned diffusion model. Training is guided by Sliced Wasserstein Distance (SWD) loss Heitz et al. [2021], Li et al. [2022] to effectively perform 3D style transfer with NeRF. A summary of our contributions is as the follows:
为了克服这一问题，我们提出了一种艺术风格转移方法，该方法通过文本引导的风格对齐扩散模型，在事先准备好的风格化图像上训练源三维 NeRF 场景。训练由 Sliced Wasserstein Distance (SWD) loss Heitz 等人[2021]、Li 等人[2022]指导，从而有效地利用 NeRF 进行三维风格转换。我们的贡献总结如下：

We propose a novel 3D style-transfer approach for NeRF, including large-scale outdoor scenes.
我们为 NeRF（包括大规模室外场景）提出了一种新颖的 3D 风格转换方法。
We show that a style-aligned diffusion model conditioned on depth maps of corresponding source views can generate perceptually view-consistent style images for fine-tuning the source NeRF. Users can test stylization ideas with the diffusion pipeline before proceeding to the NeRF fine-tuning phase.
我们表明，以相应源视图的深度图为条件的风格对齐扩散模型可以生成与视图一致的感知风格图像，用于微调源 NeRF。在进入 NeRF 微调阶段之前，用户可以使用扩散管道测试风格化想法。
We find that fine-tuning the source NeRF with SWD loss can perform 3D style transfer well.
我们发现，利用 SWD 损耗对源 NeRF 进行微调，可以很好地实现 3D 风格转移。
Our experimental results illustrate the rich capability of stylizing scenes with various text prompts.
我们的实验结果表明，利用各种文本提示对场景进行风格化设计的能力非常强大。

2.1 Implicit 3D Representation
2.1 隐式三维表示法

NeRF, introduced by the seminal paper Mildenhall et al. [2020] became one of the most popular implicit 3D representation techniques due to several benefits. NeRF can render photo-realistic novel views with arbitrary resolution due to its continuous representation with a compact model compared to explicit representations such as polygon mesh or voxels. In our research, we use the "nerfacto" model implemented by Nerfstudio Tancik et al. [2023], which is a combination of modular features from multiple papers Wang et al. [2021], Barron et al. [2022], Müller et al. [2022], Martin-Brualla et al. [2021], Verbin et al. [2022], designed to achieve a balance between speed and quality.
由 Mildenhall 等人的开创性论文[2020]引入的 NeRF 因其多种优势而成为最流行的隐式三维表示技术之一。与多边形网格或体素等显式表示相比，NeRF 具有紧凑的连续表示模型，因此能以任意分辨率呈现照片般逼真的新颖视图。在我们的研究中，我们使用了由 Nerfstudio Tancik 等人[2023] 实现的 "nerfacto "模型，该模型结合了多篇论文中的模块化特征，如 Wang 等人[2021]、Barron 等人[2022]、Müller 等人[2022]、Martin-Brualla 等人[2021]、Verbin 等人[2022]，旨在实现速度与质量之间的平衡。

2.2 Style Transfer 2.2 风格转换

2.2.1 2D Style Transfer. 2.2.1 二维样式转换。

Style transfer is a technique for blending images, a source image and a style image, to create another image that retains the first's content but exhibits the second's style. Since the introduction of the foundational style transfer algorithm proposed by Gatys et al. [2015], many follow-up works for 2D style transfer have been explored for further improvements such as faster optimization Johnson et al. [2016], Huang and Belongie [2017], zero-shot style-transfer Li] et al. [2017], and photo-realism Luan et al. [2017].
风格转移是一种混合图像（源图像和风格图像）的技术，以创建另一幅图像，该图像保留了前一幅图像的内容，但展现了后一幅图像的风格。自 Gatys 等人[2015]提出基础风格转移算法以来，许多后续工作都在探索二维风格转移的进一步改进，如更快优化 Johnson 等人[2016]、Huang 和 Belongie [2017]、零镜头风格转移 Li] 等人[2017]、照片逼真度 Luan 等人[2017]等。

2.2.2 3D Style Transfer. 2.2.2 3D 风格转移。

Several recent 3D style transfer works have applied style transfer techniques using deep feature statistics Huang and Belongie [2017] to NeRF Liu et al. [2023], Wang et al. [2023], Zhang et al. [2022]. Although these methods require a reference style image, text-based 3D editing techniques have also been proposed leveraging foundational 2D Text-to-Image (T2I) generative models. While Instruct 3D-to-3D Kamata et al. 2023 proposed using Score Distillation Sampling (SDS) loss Poole et al. [2022] for text guided NeRF stylization, Instruct-NeRF2NeRF Haque et al. [2023] and ViCA-NeRF Dong and Wang [2024] perform NeRF editing by optimizing the underlying scene with a process referred to as Iterative Dataset Update (Iterative DU), which gradually replaces the input images with edited images from InstructPix 2Pix Haque et al. [2023], an image-conditioned instruction-based diffusion model, followed by an update of NeRF. Inspired by these methods, we also develop a 3D style transfer method for NeRF, supervised by images created by a diffusion pipeline but without Iterative DU.
最近有几项三维风格转换工作将使用深度特征统计的风格转换技术应用于 NeRF，Huang 和 Belongie [2017] Liu 等人[2023]、Wang 等人[2023]、Zhang 等人[2022]。虽然这些方法需要参考式图像，但也有人提出了基于文本的三维编辑技术，利用基础的二维文本到图像（T2I）生成模型。Instruct 3D-to-3D Kamata 等人[2023]提出使用分数蒸馏采样（SDS）损失 Poole 等人[2022]进行文本引导的 NeRF 风格化，而 Instruct-NeRF2NeRF Haque 等人[2023]和 ViCA 等人[2022]则提出了基于文本的三维编辑技术。[2023] 和 ViCA-NeRF Dong 和 Wang [2024] 通过优化底层场景来进行 NeRF 编辑，该过程被称为 "迭代数据集更新"（Iterative DU），它将输入图像逐渐替换为 InstructPix 2Pix Haque 等人[2023]编辑的图像，后者是一种基于图像条件的指令扩散模型，随后更新 NeRF。受这些方法的启发，我们还为 NeRF 开发了一种 3D 风格转移方法，由扩散管道创建的图像进行监督，但不使用迭代 DU。

2.3 Diffusion Models 2.3 扩散模型

Diffusion models Sohl-Dickstein et al. [2015], Song et al. [2020a], Dhariwal and Nichol [2021] are generative models that have gained significant attention for their ability to generate high-quality, diverse images. Inspired by classical non-equilibrium thermodynamics, they are trained to generate an image by reversing the diffusion process, progressively denoising noisy images towards meaningful ones. Diffusion models are commonly trained with classifier-free guidance Ho and Salimans 2022] to enable image generation conditioned on an input text.
扩散模型 Sohl-Dickstein 等人[2015]、Song 等人[2020a]、Dhariwal 和 Nichol [2021] 是一种生成模型，因其生成高质量、多样化图像的能力而备受关注。受到经典非平衡热力学的启发，这些模型通过反向扩散过程训练生成图像，逐步对噪声图像进行去噪处理，最终生成有意义的图像。扩散模型通常是在无分类器指导下进行训练的，Ho 和 Salimans 2022]，以便在输入文本的条件下生成图像。

2.3.1 Controlled Generations with Diffusion Models.
2.3.1 采用扩散模型的受控代。

Leveraging the success of T2I diffusion models, recent research has expanded their application to controlled image generation and editing, notably in image-to-image (I2I) tasks Meng et al. [2021], Parmar et al. [2023], Kawar et al. [2023], Tumanyan et al. [[2023], Mokady et al. [2023], Hertz et al. [2023a, 2022], Brooks et al. [2023]. For example, SDEdit Meng et al. [[2021] achieves this by first adding noise to a source image and then guiding the diffusion process toward an output based on a given prompt. ControlNet Zhang et al. [2023] was proposed as an add-on architecture for training T2I diffusion models with extra conditioning inputs such as depth, pose, edge maps, and more. Several recent techniques Hertz et al. [2023b], Sohn et al. [2024], Cheng et al. [2023] focus on generating style-aligned images. In our work, we use a depth-conditioned I2I pipeline with an attention-sharing mechanism similar to "StyleAligned" Hertz et al. [2023b] to create a set of multi-view images sharing a consistent style.
利用 T2I 扩散模型的成功，最近的研究已将其应用扩展到受控图像生成和编辑，特别是在图像到图像（I2I）任务中 Meng 等人[2021]、Parmar 等人[2023]、Kawar 等人[2023]、Tumanyan 等人[[2023]、Mokady 等人[2023]、Hertz 等人[2023a, 2022]、Brooks 等人[2023]。例如，SDEdit Meng 等人[[2021]通过首先在源图像中添加噪声，然后根据给定的提示引导扩散过程朝输出方向进行，从而实现了这一目标。ControlNet Zhang 等人[[2023]]提出了一种附加架构，用于训练具有额外条件输入（如深度、姿势、边缘图等）的 T2I 扩散模型。最近的几项技术 Hertz 等人[2023b]、Sohn 等人[2024]、Cheng 等人[2023]侧重于生成风格对齐的图像。在我们的工作中，我们使用深度条件 I2I 流水线和类似于 Hertz 等人[2023b]的 "StyleAligned "注意力共享机制来创建一组共享一致风格的多视角图像。

Figure 1: Overall Pipeline: Our method consists of distinct procedures. We first prepare a NeRF model of the source view images. Given the depth maps of the corresponding views (by either estimation or rendering by NeRF), we generate stylized multi-view images using a style-aligned diffusion model. Lastly, we fine-tune the source NeRF on the stylized images using the SWD loss.
图 1：整体管道：我们的方法由不同的程序组成。我们首先为源视图图像准备一个 NeRF 模型。根据相应视图的深度图（通过 NeRF 估算或渲染），我们使用风格对齐的扩散模型生成风格化的多视图图像。最后，我们使用 SWD 损失对风格化图像上的源 NeRF 进行微调。

3 Method 3 方法

3.1 Preliminaries 3.1 前言

3.1.1 Neural Radiance Fields.
3.1.1 神经辐射场。

NeRF Mildenhall et al. [2020] models a volumetric 3D scene as a continuous function by mapping a 3D coordinate

and a 2D viewing direction

to a color

and a density

. This function

is often parameterized by a neural network combined with voxel grid structures or other encoding techniques Fridovich-Keil et al. [2022], Müller et al. [2022], Sun et al. [2022a b]] to accelerate performance. Given a NeRF model trained on a set of 2D images taken from various viewpoints of a target scene, the accumulated color

along an arbitrary camera ray

is calculated with the quadrature rule by volume rendering Max [1995]:
NeRF Mildenhall 等人[2020] 通过将三维坐标

和二维观察方向

映射到颜色

和密度

中，将体积三维场景建模为连续函数。该函数

通常由神经网络参数化，并结合体素网格结构或其他编码技术 Fridovich-Keil 等[2022]、Müller 等[2022]、Sun 等[2022a b]]来加速性能。给定在一组从目标场景的不同视角拍摄的二维图像上训练好的 NeRF 模型，沿任意摄像机光线

可通过体积渲染 Max [1995] 使用正交规则计算出来：

where

is the distance between sampled points on the ray and

is the accumulated transmittance from origin

to the

-th sample.
其中，

为射线上采样点之间的距离，

为从原点

到

个采样点的累积透射率。

3.1.2 Conditional Diffusion Models.
3.1.2 条件扩散模型。

Recent T2I diffusion models Rombach et al. [2022], Podell et al. [2023] are built with a U-net architecture Ronneberger et al. [2015] integrated with convolutional layers and attention blocks Vaswani et al. [2017]. Within the model, attention blocks play a crucial role in correlating text with relevant parts of the deep features during image generation. Our work uses an open-source latent diffusion model Podell et al. [2023], which includes a CLIP text encoder Radford et al. [2021] for text embedding. The cross-attention between contextual text embedding and the deep features of the denoising network is calculated as follows:
最近的 T2I 扩散模型 Rombach 等人[2022]、Podell 等人[2023]采用 U 网架构构建，Ronneberger 等人[2015]集成了卷积层和注意力区块 Vaswani 等人[2017]。在该模型中，注意力区块在图像生成过程中将文本与深度特征的相关部分关联起来方面起着至关重要的作用。我们的工作使用了一个开源的潜在扩散模型 Podell 等人[2023]，其中包括一个用于文本嵌入的 CLIP 文本编码器 Radford 等人[2021]。上下文文本嵌入与去噪网络深度特征之间的交叉关注度计算如下：

where

are projection matrices for a deep feature map

. We may interpret the attention operation in equation 2 as values

, originating from conditional text, weighted by the correlation of queries

, and the keys

. There are often

multiple attention heads in each layer along the

dimension to allow the model to attend to information from different subspaces in feature space jointly:
其中

是深度特征图

的投影矩阵。我们可以将公式 2 中的注意力操作解释为源自条件文本的

值，并根据查询

和键

的相关性进行加权。在每个层中，通常会沿

维度设置

多个注意力头，以便模型能够共同关注来自特征空间中不同子空间的信息：

3.2 Style-NeRF2NeRF 3.2 样式-NeRF2NeRF

Our method is a distinct two-step process. First, we prepare stylized images of corresponding source views using our style-aligned diffusion pipeline, and then refine the source NeRF model based on the generated views to acquire a style-transferred 3D scene.
我们的方法分为两个步骤。首先，我们使用风格对齐的扩散管道为相应的源视图准备风格化图像，然后根据生成的视图完善源 NeRF 模型，从而获得风格转换的三维场景。

3.2.1 Style-Aligned Image-to-Image Generation.
3.2.1 风格对齐的图像到图像生成

Given a set of source view images

, our first goal is to generate a corresponding set of stylized view images

under a text condition

with as much perceptual view consistencies among images where

consists of a sampling process such as DDIM Song et al. 2020b.
给定一组源视图图像

，我们的第一个目标是在文本条件

下生成一组相应的风格化视图图像

，图像之间的感知视图要尽可能一致，其中

包含一个采样过程，如 DDIM Song 等人，2020b。
Although T2I diffusion models can generate rich images with arbitrary text prompts, merely sharing the same prompt across different source views is insufficient to generate stylized images with a perceptually consistent style. To alleviate this problem, we apply a fully-shared-attention variant of a style-aligned image generation method proposed by Hertz et al. 2023b. Let

be the queries, keys, and values from a deep feature

for view

, then we generate

stylized views simultaneously using the following fully-shared-attention:
尽管 T2I 扩散模型可以生成带有任意文本提示的丰富图像，但仅仅在不同的源视图中共享相同的提示并不足以生成具有一致感知风格的风格化图像。为了缓解这一问题，我们采用了 Hertz 等人提出的风格一致图像生成方法的完全共享注意力变体 2023b。假设

是视图

的深度特征

的查询、键和值，那么我们将使用以下完全共享注意力方法同时生成

风格化视图：

Figure 4 illustrates an example of multi-view images generated with and without the fully-shared-attention mechanism.
图 4 举例说明了使用和不使用完全共享注意力机制生成的多视角图像。

3.2.2 Conditioning on Source Views.
3.2.2 源视图条件。

To further strengthen perceptual consistencies across multi-view frames, we attach a depth-conditioned ControlNet Zhang et al. [2023] and optionally enable SDEdit Meng et al. 2021] for conditioning on the source view. As for the depth inputs, we may either render the corresponding depth maps from the source NeRF or use an off-the-shelf depth estimator model such as MiDaS Ranftl et al. [2020].
为了进一步加强多视图帧之间的感知一致性，我们附加了一个深度调节控制网 Zhang 等人[2023]，并可选择启用 SDEdit Meng 等人[2021]对源视图进行调节。至于深度输入，我们可以从源 NeRF 中渲染相应的深度图，或使用现成的深度估计模型，如 MiDaS Ranftl 等人[2020]。
Given a set of translated multi-view images based on style text and their corresponding camera poses for training a source NeRF model, we may proceed to the NeRF refining stage described below.
有了一组根据风格文本翻译的多视角图像及其相应的相机姿态来训练源 NeRF 模型，我们就可以进入下文所述的 NeRF 完善阶段。

3.2.3 NeRF Fine-Tuning. 3.2.3 NeRF 微调。

Based on the perceptually view-consistent images

created by the style-aligned image-to-image diffusion model, our next objective is to fine-tune the source NeRF scene to reflect the target style in a 3D consistent manner.
基于由风格对齐的图像到图像扩散模型创建的感知上与视图一致的

图像，我们的下一个目标是对源 NeRF 场景进行微调，以三维一致的方式反映目标风格。
Although the stylized multi-view images are a good starting point for fine-tuning the source NeRF, we found that using a common RGB pixel loss is prone to over-fitting due to ambiguities in 3D geometry and color. Therefore, an alternative
虽然风格化多视角图像是微调源 NeRF 的良好起点，但我们发现，由于三维几何和颜色的模糊性，使用普通 RGB 像素损失容易造成过度拟合。因此，另一种
loss function that reflects the perceptual similarity is preferred for guiding the 3D style-transfer process. To meet our requirement, we employ the Sliced Wasserstein Distance loss (SWD loss) Heitz et al. [2021].
在指导三维风格转换过程时，反映感知相似性的损失函数是首选。为了满足我们的要求，我们采用了海茨等人[2021]提出的切片瓦瑟斯坦距离损失函数（SWD loss）。

3.3 Sliced Wasserstein Distance Loss.
3.3 瓦瑟斯坦切片距离损失。

Feature statistics of pre-trained Convolutional Neural Networks (CNNs) such as VGG-19 Simonyan and Zisserman [2014] are known to be useful for representing a style of an image Gatys et al. [2015], Johnson et al. [2016], Huang and Belongie [2017], Li et al. [2017], Luan et al. [2017]. In our study we employ the SWD loss originally proposed for texture synthesis Heitz et al. [2021] as the loss term to guide the style-transfer process for NeRF.
众所周知，VGG-19 Simonyan 和 Zisserman [2014] 等预先训练好的卷积神经网络（CNN）的特征统计对于表现图像的风格非常有用 Gatys 等人[2015]、Johnson 等人[2016]、Huang 和 Belongie [2017]、Li 等人[2017]、Luan 等人[2017]。在我们的研究中，我们采用了最初为纹理合成提出的 SWD 损失 Heitz 等人[2021]，作为损失项来指导 NeRF 的风格转换过程。
Let

denote the feature vector of the

-th convolutional layer at pixel

where

is the number of pixels and

is the feature dimension size. Using the delta Dirac function, we may express the discrete probability density function

of the features for layer

as below:
让

表示

卷积层在像素

处的特征向量，其中

是像素数，

是特征维度大小。利用 delta Dirac 函数，我们可以表达层

中特征的离散概率密度函数

如下：

Using the feature distributions

for image

and its corresponding optimization target

, the style loss is defined as a sum of SWD over the layers:
利用图像

的特征分布

及其相应的优化目标

，样式损失被定义为各层的 SWD 之和：

where,

is the SWD term defined as the expectation over 1-dimensional Wasserstein distances of features projected by random directions

sampled from a unit hypersphere.
其中，

是 SWD 项，定义为通过从单位超球中采样的随机方向

投影的特征在一维瓦瑟斯坦距离上的期望值。
Using the projected scalar features

, where

as the following where the 1 -dimentional 2 -Wasserstein distance

is trivially calculated in a closed form by taking the element-wise

distances between sorted scalars in

and

. An illustration of a projected 1D Wasserstein distance is shown in figure 2
使用投影标量特征

，其中

、

如下，其中 1 -dimentional 2 -Wasserstein 距离

可以通过在

和

中取排序标量之间的元素距离

以闭合形式计算。图 2 展示了投影一维瓦瑟斯坦距离的示意图

Expectation over random projections

provides a good approximation in practice and an optimized distribution is proven to converge to the target distribution. SWD is known to capture the complete target distribution Pitie et al. 2005 as described below:
在实践中，对随机投影

的期望提供了一个很好的近似值，而且经过优化的分布已被证明可以收敛到目标分布。已知 SWD 可以捕捉到完整的目标分布 Pitie 等人，2005 年，如下所述：

The calculation of SWD scales in

for an

-dimensional distribution, making it suitable for machine learning applications with gradient descent algorithms.
对于

维分布，SWD 的计算以

为单位缩放，因此适用于使用梯度下降算法的机器学习应用。

3.4 Style Blending. 3.4 风格混合。

Given two different stylized views

and their corresponding feature distributions

, one may obtain a style-blended scene by refining the source NeRF model towards the Wasserstein barycenter where

is the blending weight between the two styles:
给定两个不同的风格化视图

及其相应的特征分布

，我们可以通过向瓦塞尔斯坦原点细化源 NeRF 模型来获得风格混合场景，其中

是两种风格之间的混合权重：

An example of style blending is shown in figure 3.
图 3 显示了样式混合的一个示例。

Figure 2: Sliced Wasserstein Distance:

and

are projected onto a random unit direction

(left). The 1-dimensional Wasserstein distance can be calculated by taking the

difference between the sorted projections

and

(right). Expectation over random

vectors is a practical approximation of the

-dimensional Wasserstein distance.
图 2：瓦瑟斯坦距离切片：

和

被投影到随机单位方向

上（左图）。取分类投影

和

之间的

差值，即可计算出一维瓦瑟斯坦距离（右图）。对随机

向量的期望是

维 Wasserstein 距离的实用近似值。

Figure 3: Style Interpolation: An example of style blending using the Wasserstein barycenter between two different style prompts "A person like Marilyn Monroe, pop art style" and "A person like Steve Jobs".
图 3：风格插值：在两个不同的风格提示 "一个像玛丽莲-梦露的人，流行艺术风格 "和 "一个像史蒂夫-乔布斯的人 "之间使用 Wasserstein 双曲中心进行风格混合的示例。

3.5 Implementation Details.
3.5 实施细节。

We employ Stable Diffusion XL Podell et al. [2023] as a backbone for the style-aligned image-to-image diffusion pipeline. For NeRF representation, we use the "nerfacto" model implemented in NeRFStudio Tancik et al. [2023]. Due to memory constraints, we generate up to 18 views simultaneously with a fixed seed across all

source views. In our experiments, we generated the target images with 50 denoising steps using a range of classifier-free guidance weights (mostly between 5 and 30) depending on the scene or the style text. While we use depth maps rendered by NeRF for relatively compact and forward-facing scenes, we opt for depth estimations from the MiDaS model Ranftl et al. [2020] for large-scale outdoor scenes.
我们采用 Stable Diffusion XL Podell 等人[2023]作为风格对齐图像间扩散管道的主干。对于 NeRF 表示，我们使用 NeRFStudio Tancik 等人[2023]中实现的 "nerfacto "模型。由于内存限制，我们在所有

源视图中使用固定种子同时生成多达 18 个视图。在我们的实验中，根据场景或风格文本的不同，我们使用一系列无分类器引导权重（大多在 5 到 30 之间）生成了 50 步去噪目标图像。在相对紧凑和面向前方的场景中，我们使用 NeRF 渲染的深度图，而在大型室外场景中，我们选择使用 Ranftl 等人[2020] 的 MiDaS 模型进行深度估计。
As the image editing is performed before NeRF training, our method allows users to test with different text prompts and parameters (e.g., text guidance scale, SDEdit strength) beforehand. Additionally, our straight-forward NeRF training without Iterative DU Haque et al. [2023] or score distillation sampling (SDS) loss Poole et al. [2022] allows the training process to run with less GPU memory as editing by a diffusion model is not necessary during NeRF updates. We also verify the importance of style-alignment in the ablation study. Please refer to the supplementary material for more implementation details. The overall pipeline of our method is shown in Figure 1 .
由于图像编辑是在 NeRF 训练之前进行的，因此我们的方法允许用户事先使用不同的文本提示和参数（如文本引导尺度、SDEdit 强度）进行测试。此外，我们的直接 NeRF 训练不需要迭代 DU Haque 等人[2023]或分数蒸馏采样（SDS）损失 Poole 等人[2022]，由于在 NeRF 更新过程中不需要通过扩散模型进行编辑，因此训练过程可以使用较少的 GPU 内存运行。我们还在消融研究中验证了风格对齐的重要性。更多实现细节请参阅补充材料。我们方法的整体流程如图 1 所示。

Figure 4: Effect of Style-Alignment: An example of source view conversion applied to "Bear" scene using a text prompt "A bear, impressionism painting style" with and without shared-attention mechanism within the diffusion pipeline. We find that a fully-shared-attention variant of the style-aligned diffusion model Hertz et al. [2023b] greatly improves perceptual consistencies among generated views.
图 4：风格对齐的效果：使用文本提示 "一只熊，印象派绘画风格 "对 "熊 "场景进行源视图转换的示例，在扩散管道中使用和不使用共享注意力机制。我们发现，风格对齐扩散模型的完全共享注意力变体 Hertz 等人[2023b]大大提高了生成视图之间的感知一致性。

4 Results 4 成果

We run our experiments on several real-world scenes, including the Instruct-NeRF2NeRF Haque et al. [2023] dataset captured by a smartphone and a mirrorless camera with camera poses extracted by COLMAP Schönberger and Frahm [2016] and PolyCam pol. The dataset contains large-scale 360 scenes, objects, and forward-facing human portraits. We show qualitative results and comparisons against several variants to verify the effectiveness of our method design with CLIP Text-Image Direction Similarity (CLIP-TIDS), a quantitative metric originally introduced in StyleGAN-Nada Gal et al. [2022] and CLIP Directional Consistency (CLIP-DC), a score proposed by Instruct-NeRF2NeRF Haque et al. [2023] that aims to measure the view-consistency between adjacent frames. In addition to the above, we also present comparison results against recent NeRF-based 3D editing methods, Instruct-NeRF2NeRF Haque et al. [2023] and ViCA-NeRF Dong and Wang [2024]. We encourage our readers to see the results in the supplementary video.
我们在多个真实世界场景上进行了实验，包括由智能手机和无反光镜相机拍摄的 Instruct-NeRF2NeRF Haque 等人[2023] 数据集，以及由 COLMAP Schönberger 和 Frahm [2016] 和 PolyCam pol 提取的相机姿势。该数据集包含大型 360 场景、物体和正面人像。我们展示了定性结果，并与几种变体进行了比较，以验证我们的方法设计与 CLIP 文本-图像方向相似性（CLIP-TIDS）和 CLIP 方向一致性（CLIP-DC）的有效性，前者是 StyleGAN-Nada Gal 等人[2022]提出的定量指标，后者是 Instruct-NeRF2NeRF Haque 等人[2023]提出的评分标准，旨在衡量相邻帧之间的视图一致性。除此之外，我们还提供了与最近基于 NeRF 的 3D 编辑方法 Instruct-NeRF2NeRF Haque 等人[2023] 和 ViCA-NeRF Dong 和 Wang [2024] 的比较结果。我们鼓励读者查看补充视频中的结果。

4.1 Qualitative Evaluation
4.1 定性评估

Qualitative results are shown in Figure 6. Our method is capable of performing artistic style transfer under various style prompts without hallucinations. We recommend watching the supplementary video to confirm that the stylized scenes are sufficiently view-consistent.
定性结果如图 6 所示。我们的方法能够在各种风格提示下进行艺术风格转换，而不会产生幻觉。我们建议观看补充视频，以确认风格化场景与视图足够一致。

4.2 Ablations 4.2 消融

We verify the effectiveness of our method by comparing it against the following variants. An illustration of the comparison results is shown in Figure 5
我们将我们的方法与以下变体进行比较，以验证其有效性。比较结果如图 5 所示

No Style-Alignment: To examine the importance of preparing perceptually view-consistent stylized images prior to the training process of a source NeRF model, we turn off the full-attention-sharing. Due to the view-inconsistencies in stylized images (See also middle row in figure 4), fine-tuning NeRF on such images will result in an unpredictable mixture of styles.
无风格对齐：为了检验在训练源 NeRF 模型之前准备感知视图一致的风格化图像的重要性，我们关闭了完全注意力共享。由于风格化图像的视图不一致（另见图 4 中行），在此类图像上对 NeRF 进行微调将产生不可预测的混合风格。
Style-Alignment Train-from-Scratch: In this naive variant, we train a NeRF from scratch using the images generated with our style-aligned diffusion pipeline. Without pre-training of the underlying scene, 3D style transfer produces floating artifacts and shape inconsistencies due to ambiguities in geometry and color of stylized training images.
风格对齐从零开始训练：在这种天真的变体中，我们使用风格对齐扩散管道生成的图像从头开始训练 NeRF。如果不对底层场景进行预训练，三维风格转移会产生浮动伪影和形状不一致，这是由于风格化训练图像的几何和颜色存在模糊性。
Style-Alignment w/ RGB Loss: This variant trains the source NeRF with pixel RGB loss instead of SWD loss. As perceputal view-consistency or similar style does not guarantee physically consistent geometry and color across different views, training with RGB loss tends to diverge to a blurry scene. RGB loss is prone
风格对齐与 RGB 损失：此变体使用像素 RGB 损失而非 SWD 损失来训练源 NeRF。由于感知视图一致性或类似风格不能保证不同视图中的几何形状和颜色在物理上保持一致，因此使用 RGB 损失进行训练往往会导致场景模糊。RGB 损失容易

(a) Original View (a) 原始景观
(b) No Style-Alignment (b) 无风格对齐
(c) Train from Scratch (c) 从头开始培训

(d)

Reference Style
(d)

参考样式
(e) Style-Alignment w/ RGB Loss
(e) 样式对齐，带 RGB 损失
(f) Ours (NeRF Render) (f) 我们的（NeRF 渲染）

Figure 5: Baseline Comparisons: We compare our method against several variants. The images show an example comparison of the "Bear" scene edited with a text description "A bear, impressionism painting style" with a text guidance scale of 7.5. Note that (b), (c), (e), and (f) are all novel view renders from NeRF. We encourage our readers to check the results in the video.
图 5：基准比较：我们将我们的方法与几种变体进行了比较。图片显示的是 "熊 "场景的对比示例，该场景的文字描述为 "一只熊，印象派绘画风格"，文字引导比例为 7.5。请注意，(b)、(c)、(e)和(f)都是来自 NeRF 的新视图渲染。我们鼓励读者查看视频中的结果。
to over-fitting, whereas SWD is a more valid choice for effectively learning the perceptual similarity from style-aligned training images.
而 SWD 则更适合从风格对齐的训练图像中有效学习知觉相似性。

4.2.1 Quantitative Evaluation.
4.2.1 定量评估。

We quantitatively measure our method against the variants using CLIP-TIDS and CLIP-DC with a fixed text guidance scale of 15 . The results are shown in table 1
我们使用 CLIP-TIDS 和 CLIP-DC 对我们的方法与变体进行了定量测评，文本指导尺度固定为 15。结果如表 1 所示

Table 1: Quantitative Evaluation We show both CLIP-TIDS and CLIP-DC results. The values are the average of the novel view render frames over two scenes using five prompts.
表 1：定量评估我们展示了 CLIP-TIDS 和 CLIP-DC 的结果。这些数值是两个场景中使用五个提示的新视图渲染帧的平均值。

否 Style-Align

Style-Align

火车从 Scratch

Train from

Scratch

Style-Align

w/RGB Loss

Ours 我们的

CLIP-TIDS

0.125

0.073

0.237

CLIP-DC

0.917

0.928

4.3 Method Comparison 4.3 方法比较

We compare our method against Instruct-NeRF2NeRF Haque et al. [2023] and ViCA-NeRF Dong and Wang [2024] on four scenes including two large-scale outdoor scenes, a 360 object scene (Bear) and a forward-facing scene (a human portrait) using three different text prompts for each scene. Our method exhibits competitive style transfer results whereas previous methods occasionally suffer from hallucination effects (e.g. The Janus problem etc...) caused by the underlying diffusion model. As the generation of images and NeRF refinement is a separate process in our method, it is possible to filter out and recreate any images that could have undesired impact on the NeRF fine-tuning. Visual results are given in figure 7
我们将我们的方法与 Instruct-NeRF2NeRF Haque 等人[2023] 和 ViCA-NeRF Dong 和 Wang [2024]在四个场景上进行了比较，包括两个大型室外场景、一个 360 物体场景（熊）和一个前向场景（人物肖像），每个场景使用三个不同的文本提示。我们的方法展示了具有竞争力的风格转换结果，而之前的方法偶尔会受到底层扩散模型导致的幻觉效应（如杰纳斯问题等......）的影响。在我们的方法中，图像的生成和 NeRF 的细化是一个独立的过程，因此可以过滤并重新生成任何可能对 NeRF 微调产生不良影响的图像。可视结果见图 7

4.3.1 User Study 4.3.1 用户研究

For each scene, participants were shown a combination of stylized views rendered by different methods in random order, and were asked to select a single view that most likely adhere to the provided stylization text prompt. In our user study, we collected feedbacks from 33 individuals resulting in a total of 396 votes. The overall percentage of the selected preferred method is shown in table 2, indicating that our method can perform competitive artistic style transfer without hallucinations.
在每个场景中，参与者都会看到以随机顺序排列的不同方法渲染的风格化视图组合，并被要求选择一个最有可能符合所提供的风格化文本提示的视图。在用户研究中，我们收集了 33 位用户的反馈意见，共计 396 票。被选中的首选方法的总体比例如表 2 所示，这表明我们的方法可以在不产生幻觉的情况下进行有竞争力的艺术风格转换。

4.3.2 Quantitative Comparison
4.3.2 量化比较

As style transfer is inherently a subjective task, we think that qualitative evaluation by the user is the most important. Nevertheless, we additionally provide quantitative comparison results using CLIP-TIDS and CLIP-DC. Results are included in table 2
由于风格转换本质上是一项主观任务，我们认为用户的定性评价最为重要。不过，我们还提供了 CLIP-TIDS 和 CLIP-DC 的定量比较结果。结果见表 2

Table 2: Method Comparison Results The metrics are the average of novel view renders over four scenes with each using three prompts. (Average of

style transfer results.) Our method shows the best values for CLIP-TIDS, CLIP-DC, and user preference.
表 2：方法比较结果这些指标是四个场景中新颖视图渲染的平均值，每个场景使用三个提示。(我们的方法在 CLIP-TIDS、CLIP-DC 和用户偏好方面显示了最佳值。

	CLIP-TIDS	CLIP-DC	User Preference 用户偏好
Instruct-NeRF2NeRF 指导-NeRF2NeRF	0.081	0871
ViCA-NeRF	0.061	0.914
Ours 我们的

5 Limitations and Future Work
5 局限性和未来工作

While our method may apply artistic style transfer to various scenes, including large-scale outdoor environments, there are several limitations that we'd like to consider. Thin structures such as plants and trees in the background or delicate texture patterns are challenging to reconstruct due to ambiguities in the stylized multi-view images. For the same reason, our method will struggle to learn fine details if there is too much variation in the training images (e.g. different people or objects in the background, random patterns of clouds in the sky). As the style-aligned diffusion pipeline is conditioned on depth maps, significant editing of geometry is also difficult.
虽然我们的方法可以将艺术风格转移应用到各种场景，包括大型室外环境，但我们也要考虑到一些局限性。由于风格化多视角图像的模糊性，重建背景中的花草树木等细小结构或精细纹理图案具有挑战性。出于同样的原因，如果训练图像中存在太多变化（如背景中不同的人或物体、天空中云朵的随机模式），我们的方法在学习精细细节时就会很吃力。由于风格对齐扩散管道是以深度图为条件的，因此也很难对几何图形进行大幅编辑。
We think our approach is applicable to other types of 3D representations such as 3D Gaussian Splatting Kerbl et al. [2023] and extendable to more features such as scene relighting and deformation, which are exciting directions for further exploration.
我们认为我们的方法适用于其他类型的三维表征，如三维高斯拼接 Kerbl 等人[2023]，并可扩展到更多特征，如场景重光和变形，这些都是令人兴奋的进一步探索方向。

6 Conclusion 6 结论

We propose a novel 3D style-transfer method for NeRF representation leveraging a style-aligned generative diffusion pipeline. By guiding the training process with Sliced Wasserstein Distance or SWD loss, the source 3D scene, pretrained as a NeRF model, is effectively translated into a stylized 3D scene. The method is a relatively straightforward two-step process, allowing the creators to visually search and refine their style concepts by testing various text prompts and guidance scales before fine-tuning the source NeRF model. Our proposed method shows competitive 3D style transfer results compared to previous methods and can blend styles by optimizing the source 3D scene towards the Wasserstein barycenter.
我们提出了一种新颖的三维风格转换方法，利用风格对齐的生成扩散管道来表示 NeRF。通过使用 Sliced Wasserstein Distance 或 SWD loss 对训练过程进行指导，预先训练为 NeRF 模型的源 3D 场景可以有效地转换为风格化的 3D 场景。该方法是一个相对简单的两步流程，允许创作者在微调源 NeRF 模型之前，通过测试各种文本提示和引导尺度，直观地搜索和完善其风格概念。与以前的方法相比，我们提出的方法显示出具有竞争力的三维风格转移结果，并能通过优化源三维场景以实现瓦瑟施泰因原点来混合风格。

7 Acknowledgements 7 致谢

This work was partially supported by JST Moonshot R&D Grant Number JPMJPS2011, CREST Grant Number JPMJCR2015, and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo. We want to thank Instruct-NeRF2NeRF Haque et al. [2023] authors for sharing their dataset, and Yuki Kato, Shinji Terakawa, and Yoshiaki Tahara for assisting with data capturing.
本研究得到了日本科学技术振兴机构（JST）Moonshot R&D Grant（编号：JPMJPS2011）、CREST Grant（编号：JPMJCR2015）和东京大学人工智能与超越研究所（Institute for AI and Beyond of Tokyo University）Basic Research Grant（Super AI）的部分资助。我们要感谢Instruct-NeRF2NeRF Haque等人[2023]分享了他们的数据集，并感谢Yuki Kato、Shinji Terakawa和Yoshiaki Tahara在数据采集方面提供的协助。

References 参考资料

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
Ben Mildenhall、Pratul P. Srinivasan、Matthew Tancik、Jonathan T. Barron、Ravi Ramamoorthi 和 Ren Ng。Nerf：将场景表示为用于视图合成的神经辐射场。在 ECCV，2020。

Thu Nguyen-Phuoc, Feng Liu, and Lei Xiao. Snerf: stylized neural implicit representations for 3d scenes. arXiv preprint arXiv:2207.02363, 2022.
Thu Nguyen-Phuoc、Feng Liu 和 Lei Xiao.Snerf：3D 场景的风格化神经隐式表征》，arXiv preprint arXiv:2207.02363, 2022.
Can Wang, Ruixiang Jiang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023.
Can Wang、Ruixiang Jiang、Menglei Chai、Mingming He、Dongdong Chen 和 Jing Liao.神经艺术：文本驱动的神经辐射场风格化。电气和电子工程师协会可视化与计算机图形学论文集》，2023 年。
Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu, and Eric P Xing. Stylerf: Zero-shot 3d style transfer of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8338-8348, 2023.
刘坤浩、詹方能、陈怡文、张家辉、于颖晨、Abdulmotaleb El Saddik、吕世建、邢鹏。Stylerf：神经辐射场的零镜头三维风格传输。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 8338-8348 页，2023 年。
Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, and Takuya Narihira. Instruct 3d-to-3d: Text instruction guided 3d-to-3d conversion. arXiv preprint arXiv:2303.15780, 2023.
Hiromichi Kamata、Yuiko Sakuma、Akio Hayakawa、Masato Ishii 和 Takuya Narihira。Instruct 3d-to-3d：ArXiv preprint arXiv:2303.15780, 2023.
Ayaan Haque, Matthew Tancik, Alexei Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. 2023.
Ayaan Haque、Matthew Tancik、Alexei Efros、Aleksander Holynski 和 Angjoo Kanazawa。Instruct-nerf2nerf：使用说明编辑 3D 场景。2023.
Jiahua Dong and Yu-Xiong Wang. Vica-nerf: View-consistency-aware 3d editing of neural radiance fields. Advances in Neural Information Processing Systems, 36, 2024.
《董家华、王宇雄。Vica-nerf：神经辐射场的视图一致性感知三维编辑。神经信息处理系统进展》，36，2024。
Cyrus Vachha and Ayaan Haque. Instruct-gs2gs: Editing 3d gaussian splats with instructions, 2024. URL https: //instruct-gs2gs.github.io/.
Cyrus Vachha 和 Ayaan Haque.Instruct-gs2gs：使用说明编辑 3d 高斯飞溅，2024 年。URL https: //instruct-gs2gs.github.io/.
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392-18402, 2023.
Tim Brooks、Aleksander Holynski 和 Alexei A Efros。Instructpix2pix：学习遵循图像编辑指令。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 18392-18402 页，2023 年。

Eric Heitz, Kenneth Vanhoey, Thomas Chambon, and Laurent Belcour. A sliced wasserstein loss for neural texture synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages

.
Eric Heitz、Kenneth Vanhoey、Thomas Chambon 和 Laurent Belcour。用于神经纹理合成的切片Wasserstein损失。IEEE/CVF 计算机视觉与模式识别会议论文集》，第

页。
Jie Li, Dan Xu, and Shaowen Yao. Sliced wasserstein distance for neural style transfer. Computers & Graphics, 102: 89-98, 2022.
Jie Li, Dan Xu, and Shaowen Yao.用于神经风格转移的切片贝塞尔斯坦距离计算机与图形》，102: 89-98, 2022.
Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, et al. Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1-12, 2023.
Matthew Tancik、Ethan Weber、Evonne Ng、Ruilong Li、Brent Yi、Terrance Wang、Alexander Kristoffersen、Jake Austin、Kamyar Salahi、Abhik Ahuja 等。 Nerfstudio：神经辐射场开发的模块化框架。In ACM SIGGRAPH 2023 Conference Proceedings, pages 1-12, 2023.
Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf-: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
王梓睿、吴尚哲、谢伟迪、陈敏、维克托-阿德里安-普里萨卡鲁。Nerf-：ArXiv preprint arXiv:2102.07064, 2021.
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470-5479, 2022.
Jonathan T Barron、Ben Mildenhall、Dor Verbin、Pratul P Srinivasan 和 Peter Hedman。Mip-nerf 360：无界反锯齿神经辐射场。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 5470-5479 页，2022 年。
Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1-15, 2022.
Thomas Müller、Alex Evans、Christoph Schied 和 Alexander Keller。使用多分辨率哈希编码的即时神经图形基元。ACM 图形学论文集（ToG），41（4）：1-15，2022。
Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210-7219, 2021.
Ricardo Martin-Brualla、Noha Radwan、Mehdi SM Sajjadi、Jonathan T Barron、Alexey Dosovitskiy 和 Daniel Duckworth。野外纳夫：用于无约束照片集的神经辐射场。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 7210-7219 页，2021 年。
Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Structured view-dependent appearance for neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5481-5490. IEEE, 2022.
Dor Verbin、Peter Hedman、Ben Mildenhall、Todd Zickler、Jonathan T Barron 和 Pratul P Srinivasan。Ref-nerf：神经辐射场的结构化视图依赖性外观。In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5481-5490.IEEE, 2022.
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
Leon A Gatys, Alexander S Ecker, and Matthias Bethge.艺术风格的神经算法》，arXiv preprint arXiv:1508.06576, 2015.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694-711. Springer, 2016.
贾斯汀-约翰逊、亚历山大-阿拉希和李菲菲。实时风格转移和超分辨率的感知损失。In Computer Vision-ECCV 2016：第 14 届欧洲会议，荷兰阿姆斯特丹，2016 年 10 月 11-14 日，论文集，第二部分 14，第 694-711 页。Springer, 2016.
Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501-1510, 2017.
Xun Huang 和 Serge Belongie.利用自适应实例归一化实时实现任意风格转移。IEEE 计算机视觉国际会议论文集》，第 1501-1510 页，2017 年。
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017.
《李一君、方晨、杨继美、王兆文、卢昕和杨明轩。通过特征变换实现通用风格转移。神经信息处理系统进展》，2017年第30期。
Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4990-4998, 2017.
栾福军、西尔万-帕里斯、伊莱-谢赫特曼和卡维塔-巴拉。深度照片风格转移。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4990-4998, 2017.
Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields. In European Conference on Computer Vision, pages 717-733. Springer, 2022.
Kai Zhang、Nick Kolkin、Sai Bi、Fujun Luan、Zexiang Xu、Eli Shechtman 和 Noah Snavely。Arf：艺术辐射场。欧洲计算机视觉会议，第 717-733 页。Springer, 2022.

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
Ben Poole、Ajay Jain、Jonathan T Barron 和 Ben Mildenhall。Dreamfusion：ArXiv preprint arXiv:2209.14988, 2022.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256-2265. PMLR, 2015.
Jascha Sohl-Dickstein、Eric Weiss、Niru Maheswaranathan 和 Surya Ganguli。使用非平衡热力学的深度无监督学习。国际机器学习会议，第 2256-2265 页。PMLR，2015。
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020a.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.基于分数的随机微分方程生成建模.arXiv 预印本 arXiv:2011.13456, 2020a.
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780-8794, 2021.
Prafulla Dhariwal 和 Alexander Nichol.图像合成中的扩散模型击败甘斯。神经信息处理系统进展》，34:8780-8794，2021 年。
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Jonathan Ho 和 Tim Salimans.ArXiv preprint arXiv:2207.12598, 2022.
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
孟晨林、何宇彤、宋扬、宋佳明、吴佳俊、朱俊彦和斯特凡诺-埃尔蒙。Sdedit：用随机微分方程指导图像合成和编辑。arXiv 预印本 arXiv:2108.01073, 2021.
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-toimage translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1-11, 2023.
Gaurav Parmar、Krishna Kumar Singh、Richard Zhang、Yijun Li、Jingwan Lu 和 Jun-Yan Zhu。零镜头图像到图像的转换。ACM SIGGRAPH 2023 会议论文集》，第 1-11 页，2023 年。
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007-6017, 2023.
Bahjat Kawar、Shiran Zada、Oran Lang、Omer Tov、Huiwen Chang、Tali Dekel、Inbar Mosseri 和 Michal Irani。Imagic：利用扩散模型进行基于文本的真实图像编辑。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 6007-6017 页，2023 年。
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-toimage translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921-1930, 2023.
Narek Tumanyan、Michal Geyer、Shai Bagon 和 Tali Dekel。用于文本驱动图像到图像翻译的即插即用扩散特征。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 1921-1930 页，2023 年。
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038-6047, 2023.
罗恩-莫卡迪、阿米尔-赫兹、克菲尔-阿伯曼、雅伊尔-普里奇和丹尼尔-科恩-奥尔。使用引导扩散模型编辑真实图像的空文本反转。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 6038-6047 页，2023 年。
Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328-2337, 2023a.
《阿米尔-赫兹、克菲尔-阿伯曼和丹尼尔-科恩-奥尔。三角洲去噪分数。IEEE/CVF 计算机视觉国际会议论文集》，第 2328-2337 页，2023a。
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
阿米尔-赫兹、罗恩-莫卡迪、杰伊-特南鲍姆、克菲尔-阿伯曼、雅伊尔-普里奇和丹尼尔-科恩-奥尔。具有交叉注意力控制的 "提示到提示 "图像编辑。arXiv 预印本 arXiv:2208.01626, 2022.
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836-3847, 2023.
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.为文本到图像扩散模型添加条件控制。IEEE/CVF 计算机视觉国际会议论文集》，第 3836-3847 页，2023 年。
Amir Hertz, Andrey Voynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133, 2023b.
Amir Hertz, Andrey Voynov, Shlomi Fruchter, and Daniel Cohen-Or.通过共享注意力生成风格对齐的图像。arXiv 预印本 arXiv:2312.02133, 2023b.
Kihyuk Sohn, Lu Jiang, Jarred Barber, Kimin Lee, Nataniel Ruiz, Dilip Krishnan, Huiwen Chang, Yuanzhen Li, Irfan Essa, Michael Rubinstein, et al. Styledrop: Text-to-image synthesis of any style. Advances in Neural Information Processing Systems, 36, 2024.
Kihyuk Sohn, Lu Jiang, Jarred Barber, Kimin Lee, Nataniel Ruiz, Dilip Krishnan, Huiwen Chang, Yuanzhen Li, Irfan Essa, Michael Rubinstein, et al. Styledrop：任何风格的文本到图像合成。神经信息处理系统进展》，36，2024。
Bin Cheng, Zuhao Liu, Yunbo Peng, and Yue Lin. General image-to-image translation with one-shot image guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22736-22746, 2023.
《程斌、刘祖豪、彭云波、林玥。采用单次图像引导的一般图像到图像翻译。IEEE/CVF 计算机视觉国际会议论文集》，第 22736-22746 页，2023 年。
Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501-5510, 2022.
Sara Fridovich-Keil、Alex Yu、Matthew Tancik、Qinhong Chen、Benjamin Recht 和 Angjoo Kanazawa。全像素：没有神经网络的辐射场。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 5501-5510 页，2022 年。
Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages

.
Cheng Sun, Min Sun, and Hwann-Tzong Chen.直接体素网格优化：辐射场重建的超快收敛。IEEE/CVF计算机视觉与模式识别会议论文集》，第

页。
Cheng Sun, Min Sun, and Hwann-Tzong Chen. Improved direct voxel grid optimization for radiance fields reconstruction. arXiv preprint arXiv:2206.05085, 2022b.
Cheng Sun, Min Sun, and Hwann-Tzong Chen.用于辐射场重建的改进型直接体素网格优化。arXiv预印本arXiv:2206.05085，2022b。
Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics,

.
纳尔逊-麦克斯用于直接体积渲染的光学模型IEEE Visualization and Computer Graphics（IEEE 可视化与计算机图形学论文集），

.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022.
Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer。利用潜在扩散模型进行高分辨率图像合成。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 10684-10695 页，2022 年。
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
Dustin Podell、Zion English、Kyle Lacey、Andreas Blattmann、Tim Dockhorn、Jonas Müller、Joe Penna 和 Robin Rombach。Sdxl: Improving latent diffusion models for high-resolution image synthesis. ArXiv preprint arXiv:2307.01952, 2023.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234-241. Springer, 2015.
Olaf Ronneberger、Philipp Fischer 和 Thomas Brox.U-net：用于生物医学图像分割的卷积网络。医学影像计算与计算机辅助干预-MICCAI 2015：第 18 届国际会议，德国慕尼黑，2015 年 10 月 5-9 日，论文集，第三部分 18，第 234-241 页。Springer, 2015.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。神经信息处理系统进展》，2017年30期。
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748-8763. PMLR, 2021.
Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark 等。从自然语言监督中学习可转移的视觉模型。国际机器学习大会，第 8748-8763 页。PMLR，2021 年。
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020b.
宋家明、孟晨霖、斯特凡诺-埃尔蒙。去噪扩散隐含模型。ArXiv 预印本 arXiv:2010.02502, 2020b.
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623-1637, 2020.
René Ranftl、Katrin Lasinger、David Hafner、Konrad Schindler 和 Vladlen Koltun。实现稳健的单目深度估计：混合数据集实现零镜头跨数据集传输。电气和电子工程师学会模式分析与机器智能交易》，44（3）：1623-1637，2020 年。
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Karen Simonyan 和 Andrew Zisserman.用于大规模图像识别的深度卷积网络》，arXiv preprint arXiv:1409.1556, 2014.
Francois Pitie, Anil C Kokaram, and Rozenn Dahyot. N-dimensional probability density function transfer and its application to color transfer. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 2, pages 1434-1439. IEEE, 2005.
Francois Pitie、Anil C Kokaram 和 Rozenn Dahyot。N 维概率密度函数转移及其在色彩转移中的应用。第十届电气和电子工程师学会计算机视觉国际会议（ICCV'05）第一卷、第二卷，第 1434-1439 页。IEEE, 2005.
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Johannes Lutz Schönberger 和 Jan-Michael Frahm.从运动看结构》（Structure-from-Motion Revisited）。计算机视觉与模式识别大会（CVPR），2016 年。
Polycam. URL https://poly.cam/
Polycam.URL https://poly.cam/
Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1-13, 2022.
Rinon Gal、Or Patashnik、Haggai Maron、Amit H Bermano、Gal Chechik 和 Daniel Cohen-Or。Stylegan-nada：剪辑引导的图像生成器领域适应。ACM Transactions on Graphics (TOG)，41(4):1-13，2022.
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4), jul 2023. ISSN 0730-0301. doi 10.1145/3592433. URL https://doi.org/10.1145/3592433
Bernhard Kerbl、Georgios Kopanas、Thomas Leimkuehler 和 George Drettakis。用于实时辐射场渲染的 3D 高斯拼接。ACM Trans.图形》，42（4），2023 年 7 月。DOI 10.1145/3592433。URL https://doi.org/10.1145/3592433.

Original Campsite Scene 原始营地场景
"A camp field with yellow tents, blue people, Van Gogh painting style"
"黄色帐篷、蓝色人群、梵高画风的营地"

"A campfield with orange tents, Hokusai style"
"北斋风格的橙色帐篷营地"
"A white camp field with red tents, heavy snow at dark night"
"白营田，红帐篷，暗夜大雪"

"A bear, impressionism painting style"
"一只熊，印象派画风"
"A blue bear, Vermeer painting style"
"一只蓝色的熊，维米尔的绘画风格"

Figure 6: Qualitative Results: We show novel view rendering examples of real-world scenes stylized or edited with text descriptions specifying certain artistic styles or environmental changes such as weather conditions.
图 6：定性结果：我们展示了一些新颖的视图渲染实例，这些实例中的现实世界场景经过了风格化处理或编辑，并附有文字说明，说明了某些艺术风格或环境变化（如天气状况）。

Campsite Scene, prompt: "A campsite with orange tents, Hokusai style"
营地场景，提示："北斋风格的橙色帐篷营地"

Bear Scene, prompt: "A grizzly bear"
熊的场景，提示灰熊

ViCA-NeRF
Ours 我们的
Face Scene, prompt: "A person like Marilyn Monroe, pop art style"
脸部场景，提示"像玛丽莲-梦露一样的人，波普艺术风格"

Figure 7: Method Comparison: A comparison of NeRF stylization methods. While we used a text guidance scale of between 15 to 25 for our results, it is controllable via text prompts concerning subjective preferences. Note that all images are novel view renders from NeRF.
图 7：方法比较：NeRF 风格化方法比较。虽然我们在结果中使用了 15 到 25 之间的文字引导量表，但它可以通过有关主观偏好的文字提示进行控制。请注意，所有图片均为 NeRF 的新颖视图渲染。

A Additional Implementation Details
A 更多实施细节

We pre-train the "nerfacto" model implemented in Nerfstudio Tancik et al. [2023] for 60,000 iterations and then fine-tune for 15,000 iterations. We use the default "nerfacto" losses; RGB pixel loss

, distortion loss

Barron et al. [2022], interlevel loss

Barron et al. [2022], orientation loss

Verbin et al. [2022], and predicted normal

Verbin et al. [2022] for pre-training (equation 12). During fine-tuning, we disable the RGB pixel loss

, orientation loss

, and predicted normal loss

but add the Sliced Wasserstein Distance (SWD) loss Heitz et al. [2021] (equation 13). The total loss function for each phase is as follows:
我们对 Nerfstudio Tancik 等人[2023] 中实施的 "nerfacto "模型进行了 60,000 次迭代预训练，然后进行了 15,000 次迭代微调。我们使用默认的 "nerfacto "损失；RGB 像素损失

、失真损失

Barron 等人 [2022]、层间损失

Barron 等人 [2022]、方向损失

Verbin 等人 [2022] 和预测正常值

Verbin 等人 [2022] 进行预训练（公式 12）。在微调过程中，我们禁用了 RGB 像素损失

、方向损失

和预测正常值损失

，但增加了切片瓦瑟斯坦距离（SWD）损失 Heitz 等人[2021]（等式 13）。每个阶段的总损失函数如下：

where we use the default hyper-parameters

for most cases. A greater value for the distortion loss may work better if floating artifacts remain in the scene. We list a brief description of the "nerfacto" losses.
在大多数情况下，我们使用默认的超参数

。如果场景中仍然存在浮动伪影，失真损失值越大可能效果越好。我们列出了 "事实 "损失的简要说明。

Distortion Loss: The loss encourages the density along a ray to become compact, aiming to prevent floaters and background collapse. It was proposed in Mip-NeRF 360 Barron et al. [2022].
畸变损耗：该损耗可促使射线沿线的密度变得紧凑，从而防止浮游物和背景坍塌。它是在 Mip-NeRF 360 Barron 等人的研究中提出的[2022]。
Interlevel Loss: The loss allows the histograms of the point sampling proposal network and NeRF network to become more consistent. It was also proposed in Mip-NeRF 360 Barron et al. 2022]
层间损耗：该损耗可使点采样建议网络和 NeRF 网络的直方图更加一致。Mip-NeRF 360 中也提出了这一建议。］
Orientation Loss: The loss aims to prevent "foggy" surfaces by penalizing visible samples with predicted normals facing the ray direction. It was introduced in Ref-NeRF Verbin et al. 2022.
方向损失：该损失旨在通过对预测法线朝向射线方向的可见样本进行惩罚来防止出现 "雾状 "表面。Ref-NeRF Verbin 等人于 2022 年引入了这一算法。
Predicted Normal Loss: The loss enforces the predicted normals to be consistent with density gradient normals. It is often used in conjunction with the orientation loss.
预测法线损失：该损失强制要求预测法线与密度梯度法线保持一致。它通常与方向损失结合使用。

For detailed definitions of the "nerfacto" losses, please see Mip-NeRF Barron et al. [2022] and Ref-NeRF Verbin et al. 2022.
关于 "事实 "损失的详细定义，请参见 Mip-NeRF Barron 等人[2022] 和 Ref-NeRF Verbin 等人 2022。

The SWD loss

is applied to

patches sampled during NeRF fine-tuning. Although we found that

empirically works sufficiently, one may change the patch size accordingly. We run all our experiments with Python 3.10 and CUDA 11.8 on a single NVIDIA H100. Although we optionally enable SDEdit Meng et al. [2021] in our style-aligned diffusion pipeline, we recognized that depth maps are often enough for conditioning on the original views. In such cases, we use 1.0 as the strength for SDEdit.
SWD 损失

适用于 NeRF 微调期间采样的

补丁。尽管我们发现

根据经验已经足够有效，但仍可相应改变补丁大小。我们使用 Python 3.10 和 CUDA 11.8 在一台英伟达 H100 上运行所有实验。虽然我们在风格对齐的扩散管道中可选择启用 SDEdit Meng 等人[2021]，但我们认识到深度图通常足以对原始视图进行调节。在这种情况下，我们使用 1.0 作为 SDEdit 的强度。

B CLIP Text-Image Direction Similarity
B CLIP 文本-图像方向相似性

CLIP Text-Image Direction Similarity (CLIP TIDS) Gal et al. 2022] is a metric for evaluating how well the change in the stylized image is aligned with the user-provided text prompt. Given the CLIP image and text encoder

, CLIP TIDS between the source and the stylized image

is calculated as:
CLIP 文本-图像方向相似性（CLIP TIDS） Gal 等人，2022 年]是一个用于评估样式化图像的变化与用户提供的文本提示对齐程度的指标。给定 CLIP 图像和文本编码器

，源图像和样式化图像

之间的 CLIP TIDS 计算公式为：

where

are text descriptions describing the images (e.g.

"A photo of a person",

"A person like Vincent Van Gogh").
其中

是描述图片的文字说明（例如

"一个人的照片"，

"一个像文森特-梵高的人"）。

C Effect of Text Guidance Scale
C 文本指导量表的效果

In figure 8, we show an example of NeRF renderings using the same style prompt "A person like Vincent Van Gogh" but different text guidance scales. We may see that style strength is controllable while keeping the original content structure grounded on the original image. We also show CLIP TIDS for each text guidance scale in table 3 A stronger text guidance scale will lead to higher CLIP TIDS.
在图 8 中，我们展示了一个 NeRF 渲染实例，该实例使用了相同的样式提示 "像文森特-梵高一样的人"，但文字引导比例不同。我们可以看到，在保持原始图像内容结构的基础上，风格强度是可以控制的。我们还在表 3 中显示了每种文字引导尺度的 CLIP TIDS，文字引导尺度越强，CLIP TIDS 越高。

Table 3：CLIP-TIDS Comparison CLIP-TIDS values over test renders for each text guidance scale are shown. We can verify that a stronger text guidance scale results in higher CLIP TIDS values.
表 3：CLIP-TIDS 比较图中显示了各文本引导尺度测试渲染的 CLIP-TIDS 值。我们可以验证，文字引导尺度越强，CLIP TIDS 值越高。

Text Guidance Scale 文本指导量表
CLIP TIDS 夹子	0.0693	0.1459	0.1520	0.1594	0.1718

Figure 8: Effect of Text Guidance Scale: We show some NeRF rendering results for the prompt "A person like Vincent Van Gogh" using various text guidance scales

.
图 8：文本引导比例的效果：我们展示了针对提示 "像文森特-梵高一样的人 "使用不同文本引导尺度

的 NeRF 渲染结果。

D Limitations D 限制

While our method can effectively perform overall style transfer to 3D scenes, it is still difficult to reconstruct a detailed structure of fluctuating objects within the stylized multi-view images. In figure 9 for example, (c) the clouds in the fine-tuned NeRF scene show a fractal-like pattern, which is different from (b) the clouds illustrated in the stylized image. This phenomenon is due to the ambiguities of cloud positions or shapes appearing in the stylized images generated by the style-aligned diffusion pipeline Hertz et al. [2023b. We leave the development of a more robust content structure-preserving style transfer technique as future work.
虽然我们的方法可以有效地将整体风格转移到三维场景中，但仍难以重建风格化多视角图像中波动物体的详细结构。例如，在图 9 中，(c) 微调 NeRF 场景中的云朵呈现出类似分形的图案，这与(b) 风格化图像中的云朵不同。这种现象是由于风格对齐扩散管道 Hertz 等人 [2023b.我们将开发一种更强大的内容结构保护风格转移技术作为未来的工作。

(a) Original View (a) 原始景观
(b) Stylized 2D Image (b) 风格化二维图像
(c) Fine-tuned NeRF Rendering
(c) 经过微调的 NeRF 渲染

Figure 9: Limitations: Due to remaining ambiguities in the stylized multi-view images, fluctuating objects such as clouds may lose their detailed shape in the fine-tuned NeRF renderings. We wish to improve on this in our future work.
图 9：局限性：由于风格化多视角图像中残留的模糊性，云层等波动物体可能会在微调后的 NeRF 渲染图中失去其细节形状。我们希望在今后的工作中加以改进。

Style-NERF2NERF: 3D Style TRANSFER FROM StYle-AlignED Multi-VIEW IMAGES 样式-NERF2NERF：从 StYle-AlignED 多视角图像传输 3D 样式

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

2.1 Implicit 3D Representation2.1 隐式三维表示法

2.2 Style Transfer 2.2 风格转换

2.2.1 2D Style Transfer. 2.2.1 二维样式转换。

2.2.2 3D Style Transfer. 2.2.2 3D 风格转移。

2.3 Diffusion Models 2.3 扩散模型

2.3.1 Controlled Generations with Diffusion Models.2.3.1 采用扩散模型的受控代。

3 Method 3 方法

3.1 Preliminaries 3.1 前言

3.1.1 Neural Radiance Fields.3.1.1 神经辐射场。

3.1.2 Conditional Diffusion Models.3.1.2 条件扩散模型。

3.2 Style-NeRF2NeRF 3.2 样式-NeRF2NeRF

3.2.1 Style-Aligned Image-to-Image Generation.3.2.1 风格对齐的图像到图像生成

3.2.2 Conditioning on Source Views.3.2.2 源视图条件。

3.2.3 NeRF Fine-Tuning. 3.2.3 NeRF 微调。

3.3 Sliced Wasserstein Distance Loss.3.3 瓦瑟斯坦切片距离损失。

3.4 Style Blending. 3.4 风格混合。

3.5 Implementation Details.3.5 实施细节。

4 Results 4 成果

4.1 Qualitative Evaluation4.1 定性评估

4.2 Ablations 4.2 消融

4.2.1 Quantitative Evaluation.4.2.1 定量评估。

4.3 Method Comparison 4.3 方法比较

4.3.1 User Study 4.3.1 用户研究

4.3.2 Quantitative Comparison4.3.2 量化比较

5 Limitations and Future Work5 局限性和未来工作

6 Conclusion 6 结论

7 Acknowledgements 7 致谢

References 参考资料

A Additional Implementation DetailsA 更多实施细节

B CLIP Text-Image Direction SimilarityB CLIP 文本-图像方向相似性

C Effect of Text Guidance ScaleC 文本指导量表的效果

D Limitations D 限制

Style-NERF2NERF: 3D Style TRANSFER FROM StYle-AlignED Multi-VIEW IMAGES
样式-NERF2NERF：从 StYle-AlignED 多视角图像传输 3D 样式

2.1 Implicit 3D Representation
2.1 隐式三维表示法

2.3.1 Controlled Generations with Diffusion Models.
2.3.1 采用扩散模型的受控代。

3.1.1 Neural Radiance Fields.
3.1.1 神经辐射场。

3.1.2 Conditional Diffusion Models.
3.1.2 条件扩散模型。

3.2.1 Style-Aligned Image-to-Image Generation.
3.2.1 风格对齐的图像到图像生成

3.2.2 Conditioning on Source Views.
3.2.2 源视图条件。

3.3 Sliced Wasserstein Distance Loss.
3.3 瓦瑟斯坦切片距离损失。

3.5 Implementation Details.
3.5 实施细节。

4.1 Qualitative Evaluation
4.1 定性评估

4.2.1 Quantitative Evaluation.
4.2.1 定量评估。

4.3.2 Quantitative Comparison
4.3.2 量化比较

5 Limitations and Future Work
5 局限性和未来工作

A Additional Implementation Details
A 更多实施细节

B CLIP Text-Image Direction Similarity
B CLIP 文本-图像方向相似性

C Effect of Text Guidance Scale
C 文本指导量表的效果