这是用户在 2025-1-31 5:33 为 https://arxiv.org/html/2412.09492?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0
许可证:CC BY 4.0
arXiv:2412.09492v1 [cs.MM] 12 Dec 2024
arXiv:2412.09492v1 [cs.MM] 2024 年 12 月 12 日

]Meta FAIR \contribution[*]Equal contribution
Meta FAIR 贡献[*]平等贡献

\ours : Open and Efficient Video Watermarking
开放和高效的视频水印技术

Pierre Fernandez    Hady Elsahar    I. Zeki Yalniz    Alexandre Mourachko [ pfz,hadyelsahar@meta.com
Abstract  摘要

The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms. Video watermarking addresses these challenges by embedding imperceptible signals into videos, allowing for identification. However, the rare open tools and methods often fall short on efficiency, robustness, and flexibility. To reduce these gaps, this paper introduces \ours, a comprehensive framework for neural video watermarking and a competitive open-sourced model. Our approach jointly trains an embedder and an extractor, while ensuring the watermark robustness by applying transformations in-between, \eg, video codecs. This training is multistage and includes image pre-training, hybrid post-training and extractor fine-tuning. We also introduce temporal watermark propagation, a technique to convert any image watermarking model to an efficient video watermarking model without the need to watermark every high-resolution frame. We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness. \ours achieves higher robustness compared to strong baselines especially under challenging distortions combining geometric transformations and video compression. Additionally, we provide new insights such as the impact of video compression during training, and how to compare methods operating on different payloads. Contributions in this work – including the codebase, models, and a public demo – are open-sourced under permissive licenses to foster further research and development in the field.
人工智能生成内容和复杂视频编辑工具的普及使得数字平台的管理变得既重要又具有挑战性。视频水印技术通过将不可察觉的信号嵌入视频中来应对这些挑战,从而实现识别。然而,稀缺的开放工具和方法往往在效率、稳健性和灵活性方面不足。为了缩小这些差距,本文介绍了\ours,一个全面的神经视频水印框架和一个具有竞争力的开源模型。我们的方法联合训练嵌入器和提取器,同时通过在两者之间应用变换(例如视频编码)来确保水印的稳健性。该训练是多阶段的,包括图像预训练、混合后训练和提取器微调。我们还引入了时间水印传播技术,这是一种将任何图像水印模型转换为高效视频水印模型的技术,无需对每个高分辨率帧进行水印处理。我们展示了实验结果,证明了该方法在速度、不可察觉性和稳健性方面的有效性。 \ours 在面对结合几何变换和视频压缩的挑战性失真时,相较于强基线表现出更高的鲁棒性。此外,我们提供了新的见解,例如训练过程中视频压缩的影响,以及如何比较在不同负载下运行的方法。本研究的贡献——包括代码库、模型和公共演示——均在宽松许可下开源,以促进该领域的进一步研究和发展。

\correspondence\metadata

[Code]https://github.com/facebookresearch/videoseal \metadata[Demo]https://aidemos.meta.com/videoseal
[代码] https://github.com/facebookresearch/videoseal \metadata[演示] https://aidemos.meta.com/videoseal

Refer to caption
Figure 1: Overview of digital video watermarking. A binary message is embedded into an original video (\eg, an AI-generated video), producing an imperceptible change in the pixels. This watermarked video may be compressed or edited when saved or shared online. Despite these transformations, the watermark extraction process should retrieve the embedded message. The two primary challenges in this process are (1) the speed of embedding and extraction, which must be computationally efficient to handle the large number of frames in a video, and (2) robustness to common video codecs that often degrade the watermark to the point of being undetectable.
图 1:数字视频水印的概述。一个二进制消息被嵌入到原始视频中(例如,一个 AI 生成的视频),在像素中产生不可察觉的变化。这个带水印的视频在保存或在线分享时可能会被压缩或编辑。尽管经历了这些变换,水印提取过程仍应能够检索嵌入的消息。这个过程中的两个主要挑战是(1)嵌入和提取的速度,必须在计算上高效,以处理视频中大量的帧,以及(2)对常见视频编码的鲁棒性,这些编码常常会使水印降解到无法检测的程度。

1 Introduction  1 引言

Within digital media, video watermarking has always been a very active field of research. The film industry, including Hollywood studios and streaming websites, has been particularly invested in developing robust video watermarking techniques to fight against piracy. However, with the rapid advancement of technology, new challenges and applications have emerged. For instance, the development of generative AI models for images, like DALL·E (Ramesh et al., 2022) or Stable Diffusion (Rombach et al., 2022), and videos like Sora (Brooks et al., 2024) or MovieGen (Polyak et al., 2024), raises concerns about the spread of misinformation and general misuse of such technology. Regulators (Chi, 2023; Eur, 2023; USA, 2023) are now pushing generative model providers to embed watermarks into the generated content to ease detection and attribution of said content. Additionally, they also encourage hardware providers to watermark real data at the physical device level (California State Leg., 2024), which requires fast embedding and detection. All this requires the development of robust and efficient video watermarking techniques that can keep pace with the rapidly evolving landscape of digital media and AI-generated content.
在数字媒体领域,视频水印一直是一个非常活跃的研究领域。电影行业,包括好莱坞制片厂和流媒体网站,特别致力于开发强大的视频水印技术以对抗盗版。然而,随着技术的快速发展,新的挑战和应用也随之出现。例如,图像生成 AI 模型的开发,如 DALL·E(Ramesh 等,2022)或 Stable Diffusion(Rombach 等,2022),以及视频生成模型如 Sora(Brooks 等,2024)或 MovieGen(Polyak 等,2024),引发了对虚假信息传播和此类技术一般滥用的担忧。监管机构(Chi,2023;Eur,2023;USA,2023)现在正在推动生成模型提供商在生成内容中嵌入水印,以便于检测和归属。此外,他们还鼓励硬件提供商在物理设备层面对真实数据进行水印处理(加利福尼亚州立法,2024),这需要快速的嵌入和检测。 所有这些都需要开发强大而高效的视频水印技术,以跟上快速发展的数字媒体和人工智能生成内容的格局。

It may seem logical to simply decompose videos into their constituent frames and leverage well-established image watermarking techniques to embed watermarks into each frame separately. This approach, however, is hindered by two significant limitations. Firstly, the computational load of watermarking every frame is prohibitively high, particularly for high-resolution videos with high frame rates. Processing videos as clips (chunks of frames) for embedding or extraction can help with parallelization, but large clips exceed memory limits, while smaller clips introduce synchronization issues, complicating watermark extraction. Secondly, the widespread use of video compression codecs such as AV1 (Alliance for Open Media, 2018) and H.264 (Richardson, 2010) along with the ease of access to free video editing software and social media filters poses a significant challenge to video watermarking. Whenever a video is downloaded, or shared on social media platforms, these codecs are often automatically applied, storing videos as keyframes, intraframes, and optical flows that enable frame decoding through interpolation. This process substantially reduces redundancy in videos, resulting in a strong decrease in the watermark signal. Consequently, even when computational efficiency is no longer a concern, image watermarking models may still struggle to remain effective in the face of these codecs and video editing tools, underscoring the need for video-specific watermarking solutions.
将视频简单地分解为其组成帧,并利用成熟的图像水印技术将水印分别嵌入每一帧,似乎是合乎逻辑的。然而,这种方法受到两个重大限制的制约。首先,对每一帧进行水印处理的计算负担过于沉重,尤其是对于高分辨率和高帧率的视频。将视频处理为片段(帧块)以进行嵌入或提取可以帮助实现并行化,但较大的片段超出了内存限制,而较小的片段则引入了同步问题,复杂化了水印提取。其次,广泛使用的视频压缩编码器,如 AV1(开放媒体联盟,2018)和 H.264(理查森,2010),以及免费的视频编辑软件和社交媒体滤镜的易得性,对视频水印技术构成了重大挑战。每当视频被下载或在社交媒体平台上分享时,这些编码器通常会被自动应用,将视频存储为关键帧、帧内帧和光流,从而通过插值实现帧解码。 这个过程大大减少了视频中的冗余,从而显著降低了水印信号。因此,即使计算效率不再是一个问题,图像水印模型在面对这些编解码器和视频编辑工具时仍可能难以保持有效,这突显了针对视频的水印解决方案的必要性。

There have been some works on neural video watermarking addressing the aforementioned challenges. For instance, DVMark (Luo et al., 2023) employ a compression network to simulate video compression, while VHNet (Shen et al., 2023) leverages a similar trick for steganography applications. ItoV (Ye et al., 2023) adapts architectures from image models to video watermarking by merging the temporal dimension of the videos with the channel dimension, enabling deep neural networks to treat videos as images. It also employs a straight-through estimator to allow for gradient flow on compression augmentation111see Sec. 6 for a comprehensive literature review.
请参见第 6 节以获取全面的文献综述。

已有一些关于神经视频水印的研究解决了上述挑战。例如,DVMark(Luo et al., 2023)采用压缩网络来模拟视频压缩,而 VHNet(Shen et al., 2023)则利用类似的技巧用于隐写应用。ItoV(Ye et al., 2023)通过将视频的时间维度与通道维度合并,将图像模型的架构适应于视频水印,使深度神经网络能够将视频视为图像。它还采用了直通估计器,以允许在压缩增强上进行梯度流动。
. However, despite these efforts, several limitations persist. Notably, most existing models are restricted to low-resolution videos (\eg, 128×\times×128) or short clips (\eg, 64 frames), rendering them impractical for real-world applications.
然而,尽管有这些努力,仍然存在若干限制。值得注意的是,大多数现有模型仅限于低分辨率视频(例如,128×128)或短片段(例如,64 帧),使其在实际应用中不够实用。

Most importantly, there is a lack of reproducibility in existing research on video watermarking. To our knowledge, none of the existing video watermarking models have been publicly released, hindering fair comparisons and reproducibility. This omission not only undermines the validity of the reported results but also stifles progress in the field.
最重要的是,现有的视频水印研究缺乏可重复性。据我们所知,现有的视频水印模型没有公开发布,这阻碍了公平比较和可重复性。这一遗漏不仅削弱了报告结果的有效性,还扼杀了该领域的进展。

In this paper, we introduce \ours, a state-of-the-art video watermarking model that sets a new standard for efficiency and robustness. By leveraging temporal watermark propagation, a novel technique that converts any image watermarking model into an efficient video watermarking model, \ours eliminates the need to watermark every frame in a video. \ours also employs a multistage training that includes image pre-training, hybrid post-training, and extractor fine-tuning. This training regimen is supplemented with a range of differentiable augmentations, including the popular H.264 codec, allowing \ours to withstand common video transformations and high compression rates.
在本文中,我们介绍了\ours,一种最先进的视频水印模型,树立了效率和鲁棒性的新标准。通过利用时间水印传播,这是一种将任何图像水印模型转换为高效视频水印模型的新技术,\ours 消除了对视频中每一帧进行水印处理的需求。\ours 还采用了多阶段训练,包括图像预训练、混合后训练和提取器微调。这种训练方案辅以一系列可微分增强,包括流行的 H.264 编码,使得\ours 能够抵御常见的视频变换和高压缩率。

Due to the scarcity of reproducible baselines for video watermarking, we adapt state-of-the-art image watermarking models to create strong baselines using the temporal watermark propagation technique. This adaptation is a significant contribution of this paper, as it provides a much-needed foundation for evaluating and comparing video watermarking techniques. \ours outperforms strong image baselines, including MBRS (Jia et al., 2021), TrustMark (Bui et al., 2023) and WAM (Sander et al., 2024), in terms of robustness under basic geometric transformations such as cropping, small rotations, and perspective changes. Although MBRS and TrustMark offer higher message capacities (256 and 100 bits, respectively), their design and training limitations make them vulnerable to degradation under these common transformations, which limits their real-world applicability.
由于可重复的视频水印基线稀缺,我们采用最先进的图像水印模型,通过时间水印传播技术创建强基线。这一适应性是本文的重要贡献,因为它为评估和比较视频水印技术提供了急需的基础。我们的模型在基本几何变换(如裁剪、小幅旋转和透视变化)下的鲁棒性方面,优于包括 MBRS(Jia 等,2021)、TrustMark(Bui 等,2023)和 WAM(Sander 等,2024)在内的强图像基线。尽管 MBRS 和 TrustMark 提供了更高的信息容量(分别为 256 和 100 位),但它们的设计和训练限制使其在这些常见变换下容易退化,从而限制了它们在实际应用中的适用性。

We also conduct ablation studies to investigate the impact of each component of the video inference and of our model training, including multistage training, differentiable compressions, and extractor fine-tuning. Our results show that extractor fine-tuning allows for extra gains in bit accuracy and increased robustness without compromising the quality measure through PSNR. Furthermore, we find that the most effective multistage training involves pre-training on images, followed by video training with the differentiable compression augmentation, which yields significant improvements in bit accuracy, particularly at higher compression rates.
我们还进行消融研究,以调查视频推理和我们模型训练中每个组件的影响,包括多阶段训练、可微压缩和提取器微调。我们的结果表明,提取器微调可以在不影响 PSNR 质量度量的情况下,带来额外的比特准确性提升和增强的鲁棒性。此外,我们发现最有效的多阶段训练涉及在图像上进行预训练,然后进行带有可微压缩增强的视频训练,这在比特准确性方面带来了显著的改善,特别是在较高的压缩率下。

To facilitate future research and development in video watermarking, we release several artifacts under a permissive license: model checkpoints, training and evaluation code, as well as a demo endpoint to test the models in action. We hope that the released models, along with the experiments, insights, and baselines, will serve the community and boost research in video watermarking. As a summary, our contributions are:
为了促进未来在视频水印方面的研究和开发,我们在宽松许可下发布了几个工件:模型检查点、训练和评估代码,以及一个演示端点以测试模型的实际效果。我们希望发布的模型以及实验、见解和基准能够为社区服务,并推动视频水印研究的发展。总的来说,我们的贡献包括:

  • We introduce \ours, an open-source video watermarking model that sets a new standard for efficiency and robustness. Using a novel temporal watermark propagation technique, \ours enables fast inference times by eliminating the need to individually watermark each frame in a video.


    我们介绍\ours,一个开源视频水印模型,设定了效率和鲁棒性的新标准。通过一种新颖的时间水印传播技术,\ours 通过消除对视频中每一帧单独加水印的需求,实现了快速推理时间。
  • We release a comprehensive and easy-to-use codebase for training and evaluation, as well as a demo that enables effortless testing of our models.


    我们发布了一个全面且易于使用的代码库,用于训练和评估,以及一个演示,便于轻松测试我们的模型。
  • We propose a multistage training that includes image pre-training, hybrid post-training, and extractor fine-tuning, supplemented with a range of differentiable augmentations, including multiple video codecs, allowing \ours to withstand common video transformations and high compression rates.


    我们提出了一种多阶段训练,包括图像预训练、混合后训练和提取器微调,并辅以一系列可微分的增强技术,包括多种视频编码格式,使得\ours 能够抵御常见的视频变换和高压缩率。
  • Through extensive experimentation, we gain valuable insights into the impact of video compression during training, the role of image and video data in training video watermarking models, and other key factors influencing model performance. These findings contribute to a deeper understanding of video watermarking and inform the development of more effective models.


    通过广泛的实验,我们获得了关于视频压缩在训练中的影响、图像和视频数据在训练视频水印模型中的作用以及其他影响模型性能的关键因素的宝贵见解。这些发现有助于深入理解视频水印,并为开发更有效的模型提供了信息。

2 Method  2 方法

We adopt the embedder/extractor framework originally developed for image watermarking by Zhu et al. (2018) and extend it to videos in a similar way as Ye et al. (2023). We focus on speed and practicality. Our approach operates in 2D to ensure streamability, simplify extraction, and maintain flexibility. This design also enables a unified embedder-extractor mechanism for both images and videos. Our models are based on state-of-the-art architectures trained on longer schedules with a comprehensive set of augmentations that include video codecs. They are effective at any resolution and for videos of any length.
我们采用 Zhu 等人(2018)最初为图像水印开发的嵌入/提取框架,并以 Ye 等人(2023)类似的方式将其扩展到视频。我们专注于速度和实用性。我们的方法在二维中运行,以确保流媒体传输、简化提取并保持灵活性。该设计还使得图像和视频都能实现统一的嵌入-提取机制。我们的模型基于最先进的架构,经过更长时间的训练,并采用包括视频编码在内的全面增强集。它们在任何分辨率和任何长度的视频中都有效。

2.1 Embedder & extractor architectures
2.1 嵌入器与提取器架构

Our architectures are kept voluntarily small and efficient to facilitate inference and to possibly run on mobile devices. The embedder is based on an efficient U-Net architecture with 161616M parameters in total, while the extractor is based on a vision transformer with 242424M parameters. The number of bits nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT is set to 969696.
我们的架构自愿保持小巧高效,以便于推理并可能在移动设备上运行。嵌入器基于一个总参数为 16161616 百万的高效 U-Net 架构,而提取器则基于一个参数为 24242424 百万的视觉变换器。位数 nbitssubscriptn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT 被设置为 96969696

2.1.1 Embedder  2.1.1 嵌入器

The embedder takes as input a frame x3×256×256x\in\mathbb{R}^{3\times 256\times 256}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT and a binary message m{0,1}nbitsm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and outputs a watermarked frame xw3×256×256x_{w}\in\mathbb{R}^{3\times 256\times 256}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT that slightly differs from the original. Its architecture is detailed in Tab. 1. It is based on a shrunk U-Net architecture (Ronneberger et al., 2015; Bui et al., 2023), with modifications taken from the “Efficient U-Net” of Imagen (Saharia et al., 2022). The message embedding happens in the bottleneck which operates at a lower resolution. It is done through a binary message lookup table 𝒯\mathcal{T}caligraphic_T structured to facilitate the embedding of binary messages into the latent representation of the frame, as previously presented by San Roman et al. (2024); Sander et al. (2024).
嵌入器以帧 x3×256×256superscript3256256x\in\mathbb{R}^{3\times 256\times 256}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT 和二进制消息 m{0,1}nbitssuperscript01subscriptm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 为输入,输出一个与原始帧略有不同的水印帧 xw3×256×256subscriptsuperscript3256256x_{w}\in\mathbb{R}^{3\times 256\times 256}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT 。其架构详见表 1。它基于缩小的 U-Net 架构(Ronneberger 等,2015;Bui 等,2023),并对 Imagen 的“高效 U-Net”(Saharia 等,2022)进行了修改。消息嵌入发生在以较低分辨率运行的瓶颈中。通过一个二进制消息查找表 𝒯\mathcal{T}caligraphic_T 来实现,该表结构旨在便于将二进制消息嵌入帧的潜在表示中,正如 San Roman 等(2024);Sander 等(2024)之前所提出的。

The U-Net consists of an encoder-decoder structure with skip connections, allowing to preserve the image information throughout the network, while doing most of the operations at a lower resolution. The encoder path begins with an initial residual block “ResNetBlock” that processes the input image of shape 3×256×2563\times 256\times 2563 × 256 × 256 into a feature map of shape dz/8×256×256d_{\text{z}}/8\times 256\times 256italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256. This is followed by a series of downsampling blocks “DBlocks”, which progressively reduce the spatial dimensions and increase the feature depth, resulting in feature maps of shapes dz/4×128×128d_{\text{z}}/4\times 128\times 128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128, dz/2×64×64d_{\text{z}}/2\times 64\times 64italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64, and dz×32×32d_{\text{z}}\times 32\times 32italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT × 32 × 32. Each DBlock is made of a bilinear downsampling of factor 222 followed by a ResNet block. The message processor, described in the following paragraph, then integrates the message into the deepest feature map, producing a tensor of shape (dz+dmsg)×32×32(d_{\text{z}}+d_{\text{msg}})\times 32\times 32( italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) × 32 × 32. The bottleneck consists of multiple residual blocks which merge the message and the image representations. The decoder path mirrors the encoder, using “UBlocks” to upsample the feature maps back to the original spatial dimensions, with shapes dz/2×64×64d_{\text{z}}/2\times 64\times 64italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64, dz/4×128×128d_{\text{z}}/4\times 128\times 128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128, and dz/8×256×256d_{\text{z}}/8\times 256\times 256italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256. In particular, we choose not to use deconvolution layers (ConvTranspose2D) because of the checkerboard patterns they introduce (Odena et al., 2016), and use bilinear interpolation instead. Each UBlock incorporates skip connections from the corresponding encoder layers, preserving information from the original image. The final output is produced by a Conv2D layer, resulting in an image of shape C×256×256C\times 256\times 256italic_C × 256 × 256. Each ResNetBlock is composed of two convolutional layers with RMSNorm (Zhang and Sennrich, 2019) and SiLU (Elfwing et al., 2018) activation, and includes a linear residual connection implemented as a Conv2D layer with a kernel size of 111.
U-Net 由一个带有跳跃连接的编码器-解码器结构组成,允许在整个网络中保留图像信息,同时在较低分辨率下进行大部分操作。编码器路径以一个初始残差块“ResNetBlock”开始,该块将形状为 3×256×25632562563\times 256\times 2563 × 256 × 256 的输入图像处理为形状为 dz/8×256×256subscript8256256d_{\text{z}}/8\times 256\times 256italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256 的特征图。接下来是一系列下采样块“DBlocks”,它们逐步减少空间维度并增加特征深度,产生形状为 dz/4×128×128subscript4128128d_{\text{z}}/4\times 128\times 128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128dz/2×64×64subscript26464d_{\text{z}}/2\times 64\times 64italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64dz×32×32subscript3232d_{\text{z}}\times 32\times 32italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT × 32 × 32 的特征图。每个 DBlock 由一个因子为 2222 的双线性下采样和一个 ResNet 块组成。接下来的消息处理器将消息整合到最深的特征图中,生成形状为 (dz+dmsg)×32×32subscriptsubscript3232(d_{\text{z}}+d_{\text{msg}})\times 32\times 32( italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) × 32 × 32 的张量。瓶颈部分由多个残差块组成,这些块合并了消息和图像表示。解码器路径镜像编码器,使用“UBlocks”将特征图上采样回原始空间维度,形状为 dz/2×64×64subscript26464d_{\text{z}}/2\times 64\times 64italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64dz/4×128×128subscript4128128d_{\text{z}}/4\times 128\times 128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128dz/8×256×256subscript8256256d_{\text{z}}/8\times 256\times 256italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256 。 特别地,我们选择不使用反卷积层(ConvTranspose2D),因为它们会引入棋盘图案(Odena et al., 2016),而是使用双线性插值。每个 UBlock 都包含来自相应编码器层的跳跃连接,保留了原始图像的信息。最终输出由一个 Conv2D 层生成,得到形状为 C×256×256256256C\times 256\times 256italic_C × 256 × 256 的图像。每个 ResNetBlock 由两个卷积层组成,使用 RMSNorm(Zhang 和 Sennrich,2019)和 SiLU(Elfwing et al., 2018)激活,并包括一个以 Conv2D 层实现的线性残差连接,卷积核大小为 1111

The binary message lookup table 𝒯\mathcal{T}caligraphic_T has a shape of (nbits,2,dmsg)(n_{\text{bits}},2,d_{\text{msg}})( italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT , 2 , italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ). 2 accounts for the binary values (0 or 1) each bit can take, and dmsgd_{\text{msg}}italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT is the dimensionality of the embedding space. For each bit mkm_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the message, indexed by k{1,,nbits}k\in\{1,\ldots,n_{\text{bits}}\}italic_k ∈ { 1 , … , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT }, the table maps the bit to an embedding vector 𝒯(k,mk,)dmsg\mathcal{T}(k,m_{k},\cdot)\in\mathbb{R}^{d_{\text{msg}}}caligraphic_T ( italic_k , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. These embeddings are averaged to produce a single vector of size dmsgd_{\text{msg}}italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT, capturing the overall message. This averaged vector is then repeated to match the spatial dimensions of the latent space (dmsg,32,32)(d_{\text{msg}},32,32)( italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT , 32 , 32 ). The resulting message tensor is concatenated with the latent representation of the frame, yielding an activation tensor of shape (dz+dmsg)×32×32(d_{\text{z}}+d_{\text{msg}})\times 32\times 32( italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) × 32 × 32. For our embedder, we use dz=128d_{\text{z}}=128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT = 128 and dmsg=192d_{\text{msg}}=192italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT = 192.
二进制消息查找表 𝒯\mathcal{T}caligraphic_T 的形状为 (nbits,2,dmsg)subscript2subscript(n_{\text{bits}},2,d_{\text{msg}})( italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT , 2 , italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) 。2 表示每个位可以取的二进制值(0 或 1),而 dmsgsubscriptd_{\text{msg}}italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT 是嵌入空间的维度。对于消息中的每个位 mksubscriptm_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,由 k{1,,nbits}1subscriptk\in\{1,\ldots,n_{\text{bits}}\}italic_k ∈ { 1 , … , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT } 索引,表将该位映射到嵌入向量 𝒯(k,mk,)dmsgsubscriptsuperscriptsubscript\mathcal{T}(k,m_{k},\cdot)\in\mathbb{R}^{d_{\text{msg}}}caligraphic_T ( italic_k , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 。这些嵌入被平均以生成一个大小为 dmsgsubscriptd_{\text{msg}}italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT 的单一向量,捕捉整体消息。然后将这个平均向量重复以匹配潜在空间的空间维度 (dmsg,32,32)subscript3232(d_{\text{msg}},32,32)( italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT , 32 , 32 ) 。生成的消息张量与帧的潜在表示连接,产生形状为 (dz+dmsg)×32×32subscriptsubscript3232(d_{\text{z}}+d_{\text{msg}})\times 32\times 32( italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) × 32 × 32 的激活张量。对于我们的嵌入器,我们使用 dz=128subscript128d_{\text{z}}=128italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT = 128dmsg=192subscript192d_{\text{msg}}=192italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT = 192

Table 1: High-level architecture of the encoder and decoder of the watermark embedder.
表 1:水印嵌入器的编码器和解码器的高层架构。
Encoder path  编码器路径 Bottleneck and decoder path
瓶颈和解码器路径
x3×H×Wx\in\mathbb{R}^{3\times H\times W}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT (z,zmsg)(dz+dmsg)×32×32(z,z_{\text{msg}})\in\mathbb{R}^{(d_{\text{z}}+d_{\text{msg}})\times 32\times 32}( italic_z , italic_z start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ) × 32 × 32 end_POSTSUPERSCRIPT
Interpolate, ResNetBlock dz/8×256×256\to\mathbb{R}^{d_{\text{z}}/8\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256 end_POSTSUPERSCRIPT
插值,ResNetBlock dz/8×256×256absentsuperscriptsubscript8256256\to\mathbb{R}^{d_{\text{z}}/8\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256 end_POSTSUPERSCRIPT
Bottleneck Residual Blocks dz×32×32\to\mathbb{R}^{d_{\text{z}}\times 32\times 32}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT × 32 × 32 end_POSTSUPERSCRIPT
瓶颈残差块 dz×32×32absentsuperscriptsubscript3232\to\mathbb{R}^{d_{\text{z}}\times 32\times 32}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT × 32 × 32 end_POSTSUPERSCRIPT
DBlock dz/4×128×128\to\mathbb{R}^{d_{\text{z}}/4\times 128\times 128}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128 end_POSTSUPERSCRIPT UBlock dz/2×64×64\to\mathbb{R}^{d_{\text{z}}/2\times 64\times 64}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64 end_POSTSUPERSCRIPT
DBlock dz/2×64×64\to\mathbb{R}^{d_{\text{z}}/2\times 64\times 64}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 2 × 64 × 64 end_POSTSUPERSCRIPT UBlock dz/4×128×128\to\mathbb{R}^{d_{\text{z}}/4\times 128\times 128}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 4 × 128 × 128 end_POSTSUPERSCRIPT
DBlock dz×32×32\to\mathbb{R}^{d_{\text{z}}\times 32\times 32}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT × 32 × 32 end_POSTSUPERSCRIPT UBlock dz/8×256×256\to\mathbb{R}^{d_{\text{z}}/8\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT z end_POSTSUBSCRIPT / 8 × 256 × 256 end_POSTSUPERSCRIPT
Message embedding, Repeat dmsg×32×32\to\mathbb{R}^{d_{\text{msg}}\times 32\times 32}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT × 32 × 32 end_POSTSUPERSCRIPT
消息嵌入,重复 dmsg×32×32absentsuperscriptsubscript3232\to\mathbb{R}^{d_{\text{msg}}\times 32\times 32}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT × 32 × 32 end_POSTSUPERSCRIPT
Final Conv2D 3×256×256\to\mathbb{R}^{3\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT  最终卷积层 Conv2D 3×256×256absentsuperscript3256256\to\mathbb{R}^{3\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT
Table 2: High-level architecture of the watermark extractor.
表 2:水印提取器的高层架构。
Image encoder (ViT)  图像编码器(ViT) Patch decoder (CNN)  补丁解码器(卷积神经网络)
x3×H×Wx\in\mathbb{R}^{3\times H\times W}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT zd×16×16z\in\mathbb{R}^{d^{\prime}\times 16\times 16}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 16 × 16 end_POSTSUPERSCRIPT
Interpolation 3×256×256\to\mathbb{R}^{3\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT  插值 3×256×256absentsuperscript3256256\to\mathbb{R}^{3\times 256\times 256}→ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT Residual Block d×16×16\to\mathbb{R}^{d^{\prime}\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 16 × 16 end_POSTSUPERSCRIPT  残差块 d×16×16absentsuperscriptsuperscript1616\to\mathbb{R}^{d^{\prime}\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 16 × 16 end_POSTSUPERSCRIPT
Patch Embed (Conv2D), Pos. Embed d×16×16\to\mathbb{R}^{d\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d × 16 × 16 end_POSTSUPERSCRIPT
补丁嵌入(Conv2D),位置嵌入 d×16×16absentsuperscript1616\to\mathbb{R}^{d\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d × 16 × 16 end_POSTSUPERSCRIPT
Average pooling d\to\mathbb{R}^{d^{\prime}}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT  平均池化 dabsentsuperscriptsuperscript\to\mathbb{R}^{d^{\prime}}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
L×L\timesitalic_L × {\{{ Transformer Block }\}} d×16×16\to\mathbb{R}^{d\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d × 16 × 16 end_POSTSUPERSCRIPT
L×L\timesitalic_L × {\{{ 变压器块 }\}} d×16×16absentsuperscript1616\to\mathbb{R}^{d\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d × 16 × 16 end_POSTSUPERSCRIPT
Linear nbits\to\mathbb{R}^{n_{\text{bits}}}→ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT  线性 nbitsabsentsuperscriptsubscript\to\mathbb{R}^{n_{\text{bits}}}→ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
LayerNorm, GELU, Conv2D d×16×16\to\mathbb{R}^{d^{\prime}\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 16 × 16 end_POSTSUPERSCRIPT
层归一化,GELU,卷积 2D d×16×16absentsuperscriptsuperscript1616\to\mathbb{R}^{d^{\prime}\times 16\times 16}→ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 16 × 16 end_POSTSUPERSCRIPT
Sigmoid (optional) nbits\to\mathbb{R}^{n_{\text{bits}}}→ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT  sigmoid(可选) nbitsabsentsuperscriptsubscript\to\mathbb{R}^{n_{\text{bits}}}→ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

2.1.2 Extractor  2.1.2 提取器

The extractor takes as input a frame x3×256×256x\in\mathbb{R}^{3\times 256\times 256}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT and outputs a “soft” message m~nbits\tilde{m}\in\mathbb{R}^{n_{\text{bits}}}over~ start_ARG italic_m end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT which can be thresholded to recover a “hard” binary message m^{0,1}nbits\hat{m}\in\{0,1\}^{n_{\text{bits}}}over^ start_ARG italic_m end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (soft because continuous, hard because binary). Its architecture is detailed in Tab. 2. It is based on a vision transformer (ViT) (Dosovitskiy, 2020) followed by a patch decoder and an average pooling layer that maps to a nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT dimensional vector.
提取器以帧 x3×256×256superscript3256256x\in\mathbb{R}^{3\times 256\times 256}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT 为输入,输出一个“软”消息 m~nbitssuperscriptsubscript\tilde{m}\in\mathbb{R}^{n_{\text{bits}}}over~ start_ARG italic_m end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,该消息可以通过阈值处理恢复为“硬”二进制消息 m^{0,1}nbitssuperscript01subscript\hat{m}\in\{0,1\}^{n_{\text{bits}}}over^ start_ARG italic_m end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (软是因为连续,硬是因为二进制)。其架构详见表 2。它基于视觉变换器(ViT)(Dosovitskiy,2020),后接一个补丁解码器和一个平均池化层,映射到一个 nbitssubscriptn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT 维向量。

The ViT consists of a series of attention blocks to process image patches into a high-dimensional feature space. We use the ViT-Small architecture (Touvron et al., 2021) (222222M parameters), with patch size 161616, with d=d=384d=d^{\prime}=384italic_d = italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 384. The patch embeddings are processed by a residual block, which is made of a Conv2D with kernel size of 333 and stride of 111, a LayerNorm, and a GELU activation, and with the number of channels equals to the one of input channels. We obtain a latent map of shape (d,256,256)(d^{\prime},256,256)( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 256 , 256 ), which is average-pooled and mapped to nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT-dimensional pixel features by a linear layer. Finally, a Sigmoid layer scales the outputs to [0,1][0,1][ 0 , 1 ] (this is in fact only done at inference time, since the training objective implicitly applies it in PyTorch).
ViT 由一系列注意力块组成,用于将图像块处理为高维特征空间。我们使用 ViT-Small 架构(Touvron 等,2021)( 22222222 M 参数),块大小为 16161616 ,具有 d=d=384superscript384d=d^{\prime}=384italic_d = italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 384 。块嵌入通过一个残差块进行处理,该残差块由一个卷积层(Conv2D),其卷积核大小为 3333 ,步幅为 1111 ,一个层归一化(LayerNorm)和一个 GELU 激活函数组成,通道数等于输入通道数。我们获得一个形状为 (d,256,256)superscript256256(d^{\prime},256,256)( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 256 , 256 ) 的潜在图,经过平均池化后,通过一个线性层映射到 nbitssubscriptn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT 维像素特征。最后,一个 Sigmoid 层将输出缩放到 [0,1]01[0,1][ 0 , 1 ] (这实际上仅在推理时进行,因为训练目标在 PyTorch 中隐式应用了它)。

2.2 Video inference  2.2 视频推理

Our embedder and extractor are designed to work on individual frames of fixed resolution (256×256256\times 256256 × 256). To operate in an efficient manner on videos, we use a few tricks to speed up the embedding and extraction processes. Namely, we downscale frames to the fixed resolution, embed the watermark every kkitalic_k frames, upscale the watermark to the original resolution, and propagate the watermark signal to the k1k-1italic_k - 1 neighboring frames. This is illustrated in Fig. 2 and detailed in the following paragraphs.
我们的嵌入器和提取器旨在处理固定分辨率的单个帧( 256×256256256256\times 256256 × 256 )。为了在视频上高效运行,我们使用了一些技巧来加快嵌入和提取过程。具体来说,我们将帧缩小到固定分辨率,每 kkitalic_k 帧嵌入水印,将水印放大到原始分辨率,并将水印信号传播到 k11k-1italic_k - 1 个相邻帧。这在图 2 中进行了说明,并在以下段落中进行了详细描述。

Refer to caption
Figure 2: Illustration of the embedding process for video watermarking including temporal watermark propagation. To minimize computational overhead, the embedder processes every kkitalic_k frames of the video independently, producing a watermark signal that is copied along the temporal axis to the kkitalic_k neighboring frames. Additionally, the embedding is performed on a downscaled version of the video and the watermark is later upscaled to match the original resolution. This approach helps balance efficiency and robustness.
图 2:视频水印嵌入过程的示意图,包括时间水印传播。为了最小化计算开销,嵌入器独立处理视频的每 kkitalic_k 帧,生成一个水印信号,该信号沿时间轴复制到 kkitalic_k 个相邻帧。此外,嵌入是在视频的降级版本上进行的,水印随后被放大以匹配原始分辨率。这种方法有助于平衡效率和鲁棒性。

2.2.1 High-resolution and scaling factor
2.2.1 高分辨率和缩放因子

Our embedder and extractor are trained at a fixed resolution of 256×256256\times 256256 × 256. To extend it to higher resolution, we use the same trick as presented by Bui et al. (2023); Sander et al. (2024).
我们的嵌入器和提取器在固定分辨率 256×256256256256\times 256256 × 256 下进行训练。为了将其扩展到更高的分辨率,我们使用了 Bui 等人(2023)和 Sander 等人(2024)提出的相同技巧。

Given a frame xxitalic_x of size H×WH\times Witalic_H × italic_W, we first downscale it to 256×256256\times 256256 × 256 using bilinear interpolation. The embedder takes the downsampled frame and the message as input and produces the watermark distortion wwitalic_w. We then upscale wwitalic_w to the original resolution – again using bilinear interpolation – and add it to the original frame to obtain the watermarked frame:
给定一个大小为 H×WH\times Witalic_H × italic_W 的帧 xxitalic_x ,我们首先使用双线性插值将其缩小到 256×256256256256\times 256256 × 256 。嵌入器将下采样的帧和消息作为输入,产生水印失真 wwitalic_w 。然后,我们将 wwitalic_w 放大到原始分辨率——同样使用双线性插值——并将其添加到原始帧中,以获得水印帧:

xw=x+αwresize(w),w=Embedder(resize(x),m).x_{w}=x+\alpha_{w}\cdot\text{resize}(w),\quad w=\text{Embedder}(\text{resize}(% x),m).italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_x + italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ resize ( italic_w ) , italic_w = Embedder ( resize ( italic_x ) , italic_m ) . (1)

αw\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is called the scaling factor and controls the strength of the watermark. It may be adjusted at inference time to trade quality for robustness. In the following sections of the paper, we say that αw\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is “nominal” at inference when it is set to the same value as during training.
αwsubscript\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 被称为缩放因子,控制水印的强度。它可以在推理时进行调整,以在质量和鲁棒性之间进行权衡。在本文的后续部分,我们说当 αwsubscript\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 在推理时设置为与训练期间相同的值时,它是“名义上的”。

We proceed similarly for extraction and we resize all frames to 256×256256\times 256256 × 256 before giving them to the extractor.
我们对提取过程进行类似的操作,并在将所有帧提供给提取器之前将它们的大小调整为 256×256256256256\times 256256 × 256

2.2.2 Temporal watermark propagation
2.2.2 时间水印传播

Watermarking each frame of a video can be computationally costly. To mitigate this, a trick suggested in the codebase by Xian et al. (2024) is to watermark every kkitalic_k frames instead. However, this approach complicates the extraction process. Indeed, leaving some frames unwatermarked can compromise the robustness of the watermark under temporal editing and video compression algorithms. Even without any video edition, the extractor signal will be mixed with a lot of signal coming from unwatermarked frames, which will reduce the accuracy of the extraction.
对视频的每一帧进行水印处理可能会消耗大量计算资源。为了减轻这一问题,Xian 等人(2024)在代码库中建议的一个技巧是每 kkitalic_k 帧进行水印处理。然而,这种方法使提取过程变得复杂。实际上,留下某些未加水印的帧可能会在时间编辑和视频压缩算法下影响水印的鲁棒性。即使没有进行任何视频编辑,提取信号也会与来自未加水印帧的大量信号混合,这将降低提取的准确性。

In our approach, called temporal watermark propagation, the video is divided into segments of kkitalic_k frames, the first frame of each segment is passed through the embedder to generate a watermark distortion which is then copied to the k1k-1italic_k - 1 subsequent frames within the segment. More rigorously, let 𝐱i3×256×256\mathbf{x}_{i}\in\mathbb{R}^{3\times 256\times 256}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT denote the ithi^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame of the video, and 𝐰i3×256×256\mathbf{w}_{i}\in\mathbb{R}^{3\times 256\times 256}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT denote the watermark distortion of 𝐱i\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let m{0,1}nbitsm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the binary message to be embedded. Temporal watermark propagation can be formulated as follows:
在我们的方法中,称为时间水印传播,视频被分成 kkitalic_k 帧的段落,每个段落的第一帧通过嵌入器生成水印失真,然后复制到段落内的 k11k-1italic_k - 1 个后续帧中。更严格地说,设 𝐱i3×256×256subscriptsuperscript3256256\mathbf{x}_{i}\in\mathbb{R}^{3\times 256\times 256}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT 表示视频的 ithsuperscripti^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT 帧, 𝐰i3×256×256subscriptsuperscript3256256\mathbf{w}_{i}\in\mathbb{R}^{3\times 256\times 256}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT 表示 𝐱isubscript\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的水印失真。设 m{0,1}nbitssuperscript01subscriptm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 表示要嵌入的二进制消息。时间水印传播可以表述如下:

𝐰i={Embedder(𝐱i,m),if imodk=0,𝐰i1,otherwise.\mathbf{w}_{i}=\begin{cases}\text{Embedder}(\mathbf{x}_{i},m),&\text{if }i% \bmod k=0,\\ \mathbf{w}_{i-1},&\text{otherwise.}\end{cases}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL Embedder ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m ) , end_CELL start_CELL if italic_i roman_mod italic_k = 0 , end_CELL end_ROW start_ROW start_CELL bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , end_CELL start_CELL otherwise. end_CELL end_ROW (2)

In practice, if kkitalic_k is set to 111, the watermark is applied to every frame of the video, and temporal watermark propagation is equivalent to watermarking each frame independently. When kkitalic_k increases the efficiency of the embedding increases. At the same time, it introduces some noise in the extraction process because we approximate the watermark signal in the unwatermarked frames. It may also introduce “shadow” artifacts if the video contains a lot of motion as the distortion often follows the image content. In practice kkitalic_k is set small enough for these two reasons, k=4k=4italic_k = 4 in this work. Note that this operation is fully differentiable, allowing for the optimization of both imperceptibility and robustness during training.
在实践中,如果 kkitalic_k 设置为 1111 ,水印将应用于视频的每一帧,时间水印传播相当于独立地对每一帧进行水印处理。当 kkitalic_k 增加时,嵌入的效率也随之提高。同时,由于我们在未加水印的帧中近似水印信号,这在提取过程中引入了一些噪声。如果视频包含大量运动,它还可能引入“阴影”伪影,因为失真通常跟随图像内容。在实践中, kkitalic_k 被设置得足够小,出于这两个原因, k=44k=4italic_k = 4 在本工作中。请注意,这个操作是完全可微的,允许在训练过程中优化不可感知性和鲁棒性。

2.2.3 Extraction  2.2.3 提取

The watermark extraction processes each frame 𝐱i\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT independently before aggregating the soft messages 𝐦i~\tilde{\mathbf{m}_{i}}over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG over the entire video. For aggregation, we simply average the soft messages across all frames, and threshold the average to obtain the hard message contained in the video m^\hat{m}over^ start_ARG italic_m end_ARG:
水印提取过程独立处理每一帧 𝐱isubscript\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,然后对整个视频的软消息 𝐦i~subscript\tilde{\mathbf{m}_{i}}over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG 进行汇总。对于汇总,我们仅仅对所有帧的软消息取平均值,并对平均值进行阈值处理,以获得视频中包含的硬消息 m^\hat{m}over^ start_ARG italic_m end_ARG

m^k={1if (1Ti=1T𝐦i~,k)>00otherwise, with m^k the bit at position k.\hat{m}_{k}=\left\{\begin{array}[]{ll}1&\text{if }\left(\frac{1}{T}\sum_{i=1}^% {T}\tilde{\mathbf{m}_{i}}_{,k}\right)>0\\ 0&\text{otherwise}\end{array}\right.\text{, with }\hat{m}_{k}\text{ the bit at% position }k.over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT ) > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY , with over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the bit at position italic_k . (3)

where TTitalic_T is the number of frames on which the extraction is done. In particular, one may choose to extract the watermark on certain frames – for instance the first ones only or the whole video – to increase robustness or to speed up the extraction process. This aggregation is chosen for simplicity and speed, but more advanced aggregation methods could be used, as studied in Sec. 5.3.
其中 TTitalic_T 是进行提取的帧数。特别地,可以选择在某些帧上提取水印——例如仅在前几帧或整个视频中——以提高鲁棒性或加快提取过程。选择这种聚合是出于简单和快速的考虑,但可以使用更高级的聚合方法,如第 5.3 节所研究的那样。

Refer to caption
Figure 3: Detailed optimization pipeline of \ours. The embedder takes a batch of input images or a sequence of video frames xxitalic_x and random binary messages mmitalic_m, and outputs a batch of watermarked images or frames xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Differentiable transformations are randomly applied to xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to simulate real-world transmissions, such as crops, brightness changes, or video compression. The extractor then processes these transformed images to estimate the embedded messages m~\tilde{m}over~ start_ARG italic_m end_ARG. The watermark embedder and extractor are trained jointly to minimize two objectives: the message reconstruction loss and the mean squared error (MSE) between the original images xxitalic_x and the watermarked images xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Additionally, they are trained to maximize the adversarial loss against a quality discriminator. In a separate optimization step, the quality discriminator DqD_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT itself is trained to distinguish between the watermarked and original images, while keeping the embedder and extractor parameters fixed.
图 3:\ours 的详细优化流程。嵌入器接收一批输入图像或一系列视频帧 xxitalic_x 和随机二进制消息 mmitalic_m ,并输出一批带水印的图像或帧 xwsubscriptx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 。对 xwsubscriptx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 随机应用可微变换,以模拟现实世界的传输,例如裁剪、亮度变化或视频压缩。提取器随后处理这些变换后的图像,以估计嵌入的消息 m~\tilde{m}over~ start_ARG italic_m end_ARG 。水印嵌入器和提取器共同训练,以最小化两个目标:消息重建损失和原始图像 xxitalic_x 与带水印图像 xwsubscriptx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 之间的均方误差(MSE)。此外,它们还被训练以最大化针对质量鉴别器的对抗损失。在一个单独的优化步骤中,质量鉴别器 DqsubscriptD_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT 本身被训练以区分带水印的图像和原始图像,同时保持嵌入器和提取器参数不变。

2.3 Training pipeline

In this section, we describe our method in detail, including image pre-training, mixed training with videos and images, and embedder freezing. Our training pipeline follows the traditional embedder/extractor approach (Zhu et al., 2018), illustrated in Fig. 3. The embedder takes as input a batch of images or video frames and a binary message and produces watermarked images or frames. The extractor then attempts to recover the original message from them. We adopt a multistage training strategy that combines the benefits of image and video training. The following paragraphs detail these stages.

2.3.1 Training objectives

The training process involves minimizing a combination of perceptual losses and an extraction loss. The perceptual losses ensure that the watermark is imperceptible, while the extraction losses ensure that the extractor’s output is close to the original message. The optimizer minimizes the following objective function:

=λdiscdisc+λii+λww,\mathcal{L}=\lambda_{\text{disc}}\mathcal{L}_{\text{disc}}+\lambda_{\text{i}}% \mathcal{L}_{\text{i}}+\lambda_{\text{w}}\mathcal{L}_{\text{w}},caligraphic_L = italic_λ start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT i end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT w end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT w end_POSTSUBSCRIPT , (4)

where λdisc\lambda_{\text{disc}}italic_λ start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT, λi\lambda_{\text{i}}italic_λ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT, and λw\lambda_{\text{w}}italic_λ start_POSTSUBSCRIPT w end_POSTSUBSCRIPT are the weights of the discriminative loss, the image perceptual loss, and the watermark extraction loss, defined in the following paragraphs.

Extraction loss.

The watermark extraction loss ensures that the extracted message m~\tilde{m}over~ start_ARG italic_m end_ARG is as close as possible to the original message mmitalic_m. We use the average binary cross-entropy (BCE) loss:

w=1nbitsk=1nbitsBCE(mk,m~k), with BCE(mk,m~k)=mklog(m~k)+(1mk)log(1m~k).\mathcal{L}_{\text{w}}=-\frac{1}{n_{\text{bits}}}\sum_{k=1}^{n_{\text{bits}}}% \text{BCE}(m_{k},\tilde{m}_{k}),\textrm{ with }\text{BCE}(m_{k},\tilde{m}_{k})% =m_{k}\log(\tilde{m}_{k})+(1-m_{k})\log(1-\tilde{m}_{k}).caligraphic_L start_POSTSUBSCRIPT w end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT BCE ( italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , with roman_BCE ( italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log ( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ( 1 - italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) roman_log ( 1 - over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (5)
Perceptual losses.

Additionally we compute the Mean Squared Error (MSE) between the original image xxitalic_x and the watermarked image xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, given by:

i=1Ni=1N(xixw,i)2,\mathcal{L}_{\text{i}}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-x_{w,i})^{2},caligraphic_L start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_w , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where NNitalic_N is the number of pixels in the image. Although we experimented with more advanced perceptual models such as LPIPS (Zhang et al., 2018) and Watson perceptual models (Czolbe et al., 2020), gains were not significant enough to justify their complexity.

Quality discriminator loss.

We use an adversarial training with a patch-based discriminator DDitalic_D (Isola et al., 2017; Rombach et al., 2022), and the update rules presented by Lim and Ye (2017).

During the embedder-extractor update, we optimize the adversarial loss to ensure that the watermarked image xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is indistinguishable from real images. This loss is given by:

disc=Dq(xw),\mathcal{L}_{\text{disc}}=-D_{q}(x_{w}),caligraphic_L start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT = - italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ,

where Dq()D_{q}(\cdot)italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ⋅ ) represents the quality discriminator’s output in raw logits.

In a separate optimization step, the quality discriminator itself is being optimized through minimizing the Dual-Hinge Discriminator Loss, disc’\mathcal{L}_{\text{disc'}}caligraphic_L start_POSTSUBSCRIPT disc’ end_POSTSUBSCRIPT, which enforces the quality discriminator to correctly classify both original images xxitalic_x and watermarked images xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and therefore present a strong challenge to the embedder. This loss is defined as:

disc’=12(max(0,1Dq(x))+max(0,1+Dq(xw))),\mathcal{L}_{\text{disc'}}=\frac{1}{2}\left(\max(0,1-D_{q}(x))+\max(0,1+D_{q}(% x_{w}))\right),caligraphic_L start_POSTSUBSCRIPT disc’ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_max ( 0 , 1 - italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) + roman_max ( 0 , 1 + italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ) ) ,

where the hinge function max(0,1z)\max(0,1-z)roman_max ( 0 , 1 - italic_z ) penalizes incorrect classifications.

Balancer.

To balance the different loss components and to stabilize training, we compute adaptive weights as done in previous works (Défossez et al., 2022; Rombach et al., 2022). Our balancer is based on the norm of the gradients of each loss with respect to the last layer of the embedder (in the case of the U-Net this corresponds to the weights of the final convolution that maps to 3×256×256\mathbb{R}^{3\times 256\times 256}blackboard_R start_POSTSUPERSCRIPT 3 × 256 × 256 end_POSTSUPERSCRIPT). Each loss k\mathcal{L}_{k}caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where k{disc,i,w}k\in\{\text{disc},\text{i},\text{w}\}italic_k ∈ { disc , i , w }, is rescaled by the norm of its gradient:

λ~k=λkkλkRθ(k)+ϵ,\tilde{\lambda}_{k}=\frac{\lambda_{k}}{\sum_{k^{\prime}}\lambda_{k^{\prime}}}% \cdot\frac{R}{\|\nabla_{\theta}(\mathcal{L}_{k})\|+\epsilon},over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_R end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ + italic_ϵ end_ARG , (7)

where RRitalic_R is a constant representing the total gradient norm – set to 111 as in EnCodec (Défossez et al., 2022) –, θ\thetaitalic_θ represents the parameters of the last layer and ϵ\epsilonitalic_ϵ is a small constant to avoid division by zero. Eventually, we backpropagate ~=λ~discdisc+λ~ii+λ~ww\tilde{\mathcal{L}}=\tilde{\lambda}_{\text{disc}}\mathcal{L}_{\text{disc}}+% \tilde{\lambda}_{\text{i}}\mathcal{L}_{\text{i}}+\tilde{\lambda}_{\text{w}}% \mathcal{L}_{\text{w}}over~ start_ARG caligraphic_L end_ARG = over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT i end_POSTSUBSCRIPT + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT w end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT w end_POSTSUBSCRIPT instead of \mathcal{L}caligraphic_L in Eq. 4.

2.3.2 Multistage training

Image pre-training and hybrid post-training.

Our approach employs a multistage training strategy, where we first pre-train our model on images and then continue training on a mix of images and videos using a scheduled approach. This approach has few benefits, first it allows us to leverage the faster training times of image-based models while still adapting to video-specific distortions. Second, as we show in Sec. 5.1, this approach provides more stable training and yields significant improvements in terms of bit accuracy, and robustness to higher compression rates. During the pre-training phase, we train our model solely on images for a specified number of epochs. We then transition to a hybrid training phase, where we alternate between training on images and videos according to a predefined schedule, with a proportion of epochs for each modality fixed in advance.

Embedder freeze and extractor fine-tuning.

To further improve the robustness of our model, we employ a two-stage training process where we first train the entire model to convergence and then fine-tune the extractor while freezing the generator. This approach allows us to break free from the trade-off between imperceptibility and robustness, as we can focus solely on improving the extractor’s performance without affecting the generated watermark. As we show in Sec. 5.2 this allows us to gain extra points for robustness without compromising the watermark imperceptibility.

2.3.3 Transformations

Table 3: List of transformations used during training. A wide range of operations is covered, from valuemetric changes like brightness, contrast and video compressions, to more complex geometric transformations like perspective distortion.
Transformation Type Parameter Choice at training
Identity - - -
Brightness Valuemetric from torchvision Random between 0.5 and 2.0
Contrast Valuemetric from torchvision Random between 0.5 and 2.0
Hue Valuemetric from torchvision Random between -0.5 and 0.5
Saturation Valuemetric from torchvision Random between 0.5 and 2.0
Gaussian blur Valuemetric kernel size kkitalic_k Random odd between 3 and 17
Median filter Valuemetric kernel size kkitalic_k Random odd between 3 and 7
JPEG Valuemetric quality QQitalic_Q Random between 40 and 80
H.264 Valuemetric constant rate factor Random between 9 and 27
Horizontal flip Geometric - -
Crop Geometric edge size ratio rritalic_r Random between 0.7 and 1.0
Resize Geometric edge size ratio rritalic_r Random between 0.7 and 1.5
Rotation Geometric angle θ\thetaitalic_θ Random between -10 and 10
Perspective Geometric distortion scale dditalic_d Random between 0.1 and 0.5

We use a comprehensive set of transformations during training, which are detailed in Tab. 3. Most of them are applied at the frame level. We categorize them into two main groups: valuemetric, which change the pixel values; geometric, which modify the image’s geometry – and are unfortunately absent from many recent works on both image and video watermarking (Jia et al., 2021; Ma et al., 2022; Ye et al., 2023).

Frame transformations.

We include crop, resize, rotation, perspective, brightness, contrast, hue, saturation, Gaussian blur, median filter, JPEG compression. The strengths of these transformations are randomly sampled from a predefined range during training, and applied the same way to all images of the mini-batch. For crop and resize, each new edge size is selected independently, which means that the aspect ratio can change (because the extractor resizes the image). Moreover, an edge size ratio of 0.330.330.33 means that the new area of the image is 0.33210%0.33^{2}\approx 10\%0.33 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ 10 % times the original area. For brightness, contrast, saturation, and sharpness, the parameter is the default factor used in the PIL and Torchvision (Marcel and Rodriguez, 2010) libraries. For JPEG, we use the Pillow library.

Video transformations.

When applied on videos, frame transformations are applied to their whole content. Additionally, we train and evaluate on common video codecs (\eg, H.264, H.265), as implemented in the PyAV wrapper around FFmpeg.

About non-differentiable transformations.

Non-differentiability or lack of backpropagatable implementations in PyTorch prevents us from backpropagating through video codecs. This poses a challenge since the gradients of the objective function cannot be backpropagated through the compression back to the embedder. One common solution is to use a differentiable approximation of the augmentation instead of the real one. For instance, Zhu et al. (2018); Zhang et al. (2023) use a differentiable JPEG compression and Luo et al. (2023); Shen et al. (2023) use a neural network trained to mimick video codec artifacts. We choose a second option for its ease of implementation and its popularity (Zhang et al., 2021; Ye et al., 2023; Sander et al., 2024). It involves using a straight-through estimator that approximates the gradient of the non-differentiable operation with the identity function (Bengio et al., 2013):

xaug=xw+nograd(T(xw)xw),x_{\textrm{aug}}=x_{w}+\mathrm{nograd}\left(T(x_{w})-x_{w}\right),italic_x start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + roman_nograd ( italic_T ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , (8)

where nograd\mathrm{nograd}roman_nograd does not propagate gradients and TTitalic_T is the non-differentiable transformation.

3 Experimental Setup and Implementation Details

3.1 Metrics

Watermarking is subject to a trade-off between imperceptibility, i.e., how much the watermarking degrades the video, and robustness, i.e., how much image or video transformations affect the recovery of the input binary message. We therefore use two main categories of evaluation metrics.

Metrics for image and video quality.

We evaluate the quality of the watermarked videos using per-pixel and perceptual metrics. The PSNR (peak-signal-to-noise ratio) measures the difference between the original and watermarked videos in terms of mean squared error (MSE), and is defined as PSNR=10log10(2552/MSE)\mathrm{PSNR}=10\log_{10}\left(255^{2}/\mathrm{MSE}\right)roman_PSNR = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( 255 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / roman_MSE ). SSIM (Wang et al., 2004) (structural similarity index measure) measures the similarity between the original and watermarked videos in terms of luminance, contrast, and structure. LPIPS (Zhang et al., 2018) is better at evaluating how humans perceive similarity. It is calculated by comparing the features extracted from the two frames using a pre-trained neural network. On videos, SSIM and LPIPS metrics are computed frame-wise and averaged over the entire video.

The above metrics do not take into account the temporal consistency of the video. VMAF (Netflix, 2016) (video multi-method assessment fusion) is, on the contrary, designed specifically for video quality assessment. It uses a neural network to predict the subjective quality of a video based on various objective metrics such as PSNR, SSIM, and motion vectors.

Metrics for robustness of extraction.

The main metric to evaluate the robustness of the watermarking in a multi-bit setting is the bit accuracy. Given an input message m{0,1}nbitsm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and an output message m^\hat{m}over^ start_ARG italic_m end_ARG, the bit accuracy is defined as the percentage of bits that are correctly decoded, i.e.,

bit accuracy(m,m^)=1nbitsk=1nbits𝟙(mk=m^k).\text{bit accuracy}(m,\hat{m})=\frac{1}{n_{\text{bits}}}\sum_{k=1}^{n_{\text{% bits}}}\mathbbm{1}_{(m_{k}=\hat{m}_{k})}.bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT . (9)

The biggest issue with bit accuracy is that it is agnostic to the number of bits being hidden, and does not account for the total capacity of the watermarking method. For instance, a method with average bit accuracy p=0.9p=0.9italic_p = 0.9 and nbits=128n_{\text{bits}}=128italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 128 has a total capacity bigger than a method with bit accuracy p=0.99p=0.99italic_p = 0.99 and nbits=64n_{\text{bits}}=64italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 64, in an information-theoretic sense (Cover, 1999)222 Assuming that bit errors are independent and distributed as Bernoulli variables with probability of failure ppitalic_p, the capacity is defined as c(p)=1(plog2p(1p)log2p)c(p)=1-\left(-p\log_{2}p-(1-p)\log_{2}p\right)italic_c ( italic_p ) = 1 - ( - italic_p roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p - ( 1 - italic_p ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ) and the total capacity as C(p,nbits)=nbitsc(p)C(p,n_{\text{bits}})=n_{\text{bits}}*c(p)italic_C ( italic_p , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT ) = italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT ∗ italic_c ( italic_p ). For nbits=64,p=0.99n_{\text{bits}}=64,p=0.99italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 64 , italic_p = 0.99, C(p,nbits)=58.8C(p,n_{\text{bits}})=58.8italic_C ( italic_p , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT ) = 58.8, and for nbits=128,p=0.9n_{\text{bits}}=128,p=0.9italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 128 , italic_p = 0.9, C(p,nbits)=68.0C(p,n_{\text{bits}})=68.0italic_C ( italic_p , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT ) = 68.0 (see App. 8.1 for more details). .

To account for this and to be able to properly compare methods, we thus introduce the ppitalic_p-value associated to a given bit accuracy. Given the two messages and the observed bit accuracy(m,m^)\text{bit accuracy}(m,\hat{m})bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), it is defined as the probability of observing, by chance, a bit accuracy greater than the one obtained. Assuming that the nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT bits are independent and distributed as Bernoulli variables with probability of failure 0.50.50.5, it is given by:

p-value(m,m^)\displaystyle p\textrm{-value}(m,\hat{m})italic_p -value ( italic_m , over^ start_ARG italic_m end_ARG ) =[bit accuracy(m,m)bit accuracy(m,m^)m(0.5)nbits]\displaystyle=\mathbb{P}\big{[}\text{bit accuracy}(m,m^{\prime})\geq\text{bit % accuracy}(m,\hat{m})\mid m^{\prime}\sim\mathcal{B}(0.5)^{n_{\text{bits}}}\big{]}= blackboard_P [ bit accuracy ( italic_m , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ) ∣ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_B ( 0.5 ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ]
=knbits×bit accuracy(m,m^)nbits(nbitsk)1/2nbits.\displaystyle=\sum_{k\geq n_{\text{bits}}\times\text{bit accuracy}(m,\hat{m})}% ^{n_{\text{bits}}}\binom{n_{\text{bits}}}{k}1/2^{n_{\text{bits}}}.= ∑ start_POSTSUBSCRIPT italic_k ≥ italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT × bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG ) 1 / 2 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (10)

We report the log ppitalic_p-value, denoted as log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ), which is more interpretable. Given an observed bit accuracy bit accuracy(m,m^)\text{bit accuracy}(m,\hat{m})bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), the ppitalic_p-value represents the confidence that the observed bit accuracy is due to chance333If the p-value is 10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, it also means that we would need to set the threshold in such a way to have a false positive rate of 10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to flag the image or video as containing a watermark.. Another way to interpret the ppitalic_p-value is to link it to the false positive rate (FPR) when using the watermarking for a detection test. The FPR is the probability of falsely detecting a watermark when there is none. In practice, if we want to have FPR<106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, we would need to set the threshold at log10(p)<6\log_{10}(p)<-6roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) < - 6 to flag the image or video as containing a watermark. We refer the interested reader to App. 8.1 for more details.

3.2 Datasets

We use two main datasets for training and evaluation across video and image domains. For image training, we use the SA-1B dataset (Kirillov et al., 2023), from which we randomly select 500k images resized to 256×256256\times 256256 × 256. For evaluation we use 1k random images at their original image resolution (with an average resolution of 1500×22501500\times 22501500 × 2250). To keep a fair comparison with existing image watermarking models, we also evaluate on 1k images from the COCO validation dataset (Lin et al., 2014), which are of slightly lower resolution.

For video training we use the SA-V dataset (Ravi et al., 2024) which comprises 51k diverse videos captured across multiple countries, with resolutions ranging from 240p to 4K and an average duration of 14 seconds at 24 fps. We randomly select 1.3-second clips (32 frames) from each video resized to a resolution of 256×256256\times 256256 × 256, while evaluation uses the first 5 seconds at the original resolution, unless stated otherwise.

3.3 Training

We first train the model on 16 GPUs, using the AdamW optimizer (Loshchilov and Hutter, 2018). For the first 800800800 epochs, we only use images from the SA-1b dataset (see Sec. 3.2 for details on datasets), with a batch size of 161616 per GPU, with 150015001500 steps per epoch. The learning rate is linearly increased from 10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT over the first 505050 epochs, and then follows a cosine schedule (Loshchilov and Hutter, 2016) down to 10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT until epoch 800. For the last 300300300 epochs, we also use the SA-V dataset, with 200200200 steps per epoch. We only forward one 323232-frame clip per GPU, randomly chosen at every step. The learning rate is linearly increased from 10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT to 10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT over the first 101010 epochs, and then follows a cosine schedule down to 10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT until the last epoch. At epoch 250250250, we freeze the embedder and only optimize the extractor (see Sec. 2.3.2). The objectives are weighted with λw=1.0\lambda_{\text{w}}=1.0italic_λ start_POSTSUBSCRIPT w end_POSTSUBSCRIPT = 1.0, λi=0.5\lambda_{\text{i}}=0.5italic_λ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = 0.5, λdisc=0.1\lambda_{\text{disc}}=0.1italic_λ start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT = 0.1.

3.4 Baselines

In the absence of an established open-source video watermarking baselines, we leverage state-of-the-art image watermarking models as foundational baselines for video watermarking. HiDDeN (Zhu et al., 2018) is one of the earliest deep-learning watermarking methods. We trained it on 48 bits with the same augmentations for fairer comparison. MBRS (Jia et al., 2021) is based on the same architecture, but embeds 256-bit watermarks, with a training using mini-batches of real and simulated JPEG compression. CIN (Ma et al., 2022) combines invertible and non-invertible mechanisms to embed 30-bit watermarks. TrustMark (Bui et al., 2023) also uses a U-Net architecture trained similarly to HiDDeN, embedding 100 bits. Finally, WAM (Sander et al., 2024) embeds 323232 bits (in addition to one bit of detection which we do not use in this study), and offers robustness to splicing and inpainting. We use the original open weights for all baselines, except for HiDDeN, for which the authors do not provide weights. \ours operates with nbits=96n_{\text{bits}}=96italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 96, with αw=2.0\alpha_{w}=2.0italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 2.0, unless stated otherwise.

Inference.

All methods operate at resolution 256×256256\times 256256 × 256, except CIN, which is at 128×128128\times 128128 × 128. We extend them to arbitrary resolutions as presented in Sec. 2.2.1 (when the networks directly predict an image xwx_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and not a watermark distortion wwitalic_w, we retrieve it by doing w=xwxw=x_{w}-xitalic_w = italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - italic_x). By default, we use the original watermark strength αw\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT of Eq. 1 (1.01.01.0 for most methods), except in Sec. 4.4 where we study the imperceptibility/robustness trade-off. When evaluating the baselines on videos, we apply the image watermarking model with the same inference strategy as our models, i.e., get the watermark distortion every k=4k=4italic_k = 4 frames, and propagate the watermark to the other 333 frames as described in Sec. 2.2.2. For watermark extraction, we aggregate the soft bit predictions across the frames, and average the outputs to retrieve the global message (see Sec. 2.2.3).

3.5 Evaluated transformations

Original H.264 Crop Brightness Combined
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Examples of transformations used for robustness evaluation, \eg, in Fig. 6 (we show the 20th frame of a 10-second video). We choose H.264 (CRF=30), crop (50% area-wise), brightness with factor 0.5, as representative of video compression codecs, geometric transformations and valuemetric transformations, respectively.

We evaluate the robustness of our method to many transformations for different strengths. For simplicity, we aggregate the transformations into five categories: no transformation, geometric, valuemetric, compression, and combined transformations. For instance, geometric transformations include rotations, crops and perspective, while valuemetric transformations include brightness, contrast, and saturation changes, all with different ranges. The combined augmentations are realistic augmentations applied sequentially, \eg, an H.264 compression followed by a crop and a brightness change. We show some examples of these transformations in Fig. 4. Full results and details on which transformations constitute each group are given in App. 9.2.

4 Results

4.1 Robustness

We report in Tab. 4 the robustness of watermark extraction across many transformations and for various models, on the SA-1b and the SA-V datasets. Full results, detailed by transformation type and strength, are available in App. 9.2. We also report results on the COCO dataset, to test the generalization of the models to unseen distributions.

We first observe that many of the image models are already strong baselines for video watermarking (as suggested by Ye et al. (2023), although this seems to be even more the case when working on high resolution videos). Most of them achieve high bit accuracy both for image and video transformations, even against video codecs. It must be noted that MBRS and CIN were trained with augmentations that do not change the geometry of the image444In particular the crop considered by MBRS and CIN is simply a black mask applied on the image, keeping the original pixels at their exact location.. Therefore, their robustness against valuemetric transformations and video codecs is particularly strong, but at the same time their robustness on geometric transformations is particularly weak.

We also observe that \ours is overall the most robust model when considering transformations, especially against combinations of geometric transformations and video codecs. For instance, under a combined transformation of H.264 compression (CRF=30), brightness adjustment (strength 0.5), and cropping (50% area-wise), \ours achieves log10(p)=6.1\log_{10}(p)=-6.1roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) = - 6.1 on average. This means that if one were to use \ours in a detection scenario, most of the transformed watermarked video would be detected as watermarked, even at low false positive rates (<106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT).

Table 4: Evaluation of the watermark robustness for various models. Models hide different number of bits, therefore, in addition to the bit accuracy we also report log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ), which takes into accounts nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT (and reflects that a bit accuracy of 1.01.01.0 for WAM which hides 323232 bits is different than \ours which hides 969696 bits). Embedding is done either on the SA-1b (image) or the SA-V (video) dataset at their original resolution with the downscaling/upscaling inference trick presented in Sec. 2.2.1. For video, the embedding is done with k=4k=4italic_k = 4 (see Eq. 2) and extraction is performed on the first 333s (see Eq. 3). The results are averaged under transformations of different types (more details in App. 9.2).
HiDDeN MBRS CIN TrustMark WAM Video Seal (ours)
Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )(\downarrow)
SA-1b Identity 1.00 -14.2 0.99 -70.6 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Valuemetric 0.88 -10.8 0.95 -59.8 0.91 -8.1 0.98 -27.4 0.95 -8.7 0.93 -23.4
Geometric 0.76 -5.5 0.52 -3.3 0.52 -0.7 0.65 -8.5 0.81 -5.5 0.83 -16.4
Compression 1.00 -14.2 0.99 -69.9 1.00 -9.0 1.00 -29.7 1.00 -9.6 0.99 -27.1
Combined 0.70 -2.6 0.50 -0.4 0.50 -0.4 0.53 -0.8 0.86 -5.9 0.91 -18.4
SA-V Identity 0.99 -14.0 1.00 -77.1 1.00 -9.0 1.00 -30.1 1.00 -9.6 0.99 -26.8
Valuemetric 0.88 -9.1 0.89 -54.3 0.93 -7.5 0.93 -24.7 0.92 -7.7 0.90 -19.9
Geometric 0.68 -2.9 0.50 -0.4 0.50 -0.4 0.60 -5.5 0.81 -5.5 0.85 -17.0
Compression 0.83 -7.2 0.79 -34.6 0.90 -6.7 0.87 -20.0 0.86 -6.1 0.85 -15.7
Combined 0.61 -1.3 0.50 -0.4 0.49 -0.4 0.51 -0.5 0.55 -0.8 0.73 -8.1

4.2 Imperceptibility

We first show some examples of watermarked images in Fig. 5, and of video frames in App. 9.1. We observe that the watermarks are imperceptible at first glance, but most are visible under close inspection, especially in flat areas, like the skies in both images. Different methods, which employ various perceptual losses and architectures, result in watermarks of distinct characteristics. For instance, MBRS and CIN tend to create grid-like patterns, while TrustMark and \ours tend to create wavier patterns.

Original HiDDeN MBRS CIN TrustMark WAM \ours
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: Qualitative results for different watermarking methods. Images are from the SA-1b dataset at their original resolution (\approx2k ×\times× 1k), and we show more examples in App. 9.1. Although watermarks are imperceptible at first glance, most are visible under close inspection, especially in the flat areas, like the skies in both images. They are also of very different nature between the methods.

We also quantitatively evaluate the imperceptibility of the watermarking models on the image datasets COCO and SA-1b and the video dataset SA-V, and report results in Tab. 5. For every baseline, we use their nominal strength (most of the time αw=1\alpha_{w}=1italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1 in Eq. 1), although they could be adapted to control the imperceptibility/robustness trade-off as done in Sec. 4.4. We report the PSNR, SSIM, and LPIPS between the watermarked and original images of the SA-1b dataset, as well as the same metrics for videos of the SA-V dataset (cut to 5s), with the addition of VMAF for videos (note that the PSNR is computed on the whole video, and not as an average of the frames as for SSIM and LPIPS). We observe that \ours achieves the highest PSNR and SSIM scores, while MBRS achieves better VMAF and TrusMark achieves better LPIPS, closely followed by \ours.

Table 5: Evaluation of the watermark imperceptibility. We report the average PSNR, SSIM, and LPIPS between watermarked and original images of the SA-1b and COCO datasets, as well as the same metrics for videos of the SA-V dataset (cut to 5s), with the addition of VMAF (Netflix, 2016) for videos.
HiDDeN
MBRS
CIN
TrustMark
WAM
Video Seal (ours)
Image SSIM (\uparrow) 0.927 0.997 0.997 0.995 0.989 0.999
LPIPS (\downarrow) 0.229 0.003 0.019 0.002 0.031 0.009
PSNR (\uparrow) 30.36 45.60 44.90 42.09 39.86 47.39
Video SSIM (\uparrow) 0.857 0.995 0.994 0.995 0.981 0.998
LPIPS (\downarrow) 0.362 0.008 0.032 0.003 0.047 0.013
PSNR (\uparrow) 30.19 46.55 45.80 43.07 40.72 48.02
VMAF (\uparrow) 74.61 94.10 92.93 89.36 89.78 93.77

It is important to note that video imperceptibility is not fully captured in these examples and in these metrics. In practice, a watermark that is imperceptible in an image may not necessarily be imperceptible in a video, particularly when the watermark lacks consistency across frames. For instance, we found that TrustMark can produce shadowy artifacts as the watermark tracks the motion of the video, making it more visible. This is less pronounced for \ours, which tends to produce blobs that do not follow objects. However, clear metrics to evaluate this are still lacking, and would require a more comprehensive study on the perception of watermarks in videos. Notably, we observe that even at very high PSNR, SSIM or VMAF, the artifacts produced by \ours may be annoying to the human eye and highly depend on the cover videos.

4.3 Latency

Table 6: Efficiency of watermark embedding and extraction. We report the number of GFlops for embedding and extraction for models at their nominal resolution (256×256256\times 256256 × 256 for all methods but CIN which is 128×128128\times 128128 × 128). Additionally, we report the processing time per second of video for embedding and extraction on CPU and GPU, averaged over 20 videos from the SA-V dataset. We use the video inference framework of Sec. 2.2 to fairly compare all models.
HiDDeN (Zhu et al., 2018)
MBRS (Jia et al., 2021)
CIN (Ma et al., 2022)
TrustMark (Bui et al., 2023)
WAM (Sander et al., 2024)
\ours (ours)
Embed GFlops 22.4 32.2 16.6 10.3 42.6 42.0
CPU - Time (s) 0.67 1.99 1.04 0.64 3.64 1.14
GPU - Time (s) 0.42 0.47 0.47 0.42 3.19 0.42
Extract GFlops 39.0 27.0 17.9  4.1 68.7  3.1
CPU - Time (s) 1.64 2.31 3.49 0.41 2.52 0.69
GPU - Time (s) 0.19 0.29 0.77 0.11 0.47 0.11

We evaluate the latency of \ours compared to the image watermarking models repurposed for video watermarking. We use the video inference framework introduced in Sec. 2.2, with the downscale/upscale of the watermark signal and temporal watermark propagation with k=4k=4italic_k = 4 – to ensure a fair comparison across all models and see if the inference efficiency generalizes the same way across all models. Each model was compiled using TorchScript to optimize performance. Experiments are conducted on video clips from the SA-V dataset (full length, with a duration ranging from 10 to 24 seconds), with 2 Intel(R) Xeon(R) 6230 @ 2.10GHz and 480GB of RAM as CPU, and (optionally) a Tesla V100-SXM2-16GB as GPU. We evaluate the time needed for embedding and extraction in two scenarios: using only the CPU and using both the CPU and GPU (we do not consider video loading and saving times in the following).

We report the GFlops and time in seconds for both CPU and GPU configurations in Tab. 6. The GFlops required for embedding are consistent across models within a range of 10 to 43, while the GFlops required for extraction vary more widely from 3 to 69. In terms of GPU time, WAM is the slowest at embedding because it uses a heatmap to attenuate the watermark, which is computationally expensive at high resolution (high resolution images are never sent to the GPU to reduce memory constraints, so the compute of the heatmap is done on the CPU). The other models are much faster (around 0.5-2 seconds on CPU), but quite similar to each other. On GPU in particular, the transfer time from CPU to GPU and the CPU operations on high-resolution videos seem to be the bottleneck. For extraction, all the models are in the same ballpark.

4.4 Imperceptibility/Robustness trade-off

We previously reported the robustness and imperceptibility of the watermarking models at their nominal strength. In practice, one may want to adapt the strength αw\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to control the imperceptibility/robustness trade-off. We investigate this trade-off by varying the strength of the watermark for each model. We report in Fig. 6 the bit accuracy and log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) for various models, under different transformations, against the VMAF between the watermarked and the original videos. This is done on 3-seconds clips from SA-V. We observe that MBRS and TrustMark obtain higher values for log10(p)-\log_{10}(p)- roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) for a good range of VMAF since they hide more bits (256 and 100 respectively). However, these methods fall short on more challenging transformations, especially when combining geometric transformations and video compression where \ours achieves higher robustness, in particular at very high PSNR (>50>50> 50 dB) or VMAF (>94>94> 94).

Refer to caption
Figure 6: Robustness/quality trade-off across transformations for various models on 5s videos from SA-V. We compare the performance of six watermarking methods under H.264 compression (CRF=30), brightness adjustments (strength 0.5), cropping (50% area-wise), and the combination of the 3 transformations. (MBRS and CIN are palished because of their lack of robustness to geometric operations). We report for each transformation type (top) the bit accuracy and (bottom) the log10(p)-\log_{10}(p)- roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ), which accounts for the total number of bits, against the VMAF between the watermarked and the original videos. \ours achieves higher robustness compared to baselines especially under challenging transformations combining geometric transformations and video compression.

5 Ablation Studies

5.1 Video training

In this section, we investigate whether training a video watermarking model with frame propagation and differentiable video compression is beneficial, or if applying an image watermarking model to videos during inference is sufficient. We also investigate if it is beneficial to pre-train on images and then to continue training on a mix of images and videos. This could potentially leverage the faster training times of image-based models while adapting to video-specific transformations.

To test this we design three main scenarios:

  1. 1.

    Image-only training, where the model is trained solely on images;

  2. 2.

    Video-only training, where the model is trained exclusively on videos;

  3. 3.

    Mixed training, where the model is first pre-trained on images and then further trained on a mix of images and videos using a scheduled approach.

When video training is activated, we further explore two sub-cases:

  1. A.

    With all augmentations, including video compression,

  2. B.

    Without video compression augmentations.

This allows us to isolate the impact of video compression on the training process, as opposed to relying solely on differentiable frame propagation of the watermark. We report the mean bit accuracy over different compressions and the PSNR during training, across multiple seeds for each experiment. In this experiment, nbits=16n_{\text{bits}}=16italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT = 16 to facilitate training and focus on the impact of video training. During the video training phase, we employ a balanced schedule, alternating between images and videos with a 1:1 ratio (i.e., one epoch for images followed by one epoch for videos) from our experiments we found that this helps stabilizing the training.

Refer to caption
Figure 7: Video training with compression augmentation after image pre-training (100-200 epochs) yields the most successful training regimen, rapidly increasing bit accuracy, especially with stronger compressions (CRF 50-60), without sacrificing PSNR. This approach outperforms video training alone that seems to be insufficient for a stable training, demonstrating the effectiveness of our mixed approach with image pre-training for the model optimization.

The results, as shown in Fig. 7, reveal that the most effective combination involves pre-training on images, followed by video training with compression augmentation. This approach yields significant improvements in bit accuracy, particularly at higher compression rates. Notably, when video training commences (epoch 100 or 200) after image pre-training, the bit accuracy increases rapidly, especially for stronger compressions (CRF 40 and 50). This suggests that the incorporation of differentiable compression augmentation provides a robust optimization signal to the model. Furthermore, this improvement in robustness does not come at the cost of lower PSNR values compared to other ablations, underscoring the effectiveness of the proposed approach.

In contrast, video training alone without image pre-training proves ineffective, resulting in a very low bit accuracy. This highlights the importance of the mixed approach, which leverages image pre-training to initialize the network before training on videos. The scheduled training strategy employed in this study demonstrates the benefits of combining the efficiency of image-based models with the adaptability to video-specific transformations afforded by video training.

5.2 Extractor fine-tuning

Refer to caption
Figure 8: Extractor fine-tuning results. Fine-tuning boosts the average training bit accuracy (top-left), bit accuracy on H.264 (CRF=30) (top-right), and on a combined augmentation with H.264, crop and brightness change (bottom-left), without influencing the PSNR (bottom-right), as the generated watermark remains unchanged. All models are trained to convergence for 1000 epochs, followed by 200 epochs of fine-tuning (red dotted line).

In this section, we investigate the impact of fine-tuning the extractor of the watermark while freezing the generator as a method to break free from the trade-off between imperceptibility and robustness. We expect fine-tuning to provide additional gains in bit accuracy for some models, particularly towards augmentations that have not been seen enough during training or models that haven’t achieved full convergence. To investigate this, we train multiple models with varying parameters including numbers of bits (64 and 128) and video training start epoch (200, 500, and 1000). We train all models to convergence for 1000 epochs, then freeze the generator and fine-tune the extractor for an additional 200 epochs. We then compare two scenarios:

  1. 1.

    Training and fine-tuning with compression augmentations, where the models are trained on lightweight augmentations and leaving robustness to compressions to the end.

  2. 2.

    Training on all augmentations, with compression augmentations left to fine-tuning time.

The rationale behind scenario 2. is that compression augmentations introduce instabilities in training due to the slow compression times and the small batch size needed to fit in memory. Therefore, we investigate the benefits of leaving the compression augmentations only when the embedder is frozen.

Our results, shown in Fig. 5.2, indicate that fine-tuning allows for extra gains in the average bit accuracy overall, without compromising the PSNR. Fine-tuning can therefore be a viable solution to enhance the robustness of the extractor without suffering from the imperceptibility/robustness trade-off. Interestingly, our results also show that there is no significant difference in the effect of pre-training with or without compression augmentations. In fact, the results suggest that it is better to start with all augmentations, including compression, from the beginning.

5.3 Video inference parameters

Step-size at embedding time.

To efficiently embed the watermark in videos, we use temporal propagation presented in Sec. 2.2.2. It involves embedding the watermark every kkitalic_k frames, where kkitalic_k is the step-size, and copying the watermark distortion onto the next frames. We investigate the impact of the step-size on the watermark robustness and the speed of the embedding. We report the bit accuracy on the same combined augmentation as in Fig. 6, i.e., for an H.264 compression with CRF=30, a crop removing half of the video, and a brightness change, as well as the time taken to embed the watermark on both CPU and GPU. We observe that the step-size kkitalic_k does not significantly impact the watermark robustness, while greatly increasing the speed of the embedding. However, it empirically introduces shadowy or blinkering artifacts in the video. Therefore, the step-size should still be kept small to ensure the watermark is imperceptible when the video is moving fast (\eg, k=4k=4italic_k = 4 in our experiments). We leave the exploration of more advanced temporal propagation techniques for future work.

Number of frames at extraction time.

At extraction time, we predict a soft message for each frame i[1,T]i\in[1,T]italic_i ∈ [ 1 , italic_T ] and aggregate them into a single message. We investigate the impact of the number of frames TTitalic_T on the watermark extraction performance and the speed of the extraction, with the same setup as in the previous ablation. As shown in Fig. 9, the number of frames TTitalic_T at extraction time has a more significant impact on both the watermark extraction performance and the speed of the extraction. Notably, the bit accuracy increases with the number of frames, as the model has more information to predict the binary message.

Refer to caption
(a) Step-size.
Refer to caption
(b) Number of frames.
Figure 9: Ablation study on the step-size at embedding time and the number of frames at extraction time. Embedding and extraction are done on 5s clips. The reported bit accuracy is on the same combined augmentation as in Fig. 6, i.e., H.264, crop and brightness change. We observe that the step-size kkitalic_k in the temporal propagation does not significantly impact the watermark robustness, while greatly increasing the speed of the embedding – although it sometimes introduces shadow of glitter artifacts in the video. The number of frames TTitalic_T at extraction time has a more significant impact on both the watermark extraction performance and the speed of the extraction.
Aggregation at extraction time.

As previously stated, the extractor predicts one soft message 𝐦i~\tilde{\mathbf{m}_{i}}over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG per frame iiitalic_i, which is aggregated into a single message for the entire video. By default, the aggregation averages all the messages bit-wise, as explained in Eq. 3. We experimentally observed that when the extraction predicts a logit 𝐦~i,k\tilde{\mathbf{m}}_{i,k}over~ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT for a given frame iiitalic_i and bit kkitalic_k, the logit is likely to be higher for the correct bit than for the incorrect ones. We therefore investigate the impact of different aggregation methods on the watermark extraction performance. We define the following ones:

  • Average, the default method, which averages the messages bit-wise: m~k=1Ti=1T𝐦i~,k\tilde{m}_{k}=\frac{1}{T}\sum_{i=1}^{T}\tilde{\mathbf{m}_{i}}_{,k}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT.

  • Squared average, which rescales each bit by its absolute value before averaging: m~k=1Ti=1T|𝐦i~,k|𝐦i~,k\tilde{m}_{k}=\frac{1}{T}\sum_{i=1}^{T}\lvert{\tilde{\mathbf{m}_{i}}_{,k}}% \rvert\tilde{\mathbf{m}_{i}}_{,k}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT | over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT.

  • L1 average, which computes the L1 norm of the frame-wise logits before averaging: m~k=1Ti=1T𝐦i~1𝐦i~,k\tilde{m}_{k}=\frac{1}{T}\sum_{i=1}^{T}\|\tilde{\mathbf{m}_{i}}\|_{1}\tilde{% \mathbf{m}_{i}}_{,k}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT.

  • L2 average, which computes the L2 norm of the frame-wise logits before averaging: m~k=1Ti=1T𝐦i~2𝐦i~,k\tilde{m}_{k}=\frac{1}{T}\sum_{i=1}^{T}\|\tilde{\mathbf{m}_{i}}\|_{2}\tilde{% \mathbf{m}_{i}}_{,k}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT.

The final bit at position kkitalic_k is then thresholded to obtain the hard message: m^k=𝟙m~k>0\hat{m}_{k}=\mathbbm{1}_{\tilde{m}_{k}>0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = blackboard_1 start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT.

We report in Tab. 7 the bit accuracy and log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) for the different aggregation methods. The experimental setup is the same as in Sec. 4.4, i.e., we watermark 3s videos from the SA-V dataset, and run the extraction on the entire clip. The bit accuracy and log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) are similar across the different methods, with a small improvement for the “L1 average”, but not significant enough to justify the increased complexity.

Table 7: Ablation study on the aggregation method for watermarking extraction on video. We use the same setup as in Sec. 4.4, i.e., 100 3s videos from the SA-V dataset. Identity, Valuemetric, Geometric, Compression, and Combined refer to the different types of transformations applied before extraction, on which the bit accuracy and log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) are averaged. We observe that the aggregation method does not significantly impact the watermark extraction performance.
Identity Valuemetric Geometric Compression Combined
Aggregation Bit acc. / log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) Bit acc. / log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) Bit acc. / log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) Bit acc. / log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) Bit acc. / log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p )
Avg 0.992 -27.6 0.904 -20.6 0.863 -18.3 0.837 -15.3 0.730 -8.6
L1 avg 0.994 -27.9 0.908 -21.1 0.867 -18.6 0.844 -16.0 0.742 -9.3
L2 avg 0.993 -27.7 0.907 -20.8 0.861 -17.8 0.842 -15.7 0.742 -9.2
Squared avg 0.989 -27.2 0.906 -20.5 0.857 -17.5 0.843 -15.6 0.742 -9.2

6 Related Work

Traditional video watermarking

operates within the framework of video codecs like H.264/AVC and HEVC which utilize entropy coding and motion estimation as part of their compression techniques. They can be broadly categorized into two main approaches. The first approach involves exploiting the Reversible Variable Length Codes (RVLC), which are a type of entropy coding used in video compression to represent frequently occurring symbols with shorter codes. In RVLC-based watermarking (Biswas et al., 2005; Noorkami and Mersereau, 2007; Mobasseri and Cinalli, 2004), the watermark is embedded by modifying the variable length codes in a way that is reversible, meaning the original video content can be restored after extraction of the watermark. The second approach (Mohaghegh and Fatemi, 2008) focuses on manipulating motion vectors, which are used to describe the movement of objects or blocks between frames in a video sequence. In motion vector-based watermarking, the watermark is embedded by slightly altering the motion vectors, typically those with larger magnitudes, in a way that is imperceptible to the human eye.

Deep-learning-based video watermarking.

Early work on deep learning-based video watermarking models, such as VStegNet (Mishra et al., 2019) and RivaGan (Zhang et al., 2019), have been proposed to address the limitations of traditional methods. VStegNet introduced a deep learning architecture that achieves high embedding capacity and visual quality but lacks robustness to video distortions or compression. In contrast, RivaGan employed a GAN training architecture with an attention-based mechanism and adversarial networks to optimize for robustness. However, its use of 4D video tensors raises concerns about efficiency and usability. To simulate non-differentiable compression algorithms, RivaGan incorporated a noise layer mimicking H.264 compression using Discrete Cosine Transform (DCT). While RivaGan’s open-sourced training code is available, the trained models are not, making comparisons challenging. Weng et al. (2019) is mostly concerned with video steganography. It focuses on hiding data in the less complex inter-frame residuals rather than directly within the more dense video frames. This work also does not consider robustness to distortions.

DVMark (Luo et al., 2023) enhances robustness in video watermarking through a multiscale design in both the encoder and decoder. This approach embeds messages across multiple spatio-temporal scales, resulting in improved robustness compared to single-scale networks. The model operates on 4D video tensors and can support variable resolutions, similar to Zhu et al. (2018), without requiring downsampling or upsampling. However, this raises concerns about its efficiency and usability in practice, particularly for long videos. To address the challenge of compression, DVMark and VHNet (Shen et al., 2023) introduce a trainable CompressionNet that simulates video compression. This allows their networks to be optimized for robustness to compression in a differentiable way. Other approaches include REVMark (Zhang et al., 2023) which also uses a differentiable approximation of H.264 to simulate video compression and achieves robust watermarking for 128×\times×128 videos with a 96-bit payload, the works of (Zhang et al., 2024b) and (Chang et al., 2024), which apply deep watermarking in the frequency domain using either DCT and Dual-Tree Complex Wavelet Transform (DT-CWT), respectively, and RIVIE (Jia et al., 2022), which simulates real-world camera imaging distortions and adds temporal loss functions and a distortion network. Lastly, V2A-Mark (Zhang et al., 2024a) embeds two watermarks, one for tamper localization and the other to hide a 32-bits payload, but it does not report any results on geometric transformations.

ItoV (Ye et al., 2023) is the most similar to our work. It adapts image watermarking architectures to process videos by merging the temporal dimension with the channel dimension, allowing 2D CNNs to treat videos as images. This approach aims to reduce computational resources and leverage faster convergence speeds compared to 3D CNNs. However, it still requires feeding the entire video at once, raising questions about its efficiency and ability to handle longer videos. Notably, ItoV employs a skip gradient trick to enable direct training on video codec augmentations, achieving good robustness against H.264 compressions at CRF=22. However, the lack of reproducibility assets limits further assessment of its robustness.

Image watermarking

has also been a long-standing research topic, very much intertwined with video watermarking. Early works date back to the spatial domain methods of Van Schyndel et al. (1994), Nikolaidis and Pitas (1998), Bas et al. (2002), as well as to the ones applying the watermark in a frequency domain such as DFT (Urvoy et al., 2014), QFT (Ouyang et al., 2015), DCT (Bors and Pitas, 1996; Piva et al., 1997; Barni et al., 1998), and DWT (Xia et al., 1998; Barni et al., 2001; Furon and Bas, 2008). The focus has since then shifted towards deep learning, pioneered by HiDDeN (Zhu et al., 2018), which has been extended by the incorporation of adversarial training (Luo et al., 2020), attention filters (Zhang et al., 2020; Yu, 2020), robust optimization (Wen and Aydore, 2019) or invertible networks (Ma et al., 2022; Fang et al., 2023). More recent works include new features such as the option to embed the watermark at any resolution (Bui et al., 2023), robustness to diffusion purification (Pan et al., 2024) or localized extraction of one or several messages from the same image (Sander et al., 2024). A parallel line of research has recently emerged, focusing on watermarking specific to AI-generated content (Yu, 2020; Yu et al., 2021), with notable works including Stable Signature (Fernandez et al., 2023), Tree-Ring (Wen et al., 2023), and their follow-ups (Kim et al., 2023; Hong et al., 2024; Ci et al., 2024). These methods aim to embed watermarks during the generation process, often providing a more robust and/or secure way to track AI-generated content. On the other hand, \ours is post-hoc, meaning that, to apply it to AI-generated content, we would need to watermark after the generation, making it more flexible, but also less secure, \eg, in the case of open-sourcing the generative model.

7 Conclusion

In this paper, we introduce \ours, a comprehensive and efficient framework for video watermarking. Our work addresses the need for robust, efficient and flexible watermarking solutions coming with the increasing ease of access of video generative models and sophisticated video editing tools. It provides a strong open foundation for researchers and practitioners to test and iterate on. It also highlights some open challenges of video watermarking. For instance, the need for better metrics (Mantiuk et al., 2024) to evaluate imperceptibility and better training objectives for it. Future work could focus on ensuring visual consistency across watermarked frames, embedding in a domain better suited for video compression (\eg, YUV or YCbCr), increasing the payload and the robustness of the watermarks, as well as exploring the security of the framework.

References

  • Chi (2023) Chinese ai governance rules, 2023. http://www.cac.gov.cn/2023-07/13/c_1690898327029107.htm. Accessed on August 29, 2023.
  • Eur (2023) European ai act, 2023. https://artificialintelligenceact.eu/. Accessed on August 29, 2023.
  • Alliance for Open Media (2018) Alliance for Open Media. Av1 bitstream & decoding process specification, 2018. https://aomediacodec.github.io/av1-spec/av1-spec.pdf.
  • Barni et al. (1998) Mauro Barni, Franco Bartolini, Vito Cappellini, and Alessandro Piva. A dct-domain system for robust image watermarking. Signal processing, 66(3):357–372, 1998.
  • Barni et al. (2001) Mauro Barni, Franco Bartolini, and Alessandro Piva. Improved wavelet-based watermarking through pixel-wise masking. IEEE transactions on image processing, 10(5):783–791, 2001.
  • Bas et al. (2002) Patrick Bas, J-M Chassery, and Benoit Macq. Geometrically invariant watermarking using feature points. IEEE transactions on image Processing, 11(9):1014–1028, 2002.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Biswas et al. (2005) Satyendra Biswas, Sunil R Das, and Emil M Petriu. An adaptive compressed mpeg-2 video watermarking scheme. IEEE transactions on Instrumentation and Measurement, 54(5):1853–1861, 2005.
  • Bors and Pitas (1996) Adrian G Bors and Ioannis Pitas. Image watermarking using dct domain constraints. In ICIP, 1996.
  • Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. https://openai.com/research/video-generation-models-as-world-simulators.
  • Bui et al. (2023) Tu Bui, Shruti Agarwal, and John Collomosse. Trustmark: Universal watermarking for arbitrary resolution images. arXiv preprint arXiv:2311.18297, 2023.
  • California State Leg. (2024) California State Leg. Amendment to california assembly bill 3211. California State Legislature, April 2024. https://legiscan.com/CA/text/AB3211/id/2984195. Amended in Assembly.
  • Chang et al. (2024) Xuanming Chang, Beijing Chen, Weiping Ding, and Xin Liao. A dnn robust video watermarking method in dual-tree complex wavelet transform domain. Journal of Information Security and Applications, 85:103868, 2024.
  • Ci et al. (2024) Hai Ci, Pei Yang, Yiren Song, and Mike Zheng Shou. Ringid: Rethinking tree-ring watermarking for enhanced multi-key identification. arXiv preprint arXiv:2404.14055, 2024.
  • Cover (1999) Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  • Czolbe et al. (2020) Steffen Czolbe, Oswin Krause, Ingemar Cox, and Christian Igel. A loss function for generative neural networks based on watson’s perceptual model. Advances in Neural Information Processing Systems, 33:2051–2061, 2020.
  • Défossez et al. (2022) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  • Dosovitskiy (2020) Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Elfwing et al. (2018) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  • Fang et al. (2023) Han Fang, Yupeng Qiu, Kejiang Chen, Jiyi Zhang, Weiming Zhang, and Ee-Chien Chang. Flow-based robust watermarking with invertible noise layer for black-box distortions. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 5054–5061, 2023.
  • Fernandez et al. (2023) Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. The stable signature: Rooting watermarks in latent diffusion models. In International Conference on Computer Vision, pages 22466–22477, 2023.
  • Furon and Bas (2008) Teddy Furon and Patrick Bas. Broken arrows. EURASIP Journal on Information Security, 2008:1–13, 2008.
  • Hong et al. (2024) Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, and Se Young Chun. On exact inversion of dpm-solvers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7069–7078, 2024.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • Jia et al. (2022) Jun Jia, Zhongpai Gao, Dandan Zhu, Xiongkuo Min, Menghan Hu, and Guangtao Zhai. Rivie: Robust inherent video information embedding. IEEE Transactions on Multimedia, 25:7364–7377, 2022.
  • Jia et al. (2021) Zhaoyang Jia, Han Fang, and Weiming Zhang. Mbrs: Enhancing robustness of dnn-based watermarking by mini-batch of real and simulated jpeg compression. In Proceedings of the 29th ACM international conference on multimedia, pages 41–49, 2021.
  • Kim et al. (2023) Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, and Yezhou Yang. Wouaf: Weight modulation for user attribution and fingerprinting in text-to-image diffusion models. arXiv preprint arXiv:2306.04744, 2023.
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Lim and Ye (2017) Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • Loshchilov and Hutter (2018) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018.
  • Luo et al. (2020) Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, and Peyman Milanfar. Distortion agnostic deep watermarking. In CVPR, 2020.
  • Luo et al. (2023) Xiyang Luo, Yinxiao Li, Huiwen Chang, Ce Liu, Peyman Milanfar, and Feng Yang. Dvmark: a deep multiscale framework for video watermarking. IEEE Transactions on Image Processing, 2023.
  • Ma et al. (2022) Rui Ma, Mengxi Guo, Yi Hou, Fan Yang, Yuan Li, Huizhu Jia, and Xiaodong Xie. Towards blind watermarking: Combining invertible and non-invertible mechanisms. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1532–1542, 2022.
  • Mantiuk et al. (2024) Rafal K Mantiuk, Param Hanji, Maliha Ashraf, Yuta Asano, and Alexandre Chapiro. Colorvideovdp: A visual difference predictor for image, video and display distortions. arXiv preprint arXiv:2401.11485, 2024.
  • Marcel and Rodriguez (2010) Sébastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In International Conference on Multimedia. ACM, 2010.
  • Mishra et al. (2019) Aayush Mishra, Suraj Kumar, Aditya Nigam, and Saiful Islam. Vstegnet: Video steganography network using spatio-temporal features and micro-bottleneck. In The British Machine Vision Conference, page 274, 2019.
  • Mobasseri and Cinalli (2004) Bijan G Mobasseri and Domenick Cinalli. Reversible watermarking using two-way decodable codes. In Security, Steganography, and Watermarking of Multimedia Contents VI, volume 5306, pages 397–404. SPIE, 2004.
  • Mohaghegh and Fatemi (2008) Najla Mohaghegh and Omid Fatemi. H. 264 copyright protection with motion vector watermarking. In 2008 International Conference on Audio, Language and Image Processing, pages 1384–1389. IEEE, 2008.
  • Netflix (2016) Netflix. Vmaf - video multi-method assessment fusion. https://github.com/Netflix/vmaf, 2016. Accessed: 2024-11-18.
  • Nikolaidis and Pitas (1998) Nikos Nikolaidis and Ioannis Pitas. Robust image watermarking in the spatial domain. Signal processing, 1998.
  • Noorkami and Mersereau (2007) Maneli Noorkami and Russell M Mersereau. A framework for robust watermarking of h. 264-encoded video with controllable detection performance. IEEE Transactions on information forensics and security, 2(1):14–23, 2007.
  • Odena et al. (2016) Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.
  • Ouyang et al. (2015) Junlin Ouyang, Gouenou Coatrieux, Beijing Chen, and Huazhong Shu. Color image watermarking based on quaternion fourier transform and improved uniform log-polar mapping. Computers & Electrical Engineering, 2015.
  • Pan et al. (2024) Minzhou Pan, Yi Zeng, Xue Lin, Ning Yu, Cho-Jui Hsieh, Peter Henderson, and Ruoxi Jia. Jigmark: A black-box approach for enhancing image watermarks against diffusion model edits. arXiv preprint arXiv:2406.03720, 2024.
  • Piva et al. (1997) Alessandro Piva, Mauro Barni, Franco Bartolini, and Vito Cappellini. Dct-based watermark recovering without resorting to the uncorrupted original image. In Proceedings of international conference on image processing, volume 1, pages 520–523. IEEE, 1997.
  • Polyak et al. (2024) Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Ravi et al. (2024) Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. https://arxiv.org/abs/2408.00714.
  • Richardson (2010) Iain E. Richardson. The H.264 Advanced Video Compression Standard. John Wiley & Sons, 2nd edition, 2010.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • San Roman et al. (2024) Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, and Tuan Tran. Proactive detection of voice cloning with localized watermarking. In International Conference on Machine Learning, volume 235, 2024.
  • Sander et al. (2024) Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, and Matthijs Douze. Watermark anything with localized messages. arXiv preprint arXiv:2411.07231, 2024.
  • Shen et al. (2023) Xiaofeng Shen, Heng Yao, Shunquan Tan, and Chuan Qin. Vhnet: A video hiding network with robustness to video coding. Journal of Information Security and Applications, 75:103515, 2023.
  • Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  • Urvoy et al. (2014) Matthieu Urvoy, Dalila Goudia, and Florent Autrusseau. Perceptual dft watermarking with improved detection and robustness to geometrical distortions. IEEE Transactions on Information Forensics and Security, 2014.
  • USA (2023) USA. Ensuring safe, secure, and trustworthy ai. https://www.whitehouse.gov/wp-content/uploads/2023/07/Ensuring-Safe-Secure-and-Trustworthy-AI.pdf, July 2023. Accessed: [july 2023].
  • Van Schyndel et al. (1994) Ron G Van Schyndel, Andrew Z Tirkel, and Charles F Osborne. A digital watermark. In Proceedings of 1st international conference on image processing, volume 2, pages 86–90. IEEE, 1994.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wen and Aydore (2019) Bingyang Wen and Sergul Aydore. Romark: A robust watermarking system using adversarial training. arXiv preprint arXiv:1910.01221, 2019.
  • Wen et al. (2023) Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. Tree-ring watermarks: Fingerprints for diffusion images that are invisible and robust. arXiv preprint arXiv:2305.20030, 2023.
  • Weng et al. (2019) Xinyu Weng, Yongzhi Li, Lu Chi, and Yadong Mu. High-capacity convolutional video steganography with temporal residual modeling. In Proceedings of the 2019 on international conference on multimedia retrieval, pages 87–95, 2019.
  • Xia et al. (1998) Xiang-Gen Xia, Charles G Boncelet, and Gonzalo R Arce. Wavelet transform based watermark for digital images. Optics Express, 1998.
  • Xian et al. (2024) Xun Xian, Ganghua Wang, Xuan Bi, Jayanth Srinivasa, Ashish Kundu, Mingyi Hong, and Jie Ding. Raw: A robust and agile plug-and-play watermark framework for ai-generated images with provable guarantees. arXiv preprint arXiv:2403.18774, 2024.
  • Ye et al. (2023) Guanhui Ye, Jiashi Gao, Yuchen Wang, Liyan Song, and Xuetao Wei. Itov: efficiently adapting deep learning-based image watermarking to video watermarking. In 2023 International Conference on Culture-Oriented Science and Technology (CoST), pages 192–197. IEEE, 2023.
  • Yu (2020) Chong Yu. Attention based data hiding with generative adversarial networks. In AAAI, 2020.
  • Yu et al. (2021) Ning Yu, Vladislav Skripniuk, Dingfan Chen, Larry S Davis, and Mario Fritz. Responsible disclosure of generative models using scalable fingerprinting. In International Conference on Learning Representations, 2021.
  • Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. (2021) Chaoning Zhang, Adil Karjauv, Philipp Benz, and In So Kweon. Towards robust deep hiding under non-differentiable distortions for practical blind watermarking. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5158–5166, 2021.
  • Zhang et al. (2020) Honglei Zhang, Hu Wang, Yuanzhouhan Cao, Chunhua Shen, and Yidong Li. Robust watermarking using inverse gradient attention. arXiv preprint arXiv:2011.10850, 2020.
  • Zhang et al. (2019) Kevin Alex Zhang, Lei Xu, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Robust invisible video watermarking with attention. arXiv preprint arXiv:1909.01285, 2019.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • Zhang et al. (2024a) Xuanyu Zhang, Youmin Xu, Runyi Li, Jiwen Yu, Weiqi Li, Zhipei Xu, and Jian Zhang. V2a-mark: Versatile deep visual-audio watermarking for manipulation localization and copyright protection. arXiv preprint arXiv:2404.16824, 2024a.
  • Zhang et al. (2023) Yulin Zhang, Jiangqun Ni, Wenkang Su, and Xin Liao. A novel deep video watermarking framework with enhanced robustness to h. 264/avc compression. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8095–8104, 2023.
  • Zhang et al. (2024b) Zhiwei Zhang, Han Wang, Guisong Wang, and Xinxiao Wu. Hide and track: Towards blind video watermarking network in frequency domain. Neurocomputing, 579:127435, 2024b.
  • Zhu et al. (2018) Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), pages 657–672, 2018.
\beginappendix

8 Theoretical Analyses

8.1 Comparing at different payloads

We consider a binary message m{0,1}nbitsm\in\{0,1\}^{n_{\text{bits}}}italic_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and its estimate m^\hat{m}over^ start_ARG italic_m end_ARG after the process of watermark embedding, edition and watermark extraction. This transmission is measured with a certain accuracy bit accuracy(m,m^)\text{bit accuracy}(m,\hat{m})bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), which does not take into account the payload nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT. We thus introduce two metrics to be able to compare the performance of models operating at different payloads nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT.

We consider that each bit is a binary symmetric channel (BSC) with a probability of error ppitalic_p. Its entropy is given by h(p)=plog2p(1p)log2(1p)h(p)=-p\log_{2}p-(1-p)\log_{2}(1-p)italic_h ( italic_p ) = - italic_p roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p - ( 1 - italic_p ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - italic_p ), and its capacity is c(p)=1h(p)c(p)=1-h(p)italic_c ( italic_p ) = 1 - italic_h ( italic_p ). If nbitsn_{\text{bits}}italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT such channels exist, the total capacity is c(p)×nbitsc(p)\times n_{\text{bits}}italic_c ( italic_p ) × italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT. In our case, we assume that, given an observed bit accuracy bit accuracy(m,m^)\text{bit accuracy}(m,\hat{m})bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), each bit is a BSC with a probability of error defined p=1bit accuracy(m,m^)p=1-\text{bit accuracy}(m,\hat{m})italic_p = 1 - bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ). We define the expected capacity as:

C(p)=nbits×(1(plog2p(1p)log2p)),C(p)=n_{\text{bits}}\times\left(1-\left(-p\log_{2}p-(1-p)\log_{2}p\right)% \right),italic_C ( italic_p ) = italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT × ( 1 - ( - italic_p roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p - ( 1 - italic_p ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ) ) , (11)

where p=bit accuracy(m,m^)p=\text{bit accuracy}(m,\hat{m})italic_p = bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ). It represents the number of bits that would be theoretically transmittable from a Shannon perspective, if we assumed that the observed bit accuracy is the true probability of error.

Another way to approach the problem is to consider it as a statistical detection test. We consider the null hypothesis H0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that each bit of the output binary message m^\hat{m}over^ start_ARG italic_m end_ARG is independent and distributed as a Bernoulli variable with probability of success 0.50.50.5, and the alternative hypothesis H1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which is that m^=m\hat{m}=mover^ start_ARG italic_m end_ARG = italic_m. Given an observed bit accuracy bit accuracy(m,m^)\text{bit accuracy}(m,\hat{m})bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), the ppitalic_p-value is the probability of observing a bit accuracy at least as extreme as the one obtained under the null hypothesis. It is given by the cumulative distribution function of the binomial distribution:

p-value(m,m^)=knbitspnbits(nbitsk)1/2nbits=I1/2(nbitsp,nbits(1p)+1),p\textrm{-value}(m,\hat{m})=\sum_{k\geq n_{\text{bits}}p}^{n_{\text{bits}}}% \binom{n_{\text{bits}}}{k}1/2^{n_{\text{bits}}}=I_{1/2}(n_{\text{bits}}\,p,n_{% \text{bits}}\,(1-p)+1),italic_p -value ( italic_m , over^ start_ARG italic_m end_ARG ) = ∑ start_POSTSUBSCRIPT italic_k ≥ italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG ) 1 / 2 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_I start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT italic_p , italic_n start_POSTSUBSCRIPT bits end_POSTSUBSCRIPT ( 1 - italic_p ) + 1 ) , (12)

where p=bit accuracy(m,m^)p=\text{bit accuracy}(m,\hat{m})italic_p = bit accuracy ( italic_m , over^ start_ARG italic_m end_ARG ), and where the c.d.f. of the binomial is expressed by Ix(a,b)I_{x}(a,b)italic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a , italic_b ), the regularized incomplete Beta function.

In Fig. 10, we show the expected capacity and the log2\log_{2}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the ppitalic_p-value, as a function of the number of bits and the bit accuracy. Interestingly, we observe that both metrics follow the exact same trend, with discontinuities for the ppitalic_p-value due to the discrete nature of the binomial distribution. In these plots, we can for instance see that a bit accuracy of 0.90.90.9 for a payload of 646464 bits would be approximately equivalent to a bit accuracy of 0.80.80.8 for a payload of 128128128 bits, in terms of expected capacity or ppitalic_p-value.

Refer to caption
Figure 10: Expected capacity and ppitalic_p-value as a function of the number of bits.

Note that the ppitalic_p-value and capacity discussed in this context are part of a theoretical analysis aimed at evaluating methods in binary message transmission. Unlike the traditional ppitalic_p-value used in statistical hypothesis testing, which assesses the likelihood of observing a bit accuracy as extreme as the observed one under H0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, this ppitalic_p-value is not directly related to the actual outcomes of a statistical test. It is purely a conceptual tool to analyze and compare different scenarios of bit accuracy and payload sizes.

9 Additional Details and Results

9.1 More qualitative results

We show in Fig. 11 additional examples of watermarked images from SA-1b, and in Fig. 12 watermarked frames from videos from SA-V. They extend results of Fig. 5.

9.2 Full robustness results

We report the robustness of watermark extraction across many transformations, and for various models, on the SA-1b, COCO, and SA-V datasets. We report for each transformation type the bit accuracy and the log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ), which accounts for the total number of bits, against the PSNR between the watermarked and the original videos. When averaging categories of transformations, as done in Tab. 4, we consider:

  • Identity: only the identity;

  • Valuemetric: brightness, contrast, hue, saturation, Gaussian blur, median filter;

  • Compression: JPEG (for images), H.264, H.265 (for videos)

  • Geometric: horizontal flip, crop, resize, rotation, perspective;

  • Combined: Compression (different CRFs) followed by a crop and a brightness change.

Original HiDDeN MBRS CIN TrustMark WAM \ours
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: Qualitative results for different watermarking methods. Images are from the SA-1b dataset at their original resolution (\approx2k ×\times× 1k).

Original

Refer to caption
HiDDeN Refer to caption
Refer to caption
MBRS Refer to caption
Refer to caption
CIN Refer to caption
Refer to caption
TrustMark Refer to caption
Refer to caption
WAM Refer to caption
Refer to caption
\ours Refer to caption
Refer to caption
Figure 12: Qualitative results for different watermarking methods. Frames are from the SA-V dataset at their original resolution (\approx2k ×\times× 1k).
Table 8: Full results for the robustness of watermark extraction on the SA-1b dataset.
HiDDeN MBRS CIN TrustMark WAM Video Seal (ours)
Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow) Bit acc. (\uparrow)/ log10(p)\log_{10}(p)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_p ) (\downarrow)
Identity 1.00 -14.2 0.99 -70.6 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
HorizontalFlip 0.69 -2.4 0.50 -0.5 0.50 -0.4 1.00 -29.9 1.00 -9.6 0.99 -27.2
Rotate 5 0.93 -10.1 0.50 -0.4 0.50 -0.4 0.61 -3.4 0.98 -8.7 0.98 -26.0
Rotate 10 0.83 -5.9 0.50 -0.4 0.50 -0.3 0.51 -0.5 0.72 -2.3 0.96 -23.3
Rotate 30 0.55 -0.7 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.58 -1.4
Rotate 45 0.50 -0.4 0.50 -0.4 0.50 -0.3 0.50 -0.4 0.50 -0.4 0.51 -0.5
Rotate 90 0.49 -0.4 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.50 -0.4
Resize 0.32 0.99 -14.0 0.98 -69.1 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.45 1.00 -14.1 0.99 -70.1 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.55 1.00 -14.2 0.99 -70.2 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.63 1.00 -14.2 0.99 -70.4 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.71 1.00 -14.2 0.99 -70.4 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.77 1.00 -14.2 0.99 -70.4 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.84 1.00 -14.2 0.99 -70.5 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.89 1.00 -14.2 0.99 -70.5 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 0.95 1.00 -14.2 0.99 -70.5 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Resize 1.0 1.00 -14.2 0.99 -70.6 1.00 -9.0 1.00 -29.9 1.00 -9.6 0.99 -27.3
Crop 0.32 0.48 -0.3 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.79 -3.8 0.50 -0.4
Crop 0.45 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.50 -0.4 0.94 -7.5