这是用户在 2025-7-7 2:57 为 https://arxiv.org/html/2410.09472?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
arXiv.org 永久非独占许可
arXiv:2410.09472v2 [cs.SD] null
arXiv:2410.09472v2 [cs.SD] 无分类

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
DRCap:基于检索增强生成的 CLAP 隐变量解码零样本音频描述方法
thanks: Qiuqiang Kong and Xie Chen are the corresponding authors.
孔秋强和陈捷为通讯作者。
thanks: Codes and models are available at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/drcap_zeroshot_aac.
代码与模型详见 https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/drcap_zeroshot_aac。

Xiquan Li1,2, Wenxi Chen1, Ziyang Ma1, Xuenan Xu1, Yuzhe Liang1, Zhisheng Zheng1,
Qiuqiang Kong3†, Xie Chen1†
1MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University, China 2SJTU Paris Elite Institute of Technology, Shanghai Jiao Tong University, China 3Department of Electronics Engineering, The Chinese University of Hong Kong, China
Abstract  摘要

While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use both the projection strategy from the encoder side and the retrieval-augmented generation strategy from the decoder side. Specifically, audio embeddings are first projected onto a text embedding support to absorb extensive semantic information within the joint multi-modal space of CLAP. At the same time, similar captions retrieved from a datastore are fed as prompts to instruct the LLM, incorporating external knowledge to take full advantage of its strong generative capability. Conditioned on both the projected CLAP embedding and the retrieved similar captions, the model is able to produce a more accurate and semantically rich textual description. By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner. Experimental results demonstrate that DRCap outperforms all other zero-shot models in in-domain scenarios and achieves state-of-the-art performance in cross-domain scenarios.
尽管自动音频描述(AAC)已取得显著进展,传统全监督 AAC 模型仍面临两大关键挑战:训练需要昂贵的音频-文本配对数据,以及跨领域迁移时性能下降。为突破这些限制,我们提出 DRCap——一种数据高效且灵活的零样本音频描述系统,仅需纯文本数据进行训练,且无需额外微调即可快速适应新领域。DRCap 以对比式语言-音频预训练(CLAP)模型和大语言模型(LLM)为骨干架构。训练阶段,模型通过 CLAP 的固定文本编码器预测真实描述;推理阶段则替换为音频编码器,以零样本方式为音频片段生成描述。为弥合 CLAP 模型的模态鸿沟,我们同时采用编码器端的投影策略和解码器端的检索增强生成策略:首先将音频嵌入向量投影至文本嵌入支撑面,以吸收 CLAP 联合多模态空间中的丰富语义信息。 同时,从数据存储中检索到的相似描述作为提示输入 LLM,通过融入外部知识充分发挥其强大的生成能力。在投影后的 CLAP 嵌入向量和检索到的相似描述双重条件约束下,该模型能够生成更准确且语义丰富的文本描述。通过针对目标领域定制文本嵌入支持和描述数据库,DRCap 获得了无需训练即可适应新领域的强大能力。实验结果表明,DRCap 在领域内场景中优于所有其他零样本模型,并在跨领域场景中实现了最先进的性能。

Index Terms:
Zero-shot AAC, CLAP, LLM, RAG
关键词:零样本音频描述生成,CLAP,LLM,检索增强生成

I Introduction  I 引言

Automated audio captioning (AAC) is a cross-modal translation task that seeks to generate natural language descriptions for given audio clips [1]. This process involves detailing the audio in terms of acoustic scenes, temporal relationships, object interactions, and environmental context [2]. Conventional AAC models often employ an encoder-decoder architecture [3], where an audio encoder extracts fine-grained audio features and a text decoder generates captions auto-regressively conditioned on these audio representations. The audio encoders used in previous studies [4, 5, 6, 7, 8] are often pre-trained on tasks such as audio tagging or sound event detection [9, 10, 11], while the text decoders are pre-trained large language models (LLMs) with extensive encyclopedic knowledge, such as BART [12] or GPT-2 [13].
自动音频描述(AAC)是一项跨模态翻译任务,旨在为给定音频片段生成自然语言描述[1]。该过程需要从声学场景、时序关系、物体互动和环境背景等维度详细刻画音频内容[2]。传统 AAC 模型通常采用编码器-解码器架构[3],其中音频编码器提取细粒度音频特征,文本解码器则基于这些音频表征以自回归方式生成描述。既有研究[4,5,6,7,8]中使用的音频编码器多在音频标记或声音事件检测任务[9,10,11]上进行预训练,而文本解码器则采用具备广博百科知识的大型语言模型(LLMs),例如 BART[12]或 GPT-2[13]。

Despite the significant strides made in AAC, most fully supervised models still rely on extensively human-annotated datasets for training. However, data scarcity remains a critical issue for AAC, as annotating audio data demands careful attention and complex analysis. Moreover, given the diversity in audio concepts [6] and annotation styles across different datasets [14], existing fully supervised models often lack the flexibility to generalize to new domains, leading to diminished performance in cross-domain evaluations, where training and test data come from two distinct datasets.
尽管自动音频描述(AAC)领域已取得重大进展,大多数全监督模型仍需依赖大量人工标注数据集进行训练。然而数据稀缺性仍是 AAC 面临的关键问题,因为音频数据标注需要细致关注与复杂分析。此外,考虑到不同数据集间音频概念的多样性[6]与标注风格的差异性[14],现有全监督模型往往缺乏跨领域泛化的灵活性,导致在训练数据与测试数据来自不同数据集的跨域评估中表现欠佳。

To overcome these challenges, researchers proposed zero-shot audio captioning framework [15, 16, 17, 18, 19], which seeks to generate captions without training on costly audio-text pair data. These works typically leverage the multi-modal capabilities of the CLAP model [20, 21, 22]. To bridge the modality gap [23] of CLAP, Deshmukh et al. [15] and Kouzelis et al. [17] injected Gaussian noise into CLAP latents, while Zhang et al. [16] crafted soft and hard prompts. However, adding noise can diminish the rich semantic information within the CLAP multi-modal space, while the fixed-category hard prompt in [16] risks misleading the decoder. Moreover, text decoders in previous works struggle to decode CLAP latents into accurate descriptions containing multiple sound events. A stronger LLM is required to fully leverage this joint multi-modal space. As a result, although existing zero-shot audio captioning models demonstrate strong performance in cross-domain scenarios[16], they still lag significantly behind fully supervised models in in-domain scenarios.
为应对这些挑战,研究者们提出了零样本音频描述框架[15,16,17,18,19],该框架旨在无需训练昂贵的音频-文本配对数据即可生成描述。这些工作通常利用 CLAP 模型[20,21,22]的多模态能力。为弥合 CLAP 的模态鸿沟[23],Deshmukh 等人[15]和 Kouzelis 等人[17]向 CLAP 潜在空间注入高斯噪声,而 Zhang 等人[16]则设计了软提示与硬提示。然而,添加噪声会削弱 CLAP 多模态空间中的丰富语义信息,而[16]中固定类别的硬提示可能导致解码器误判。此外,先前工作中的文本解码器难以将 CLAP 潜在变量准确解码为包含多个声音事件的描述。要充分利用这一联合多模态空间,需要更强大的 LLM。因此,尽管现有零样本音频描述模型在跨域场景中表现出色[16],但在域内场景中仍显著落后于全监督模型。

Refer to caption

Figure 1: Left: Overview of the CLAP model and the modality gap within its latent space. Right: Overview of the proposed DRCap. Based on the aligned multi-modal space of CLAP [20], during training, DRCap learns to decode the text embedding to reconstruct the original caption. Only the linear mapping network mmitalic_m is trained, while the LLM is fine-tuned using the LoRA [24] method. The similarity selection was employed to prevent learning collapse caused by text-to-text retrieval. During inference, the audio embedding is first projected onto a text embedding support 𝒮\mathcal{S}caligraphic_S, mitigating the modality gap. The top-kkitalic_k most similar captions retrieved from the datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S are used as prompts to instruct the LLM, producing accurate and semantically rich descriptions. Both the text embedding support 𝒮\mathcal{S}caligraphic_S and the datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S could be changed in the inference stage, offering DRCap the flexibility to adapt to new domains.
图 1:左图展示了 CLAP 模型及其潜在空间中的模态差距概览。右图呈现了所提出的 DRCap 框架。基于 CLAP[20]对齐的多模态空间,在训练阶段,DRCap 学习解码文本嵌入以重建原始描述。仅训练线性映射网络 mmitalic_m ,同时采用 LoRA[24]方法对 LLM 进行微调。相似度选择机制用于防止文本到文本检索导致的学习坍塌。推理阶段,音频嵌入首先被投影到文本嵌入支持空间 𝒮\mathcal{S}caligraphic_S 以缓解模态差距。从数据存储库 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S 检索出的前 kkitalic_k 个最相似描述作为提示指导 LLM,生成准确且语义丰富的描述。文本嵌入支持空间 𝒮\mathcal{S}caligraphic_S 与数据存储库 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S 在推理阶段均可更换,使 DRCap 能灵活适应新领域。

In this paper, we propose DRCap, a data-efficient and transferable zero-shot audio captioning system that leverages the synergy between CLAP and LLM. Based on the aligned multi-modal space of CLAP, DRCap requires only textual data for training, where a Vicuna-7B [25] is fine-tuned with LoRA [24] to reconstruct the original caption from the CLAP text embedding. During inference, the text encoder is replaced by the audio encoder. To mitigate the modality gap and enhance the quality of the generated caption, we use both the projection strategy [26] from the encoder side and the retrieval-augmented generation [27] strategy from the decoder side. Specifically, during inference, audio embeddings are first projected onto a text embedding support, while semantically similar captions retrieved from an external datastore are fed as prompts to direct the LLM to create more accurate descriptions. Moreover, both the text embedding support and the caption datastore can be customized to match the target domain, providing our model with robust adaptability to new domains. Experimental results demonstrate that DRCap performs comparably to fully supervised methods in in-domain scenarios and achieves state-of-the-art results in cross-domain scenarios.
本文提出 DRCap 系统,这是一种数据高效且可迁移的零样本音频描述方法,通过协同利用 CLAP 与 LLM 实现。基于 CLAP 对齐的多模态空间,DRCap 仅需文本数据进行训练——采用 LoRA 技术微调 Vicuna-7B 模型,使其能够从 CLAP 文本嵌入重建原始描述文本。推理阶段将文本编码器替换为音频编码器。为弥合模态差异并提升生成描述质量,我们同时采用编码器端的投影策略和解码器端的检索增强生成策略:推理时先将音频嵌入投影至文本嵌入支撑空间,同时从外部数据库检索语义相似的描述文本作为提示,引导 LLM 生成更精准的描述。此外,文本嵌入支撑空间与描述数据库均可根据目标领域定制,使模型具备强大的跨领域适应能力。 实验结果表明,DRCap 在领域内场景中的表现与全监督方法相当,并在跨领域场景中取得了最先进的成果。

II Methods  II 方法

II-A Overview  II-A 概述

We leverage the joint multi-modal space of CLAP to perform text-only training and then infer on audio clips in a zero-shot manner. As illustrated in Figure 1 (left), CLAP jointly trains an audio encoder fa()f_{a}(\cdot)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ ) and a text encoder ft()f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) to align semantically similar audio-text pairs in a shared embedding space. After training, fa(a)ft(t)f_{a}(a)\approx f_{t}(t)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) ≈ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) holds for any audio-text pair (a,t)(a,t)( italic_a , italic_t ).
我们利用 CLAP 的联合多模态空间进行纯文本训练,随后以零样本方式对音频片段进行推理。如图 1(左)所示,CLAP 联合训练音频编码器 fa()subscriptf_{a}(\cdot)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ ) 和文本编码器 ft()subscriptf_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) ,使语义相似的音频-文本对在共享嵌入空间中对齐。训练完成后,对于任意音频-文本对 (a,t)(a,t)( italic_a , italic_t ) ,关系式 fa(a)ft(t)subscriptsubscriptf_{a}(a)\approx f_{t}(t)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) ≈ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) 成立。

Given a raw caption t𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, where 𝒯\mathcal{T}caligraphic_T represents a caption corpus, the objective in text-only training is to decode its CLAP text embedding Et=ft(t)E_{t}=f_{t}(t)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) back into the original caption ttitalic_t. To achieve this, we train a lightweight linear mapping network mmitalic_m to align the CLAP latent space with the LLM, producing et=m(Et)e_{t}=m(E_{t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The LLM then reconstructs the original caption using ete_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along with an additional encoded prompt discussed in Section II-B. During inference, given an audio clip aaitalic_a, the text encoder is replaced with the audio encoder of CLAP, extracting the audio embedding Ea=fa(a)E_{a}=f_{a}(a)italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ). Due to the modality gap between audio and text embeddings, directly feeding EaE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to the LLM through mmitalic_m will yield sub-optimal results. To address this issue and enhance the quality of generated captions, both the retrieval-augmented generation strategy and the projection-based decoding strategy are employed, detailed in Section II-B and Section II-C respectively.
给定原始文本描述 t𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T ,其中 𝒯\mathcal{T}caligraphic_T 表示描述语料库,纯文本训练的目标是将其 CLAP 文本嵌入 Et=ft(t)subscriptsubscriptE_{t}=f_{t}(t)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) 解码回原始描述 ttitalic_t 。为此,我们训练了一个轻量级线性映射网络 mmitalic_m ,用于将 CLAP 潜在空间与 LLM 对齐,生成 et=m(Et)subscriptsubscripte_{t}=m(E_{t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 。随后 LLM 利用 etsubscripte_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 以及第 II-B 节讨论的额外编码提示来重建原始描述。在推理阶段,给定音频片段 aaitalic_a ,文本编码器被替换为 CLAP 的音频编码器以提取音频嵌入 Ea=fa(a)subscriptsubscriptE_{a}=f_{a}(a)italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) 。由于音频与文本嵌入之间存在模态差异,直接通过 mmitalic_mEasubscriptE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 输入 LLM 会产生次优结果。为解决该问题并提升生成描述的质量,我们同时采用了检索增强生成策略和基于投影的解码策略,具体细节分别在第 II-B 节和第 II-C 节阐述。

II-B Retrieval-Augmented Generation
II-B 检索增强生成

Retrieval-Augmented Generation (RAG) combines information retrieved from a datastore with a generative model, allowing it to generate more accurate, context-aware outputs by incorporating external knowledge. DRCap leverages the RAG method to take full advantage of the generative abilities of the LLM, bridging the modality gap and improving its ability in describing unseen sound events.
检索增强生成(RAG)通过将数据存储中检索到的信息与生成模型相结合,能够整合外部知识来生成更准确、更具上下文感知的输出。DRCap 利用 RAG 方法充分发挥 LLM 的生成能力,弥合模态鸿沟并提升其描述未知声音事件的能力。

During training, given a raw caption ttitalic_t, we use its CLAP text embedding Et=ft(t)E_{t}=f_{t}(t)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) to retrieve semantically similar captions from the datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S. The retrieval process for a candidate caption ti𝒟𝒮t_{i}\in\mathcal{DS}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D caligraphic_S is based on the cosine similarity between their respective CLAP text embeddings, calculated as follows:
在训练阶段,给定原始描述 ttitalic_t ,我们使用其 CLAP 文本嵌入 Et=ft(t)subscriptsubscriptE_{t}=f_{t}(t)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) 从数据存储 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S 中检索语义相似的描述。候选描述 ti𝒟𝒮subscriptt_{i}\in\mathcal{DS}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D caligraphic_S 的检索过程基于各自 CLAP 文本嵌入之间的余弦相似度,计算公式如下:

S(t,ti)=ft(t)ft(ti)ft(t)ft(ti)S(t,t_{\text{i}})=\frac{f_{t}(t)\cdot f_{t}(t_{\text{i}})}{\|f_{t}(t)\|\cdot\|% f_{t}(t_{\text{i}})\|}italic_S ( italic_t , italic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) = divide start_ARG italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) ⋅ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) ∥ ⋅ ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ∥ end_ARG (1)

We noticed, however, that naively selecting the top kkitalic_k most similar captions can lead the LLM to become lazy, merely reproducing one of the kkitalic_k captions as the output, neglecting ete_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To address this issue of learning collapse and improve the model’s robustness, we proposed a similarity selection strategy, defining a similarity range [Smin,Smax][S_{\text{min}},S_{\text{max}}][ italic_S start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] from which kkitalic_k captions are randomly selected. If fewer than kkitalic_k captions fall within this range, only the qualifying captions are used as input. The effectiveness of our strategy is verified in Section IV-B.
然而我们发现,若直接选取相似度最高的 kkitalic_k 条描述,会导致 LLM 产生惰性——仅简单复述其中 kkitalic_k 条描述作为输出,而忽略 etsubscripte_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 。为防止这种学习坍塌现象并提升模型鲁棒性,我们提出相似度筛选策略:设定相似度区间 [Smin,Smax]subscriptsubscript[S_{\text{min}},S_{\text{max}}][ italic_S start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] ,从中随机选取 kkitalic_k 条描述。若区间内符合条件的描述不足 kkitalic_k 条,则仅使用符合要求的描述作为输入。该策略的有效性将在第四节第二小节得到验证。

Additionally, the model is given a fixed prompt (e.g., “Describe the audio you hear”) to help the LLM better understand the task. The similar captions and the fixed prompt are encoded using the tokenizer of the LLM. Let ese_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and epe_{p}italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote the encoded embeddings of the similar captions and the fixed prompt. The model is trained to minimize the cross-entropy loss conditioned on z=Concat(et,es,ep)z=\text{Concat}(e_{t},e_{s},e_{p})italic_z = Concat ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ):
此外,模型会接收固定提示词(如"描述你听到的音频")以帮助 LLM 理解任务要求。相似描述与固定提示词均通过 LLM 的分词器进行编码。设 essubscripte_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPTepsubscripte_{p}italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 分别表示相似描述与固定提示词的编码嵌入向量,模型训练目标是最小化基于 z=Concat(et,es,ep)subscriptsubscriptsubscriptz=\text{Concat}(e_{t},e_{s},e_{p})italic_z = Concat ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) 的条件交叉熵损失。

CE=1Lti=1Ltlogp(ti|z,t1,,ti1)\displaystyle\mathcal{L}_{\text{CE}}=-\frac{1}{L_{t}}\sum_{i=1}^{L_{t}}\log p(% t_{i}|z,t_{1},...,t_{i-1})caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) (2)

Where et=m(ft(t))e_{t}=m(f_{t}(t))italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) ) is the mapped CLAP text embedding, LtL_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the length of the input caption ttitalic_t, tit_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the iiitalic_i-th token of ttitalic_t. During training, we froze the CLAP encoder and trained only the linear mapping network, while applying LoRA [24] to fine-tune the large language model, which significantly enhanced training efficiency.
其中 et=m(ft(t))subscriptsubscripte_{t}=m(f_{t}(t))italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) ) 是映射后的 CLAP 文本嵌入, LtsubscriptL_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 表示输入描述 ttitalic_t 的长度, tisubscriptt_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 代表 ttitalic_t 的第 iiitalic_i 个词元。训练过程中,我们冻结了 CLAP 编码器并仅训练线性映射网络,同时采用 LoRA[24]对大型语言模型进行微调,这显著提升了训练效率。

During inference, given an audio clip aaitalic_a, the text-to-text retrieval is replaced with the audio-to-text retrieval, where we use the CLAP audio embedding Ea=fa(a)E_{a}=f_{a}(a)italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) to retrieve kkitalic_k most similar captions. The cross-modal similarity for a candidate caption ti𝒟𝒮t_{i}\in\mathcal{DS}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D caligraphic_S is defined as:
推理阶段,给定音频片段 aaitalic_a 时,文本到文本检索被替换为音频到文本检索——此时我们使用 CLAP 音频嵌入 Ea=fa(a)subscriptsubscriptE_{a}=f_{a}(a)italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) 来检索 kkitalic_k 条最相似的描述。候选描述 ti𝒟𝒮subscriptt_{i}\in\mathcal{DS}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D caligraphic_S 的跨模态相似度定义为:

S(a,ti)=fa(a)ft(ti)fa(a)ft(ti)S(a,t_{i})=\frac{f_{a}(a)\cdot f_{t}(t_{i})}{\|f_{a}(a)\|\cdot\|f_{t}(t_{i})\|}italic_S ( italic_a , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) ⋅ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ) ∥ ⋅ ∥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ end_ARG (3)

The similarity selection is turned off, and instead, we choose the most similar captions to provide the LLM with maximum information.
我们关闭了相似度筛选机制,改为直接选择最相似的描述,以便为 LLM 提供最大化的信息量。

TABLE I: Performance comparison of AAC models for in-domain scenarios.
表一:领域内场景下各音频描述模型的性能对比
Method  方法 Clotho (%)  克洛索 (%) AudioCaps (%)  音频字幕 (%)
METEOR CIDEr  CIDEr 评分 SPICE SPIDEr  SPIDEr 评分 FENSE METEOR CIDEr  CIDEr(共识图像描述评估) SPICE SPIDEr  SPIDEr(语义命题图像描述评估) FENSE
Fully Supervised Audio Captioning
全监督音频字幕生成
Prefix AAC [4]  前缀式音频字幕生成[4] 17.0 39.2 11.8 25.5 - 24.0 73.3 17.7 45.5 -
RECAP [6]  RECAP [6] 17.7 41.1 12.5 22.4 - 25.6 75.1 18.6 47.1 -
EnCLAP-large [28]  EnCLAP-large [28] 18.2 42.6 12.9 27.8 50.7 25.5 80.3 18.8 49.5 65.5
Zero-shot Audio Captioning
零样本音频描述
ZerAuCap [18]  ZerAuCap [18] 9.4 14.0 5.3 9.7 - 12.3 28.1 8.6 18.3 -
WSAC [17]  WSAC [17] 17.4 37.1 12.3 24.7 - 24.1 63.3 17.3 40.3 -
Zhang et al. [16]
张等人[16]
17.5 41.1 12.2 26.7 48.8 22.0 64.4 15.6 40.0 -
DRCap (ours)  DRCap(我们的方法) 18.2 43.8 13.3 28.5 53.0 25.3 70.5 18.0 44.2 66.2
DRCapLAION (ours)  DRCap LAION (我们的方法) 17.9 42.5 12.4 27.5 51.9 25.7 73.7 17.9 45.8 66.4
  • : We evaluated metrics not reported in the original papers using the officially released checkpoint. DRCapLAION: DRCap with LAION-CLAP as encoder.


    :我们使用官方发布的检查点评估了原始论文中未报告的指标。 DRCap LAION :采用 LAION-CLAP 作为编码器的 DRCap。

II-C Projection-based Decoding
II-C 基于投影的解码

Moreover, during inference, instead of directly feeding the audio embedding EaE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to the LLM, we first project it onto the text embedding space of the CLAP model. Suppose that the system is trained on a caption corpus 𝒯={t1,t2,,tN}\mathcal{T}=\{t_{1},t_{2},...,t_{N}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where NNitalic_N denotes the size of 𝒯\mathcal{T}caligraphic_T. We can accumulate the text embeddings used during training, creating an embedding support 𝒮={E1,E2,,EN}\mathcal{S}=\{E_{1},E_{2},...,E_{N}\}caligraphic_S = { italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where Ei=ft(ti)E_{i}=f_{t}(t_{i})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). For a given audio embedding EaE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, its corresponding projected text-like embedding EtE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT could be obtained by performing a weighted combination of the text embeddings within the support:
此外,在推理过程中,我们并非直接将音频嵌入 EasubscriptE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 输入 LLM,而是先将其投影到 CLAP 模型的文本嵌入空间。假设系统在标注语料库 𝒯={t1,t2,,tN}subscript1subscript2subscript\mathcal{T}=\{t_{1},t_{2},...,t_{N}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } 上训练完成,其中 NNitalic_N 表示 𝒯\mathcal{T}caligraphic_T 的规模。我们可以累积训练过程中使用的文本嵌入,构建嵌入支持集 𝒮={E1,E2,,EN}subscript1subscript2subscript\mathcal{S}=\{E_{1},E_{2},...,E_{N}\}caligraphic_S = { italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ,其中 Ei=ft(ti)subscriptsubscriptsubscriptE_{i}=f_{t}(t_{i})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 。对于给定的音频嵌入 EasubscriptE_{a}italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ,其对应的类文本投影嵌入 EtsuperscriptsubscriptE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 可通过支持集中文本嵌入的加权组合获得:

Et=i=1Nexp((EaEi)/τ)j=1Nexp((EaEj)/τ)EiE_{t}^{\prime}=\sum_{i=1}^{N}\frac{\exp\left((E_{a}^{\top}\cdot E_{i})/\tau% \right)}{\sum_{j=1}^{N}\exp\left((E_{a}^{\top}\cdot E_{j})/\tau\right)}\cdot E% _{i}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG roman_exp ( ( italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( ( italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ⋅ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

Where τ\tauitalic_τ is a temperature parameter, the projected vector EtE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can capture the extensive semantic information from the support, while keeping its original acoustic features. EtE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is then aligned with the LLM through the linear mapper mmitalic_m. Conditioned on both the projected CLAP embedding and the encoded similar captions, the LLM is able to generate more refined textual descriptions in a zero-shot manner.
其中 τ\tauitalic_τ 为温度参数,经投影的向量 EtsuperscriptsubscriptE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 既能从支持集中捕获丰富的语义信息,又能保持原始声学特征。随后通过线性映射器 mmitalic_mEtsuperscriptsubscriptE_{t}^{\prime}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 与 LLM 对齐。在投影后的 CLAP 嵌入向量和编码的相似文本描述共同作用下,LLM 能够以零样本方式生成更精细的文本描述。

II-D Domain Adaptation  II-D 领域适配

With the assistance of the text embedding support 𝒮\mathcal{S}caligraphic_S and the caption datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S, DRCap is capable of generating precise and meaningfully detailed captions. Moreover, the modifiability of both 𝒮\mathcal{S}caligraphic_S and 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S provides DRCap with the flexibility to quickly adapt to new domains. When encountering new sound event domains, relevant captions can be integrated into the text embedding support and the datastore. The multi-modal latent space of CLAP could then provide meaningful projected embeddings to decode, with similar captions guiding the LLM to describe the audio. Notably, no further training is needed for this entire process. The construction of 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S and 𝒮\mathcal{S}caligraphic_S will be discussed in Section III-B.
借助文本嵌入支持集 𝒮\mathcal{S}caligraphic_S 和字幕数据存储库 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S ,DRCap 能够生成精确且富有细节意义的描述。此外,由于 𝒮\mathcal{S}caligraphic_S𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S 的可修改性,DRCap 能灵活快速地适应新领域。当遇到新的声音事件领域时,可将相关描述文本整合到文本嵌入支持集和数据存储库中。CLAP 的多模态潜在空间随后可提供有意义的投影嵌入进行解码,同时相似文本描述能引导 LLM 对音频进行描述。值得注意的是,整个过程无需额外训练。 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S𝒮\mathcal{S}caligraphic_S 的构建将在第 III-B 节讨论。

III Experimental Settings
III 实验设置

III-A Datasets  III-A 数据集

We train and evaluate DRCap on two most widely used AAC datasets, AudioCaps [29] and Clotho [30]. AudioCaps is a subset of AudioSet [31] that has been reannotated with caption labels. Each audio clip is annotated with a single caption for the training set and five captions for the validation and test sets. Our downloaded version contains 49274 examples for the training set, 494 for the validation set, and 957 for the test set. Clotho consists of audio clips sourced from Freesound, each labeled with 5 captions. In our experiment, we use version 2.1 of Clotho, which contains 3839 examples in the training set, 1045 in the validation set, and 1045 in the test set.
我们在两个最广泛使用的音频描述数据集 AudioCaps[29]和 Clotho[30]上训练和评估 DRCap 模型。AudioCaps 是 AudioSet[31]的子集,已重新标注了文字描述标签。训练集中每个音频片段标注 1 条描述,验证集和测试集则各标注 5 条描述。我们下载的版本包含 49274 个训练样本、494 个验证样本和 957 个测试样本。Clotho 数据集由 Freesound 平台采集的音频片段构成,每个片段标注 5 条描述。实验中我们使用 Clotho 2.1 版本,包含 3839 个训练样本、1045 个验证样本和 1045 个测试样本。

The frozen CLAP model employed to extract audio and text embeddings is trained on WavCaps [32] and Sound-VECaps [33]. WavCaps comprises approximately 400k audio clips sourced from AudioSet-SL [34], BBC Sound Effects, FreeSound, and SoundBible, while Sound-VECaps contains approximately 1.6M audio clips sourced from AudioSet. Both datasets are weakly annotated with the assistance of ChatGPT [35]. We filtered out the audio clips in the dataset that overlap with AudioCaps or Clotho, assuming that the target-domain audio data are unavailable during training.
用于提取音频和文本嵌入的冻结 CLAP 模型基于 WavCaps[32]和 Sound-VECaps[33]训练。WavCaps 包含约 40 万条来自 AudioSet-SL[34]、BBC 音效库、FreeSound 和 SoundBible 的音频片段,而 Sound-VECaps 包含约 160 万条来自 AudioSet 的音频片段。这两个数据集均在 ChatGPT[35]辅助下进行了弱标注。我们过滤掉了与 AudioCaps 或 Clotho 存在重叠的音频片段,假设训练时无法获取目标域音频数据。

We also evaluated the performance of DRCap with the widely used CLAP model LAION-CLAP-630k[20], which is pre-trained on AudioCaps, Clotho and keyword-to-caption augmented AudioSet. Note that this scenario no longer qualifies as strict zero-shot audio captioning, as LAION-CLAP’s pre-training has already utilized human-annotated audio-text pair data from AudioCaps and Clotho. This is also the reason why we opted to re-train a CLAP model using only weakly annotated data as described above.
我们还使用广泛采用的 LAION-CLAP-630k[20]模型评估了 DRCap 性能,该模型基于 AudioCaps、Clotho 及关键词增强版 AudioSet 进行预训练。需注意这种情况已不符合严格零样本音频描述条件,因 LAION-CLAP 预训练已使用 AudioCaps 和 Clotho 的人工标注音频-文本对数据。这也是我们选择仅用上述弱标注数据重新训练 CLAP 模型的原因。

TABLE II: Performance comparison of AAC models for cross-domain scenarios.
表 II:跨域场景下 AAC 模型性能对比
Method  方法 AudioCaps \Longrightarrow Clotho (%) Clotho \Longrightarrow AudioCaps (%)  克洛索 \Longrightarrow 音频字幕 (%)
METEOR CIDEr SPICE SPIDEr FENSE METEOR CIDEr SPICE SPIDEr FENSE
Fully Supervised Audio Captioning
全监督音频字幕生成
EnCLAP-large [28]  EnCLAP-large [28] 11.1 13.8 5.9 9.9 36.1 13.3 17.4 8.0 12.6 38.8
Prefix AAC [4]  前缀式音频描述生成 [4] 11.2 19.2 7.4 13.3 - 14.4 21.1 8.3 14.7 -
RECAP [6]  RECAP [6] 15.7 33.1 10.0 20.9 - 16.9 35.7 11.1 20.4 -
Zero-shot Audio Captioning
零样本音频字幕生成
WSAC [17]  WSAC [17] 12.0 20.6 8.2 14.4 - 17.3 25.6 12.0 18.8 -
Zhang et al. [16]
张等人 [16]
13.2 24.8 9.3 17.1 - 18.2 33.7 12.4 23.0 52.1
DRCap (ours)  DRCap(我们的方法) 15.0 33.3 10.4 21.8 52.2 22.9 44.3 17.0 30.6 62.6
DRCapLAION (ours)  DRCap LAION (我们的方法) 17.3 24.8 12.3 18.6 48.2 21.7 45.4 14.7 30.0 60.5
  • : Metrics evaluated on our test split using the officially released checkpoint. : Results provided by Zhang et al. [16] based on their re-implementation.


    :使用官方发布的检查点在我们划分的测试集上评估的指标。 :由 Zhang 等人[16]基于其复现工作提供的结果。

III-B Experimental Setup
III-B 实验设置

To comprehensively evaluate the performance of DRCap, we conduct experiments in both in-domain and cross-domain setups: (1) We train and evaluate the model on the same dataset 𝒟source\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT. (2) We train the model on the training set of 𝒟source\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT, and evaluate on the test set of another dataset 𝒟target\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. During inference, for scenario (1), we use all text embeddings accumulated in the training stage as the support 𝒮\mathcal{S}caligraphic_S mentioned in Section II-C, which corresponds to the text embeddings of all captions in the training set of 𝒟source\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT. For (2), we curate the text embedding support by encoding all captions from the training set of 𝒟target\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. In both settings, we use a caption datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S consisting of 450k captions sourced from WavCaps and the training sets of AudioCaps and Clotho.
为全面评估 DRCap 性能,我们在域内与跨域两种设置下进行实验:(1) 在同一数据集 𝒟sourcesubscript\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT 上训练并评估模型;(2) 在 𝒟sourcesubscript\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT 训练集上训练模型,并在另一数据集 𝒟targetsubscript\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT 测试集上评估。推理阶段,针对场景(1),我们使用训练阶段积累的所有文本嵌入作为 II-C 节所述的支撑集 𝒮\mathcal{S}caligraphic_S ,即 𝒟sourcesubscript\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT 训练集中所有字幕的文本嵌入。对于场景(2),我们通过编码 𝒟targetsubscript\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT 训练集所有字幕来构建文本嵌入支撑集。两种设置均使用包含 45 万条字幕的标注数据库 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S ,数据源自 WavCaps 及 AudioCaps 与 Clotho 的训练集。

III-C Implementation Details
III-C 实现细节

DRCap was trained for 40,000 steps on AudioCaps and 20,000 steps on Clotho, with a peak learning rate of 1e-5, 1000 warm-up steps followed by a linear decay. We use the Adam optimizer [36] and a batch size of 4. Validation was performed every 1000 steps, where the checkpoint with the lowest validation loss was saved for evaluation. Number of captions retrieved is set to k=3k=3italic_k = 3, and the range of similarity selection is fixed as Smin=0.75,Smax=0.85S_{min}=0.75,\ S_{max}=0.85italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0.75 , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 0.85. Our CLAP model, which employed the text encoder RoBERTa [37] and the audio encoder HTS-AT [10], was trained on WavCaps and Sound-VECaps, with a batch size of 256, a peak learning rate of 5e-5 for 15 epochs. Training followed a cosine annealing schedule with a 2-epoch warm-up phase, and the model of the last epoch was used.
DRCap 在 AudioCaps 数据集上训练 40,000 步,在 Clotho 数据集上训练 20,000 步,峰值学习率为 1e-5,采用 1000 步预热后线性衰减的学习率策略。我们使用 Adam 优化器[36],批处理大小为 4。每 1000 步进行一次验证,保存验证损失最低的检查点用于评估。检索的标题数量设置为 k=33k=3italic_k = 3 ,相似度选择范围固定为 Smin=0.75,Smax=0.85formulae-sequencesubscript0.75subscript0.85S_{min}=0.75,\ S_{max}=0.85italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0.75 , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 0.85 。我们采用的 CLAP 模型使用 RoBERTa[37]作为文本编码器、HTS-AT[10]作为音频编码器,在 WavCaps 和 Sound-VECaps 数据集上以 256 的批处理量、5e-5 的峰值学习率训练 15 个周期。训练采用带 2 周期预热阶段的余弦退火调度,最终使用最后一个周期的模型。

IV Experimental Results  IV 实验结果

IV-A Main Results  IV-A 主要结果

Tables I and II present the performance of DRCap in both in-domain and cross-domain settings. We evaluated DRCap with our re-trained CLAP encoder or the widely used LAION-CLAP, denoted as DRCap and DRCapLAION respectively. Results of DRCapLAION are grayed out for reference only, since it cannot be strictly considered as zero-shot audio captioning, as described in Section III-A. We compare DRCap’s performance with fully supervised AAC models: EnCLAP [28], Prefix-AAC [4], RECAP [6], and zero-shot audio captioning models: ZerAuCap [18], WSAC [17] and Zhang et al. [16]. ZerAuCap [18] uses CLAP to guide the LLM to generate descriptions, WSAC [17] trains a text decoder using the prefix language modeling paradigm conditioned on CLAP embeddings, while Zhang et al. [16] crafts soft and hard prompts to bridge the modality gap between audio and text embeddings of CLAP.
表 I 和表 II 展示了 DRCap 在领域内及跨领域场景下的性能表现。我们分别使用重新训练的 CLAP 编码器和广泛采用的 LAION-CLAP 评估 DRCap,标记为 DRCap 和 DRCap LAION 。DRCap LAION 的结果以灰色显示仅供参考,如第 III-A 节所述,因其不能严格视为零样本音频描述。我们将 DRCap 与全监督 AAC 模型(EnCLAP[28]、Prefix-AAC[4]、RECAP[6])及零样本音频描述模型(ZerAuCap[18]、WSAC[17]和张等人[16])进行性能对比。ZerAuCap[18]利用 CLAP 引导 LLM 生成描述,WSAC[17]基于 CLAP 嵌入采用前缀语言建模范式训练文本解码器,而张等人[16]则通过设计软硬提示来弥合 CLAP 音频与文本嵌入间的模态鸿沟。

Regardless of which CLAP model is employed, DRCap surpasses all competitive zero-shot audio captioning systems in in-domain scenarios by a large margin and is comparable with other fully-supervised methods. For cross-domain scenarios, it achieves state-of-the-art results across all metrics, highlighting its robust domain-transfer capability. Furthermore, we found that DRCap outperforms other methods in terms of the FENSE [38] score in both two scenarios. We hypothesize that this advantage is due to DRCap’s ability to use the multi-modal space of CLAP, allowing it to generate captions of better quality.
无论采用哪种 CLAP 模型,DRCap 在领域内场景中都大幅超越所有竞争性零样本音频描述系统,其表现与全监督方法相当。在跨领域场景中,该模型在所有评估指标上均取得最先进的结果,凸显了其强大的领域迁移能力。此外,我们发现 DRCap 在两种场景下的 FENSE[38]分数均优于其他方法。我们推测这一优势源于 DRCap 能够利用 CLAP 的多模态空间,从而生成质量更高的描述文本。

IV-B Ablation Study  IV-B 消融实验

We conduct a comprehensive ablation study to validate each component of DRCap.
我们进行了全面的消融实验以验证 DRCap 各组件的有效性。

Similarity Selection We turned off the similarity selection discussed in Section II-B, instead selecting the top kkitalic_k most similar captions during training. Since the ground-truth captions are available in the training stage, the retrieved most similar captions closely match the target in both semantics and vocabulary. This could lead the LLM to simply copy one of the retrieved captions as the output, trivializing the captioning task. However, during inference, without access to textual information, audio-to-text retrieval struggles to match the quality of text-to-text retrieval, and simply copying the retrieved captions hinders the model’s performance, as shown in Table III and Table IV. Our proposed similarity selection strategy significantly alleviates the learning collapse and compels the LLM to take into account both the CLAP embedding and the retrieved captions, improving generation quality in both in-domain and cross-domain settings.
相似性选择 我们关闭了第 II-B 节讨论的相似性选择机制,改为在训练阶段直接选取前 kkitalic_k 条最相似的描述文本。由于训练阶段可获得真实标注文本,检索到的最相似描述在语义和词汇层面都与目标高度匹配。这可能导致 LLM 直接复制某条检索结果作为输出,使描述任务失去挑战性。然而在推理阶段,由于无法获取文本信息,音频到文本的检索质量远低于文本到文本检索,简单复制检索结果会损害模型性能(如表 III 和表 IV 所示)。我们提出的相似性选择策略显著缓解了学习坍塌问题,迫使 LLM 同时考虑 CLAP 嵌入向量和检索到的描述文本,从而提升了模型在领域内和跨域场景下的生成质量。

Retrieval-Augmented Generation. We dropped all the similar captions in both the training and inference stages to evaluate the impact of RAG. As illustrated in table III and table IV, conditioning solely on CLAP embeddings results in inferior performance across all metrics in both scenarios, showing the advantage of similar captions in guiding the LLM to generate more accurate descriptions.
检索增强生成。我们在训练和推理阶段均移除了所有相似描述,以评估 RAG 的影响。如表 III 和表 IV 所示,仅依赖 CLAP 嵌入会导致两种场景下所有指标性能下降,这证明了相似描述在引导 LLM 生成更准确文本方面的优势。

LLM Fine-tuning. We froze the LLM during training to conduct the ablation study on LoRA. Table III and IV highlight the significance of efficient LLM fine-tuning. Integrating LoRA adapters proved effective in aligning the CLAP latent space with the LLM.
LLM 微调。我们在训练期间冻结 LLM 参数以进行 LoRA 消融研究。表 III 和表 IV 凸显了高效 LLM 微调的重要性。实验证明,集成 LoRA 适配器能有效对齐 CLAP 潜在空间与 LLM。

TABLE III: Ablation Study of DRCap for in-domain scenarios
表 III:DRCap 在领域内场景的消融研究
Main Components  核心组件 AudioCaps (%)  音频字幕数据集(%)
METEOR CIDEr  CIDEr 评分 SPICE SPIDEr  SPIDEr 评分 FENSE
DRCap  DRCap 模型 25.3 70.5 18.0 44.2 66.2
   - w/o SS
- 不含 SS
21.8 59.5 15.7 37.6 61.8
   - w/o RAG
- 不含 RAG
25.0 69.2 18.4 43.7 65.5
   - w/o LoRA
- 不含 LoRA
23.7 64.7 16.4 40.5 64.1
   - w/o PD
- 不含 PD
19.9 31.2 13.4 22.3 55.4
  • : SS stands for similarity selection. : PD for projection-based decoding.


    : SS 代表相似性选择。 : PD 表示基于投影的解码。
TABLE IV: Ablation Study of DRCap for cross-domain scenarios
表 IV:DRCap 在跨域场景下的消融研究
Main Components  主要组件 AudioCaps \Longrightarrow Clotho (%)
METEOR CIDEr  CIDEr(共识式图像描述评估) SPICE SPIDEr  SPIDEr(语义命题图像描述评估) FENSE
DRCap  DRCap(解码 CLAP 潜变量检索增强生成模型) 15.0 33.3 10.4 21.8 52.2
   - w/o SS
- 无 SS 版本
13.3 22.3 8.9 15.6 46.0
   - w/o RAG
- 不含检索增强生成(RAG)
14.2 30.1 10.2 20.1 51.3
   - w/o LoRA
- 不含低秩适应(LoRA)
14.1 29.8 9.8 19.8 51.1
   - w/o TD
- 不含时间解耦(TD)
13.4 27.5 9.3 18.4 50.6
   - w/o PD
- 不含概率解耦(PD)
13.2 22.8 8.6 15.7 46.4
  • : SS denotes similarity selection. : TD denotes target domain information. : PD denotes projection-based decoding.


    : SS 表示相似性选择。 : TD 表示目标域信息。 : PD 表示基于投影的解码。

Projection-based Decoding We directly fed the audio embedding eae_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to the linear mapping network mmitalic_m without using projection during inference to assess the benefit of projection-based decoding (PD). As illustrated in table III and IV, the modality gap caused a significant drop in performance when eae_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT was used directly, while PD effectively bridge the discrepancy between audio and text embeddings.
基于投影的解码 在推理过程中,我们直接将音频嵌入 easubscripte_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 输入线性映射网络 mmitalic_m 而不使用投影,以评估基于投影的解码(PD)的优势。如表 III 和 IV 所示,当直接使用 easubscripte_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 时,模态差异导致性能显著下降,而 PD 有效弥合了音频与文本嵌入之间的差异。

Target Domain Information. We assume that no prior knowledge of the target domain is provided for cross-domain scenarios. Specifically, during inference, we use captions from the training set of 𝒟source\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT to construct 𝒮\mathcal{S}caligraphic_S, rather than using captions from the training set of 𝒟target\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. Furthermore, all captions from 𝒟target\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT are excluded from the datastore 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S. As shown in Table IV, incorporating domain knowledge greatly improves DRCap’s cross-domain performance, demonstrating its adaptability during inference. Moreover, despite the absence of target domain knowledge, DRCap still performs competitively with SOTA methods, as shown in Table II.
目标域信息 我们假设在跨域场景中未提供目标域的先验知识。具体而言,在推理过程中,我们使用 𝒟sourcesubscript\mathcal{D}_{source}caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT 训练集的描述文本来构建 𝒮\mathcal{S}caligraphic_S ,而非使用 𝒟targetsubscript\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT 训练集的描述文本。此外,来自 𝒟targetsubscript\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT 的所有描述文本均未纳入数据存储 𝒟𝒮\mathcal{DS}caligraphic_D caligraphic_S 。如表 IV 所示,融入领域知识极大提升了 DRCap 的跨域性能,展现了其在推理过程中的适应性。值得注意的是,即便缺乏目标域知识,DRCap 仍能与当前最优方法保持竞争力,如表 II 所示。

V Conclusion and future work
五、结论与未来工作

We present DRCap, a data-efficient and flexible audio captioning model that requires only textual data for training and can quickly adapt to other domains. Based on the CLAP model and the LLM, DRCap leverages projection-based decoding and retrieval-augmented generation to mitigate the modality gap. Conditioned on both the projected CLAP embedding and the retrieved similar captions, DRCap could produce more accurate and semantically rich descriptions. The replaceability of the text embedding support and the caption datastore guarantees the adaptability of the model. Experimental results show that DRCap outperforms other zero-shot audio captioning models in in-domain scenarios and achieves state-of-the-art performance in cross-domain scenarios.
我们提出 DRCap 模型,这是一种数据高效且灵活的音频描述模型,仅需文本数据即可训练,并能快速适应其他领域。基于 CLAP 模型和 LLM,DRCap 采用基于投影的解码和检索增强生成技术来弥合模态差异。通过结合投影后的 CLAP 嵌入特征与检索到的相似描述,DRCap 能够生成更准确且语义丰富的音频描述。可替换的文本嵌入支持模块与描述数据库保障了模型的适应性。实验结果表明,DRCap 在领域内场景中优于其他零样本音频描述模型,并在跨领域场景中取得了最先进的性能。

Acknowledgment  致谢

This work was supported by the Science and Technology Innovation (STI) 2030-Major Project (2022ZD0208700), the National Natural Science Foundation of China (No. 62206171 and No. U23B2018), Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102 and the International Cooperation Project of PCL.
本研究获得科技创新 2030-"新一代人工智能"重大项目(2022ZD0208700)、国家自然科学基金(62206171 和 U23B2018)、上海市市级科技重大专项(2021SHZDZX0102)以及鹏城实验室国际合作项目的资助。

References

  • [1] X. Mei, X. Liu, M. D. Plumbley, and W. Wang, “Automated audio captioning: An overview of recent progress and new challenges,” EURASIP journal on audio, speech, and music processing, 2022.
  • [2] X. Xu, Z. Xie, M. Wu, and K. Yu, “Beyond the status quo: A contemporary survey of advances and challenges in audio captioning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Proc. NeurIPS, 2014.
  • [4] M. Kim, K. Sung-Bin, and T.-H. Oh, “Prefix tuning for automated audio captioning,” in Proc. ICASSP, 2023.
  • [5] T. Pellegrini, I. Khalfaoui-Hassani, E. Labbé, and T. Masquelier, “Adapting a ConvNeXt model to audio classification on AudioSet,” arXiv preprint arXiv:2306.00830, 2023.
  • [6] S. Ghosh, S. Kumar, C. K. R. Evuru, R. Duraiswami, and D. Manocha, “Recap: retrieval-augmented audio captioning,” in Proc. ICASSP, 2024.
  • [7] S.-L. Wu, X. Chang, G. Wichern, J.-w. Jung, F. Germain, J. Le Roux, and S. Watanabe, “Improving audio captioning models with fine-grained audio features, text embedding supervision, and LLM mix-up augmentation,” in Proc. ICASSP, 2024.
  • [8] W. Chen, X. Li, Z. Ma, Y. Liang, A. Jiang, Z. Zheng, Y. Qian, P. Fan, W.-Q. Zhang, C. Lu et al., “Sjtu-thu automated audio captioning system for dcase 2024,” DCASE Challenge, Tech. Rep, Tech. Rep., 2024.
  • [9] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
  • [10] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. ICASSP, 2022.
  • [11] W. Chen, Y. Liang, Z. Ma, Z. Zheng, and X. Chen, “EAT: Self-supervised pre-training with efficient audio transformer,” in Proc. IJCAI, 2024.
  • [12] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
  • [13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, 2019.
  • [14] I. Martin Morato and A. Mesaros, “Diversity and bias in audio captioning datasets,” DCASE, 2021.
  • [15] S. Deshmukh, B. Elizalde, D. Emmanouilidou, B. Raj, R. Singh, and H. Wang, “Training audio captioning models without audio,” in Proc. ICASSP, 2024.
  • [16] Y. Zhang, X. Xu, R. Du, H. Liu, Y. Dong, Z.-H. Tan, W. Wang, and Z. Ma, “Zero-shot audio captioning using soft and hard prompts,” arXiv preprint arXiv:2406.06295, 2024.
  • [17] T. Kouzelis and V. Katsouros, “Weakly-supervised automated audio captioning via text only training,” arXiv preprint arXiv:2309.12242, 2023.
  • [18] L. Salewski, S. Fauth, A. Koepke, and Z. Akata, “Zero-shot audio captioning with audio-language model guidance and audio context keywords,” arXiv preprint arXiv:2311.08396, 2023.
  • [19] T. Shaharabany, A. Shaulov, and L. Wolf, “Zero-shot audio captioning via audibility guidance,” arXiv preprint arXiv:2309.03884, 2023.
  • [20] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. ICASSP, 2023.
  • [21] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in Proc. ICASSP, 2023.
  • [22] B. Elizalde, S. Deshmukh, and H. Wang, “Natural language supervision for general-purpose audio representations,” in Proc. ICASSP.   IEEE, 2024.
  • [23] V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” Proc. NeurIPS, 2022.
  • [24] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” Proc. ICLR, 2022.
  • [25] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality, march 2023,” URL https://lmsys. org/blog/2023-03-30-vicuna, 2023.
  • [26] W. Li, L. Zhu, L. Wen, and Y. Yang, “DeCap: Decoding clip latents for zero-shot captioning via text-only training,” arXiv preprint arXiv:2303.03032, 2023.
  • [27] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” Proc. NeurIPS, 2020.
  • [28] J. Kim, J. Jung, J. Lee, and S. H. Woo, “EnCLAP: Combining neural audio codec and audio-text joint embedding for automated audio captioning,” in Proc. ICASSP, 2024.
  • [29] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proc. NAACL, 2019.
  • [30] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. ICASSP, 2020.
  • [31] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
  • [32] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  • [33] Y. Yuan, D. Jia, X. Zhuang, Y. Chen, Z. Liu, Z. Chen, Y. Wang, Y. Wang, X. Liu, M. D. Plumbley et al., “Improving audio generation with visual enhanced caption,” arXiv preprint arXiv:2407.04416, 2024.
  • [34] S. Hershey, D. P. Ellis, E. Fonseca, A. Jansen, C. Liu, R. C. Moore, and M. Plakal, “The benefit of temporally-strong labels in audio event classification,” in Proc. ICASSP, 2021.
  • [35] J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny et al., “Introducing ChatGPT,” OpenAI Blog, 2022.
  • [36] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [37] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [38] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, “Can audio captions be evaluated with image caption metrics?” in Proc. ICASSP, 2022.