2024_08_27_73dc98aecc5090f65089g

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed
Demucs：深度提取器，用于处理带有额外未标记数据的音乐源混音

Alexandre Défossez 亚历山大·德福塞Facebook AI Research Facebook 人工智能研究INRIA / École Normale Supérieure
INRIA / 高等师范学校PSL Research University PSL 研究大学Paris, France 巴黎，法国defossez@fb.com

Nicolas Usunier 尼古拉斯·尤苏尼尔Facebook AI Research Facebook 人工智能研究Paris, France 巴黎，法国usunier@fb.com

Léon Bottou 莱昂·博图
Facebook AI Research Facebook 人工智能研究
New York, USA 美国纽约
leonb@fb.com

Francis Bach 弗朗西斯·巴赫INRIA / École Normale Supérieure
INRIA / 高等师范学校PSL Research University PSL 研究大学Paris, France 巴黎，法国francis.bach@ens.fr

Abstract 摘要

We study the problem of source separation for music using deep learning with four known sources: drums, bass, vocals and other accompaniments. State-of-the-art approaches predict soft masks over mixture spectrograms while methods working on the waveform are lagging behind as measured on the standard MusDB [22] benchmark. Our contribution is two fold. (i) We introduce a simple convolutional and recurrent model that outperforms the state-of-the-art model on waveforms, that is, Wave-U-Net [28], by 1.6 points of SDR (signal to distortion ratio). (ii) We propose a new scheme to leverage unlabeled music. We train a first model to extract parts with at least one source silent in unlabeled tracks, for instance without bass We remix this extract with a bass line taken from the supervised dataset to form a new weakly supervised training example. Combining our architecture and scheme, we show that waveform methods can play in the same ballpark as spectrogram ones.
我们研究了使用深度学习进行音乐源分离的问题，涉及四种已知源：鼓、贝斯、声乐和其他伴奏。最先进的方法在混合声谱图上预测软掩模，而在波形上工作的算法则落后于标准的 MusDB [22] 基准测试。我们的贡献有两个方面。(i) 我们引入了一种简单的卷积和递归模型，在波形上超越了最先进的模型 Wave-U-Net [28]，提高了 1.6 分的 SDR（信号失真比）。(ii) 我们提出了一种新的方案来利用未标记的音乐。我们训练第一个模型，从未标记的曲目中提取至少有一个源静音的部分，例如没有贝斯的部分。我们将这个提取与来自监督数据集的贝斯线重新混合，以形成一个新的弱监督训练示例。结合我们的架构和方案，我们展示了波形方法可以与声谱图方法相媲美。

1 Introduction 1 引言

Cherry first noticed the "cocktail party effect" [5]: how the human brain is able to separate a single conversation out of a surrounding noise from a room full of people chatting. Bregman later tried to understand how the brain was able to analyse a complex auditory signal and segment it into higher level streams. His framework for auditory scene analysis [4] spawned its computational counterpart, trying to reproduce or model accomplishment of the brains with algorithmic means [36].
樱桃首次注意到“鸡尾酒会效应”[5]：人类大脑如何能够从充满人们交谈的房间中的周围噪音中分离出单一的对话。布雷格曼后来试图理解大脑如何能够分析复杂的听觉信号并将其分割成更高层次的流。他的听觉场景分析框架[4]催生了其计算对应物，试图通过算法手段重现或模拟大脑的成就[36]。

When producing music, recordings of individual instruments called stems are arranged together and mastered into the final song. The goal of source separation is then to recover those individual stems from the mixed signal. Unlike the cocktail problem, there is not a single source of interest to differentiate from an unrelated background noise, but instead a wide variety of tones and timbres playing in a coordinated way. As part of the SiSec Mus evaluation campaign for music separation [29], a choice was made to regroup those individual stems into 4 broad categories: (1) drums, (2) bass, (3) other, (4) vocals.
在制作音乐时，单个乐器的录音称为音轨，这些音轨被组合在一起并最终制作成歌曲。源分离的目标是从混合信号中恢复这些单独的音轨。与鸡尾酒问题不同，这里没有单一的感兴趣源需要与无关的背景噪声区分，而是有多种音调和音色以协调的方式演奏。作为 SiSec Mus 音乐分离评估活动的一部分[29]，决定将这些单独的音轨重新分为四个大类：（1）鼓，（2）低音，（3）其他，（4）人声。

Each source is represented by a waveform

where

is the number of channels ( 1 for mono, 2 for stereo) and

the number of samples. We define

the concatenation of sources in a
每个源由波形

表示，其中

是通道数（单声道为 1，立体声为 2），

是样本数。我们定义

为源的连接。
tensor of size

and the mixture

. We aim at training a model that minimises
大小为

的张量和混合

。我们的目标是训练一个最小化的模型。

for some dataset

, reconstruction error

, model architecture

with 4 outputs

, and model weights

.
对于某些数据集

，重建误差

，模型架构

具有 4 个输出

，以及模型权重

。

As presented in the next section, most methods to solve (1) learn a mask per source

on the mixture spectrogram

(Short-Time Fourier Transform). The estimated sources are then

(Inverse Short-Time Fourier Transform). The mask

can either be a binary mask valued in

or a soft assignment valued in

. Those methods are state-of-the-art and perform very well without requiring large models. However, they come with two limitations:
如下一节所述，大多数解决（1）的方法在混合谱图

（短时傅里叶变换）上为每个源

学习一个掩码。然后，估计的源通过

（逆短时傅里叶变换）获得。掩码

可以是值为

的二进制掩码，也可以是值为

的软分配。这些方法是最先进的，表现非常好，而不需要大型模型。然而，它们有两个局限性：

There is no reason for to be a real spectrogram (i.e., obtained from a real signal). In that case the ISTFT step will perform a projection step that is not accounted for in the training loss and could result in artifacts.
没有理由认为是一个真实的声谱图（即，从真实信号中获得）。在这种情况下，ISTFT 步骤将执行一个在训练损失中未考虑的投影步骤，这可能导致伪影。
Such methods do not try to model the phase but reuse the input mixture. Let us imagine that a guitar plays with a singer at the same pitch but the singer is doing a slight vibrato, i.e., a small modulation of the pitch. This modulation will impact the spectrogram phase, as the derivative of the phase is the instant frequency. Let say both the singer and the guitar have the same intensity, then the ideal mask would be 0.5 for each. However, as we reuse for each source the original phase, the vibrato from the singer would also be applied to the guitar While this could be consider a corner case, its existence is a motivation for the search of an alternative.
这种方法并不试图建模相位，而是重用输入混合信号。让我们想象一下，一把吉他与一位歌手在同一音高上演奏，但歌手在进行轻微的颤音，即音高的微小调制。这种调制会影响声谱图的相位，因为相位的导数是瞬时频率。假设歌手和吉他的音量相同，那么理想的掩膜对于每个源来说都是 0.5。然而，由于我们为每个源重用原始相位，歌手的颤音也会应用到吉他上。虽然这可以被视为一个边缘案例，但它的存在激励我们寻找替代方案。

Learning a model from/to the waveform could allow to lift some of the aforementioned limitations. Because a waveform is directly generated, the training loss is end-to-end, with no extra synthesis step that could add artifacts which solves the first point above. As for the second point, it is unknown whether any model could succeed in separating such a pathological case. In the fields of speech or music generation, direct waveform synthesis has replaced spectrogram based methods [34, 20, 7]. When doing generation without an input signal

, the first point is more problematic. Indeed, there is no input phase to reuse and the inversion of a power spectrogram will introduce significant artifacts [8] Those successes were also made possible by the development of very large scale datasets ( 30 GB for the NSynth dataset [8]). In comparison the standard MusDB dataset is only a few GB. This explains, at least partially, the worse performance of waveform methods for source separation [29].
从波形学习模型可以克服上述一些限制。因为波形是直接生成的，训练损失是端到端的，没有额外的合成步骤可能引入伪影，这解决了上述第一个问题。至于第二个问题，目前尚不清楚是否有任何模型能够成功分离这种病态情况。在语音或音乐生成领域，直接波形合成已取代基于声谱图的方法[34, 20, 7]。在没有输入信号

的情况下进行生成时，第一个问题更加棘手。实际上，没有输入阶段可以重用，功率声谱图的反演将引入显著的伪影[8]。这些成功也得益于非常大规模数据集的发展（NSynth 数据集为 30 GB[8]）。相比之下，标准的 MusDB 数据集仅有几 GB。这至少部分解释了波形方法在源分离中的较差表现[29]。
In this paper we aim at taking waveform based methods one step closer to spectrogram methods. We contribute a simple model architecture inspired by previous work in source separation from the waveform and audio synthesis. We show that this model outperforms the previous state of the art on the waveform domain. Given the limited data available, we further refine the performance of our model by using a novel semi-supervised data augmentation scheme that allows to leverage 2,000 unlabeled songs.
在本文中，我们旨在使基于波形的方法更接近于谱图方法。我们贡献了一个简单的模型架构，灵感来自于之前在波形源分离和音频合成方面的工作。我们展示了该模型在波形领域的表现优于之前的最先进技术。鉴于可用数据有限，我们通过使用一种新颖的半监督数据增强方案进一步提升了模型的性能，该方案能够利用 2000 首未标记的歌曲。

A first category of methods for supervised music source separation work on power spectrograms. They predict a power spectrogram for each source and reuse the phase from the input mixture to synthesise individual waveforms. Traditional methods have mostly focused on blind (unsupervised) source separation. Non-negative matrix factorization techniques [26] model the power spectrum as a weighted sum of a learnt spectral dictionary, whose elements can then be grouped into individual sources. Independent component analysis [12] relies on independence assumptions and multiple microphones to separate the sources. Learning a soft/binary mask over power spectrograms has been done using either HMM-based prediction [25] or segmentation techniques [3].
第一类监督音乐源分离的方法基于功率谱图。它们为每个源预测一个功率谱图，并重用输入混合信号的相位来合成单独的波形。传统方法主要集中在盲（无监督）源分离上。非负矩阵分解技术[26]将功率谱建模为学习到的谱字典的加权和，其元素可以被分组为单独的源。独立成分分析[12]依赖于独立性假设和多个麦克风来分离源。对功率谱图学习软/二进制掩模的方法已经使用 HMM 基础的预测[25]或分割技术[3]进行。
With the development of deep learning, fully supervised methods have gained momentum. Initial work was performed on speech source separation [9] then for music using simple fully connected networks over few spectrogram frames [32], LSTMs [33], or multi scale convolutional / recurrent networks [18, 30, 31]. State-of-the-art performance is obtained with those models when trained with extra labeled data. We show that our model architecture combined with our semi-supervised scheme can provide performance almost on par, while being trained on 5 times less labeled data.
随着深度学习的发展，完全监督的方法获得了动力。最初的工作是在语音源分离上进行的[9]，然后是使用简单的全连接网络在少量声谱图帧上进行音乐处理[32]，LSTM[33]，或多尺度卷积/递归网络[18, 30, 31]。当使用额外的标记数据进行训练时，这些模型获得了最先进的性能。我们展示了我们的模型架构结合我们的半监督方案可以提供几乎相当的性能，同时训练所需的标记数据减少了 5 倍。

(a) Demucs architecture with the mixture waveform as input and the four sources estimates as output. Arrows represents U-Net connections.
（a）Demucs 架构以混合波形作为输入，四个源的估计作为输出。箭头表示 U-Net 连接。

(b) Detailed view of the layers Decoder

on the top and Encoder

on the bottom. Arrows represent connections to other parts of the model.
(b) 层的详细视图，解码器

在顶部，编码器

在底部。箭头表示与模型其他部分的连接。

Figure 1: Demucs complete architecture on the left, with detailed representation of the encoder and decoder layers on the right. Key novelties compared to the previous Wave-U-Net are the GLU activation in the encoder and decoder, the bidirectional LSTM in-between and exponentially growing number of channels, allowed by the stride of 4 in all convolutions.
图 1：左侧为 Demucs 完整架构，右侧为编码器和解码器层的详细表示。与之前的 Wave-U-Net 相比，主要创新包括编码器和解码器中的 GLU 激活、介于两者之间的双向 LSTM，以及由于所有卷积中的步幅为 4 而允许的指数增长的通道数量。

On the other hand, working directly on the waveform only became possible with deep learning models. A Wavenet-like but regression based approach was first used for speech denoising [23] and then adapted to source separation [19]. Concurrently, a convolutional network with a U-Net structure called Wave-U-Net was used first on spectrograms [14] and then adapted to the waveform domain [28]. Those methods performs significantly worse than the spectrogram ones as shown in the latest SiSec Mus source separation evaluation campaign [29]. As shown in Section 5], we outperform Wave-U-Net by a large margin with our architecture alone.
另一方面，直接处理波形仅在深度学习模型出现后才成为可能。一种类似于 Wavenet 但基于回归的方法首次用于语音去噪[23]，随后被调整用于源分离[19]。与此同时，一种名为 Wave-U-Net 的卷积网络结构首先在声谱图上使用[14]，然后调整到波形域[28]。如最新的 SiSec Mus 源分离评估活动[29]所示，这些方法的表现明显不如声谱图方法。如第 5 节所示，我们的架构在性能上大幅超越了 Wave-U-Net。

In [21], the problem of semi-supervised source separation is tackled for 2 sources separation where a dataset of mixtures and unaligned isolated examples of source 1 but not source 2 is available. Using specifically crafted adversarial losses the authors manage to learn a separation model. In [11], the case of blind, i.e., completely unsupervised source separation is covered, combining adversarial losses with a remixing trick similar in spirit to our unlabeled data remixing presented in Section 4 . Both papers are different from our own setup, as they assume that they completely lack isolated audio for some or all sources. Finally, when having extra isolated sources, previous work showed that it was possible to use adversarial losses to leverage them without using them to generate new mixtures [27]. Unfortunately, extra isolated sources are exactly the kind of data that is hard to come by. As far as we know, no previous work tried to leverage unlabeled songs in order to improve supervised source separation performance. Besides, most previous work relied on adversarial losses, which can prove expensive while our remixing trick allows for direct supervision of the training loss.
在[21]中，针对两个源的半监督源分离问题进行了探讨，其中可用的数据集包含混合信号和未对齐的源 1 的孤立示例，但没有源 2 的孤立示例。作者通过特定设计的对抗损失成功学习了一个分离模型。在[11]中，讨论了盲源分离，即完全无监督的源分离，结合了对抗损失和一种类似于我们在第 4 节中提出的无标签数据重混合的技巧。这两篇论文与我们的设置不同，因为它们假设对于某些或所有源完全缺乏孤立音频。最后，当有额外的孤立源时，之前的工作表明可以利用对抗损失来利用这些源，而不需要使用它们生成新的混合信号[27]。不幸的是，额外的孤立源正是难以获得的数据。就我们所知，之前没有工作尝试利用无标签歌曲来提高监督源分离的性能。此外，大多数之前的工作依赖于对抗损失，这可能会很昂贵，而我们的重混合技巧则允许对训练损失进行直接监督。

3 Model Architecture 3 模型架构

Our network architecture is a blend of ideas from the SING architecture [7] developed for music note synthesis and Wave-U-Net. We reuse the synthesis with large strides and large number of channels as well as the combination of a LSTM and convolutional layers from SING, while retaining the U-Net [24] structure of Wave-U-Net. The model is composed of a convolutional encoder, an LSTM and a convolutional decoder, with the encoder and decoder linked with skip U-Net connections. The model takes a stereo mixture

as input and outputs a stereo estimate

for each source Similarly to other work in generation in both image [16, 15] and sound [7], we do not use batch normalization [13] as our early experiments showed that it was detrimental to the model performance.
我们的网络架构融合了为音乐音符合成开发的 SING 架构[7]和 Wave-U-Net 的思想。我们重用了具有大步幅和大量通道的合成，以及来自 SING 的 LSTM 和卷积层的组合，同时保留了 Wave-U-Net 的 U-Net[24]结构。该模型由一个卷积编码器、一个 LSTM 和一个卷积解码器组成，编码器和解码器通过跳跃 U-Net 连接相连。模型以立体混合信号

作为输入，并为每个源输出立体估计

。与图像[16, 15]和声音[7]生成中的其他工作类似，我们不使用批量归一化[13]，因为我们的早期实验表明这对模型性能有害。

The encoder is composed of

stacked layers numbered from 1 to

. Layer

is composed of a convolution with kernel size

, stride

input channels,

output channels and ReLU activation followed by a

convolution with GLU activation [6]. As the GLU outputs

channels with

channels as input, we double the number of channels in the 1 x 1 convolution. We define

the number of channels in the input mixture and

the initial number of channels for our model. For

we take

so that the final number of channels is

. We then use a bidirectional LSTM with 2 layers and a hidden size

. The LSTM outputs

channels per time position. We use a 1 x 1 convolution with ReLU activation to take that number down to

.
编码器由

个堆叠层组成，编号从 1 到

。第

层由一个卷积组成，卷积核大小为

，步幅为

，输入通道为

，输出通道为

，激活函数为 ReLU，随后是一个带有 GLU 激活的

卷积[6]。由于 GLU 输出

个通道，输入为

个通道，我们在 1 x 1 卷积中将通道数加倍。我们定义

为输入混合物中的通道数，

为我们模型的初始通道数。对于

，我们取

，以便最终通道数为

。然后我们使用一个具有 2 层和隐藏层大小为

的双向 LSTM。LSTM 在每个时间位置输出

个通道。我们使用一个带有 ReLU 激活的 1 x 1 卷积将通道数减少到

。
The decoder is almost the symmetric of the encoder. It is composed of

layers numbered in reverse order from

to 1 . The

-th layer starts with a convolution with kernel size 3 and stride 1 , input/output channels

and a ReLU activation. We concatenate its result with the output of the

-th layer of the encoder to form a U-Net and take back the number of channels to

using a 1 x 1 convolution with GLU activation. Finally, we use a transposed convolution with kernel size

and stride

outputs and ReLU activation. For the final layer, we instead output

channels and do not use any activation function.
解码器几乎是编码器的对称体。它由

层组成，层编号从

到 1 反向排列。

-th 层以卷积开始，卷积核大小为 3，步幅为 1，输入/输出通道为

，并使用 ReLU 激活。我们将其结果与编码器的

-th 层的输出连接，形成 U-Net，并通过 1 x 1 卷积与 GLU 激活将通道数恢复到

。最后，我们使用卷积转置，卷积核大小为

，步幅为

，输出为

，并使用 ReLU 激活。对于最后一层，我们输出

个通道，并不使用任何激活函数。

Weights rescaling The weights of a convolutional layer in a deep learning model are usually initialized in a way that account for the number of input channels and receptive field of the convolutions (i.e., fan in), as introduced by He et al. [10]. The initial weights of a convolution will roughly scale as

where

is the kernel size and

the number of input channels. For instance, the standard deviation after initialization of the weights of the first layer of our encoder is about 0.2 , while that of the last layer is 0.01 . Modern optimizers such as Adam [17] normalize the gradient update per coordinate so that, on average, each weight will receive updates of the same magnitude. Thus, if we want to take a learning rate large enough to tune the weights of the first layer, it will most likely be too large for the last layer.
权重重缩放深度学习模型中卷积层的权重通常以考虑输入通道数量和卷积的感受野（即，fan in）的方式进行初始化，如 He 等人所述[10]。卷积的初始权重大致按

缩放，其中

是卷积核大小，

是输入通道数量。例如，我们编码器第一层权重初始化后的标准差约为 0.2，而最后一层的标准差为 0.01。现代优化器如 Adam [17]对每个坐标的梯度更新进行归一化，以便平均而言，每个权重将接收相同幅度的更新。因此，如果我们希望选择一个足够大的学习率来调整第一层的权重，那么对于最后一层来说，这个学习率很可能会过大。

In order to remedy this problem, we use a trick that is equivalent to using specific learning rates per layer. Let us denote

the weights at initialization used to compute the convolution

. We take

, where

is a reference scale. We replace

and the output of the convolution by

, so that the output of the layer is unchanged. This is similar to the equalized learning rate trick used for image generation with GAN [16]. We observed both faster decay of the training loss and convergence to a better optimum when using the weight rescaling trick, see Section55.3 Optimal performance was obtained for a reference level

. We also tried rescaling the weights by

rather than

however this made the training loss diverge.
为了补救这个问题，我们使用了一种技巧，相当于为每一层使用特定的学习率。我们用

表示用于计算卷积

的初始化权重。我们取

，其中

是参考尺度。我们用

替换

，并将卷积的输出替换为

，以使层的输出保持不变。这类似于用于生成对抗网络（GAN）图像的均衡学习率技巧[16]。我们观察到使用权重重缩放技巧时，训练损失的衰减速度更快，并且收敛到更好的最优解，见第 55.3 节。对于参考水平

，获得了最佳性能。我们还尝试通过

而不是

来重缩放权重，但这导致训练损失发散。

Synthesis vs. filtering Let us denote

the output of the

-th layer of the encoder and

the output of the

-th layer of the decoder. Wave-U-Net takes

and upsamples it using linear interpolation. It then concatenates it with

(with

) and applies a convolution with a stride of 1 to obtain

. Thus, it works by taking a coarse representation, upsampling it, adding back the fine representation from

and filtering it out to separate channels.
合成与过滤让我们用

表示编码器第

层的输出，用

表示解码器第

层的输出。Wave-U-Net 接收

并使用线性插值进行上采样。然后，它将其与

（带有

）连接，并应用步幅为 1 的卷积以获得

。因此，它的工作原理是获取粗略表示，进行上采样，添加来自

的细致表示，并将其过滤以分离通道。

On the other hand, our model takes

and concatenates it with

and uses a transposed convolution to obtain

. A transposed convolution is different from a linear interpolation upsampling. With a sufficient number of input channels, it can generate any signal, while a linear upsampling will generate a signal with higher sampling rate but no high frequencies. High frequencies are injected using a U-Net skip connection. Separation is performed by applying various filters to the obtained signal (aka convolution with a stride of 1).
另一方面，我们的模型将

与

连接，并使用转置卷积获得

。转置卷积与线性插值上采样不同。通过足够数量的输入通道，它可以生成任何信号，而线性上采样将生成具有更高采样率但没有高频的信号。高频通过 U-Net 跳跃连接注入。通过对获得的信号应用各种滤波器（即步幅为 1 的卷积）来进行分离。
Thus, Wave-U-Net generates its output by iteratively upsampling, adding back the high frequency part of the signal from the matching encoder output (or from the input for the last decoder layer) and filtering. On the other hand, our approach consist in a direct synthesis. The main benefit of synthesis is that we can use a relatively large stride in the decoder, thus speeding up the computations and allowing for a larger number of channels. We believe this larger number of channels is one of the reasons for the better performance of our model as shown in Section 5
因此，Wave-U-Net 通过迭代上采样、将匹配编码器输出（或最后解码器层的输入）中的高频部分加回并进行滤波来生成其输出。另一方面，我们的方法则是直接合成。合成的主要好处在于我们可以在解码器中使用相对较大的步幅，从而加快计算速度并允许更多的通道。我们认为，这种更多的通道数量是我们模型在第 5 节中所示的更好性能的原因之一。

4 Unlabeled Data Remixing
4 无标签数据重混合

In order to leverage unlabeled songs, we propose to first train a classifier to detect the absence or presence of each source on small time frames, using a supervised train set for which we know the
为了利用未标记的歌曲，我们建议首先训练一个分类器，以检测每个来源在小时间段内的缺失或存在，使用一个我们已知的监督训练集

Figure 2: Overall representation of our unlabeled data remixing pipeline. When we detect an excerpt of at least 5 seconds with one source silent, here the bass, we recombine it with a single bass sample from the training set. We can then provide strong supervision for the silent source, and weak supervision for the other 3 as we only know the ground truth for their sum.
图 2：我们未标记数据重混合管道的整体表示。当我们检测到至少 5 秒的片段且一个源静音时，这里是低音，我们将其与训练集中单个低音样本重新组合。然后，我们可以为静音源提供强监督，而对其他三个源提供弱监督，因为我们只知道它们总和的真实值。
contribution of each source. When we detect an audio excerpt

with at least 5 seconds of silence for source

, we add it to a new set

. We can then mix an example

with a single source

taken from the supervised train set in order to form a new mixture

, noting

the ground truth for this example (potentially unknown to us) for each source

. As the source

is silent in

we can provide strong supervision for source

as we have

and weak supervision for the other sources as we have

. The whole process pipeline is represented in Figure 2
每个来源的贡献。当我们检测到一个音频片段

，其在来源

中至少有 5 秒的静音时，我们将其添加到一个新集合

中。然后，我们可以将一个示例

与从监督训练集中提取的单一来源

混合，以形成一个新的混合体

，并注意

该示例的真实情况（可能对我们未知）对于每个来源

。由于来源

在

中是静音的，我们可以为来源

提供强监督，因为我们有

，而对其他来源则提供弱监督，因为我们有

。整个过程管道如图 2 所示。
Motivation for this approach comes from our early experiments with the available supervised data which showed a clear tendency for overfitting when training our separation models. We first tried using completely unsupervised regularization, for instance given an unlabeled track

, we want

where

is the estimated source

. This proved too weak to improve performance. We then tried to detect passages with a single source present however this proved too rare of an event in Pop/Rock music: for the standard MusDB dataset presented in Section 5.1 source other is alone 2.6% of the time while the others are so less than

of the time. Accounting for the fact that our model will never reach a recall of

, this represents too few extractable data to be interesting. On the other hand, a source being silent happen quite often,

of the time for the drums,

for the bass or

for the vocals. This time, the other is the least frequent with

and hardest to extract as noted hereafter.
这种方法的动机来源于我们早期对可用监督数据的实验，这些实验显示在训练我们的分离模型时存在明显的过拟合倾向。我们首先尝试使用完全无监督的正则化，例如给定一个未标记的轨道

，我们希望得到

，其中

是估计的源

。这被证明对提高性能过于薄弱。然后我们尝试检测只有一个源存在的片段，但这在流行/摇滚音乐中被证明是一个过于稀有的事件：在第 5.1 节中介绍的标准 MusDB 数据集中，源“其他”单独出现的时间仅为 2.6%，而其他源的出现频率更低，少于

。考虑到我们的模型永远无法达到

的召回率，这代表可提取的数据太少，不具吸引力。另一方面，源静音的情况相当频繁，鼓的静音频率为

，贝斯为

，人声为

。这次，“其他”是出现频率最低的，仅为

，并且如后所述，提取难度最大。
We first formalize our classification problem and then describe the extraction procedure. The use of the extracted data for training is detailed in Section 5.2 .
我们首先对分类问题进行形式化，然后描述提取过程。提取数据用于训练的详细信息在第 5.2 节中。

4.1 Silent source detector
4.1 静音源探测器

Given a mixture

, we define for all sources

the relative volume

and a source being silent as

. For instance, having

means that source

is 100 times quieter than the mixture. Doing informal testing, we observe that a source with a relative volume between 0 and -10 will be perceived clearly, between -10 and -20 it will feel like a whisper and almost silent between -20 and -30 . A source with a relative volume under -30 is perceptually zero.
给定一个混合物

，我们为所有源

定义相对音量

，并将源静音定义为

。例如，拥有

意味着源

比混合物安静 100 倍。通过非正式测试，我们观察到，相对音量在 0 到-10 之间的源会被清晰感知，在-10 到-20 之间会感觉像耳语，而在-20 到-30 之间几乎是静音。相对音量低于-30 的源在感知上为零。
We can then train a classifier to estimate

, the probability that source

is silent given the input mixture

. Given the limited amount of supervised data, we use a Wavelet scattering transform [1] of order two as input features rather than the raw waveform. This transformation is computed using the Kymatio package [2]. The model is then composed of convolutional layers with max pooling and batch normalization and a final LSTM that produces an estimate

for every window of 0.64 seconds with a stride of 64 ms . We detail the architecture in the Section 2 of the supplementary material. We have observed that using a downsampled

and mono representation of the mixture further helps prevent overfitting.
我们可以训练一个分类器来估计

，即在给定输入混合

的情况下，源

是静音的概率。考虑到监督数据的有限性，我们使用二阶小波散射变换[1]作为输入特征，而不是原始波形。该变换是使用 Kymatio 软件包[2]计算的。模型由卷积层、最大池化和批量归一化组成，最后是一个 LSTM，它为每个 0.64 秒的窗口生成一个估计

，步幅为 64 毫秒。我们在补充材料的第 2 节中详细介绍了架构。我们观察到，使用下采样的

和单声道混合表示进一步有助于防止过拟合。
The silence threshold is set to -13 dB . Although this is far from silent, this allows for a larger number of positive samples and better training. We empirically observe that the true relative volumes decreases as the estimated probability

increases. Thus, in order to select only truly silent samples (

), one only needs to select a high threshold on

.
静音阈值设定为 -13 dB。尽管这远非完全静音，但这允许更多的正样本和更好的训练。我们经验性地观察到，随着估计概率

的增加，真实相对音量下降。因此，为了仅选择真正的静音样本 (

)，只需在

上选择一个高阈值。

Table 1: Comparison of Demucs with the state of the art in the waveform domain (Wave-U-Net) and in the spectrogram domain (MMDenseNet, MMDenseNetLSTM) on the MusDB test set. Extra data is the number of extra songs used, either labeled with the waveform for each source or unlabeled We report the median over all tracks of the median SDR over each track, as done in the SiSec Mus evaluation campaign [29]. For easier comparison, the All column is obtained by concatenating the metrics from all sources and then taking the median.
表 1：Demucs 与波形域（Wave-U-Net）和谱图域（MMDenseNet，MMDenseNetLSTM）在 MusDB 测试集上的比较。额外数据是使用的额外歌曲数量，标记为每个源的波形或未标记。我们报告所有轨道的中位数，以及每个轨道的中位数 SDR，正如在 SiSec Mus 评估活动中所做的那样[29]。为了便于比较，All 列是通过连接所有源的指标然后取中位数得到的。

	Extra data 额外数据				Test SDR in dB 测试 SDR（分贝）
Architecture 建筑学	Wav? 波形？	labeled 标记的	unlabed 未标记	All 所有	Drums 鼓	Bass 低音	Other 其他	Vocals 人声
MMDenseNet				5.34	6.40	5.14	4.13	6.57
Wave-U-Net 波形 U-Net				3.17	4.16	3.17	2.24	3.05
Demucs				4.81	5.38	5.07	3.01	5.44
Demucs			2,000	5.09	5.79	6.23	3.45	5.51
Demucs		100		5.41	5.99	5.72	3.65	6.17
Demucs		100	2,000	5.67	6.50	6.21	3.80	6.21
MMDenseLSTM		804		5.97	6.75	5.28	4.72	7.15
MMDenseNet		804		5.85	6.81	5.21	4.37	6.74

Table 2: Ablation study for the novel elements in our architecture or training procedure. Unlike on Table 1. we report the simple SDR defined in Section 5.1 rather than the extended version

. We also report average values rather than medians as this make small changes more visible. This explains the SDR reported here being much smaller than on Table 1. Both metrics are averaged over the last 3 epochs and computed on the MusDB test set. Reference is trained with remixed unlabeled data, with stereo channels input resampled at 22 kHz , on the train set of MusDB.
表 2：我们架构或训练过程中新元素的消融研究。与表 1 不同，我们报告的是第 5.1 节中定义的简单 SDR，而不是扩展版本

。我们还报告平均值而不是中位数，因为这使得小的变化更为明显。这解释了这里报告的 SDR 远小于表 1 的原因。两个指标都是在最后 3 个周期上取平均，并在 MusDB 测试集上计算。参考模型使用重新混合的未标记数据进行训练，立体声通道输入以 22 kHz 的频率重新采样，基于 MusDB 的训练集。

	Train set 训练集		Test set 测试集
Difference 差异	L1 loss L1 损失	SDR	L1 loss L1 损失	SDR
Reference 参考文献	0.090	8.82	0.169	5.09
no remixed 未混音	0.089	8.87	0.175	4.81
no GLU 无 GLU	0.099	8.00	0.174	4.68
no BiLSTM 无双向长短期记忆网络	0.156	8.42	0.182	4.83
MSE loss 均方误差损失	N/A	8.84	N/A	5.04
no weight rescaling 无权重重缩放	0.094	8.39	0.171	4.68

4.2 Extraction procedure
4.2 提取程序

We assume we have a few labeled data from the same distribution as the unlabeled data available, in our case we used 100 labeled tracks for 2,000 unlabeled ones, as explained in Section 5.1 If such data is not available, it is still possible to annotate part of the unlabeled data, but only with weak labels (source present or absence) which is easier than obtaining the exact waveform for each source. We perform extraction by first setting thresholds probabilities

for each source. We define

as the lowest limit so that for at least

of the samples with

, we have

on the stem set. We then only keep audio extracts where

for at least 5 seconds, which reduces the amount of data extracted by roughly

but also reduces the

percentile of the relative volume from -20 to -30 . We assemble all the 5 seconds excerpt where source

is silent into a new dataset

.
我们假设我们有一些来自与未标记数据相同分布的标记数据，在我们的案例中，我们使用了 100 个标记轨道对应 2000 个未标记轨道，如第 5.1 节所述。如果这样的数据不可用，仍然可以对部分未标记数据进行注释，但只能使用弱标签（源的存在或缺失），这比为每个源获取精确波形要容易。我们通过首先为每个源设置阈值概率

来进行提取。我们将

定义为最低限度，以便在至少

个样本中具有

时，我们在干扰集上有

。然后，我们仅保留在至少 5 秒内

的音频提取，这大约减少了提取的数据量

，但也将相对音量的

百分位数从-20 降低到-30。我们将源

静音的所有 5 秒摘录汇总成一个新的数据集

。

In our case, we did not manage to obtain more than a few minutes of audio with source other silent. Indeed, as noted above, it is the most frequent source, training examples without it are rare leading to unreliable prediction.
在我们的案例中，我们未能获得超过几分钟的音频，其他部分则是静音的。实际上，如上所述，这是最常见的来源，缺乏它的训练示例很少，导致预测不可靠。

5 Experimental results 5 实验结果

We present here the datasets, metrics and baselines used for evaluating our architecture and unlabeled data remixing. We mostly reuse the framework setup for the SiSec Mus evaluation campaign for music source separation [29] and their MusDB dataset [22].
我们在此介绍用于评估我们的架构和未标记数据重混的 datasets、指标和基准。我们主要重用 SiSec Mus 评估活动中用于音乐源分离的框架设置 [29] 及其 MusDB 数据集 [22]。

5.1 Evaluation framework
5.1 评估框架

MusDB and unsupervised datasets We use the MusDB [22] dataset, which is composed of 150 songs with full supervision in stereo and sampled at 44100 Hz . For each song, we have the exact waveform of the drums, bass, other and vocals parts, i.e. each of the sources. The actual song, a.k.a. the mixture, is the sum of those four parts. The first 100 songs form the train set while the remaining 50 are kept for the test set.
MusDB 和无监督数据集我们使用 MusDB [22] 数据集，该数据集由 150 首歌曲组成，具有全监督的立体声，并以 44100 Hz 的采样率进行采样。对于每首歌曲，我们拥有鼓、贝斯、其他和人声部分的精确波形，即每个音源。实际的歌曲，也称为混合，是这四个部分的总和。前 100 首歌曲构成训练集，而剩余的 50 首则保留用于测试集。
To test out the semi-supervised scheme described in Section 4, we exploit our own set of 2,000 unlabeled tracks, which represents roughly 4.5 days of audio. It is composed of

of Heavy Metal,

of Jazz,

of Pop and

of Rock music. Although we do not release this set, we believe that a similarly composed digital music collection will allow to replicate our results. We refer to this data as the unsupervised or unlabled set.
为了测试第 4 节中描述的半监督方案，我们利用自己的一组 2,000 个未标记的音轨，这大约代表 4.5 天的音频。它由

的重金属、

的爵士乐、

的流行音乐和

的摇滚音乐组成。尽管我们不发布这一数据集，但我们相信，类似构成的数字音乐集合将能够复制我们的结果。我们将这些数据称为无监督或未标记集。
We also collected raw stems for 100 tracks, i.e., individual instrument recordings used in music production software to make a song. Those tracks come from the same distribution as our unsupervised dataset but do not overlap. We manually assigned each instrument to one of the sources using simple rules on the filenames (for instance "Lead Guitar.wav" is assigned to the other source) or listening to the stems in case of ambiguity. We will call this extra supervised data the stem set. As some of the baslises used additional labeled data ( 807 songs), we also provide metrics for our own architecture trained using this extra stem set.
我们还收集了 100 个音轨的原始音轨，即用于音乐制作软件制作歌曲的单个乐器录音。这些音轨来自与我们的无监督数据集相同的分布，但不重叠。我们根据文件名的简单规则（例如，“Lead Guitar.wav”被分配到另一个来源）或在模糊情况下听取音轨，手动将每个乐器分配给一个来源。我们将这组额外的监督数据称为音轨集。由于一些基线使用了额外的标记数据（807 首歌曲），我们还提供了使用这个额外音轨集训练的我们自己架构的指标。
We applied our extraction pipeline to the 2,000 unlabeled songs, and obtained about 1.5 days of audio (with potential overlap due to our extraction procedure) for with the source drums, bass or vocals silent which form respectively the datasets

. We could not retrieve a significant amount of audio for the other source. Indeed, this last source is the most frequently present (there is almost always a melodic part in a song), and with the amount of data available, we could not train a model that would reliably predict the absence of this source. As a consequence, we do not extract a dataset

for it. We did manage to extract a few hours with only the other source, but we have not tried to inject it into our separation model training. Although we trained our model on mono and 16 kHz audio, we perform the extraction on the original 44 kHz stereo data.
我们将提取流程应用于 2000 首未标记的歌曲，获得了大约 1.5 天的音频（由于我们的提取过程可能存在重叠），其中源鼓、贝斯或人声静音，分别形成数据集

。我们无法为其他源检索到大量音频。实际上，这个源是最常见的（几乎每首歌中都有旋律部分），而且由于可用数据的数量，我们无法训练出一个能够可靠预测该源缺失的模型。因此，我们没有为其提取数据集

。我们确实成功提取了几小时仅包含其他源的音频，但我们没有尝试将其注入到我们的分离模型训练中。尽管我们在单声道和 16 kHz 音频上训练了模型，但我们在原始 44 kHz 立体声数据上进行提取。
Source separation metrics Measurements of the performance of source separation models was developed by Vincent et al. for blind source separation [35] and reused for supervised source separation in the SiSec Mus evaluation campaign [29]. Reusing the notations from [35], let us take a source

and introduce

the orthogonal projection on

resp on

). We then take with

the estimate of source

and

. The signal to distortion ratio is then defined as
源分离指标源分离模型的性能测量由 Vincent 等人开发，用于盲源分离[35]，并在 SiSec Mus 评估活动中重新用于监督源分离[29]。借用[35]中的符号，设源为

，并引入

在

和

上的正交投影。然后，我们用

来估计源

和

。信号失真比定义为

Note that this definition is invariant to the scaling of

. We used the python package museval

which provide a reference implementation for the SiSec Mus 2018 evaluation campaign. It also allows time invariant filters to be applied to

as well as small delays between the estimate and ground truth [35]. As done in the SiSec Mus competition, we report the median over all tracks of the median of the metric over each track computed using the museval package. Similarly to previous work [28, 30, 31], we focus in this section on the SDR, but other metrics can be defined (SIR an SAR) and we present them in the Appendix, Section B.
注意，这一定义对

的缩放是保持不变的。我们使用了 python 包 museval

，该包为 SiSec Mus 2018 评估活动提供了参考实现。它还允许对

应用时间不变的滤波器，以及在估计值和真实值之间的小延迟[35]。与 SiSec Mus 竞赛中的做法一样，我们报告所有轨道的中位数，以及使用 museval 包计算的每个轨道的指标的中位数。与之前的工作[28, 30, 31]类似，我们在本节中专注于 SDR，但也可以定义其他指标（SIR 和 SAR），并在附录 B 中呈现它们。
Baselines We selected the best performing models from the last SiSec Mus evaluation campaign [29] as baselines. MMDenseNet [30] is a multiscale convolutional network with skip and U-Net connections. This model was submitted as TAK1 when trained without extra labeled data and as TAK3 when trained with 804 extra labeled song

. MMDenseLSTM [31] is an extension of MMDenseNet that adds LSTMs at different scales of the encoder and decoder. This model was submitted as TAK2 and was trained with the same 804 extra labeled songs. Unlike MMDenseNet, this model was not submitted without supplementary training data. The only Waveform based method submitted to the evaluation campaign is Wave-U-Net [28] with the identifier STL2. Metrics from all baselines were downloaded from the SiSec submission repository

基线我们从上一次 SiSec Mus 评估活动中选择了表现最佳的模型作为基线[29]。MMDenseNet [30] 是一种具有跳跃连接和 U-Net 连接的多尺度卷积网络。该模型在没有额外标记数据的情况下提交为 TAK1，在使用 804 个额外标记歌曲时提交为 TAK3。MMDenseLSTM [31] 是 MMDenseNet 的扩展，在编码器和解码器的不同尺度上添加了 LSTM。该模型提交为 TAK2，并使用相同的 804 个额外标记歌曲进行训练。与 MMDenseNet 不同，该模型没有在没有补充训练数据的情况下提交。提交到评估活动的唯一基于波形的方法是 Wave-U-Net [28]，其标识符为 STL2。所有基线的指标均从 SiSec 提交库下载

。

5.2 Training procedure 5.2 训练程序

We define one epoch over the dataset as a pass over all 5 second extracts with a stride of 0.5 seconds. We train the classifier described in Section 4 on 4 V100 GPUs for 40 epochs with a batch size of 64 using Adam [17] with a learning rate of 5e-4. We use the sum of the binary cross entropy loss for each source as a training loss. The Demucs separation model described in Section 3 is trained for 400 epochs on 16 V100 GPUs, with a batch size of 128 using Adam with a learning rate of 5e-4 and decaying the learning rate every 160 epochs by a factor of 5 . We perform the following data augmentation, partially inspired by [33]: shuffling sources within one batch, randomly shifting sources in time (same shift for both channels), randomly swapping channels, random multiplication by

per channel. Given the cost of fitting those models, we perform a single run for each configuration.
我们将数据集上的一个周期定义为对所有 5 秒提取的遍历，步幅为 0.5 秒。我们在 4 个 V100 GPU 上训练第 4 节中描述的分类器，训练 40 个周期，批量大小为 64，使用 Adam [17]，学习率为 5e-4。我们使用每个源的二元交叉熵损失的总和作为训练损失。第 3 节中描述的 Demucs 分离模型在 16 个 V100 GPU 上训练 400 个周期，批量大小为 128，使用 Adam，学习率为 5e-4，并在每 160 个周期将学习率衰减 5 倍。我们执行以下数据增强，部分灵感来自[33]：在一个批次内打乱源，随机在时间上移动源（两个通道相同的移动），随机交换通道，随机对每个通道乘以

。考虑到拟合这些模型的成本，我们对每个配置执行一次运行。
We use the L1 distance between the estimated sources

and the ground truth

as we observed it improved the performance quite a bit, as shown on Table 2 . We have tried replacing or adding to this loss the L1 distance between the power spectrogram of

and that of

, as done in [7], however it only degraded the final SDR of the model. When using the unlabeled data remixing trick describe in Section 4 , we perform an extra step with probability 0.25 after each training batch from the main training step. We sample one source

at random out of ( 0 ) drums, (1) bass or (3) vocals (remember that we could not extract excerpt for source other) and obtain

where source

is silent and

from the training set where only

is present. We take

and perform a gradient step on the following loss
我们使用估计源

与真实值

之间的 L1 距离，因为我们观察到这显著提高了性能，如表 2 所示。我们尝试用

的功率谱图与

的功率谱图之间的 L1 距离替换或添加到这个损失中，正如文献[7]中所做的那样，但这只会降低模型的最终 SDR。在使用第 4 节中描述的无标签数据重混技巧时，我们在每个训练批次后以 0.25 的概率执行额外一步。我们从(0)鼓、(1)贝斯或(3)人声中随机抽取一个源

（请记住，我们无法提取其他源的片段），并获得

，其中源

是静音的，而

来自训练集，仅包含

。我们取

并对以下损失执行梯度步骤。

Given that the extracted examples

are noisier than those coming from the train set, we use a separate instance of Adam for this step with a learning rate 10 times smaller than the main one. Furthermore, as we only have weak supervision over sources

, the second term is too be understood as a regularizer rather than a leading term, thus we take

.
鉴于提取的示例

比来自训练集的示例噪声更大，我们在此步骤中使用一个独立的 Adam 实例，学习率比主要学习率小 10 倍。此外，由于我们对源

只有弱监督，因此第二项应理解为正则化项而非主导项，因此我们取

。

5.3 Evaluation results 5.3 评估结果

We compare the performance of our approach with the state of the art in Table 1 On the top half, we show all methods trained without supplementary data. We can see a clear improvement coming from our new architecture alone compared to Wave-U-Net while MMDenseNet keeps a clear advantage. We then look at the impact of adding unlabeled remixed data. We obtain a gain of nearly 0.3 of the median SDR. As a reference, adding 100 labeled tracks to the train set gives a gain of 0.6. Interestingly, even when training with the extra tracks, our model still benefits from the unlabeled data, gaining an extra 0.2 points of SDR. MMDenseLSTM and MMDenseNet still obtain the best performance overall but we can notice that Demucs trained with unlabeled data achieved state of the art performance for the separation of the bass source. It only had access to 100 extra labeled songs, which is far from the 804 extra labeled songs used for MMDenseNet/LSTM and it would be interesting to see how waveform based models perform with a dataset that large. Box plots with quantiles can be found in the Appendix, Section B. Audio samples from different Demucs variant and the baselines are provided online

We provide an ablation study of the main novelties of this paper on Table 2, on the train set of MusDB plus our remixed unlabeled data.
我们在表 1 中比较了我们的方法与最先进技术的性能。在上半部分，我们展示了所有未使用补充数据训练的方法。我们可以清楚地看到，与 Wave-U-Net 相比，仅凭我们的新架构就有明显的改进，而 MMDenseNet 仍然保持明显的优势。接下来，我们观察添加未标记重混数据的影响。我们获得了近 0.3 的中位数 SDR 增益。作为参考，向训练集添加 100 个标记轨道可获得 0.6 的增益。有趣的是，即使在使用额外轨道进行训练时，我们的模型仍然受益于未标记数据，额外获得了 0.2 的 SDR 分数。MMDenseLSTM 和 MMDenseNet 仍然获得了整体最佳性能，但我们注意到，使用未标记数据训练的 Demucs 在低音源分离方面达到了最先进的性能。它仅访问了 100 首额外的标记歌曲，这远低于 MMDenseNet/LSTM 使用的 804 首额外标记歌曲，看到基于波形的模型在如此大数据集上的表现将是非常有趣的。带有分位数的箱线图可以在附录 B 节中找到。来自不同 Demucs 变体和基线的音频样本已在线提供

。我们在表 2 中对本文的主要新颖性进行了消融研究，基于 MusDB 训练集以及我们重新混合的未标记数据。

Conclusion 结论

We presented Demucs, a simple architecture inspired by previous work in source separation from the waveform and audio synthesis that reduces the gap between spectrogram and waveform based methods from 2.2 points of median SDR to 0.5 points when trained only on the standard MusDB dataset. We have also demonstrated how to leverage 2,000 unlabeled mp3s by first training a classifier to detect excerpt with at least one source silent and then remixing it with an isolated source from the training set. To our knowledge, this is the first semi-supervised approach to source separation that does not rely on adversarial losses. Finally, training our model with remixed unlabeled data as well as 100 extra training examples, we obtain performance almost on par with that of state of the art spectrogram based methods, even better for the bass source, making waveform based method a legitimate contender for supervised source separation.
我们提出了 Demucs，这是一种简单的架构，灵感来自于之前在波形源分离和音频合成方面的工作，它将基于谱图和波形的方法之间的中位数 SDR 差距从 2.2 分减少到 0.5 分，仅在标准 MusDB 数据集上训练。我们还展示了如何利用 2000 个未标记的 mp3，首先训练一个分类器以检测至少有一个源静音的片段，然后将其与训练集中孤立的源进行混音。据我们所知，这是第一种不依赖对抗损失的半监督源分离方法。最后，通过使用混音的未标记数据以及 100 个额外的训练样本来训练我们的模型，我们获得的性能几乎与最先进的基于谱图的方法相当，甚至在低音源方面表现更好，使得基于波形的方法成为监督源分离的一个合法竞争者。

References 参考文献

[1] Joakim Andén and Stéphane Mallat. Deep scattering spectrum. IEEE Transactions on Signal Processing, 2014.
[1] Joakim Andén 和 Stéphane Mallat. 深散射谱. IEEE 信号处理汇刊, 2014.
[2] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Eugene Belilovsky, and Joan Bruna. Kymatio: Scattering transforms in python. Technical Report 1812.11214, arXiv, 2018.
[2] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Eugene Belilovsky, 和 Joan Bruna. Kymatio: Python 中的散射变换. 技术报告 1812.11214, arXiv, 2018.
[3] Francis Bach and Michael I. Jordan. Blind one-microphone speech separation: A spectral learning approach. In Advances in neural information processing systems, 2005.
[3] Francis Bach 和 Michael I. Jordan. 盲一麦克风语音分离：一种谱学习方法. 载于《神经信息处理系统进展》，2005。
[4] A. S. Bregman. Auditory Scene Analysis. MIT Press, Cambridge, MA, 1990.
[4] A. S. Bregman. 听觉场景分析. 麻省理工学院出版社, 剑桥, 马萨诸塞州, 1990.
[5] E. Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustic Society of America, 1953.
[5] E. Colin Cherry. 关于单耳和双耳语音识别的一些实验。《美国声学学会期刊》，1953 年。
[6] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, 2017.
[6] Yann N. Dauphin, Angela Fan, Michael Auli, 和 David Grangier. 使用门控卷积网络进行语言建模. 载于 2017 年国际机器学习会议论文集.
[7] Alexandre Défossez, Neil Zeghidour, Usunier Nicolas, Leon Bottou, and Francis Bach. Sing: Symbol-to-instrument neural generator. In Advances in Neural Information Processing Systems 32, 2018 .
[7] Alexandre Défossez, Neil Zeghidour, Usunier Nicolas, Leon Bottou, 和 Francis Bach. Sing: 符号到乐器的神经生成器. 载于《神经信息处理系统进展》第 32 卷, 2018 年.
[8] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. Technical Report 1704.01279, arXiv, 2017.
[8] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan 和 Mohammad Norouzi. 基于 WaveNet 自编码器的音乐音符神经音频合成. 技术报告 1704.01279, arXiv, 2017.
[9] Emad M. Grais, Mehmet Umut Sen, and Hakan Erdogan. Deep neural networks for single channel source separation. In International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014.
[9] Emad M. Grais, Mehmet Umut Sen, 和 Hakan Erdogan. 单通道源分离的深度神经网络. 载于国际声学、语音与信号处理会议 (ICASSP), 2014.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, 2015.
[10] 何凯明, 张向宇, 任少卿, 孙剑. 深入研究整流器：在 ImageNet 分类中超越人类水平的表现. 载于 2015 年 IEEE 国际计算机视觉会议论文集.
[11] Yedid Hoshen. Towards unsupervised single-channel blind source separation using adversarial pair unmix-and-remix. Technical Report 1812.07504, arXiv, 2018.
[11] Yedid Hoshen. 通过对抗性配对解混和重混实现无监督单通道盲源分离。技术报告 1812.07504，arXiv，2018。
[12] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis. John Wiley & Sons, 2004.
[12] Aapo Hyvärinen, Juha Karhunen, 和 Erkki Oja. 独立成分分析. 约翰·威利父子公司, 2004.
[13] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Technical Report 1502.03167, arXiv, 2015.
[13] Sergey Ioffe 和 Christian Szegedy. 批量归一化：通过减少内部协变量偏移加速深度网络训练. 技术报告 1502.03167, arXiv, 2015.
[14] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In ISMIR 2018, 2017.
[14] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, 和 Tillman Weyde. 基于深度 U-Net 卷积网络的歌声分离. 在 ISMIR 2018, 2017.
[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. Technical Report 1710.10196, arXiv, 2017.
[15] Tero Karras, Timo Aila, Samuli Laine, 和 Jaakko Lehtinen. 逐步增长的生成对抗网络以提高质量、稳定性和变化性. 技术报告 1710.10196, arXiv, 2017.
[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. Technical Report 1812.04948, arXiv, 2018.
[16] Tero Karras, Samuli Laine, 和 Timo Aila. 一种基于风格的生成对抗网络生成器架构. 技术报告 1812.04948, arXiv, 2018.
[17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[17] Diederik P Kingma 和 Jimmy Ba. Adam：一种随机优化方法。在国际学习表征会议，2015。
[18] Jen-Yu Liu and Yi-Hsuan Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.
[18] 刘建宇和杨怡萱。具有递归跳跃连接和残差回归的去噪自编码器用于音乐源分离。在 2018 年第 17 届 IEEE 国际机器学习与应用会议（ICMLA），2018。
[19] Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? Technical Report 1810.12187, arXiv, 2018.
[19] Francesc Lluís, Jordi Pons, 和 Xavier Serra. 端到端音乐源分离：在波形域中是否可行？技术报告 1810.12187，arXiv，2018。
[20] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio generation model. Technical Report 1612.07837, arXiv, 2016.
[20] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, 和 Yoshua Bengio. Samplernn：一种无条件的端到端神经音频生成模型。技术报告 1612.07837，arXiv，2016。
[21] Michael Michelashvili, Sagie Benaim, and Lior Wolf. Semi-supervised monaural singing voice separationwith a masking network trained on synthetic mixtures. Technical Report 1812.06087, arXiv, 2018.
[21] Michael Michelashvili, Sagie Benaim, 和 Lior Wolf. 基于合成混合物训练的掩蔽网络的半监督单声道歌声分离. 技术报告 1812.06087, arXiv, 2018.
[22] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The musdb18 corpus for music separation, 2017.
[22] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis 和 Rachel Bittner. musdb18 音乐分离语料库, 2017.
[23] Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
[23] Dario Rethage, Jordi Pons, 和 Xavier Serra. 一种用于语音去噪的波网. 载于 2018 年 IEEE 国际声学、语音与信号处理会议（ICASSP），2018。
[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
[24] Olaf Ronneberger, Philipp Fischer 和 Thomas Brox. U-net: 用于生物医学图像分割的卷积网络. 在国际医学图像计算与计算机辅助干预会议, 2015.
[25] Sam T. Roweis. One microphone source separation. In Advances in Neural Information Processing Systems, 2001.
[25] Sam T. Roweis. 单麦克风源分离. 在神经信息处理系统进展, 2001.
[26] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman. Static and dynamic source separation using nonnegative factorizations: A unified view. IEEE Signal Processing Magazine, 31(3), 2014.
[26] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, 和 M. Hoffman. 使用非负分解的静态和动态源分离：统一视角. IEEE 信号处理杂志, 31(3), 2014.
[27] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2391-2395. IEEE, 2018.
[27] 丹尼尔·斯托勒，塞巴斯蒂安·埃维特，西蒙·迪克森。对抗性半监督音频源分离在歌声提取中的应用。发表于 2018 年 IEEE 国际声学、语音与信号处理会议（ICASSP），第 2391-2395 页。IEEE，2018。
[28] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. Technical Report 1806.03185, arXiv, 2018.
[28] Daniel Stoller, Sebastian Ewert, 和 Simon Dixon. Wave-u-net: 一种用于端到端音频源分离的多尺度神经网络. 技术报告 1806.03185, arXiv, 2018.
[29] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. In 14th International Conference on Latent Variable Analysis and Signal Separation, 2018.
[29] Fabian-Robert Stöter, Antoine Liutkus, 和 Nobutaka Ito. 2018 年信号分离评估活动. 在第 14 届潜变量分析与信号分离国际会议上, 2018.
[30] Naoya Takahashi and Yuki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 21-25. IEEE, 2017.
[30] 高桥直也和三藤幸。用于音频源分离的多尺度多频带密集网络。在音频与声学信号处理应用研讨会（WASPAA）上，第 21-25 页。IEEE，2017。
[31] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. Technical Report 1805.02410, arXiv, 2018.
[31] 高桥直也, Nabarun Goswami, 和三藤幸喜。Mmdenselstm：一种高效的卷积神经网络与递归神经网络结合用于音频源分离。技术报告 1805.02410，arXiv，2018。
[32] Stefan Uhlich, Franck Giron, and Yuki Mitsufuji. Deep neural network based instrument extraction from music. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
[32] Stefan Uhlich, Franck Giron 和 Yuki Mitsufuji. 基于深度神经网络的音乐乐器提取. 在国际声学、语音与信号处理会议 (ICASSP), 2015.
[33] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
[33] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Takahashi Naoya 和 Mitsufuji Yuki. 基于深度神经网络通过数据增强和网络融合改善音乐源分离. 载于 2017 年 IEEE 国际声学、语音与信号处理会议（ICASSP），2017.
[34] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. Technical Report 1609.03499, arXiv, 2016.
[34] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, 和 Koray Kavukcuoglu. Wavenet: 一种用于原始音频的生成模型. 技术报告 1609.03499, arXiv, 2016.
[35] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4): 1462-1469, 2006. URL https://hal.inria.fr/inria-00544230
[35] Emmanuel Vincent, Rémi Gribonval, 和 Cédric Févotte. 盲音频源分离中的性能测量. IEEE 音频、语音与语言处理汇刊, 14(4): 1462-1469, 2006. URL https://hal.inria.fr/inria-00544230
[36] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis. IEEE Press, Piscataway, NJ, 2006.
[36] 王德良和盖·J·布朗，编辑。《计算听觉场景分析》。IEEE 出版社，皮斯卡塔维，NJ，2006 年。

Appendix 附录

A Architecture of the silent sources detector
静态源探测器的架构

We use as input a scattering transform of order 2, computed using the Kymatio package [2] with

wavelets per octave. Coefficients of order 1 are indexed by one frequency

and of order two by

and

with

the frequency of the second order filter. We organize the coefficient in a tensor of dimension

where

is the number of time windows,

is the number of order 1 frequencies. The first channel is composed of order 1 coefficients, while the next ones contains the order two coefficient ordered by

. Thanks to this reorganization, we can now use 2 D convolutions over the output of the scattering transform. The model is then composed of
我们使用二阶散射变换作为输入，该变换是使用 Kymatio 软件包[2]计算的，每个八度

个小波。一级系数由一个频率

索引，二级系数由

和

索引，

是二阶滤波器的频率。我们将系数组织成一个维度为

的张量，其中

是时间窗口的数量，

是一级频率的数量。第一个通道由一级系数组成，而接下来的通道包含按

排序的二级系数。通过这种重组，我们现在可以对散射变换的输出使用二维卷积。模型由此组成。

batch normalization, 批量归一化
,
Relu( ,
,
batch normalization, 批量归一化
Relu( ,
,
,
batch normalization, 批量归一化
frequency dimension is eliminated with a final convolution of kernel size 14 in the frequency axis and 1 in the time axis with 512 input channels and 1024 channels as output,
频率维度通过在频率轴上使用核大小为 14、在时间轴上使用核大小为 1 的最终卷积被消除，输入通道为 512，输出通道为 1024
BiLSTM with hidden size of 1024, 2 layers, dropout at 0.18 ,
双向长短期记忆网络（BiLSTM），隐藏层大小为 1024，2 层，丢弃率为 0.18。
,batch normalization, then Relu,
，批量归一化，然后是 ReLU，
.

B Results for all metrics with boxplots
B 所有指标的结果与箱线图

Reusing the notations from [35], let us take a source

and introduce

(resp

) the orthogonal projection on

(resp on

). We then take with

the estimate of source

重用[35]中的符号，设定源

并引入

（分别为

）在

（分别在

）上的正交投影。然后我们用

对源

进行估计。

We can now define various signal to noise ratio, expressed in decibels (dB): the source to distortion ratio
我们现在可以定义各种信噪比，以分贝（dB）表示：源与失真比

the source to interference ratio
源信号与干扰信号比率

and the sources to artifacts ratio
和文物的来源比率

As explained in the main paper, extra invariants are added when using the museval package. We refer the reader to [35] for more details. We provide hereafter box plots for each metric and each target. generated using the notebook provided specifically by the organizers of the SiSec Mus evaluation

An "Extra" suffix means that extra training data has been used and the "Remixed" suffix means that our unlabeled data remixing scheme has been used.
如主论文中所述，在使用 museval 包时添加了额外的不变性。我们在此参考[35]以获取更多细节。以下是针对每个指标和每个目标生成的箱线图，这些图是使用 SiSec Mus 评估组织者专门提供的笔记本生成的。带有“Extra”后缀表示使用了额外的训练数据，而“Remixed”后缀表示使用了我们的无标签数据混合方案。

https://github.com/sigsep/sigsep-mus-eval
Source: https://sisec18.unmix.app/#/methods/TAK2
来源: https://sisec18.unmix.app/#/methods/TAK2
https://github.com/sigsep/sigsep-mus-2018
https://ai.honu.io/papers/demucs/
5 https://github.com/sigsep/sigsep-mus-2018-analysis

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed Demucs：深度提取器，用于处理带有额外未标记数据的音乐源混音

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

3 Model Architecture 3 模型架构

4 Unlabeled Data Remixing4 无标签数据重混合

4.1 Silent source detector4.1 静音源探测器

4.2 Extraction procedure4.2 提取程序

5 Experimental results 5 实验结果

5.1 Evaluation framework5.1 评估框架

5.2 Training procedure 5.2 训练程序

5.3 Evaluation results 5.3 评估结果

Conclusion 结论

References 参考文献

Appendix 附录

A Architecture of the silent sources detector静态源探测器的架构

B Results for all metrics with boxplotsB 所有指标的结果与箱线图

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed
Demucs：深度提取器，用于处理带有额外未标记数据的音乐源混音

4 Unlabeled Data Remixing
4 无标签数据重混合

4.1 Silent source detector
4.1 静音源探测器

4.2 Extraction procedure
4.2 提取程序

5.1 Evaluation framework
5.1 评估框架

A Architecture of the silent sources detector
静态源探测器的架构

B Results for all metrics with boxplots
B 所有指标的结果与箱线图