Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed Demucs:深度提取器,用于处理带有额外未标记数据的音乐源混音
Alexandre Défossez 亚历山大·德福塞Facebook AI Research Facebook 人工智能研究INRIA / École Normale Supérieure INRIA / 高等师范学校PSL Research University PSL 研究大学Paris, France 巴黎,法国defossez@fb.com
Nicolas Usunier 尼古拉斯·尤苏尼尔Facebook AI Research Facebook 人工智能研究Paris, France 巴黎,法国usunier@fb.com
Léon Bottou 莱昂·博图
Facebook AI Research Facebook 人工智能研究
New York, USA 美国纽约 leonb@fb.com
Francis Bach 弗朗西斯·巴赫INRIA / École Normale Supérieure INRIA / 高等师范学校PSL Research University PSL 研究大学Paris, France 巴黎,法国francis.bach@ens.fr
Abstract 摘要
We study the problem of source separation for music using deep learning with four known sources: drums, bass, vocals and other accompaniments. State-of-the-art approaches predict soft masks over mixture spectrograms while methods working on the waveform are lagging behind as measured on the standard MusDB [22] benchmark. Our contribution is two fold. (i) We introduce a simple convolutional and recurrent model that outperforms the state-of-the-art model on waveforms, that is, Wave-U-Net [28], by 1.6 points of SDR (signal to distortion ratio). (ii) We propose a new scheme to leverage unlabeled music. We train a first model to extract parts with at least one source silent in unlabeled tracks, for instance without bass We remix this extract with a bass line taken from the supervised dataset to form a new weakly supervised training example. Combining our architecture and scheme, we show that waveform methods can play in the same ballpark as spectrogram ones. 我们研究了使用深度学习进行音乐源分离的问题,涉及四种已知源:鼓、贝斯、声乐和其他伴奏。最先进的方法在混合声谱图上预测软掩模,而在波形上工作的算法则落后于标准的 MusDB [22] 基准测试。我们的贡献有两个方面。(i) 我们引入了一种简单的卷积和递归模型,在波形上超越了最先进的模型 Wave-U-Net [28],提高了 1.6 分的 SDR(信号失真比)。(ii) 我们提出了一种新的方案来利用未标记的音乐。我们训练第一个模型,从未标记的曲目中提取至少有一个源静音的部分,例如没有贝斯的部分。我们将这个提取与来自监督数据集的贝斯线重新混合,以形成一个新的弱监督训练示例。结合我们的架构和方案,我们展示了波形方法可以与声谱图方法相媲美。
1 Introduction 1 引言
Cherry first noticed the "cocktail party effect" [5]: how the human brain is able to separate a single conversation out of a surrounding noise from a room full of people chatting. Bregman later tried to understand how the brain was able to analyse a complex auditory signal and segment it into higher level streams. His framework for auditory scene analysis [4] spawned its computational counterpart, trying to reproduce or model accomplishment of the brains with algorithmic means [36]. 樱桃首次注意到“鸡尾酒会效应”[5]:人类大脑如何能够从充满人们交谈的房间中的周围噪音中分离出单一的对话。布雷格曼后来试图理解大脑如何能够分析复杂的听觉信号并将其分割成更高层次的流。他的听觉场景分析框架[4]催生了其计算对应物,试图通过算法手段重现或模拟大脑的成就[36]。
When producing music, recordings of individual instruments called stems are arranged together and mastered into the final song. The goal of source separation is then to recover those individual stems from the mixed signal. Unlike the cocktail problem, there is not a single source of interest to differentiate from an unrelated background noise, but instead a wide variety of tones and timbres playing in a coordinated way. As part of the SiSec Mus evaluation campaign for music separation [29], a choice was made to regroup those individual stems into 4 broad categories: (1) drums, (2) bass, (3) other, (4) vocals. 在制作音乐时,单个乐器的录音称为音轨,这些音轨被组合在一起并最终制作成歌曲。源分离的目标是从混合信号中恢复这些单独的音轨。与鸡尾酒问题不同,这里没有单一的感兴趣源需要与无关的背景噪声区分,而是有多种音调和音色以协调的方式演奏。作为 SiSec Mus 音乐分离评估活动的一部分[29],决定将这些单独的音轨重新分为四个大类:(1)鼓,(2)低音,(3)其他,(4)人声。
Each source is represented by a waveform where is the number of channels ( 1 for mono, 2 for stereo) and the number of samples. We define the concatenation of sources in a 每个源由波形 表示,其中 是通道数(单声道为 1,立体声为 2), 是样本数。我们定义 为源的连接。
tensor of size and the mixture . We aim at training a model that minimises 大小为 的张量和混合 。我们的目标是训练一个最小化的模型。
for some dataset , reconstruction error , model architecture with 4 outputs , and model weights . 对于某些数据集 ,重建误差 ,模型架构 具有 4 个输出 ,以及模型权重 。
As presented in the next section, most methods to solve (1) learn a mask per source on the mixture spectrogram (Short-Time Fourier Transform). The estimated sources are then (Inverse Short-Time Fourier Transform). The mask can either be a binary mask valued in or a soft assignment valued in . Those methods are state-of-the-art and perform very well without requiring large models. However, they come with two limitations: 如下一节所述,大多数解决(1)的方法在混合谱图 (短时傅里叶变换)上为每个源 学习一个掩码。然后,估计的源通过 (逆短时傅里叶变换)获得。掩码 可以是值为 的二进制掩码,也可以是值为 的软分配。这些方法是最先进的,表现非常好,而不需要大型模型。然而,它们有两个局限性:
There is no reason for to be a real spectrogram (i.e., obtained from a real signal). In that case the ISTFT step will perform a projection step that is not accounted for in the training loss and could result in artifacts. 没有理由认为 是一个真实的声谱图(即,从真实信号中获得)。在这种情况下,ISTFT 步骤将执行一个在训练损失中未考虑的投影步骤,这可能导致伪影。
Such methods do not try to model the phase but reuse the input mixture. Let us imagine that a guitar plays with a singer at the same pitch but the singer is doing a slight vibrato, i.e., a small modulation of the pitch. This modulation will impact the spectrogram phase, as the derivative of the phase is the instant frequency. Let say both the singer and the guitar have the same intensity, then the ideal mask would be 0.5 for each. However, as we reuse for each source the original phase, the vibrato from the singer would also be applied to the guitar While this could be consider a corner case, its existence is a motivation for the search of an alternative. 这种方法并不试图建模相位,而是重用输入混合信号。让我们想象一下,一把吉他与一位歌手在同一音高上演奏,但歌手在进行轻微的颤音,即音高的微小调制。这种调制会影响声谱图的相位,因为相位的导数是瞬时频率。假设歌手和吉他的音量相同,那么理想的掩膜对于每个源来说都是 0.5。然而,由于我们为每个源重用原始相位,歌手的颤音也会应用到吉他上。虽然这可以被视为一个边缘案例,但它的存在激励我们寻找替代方案。
Learning a model from/to the waveform could allow to lift some of the aforementioned limitations. Because a waveform is directly generated, the training loss is end-to-end, with no extra synthesis step that could add artifacts which solves the first point above. As for the second point, it is unknown whether any model could succeed in separating such a pathological case. In the fields of speech or music generation, direct waveform synthesis has replaced spectrogram based methods [34, 20, 7]. When doing generation without an input signal , the first point is more problematic. Indeed, there is no input phase to reuse and the inversion of a power spectrogram will introduce significant artifacts [8] Those successes were also made possible by the development of very large scale datasets ( 30 GB for the NSynth dataset [8]). In comparison the standard MusDB dataset is only a few GB. This explains, at least partially, the worse performance of waveform methods for source separation [29]. 从波形学习模型可以克服上述一些限制。因为波形是直接生成的,训练损失是端到端的,没有额外的合成步骤可能引入伪影,这解决了上述第一个问题。至于第二个问题,目前尚不清楚是否有任何模型能够成功分离这种病态情况。在语音或音乐生成领域,直接波形合成已取代基于声谱图的方法[34, 20, 7]。在没有输入信号 的情况下进行生成时,第一个问题更加棘手。实际上,没有输入阶段可以重用,功率声谱图的反演将引入显著的伪影[8]。这些成功也得益于非常大规模数据集的发展(NSynth 数据集为 30 GB[8])。相比之下,标准的 MusDB 数据集仅有几 GB。这至少部分解释了波形方法在源分离中的较差表现[29]。
In this paper we aim at taking waveform based methods one step closer to spectrogram methods. We contribute a simple model architecture inspired by previous work in source separation from the waveform and audio synthesis. We show that this model outperforms the previous state of the art on the waveform domain. Given the limited data available, we further refine the performance of our model by using a novel semi-supervised data augmentation scheme that allows to leverage 2,000 unlabeled songs. 在本文中,我们旨在使基于波形的方法更接近于谱图方法。我们贡献了一个简单的模型架构,灵感来自于之前在波形源分离和音频合成方面的工作。我们展示了该模型在波形领域的表现优于之前的最先进技术。鉴于可用数据有限,我们通过使用一种新颖的半监督数据增强方案进一步提升了模型的性能,该方案能够利用 2000 首未标记的歌曲。
2 Related Work 2 相关工作
A first category of methods for supervised music source separation work on power spectrograms. They predict a power spectrogram for each source and reuse the phase from the input mixture to synthesise individual waveforms. Traditional methods have mostly focused on blind (unsupervised) source separation. Non-negative matrix factorization techniques [26] model the power spectrum as a weighted sum of a learnt spectral dictionary, whose elements can then be grouped into individual sources. Independent component analysis [12] relies on independence assumptions and multiple microphones to separate the sources. Learning a soft/binary mask over power spectrograms has been done using either HMM-based prediction [25] or segmentation techniques [3]. 第一类监督音乐源分离的方法基于功率谱图。它们为每个源预测一个功率谱图,并重用输入混合信号的相位来合成单独的波形。传统方法主要集中在盲(无监督)源分离上。非负矩阵分解技术[26]将功率谱建模为学习到的谱字典的加权和,其元素可以被分组为单独的源。独立成分分析[12]依赖于独立性假设和多个麦克风来分离源。对功率谱图学习软/二进制掩模的方法已经使用 HMM 基础的预测[25]或分割技术[3]进行。
With the development of deep learning, fully supervised methods have gained momentum. Initial work was performed on speech source separation [9] then for music using simple fully connected networks over few spectrogram frames [32], LSTMs [33], or multi scale convolutional / recurrent networks [18, 30, 31]. State-of-the-art performance is obtained with those models when trained with extra labeled data. We show that our model architecture combined with our semi-supervised scheme can provide performance almost on par, while being trained on 5 times less labeled data. 随着深度学习的发展,完全监督的方法获得了动力。最初的工作是在语音源分离上进行的[9],然后是使用简单的全连接网络在少量声谱图帧上进行音乐处理[32],LSTM[33],或多尺度卷积/递归网络[18, 30, 31]。当使用额外的标记数据进行训练时,这些模型获得了最先进的性能。我们展示了我们的模型架构结合我们的半监督方案可以提供几乎相当的性能,同时训练所需的标记数据减少了 5 倍。
(a) Demucs architecture with the mixture waveform as input and the four sources estimates as output. Arrows represents U-Net connections. (a)Demucs 架构以混合波形作为输入,四个源的估计作为输出。箭头表示 U-Net 连接。
(b) Detailed view of the layers Decoder on the top and Encoder on the bottom. Arrows represent connections to other parts of the model. (b) 层的详细视图,解码器 在顶部,编码器 在底部。箭头表示与模型其他部分的连接。
Figure 1: Demucs complete architecture on the left, with detailed representation of the encoder and decoder layers on the right. Key novelties compared to the previous Wave-U-Net are the GLU activation in the encoder and decoder, the bidirectional LSTM in-between and exponentially growing number of channels, allowed by the stride of 4 in all convolutions. 图 1:左侧为 Demucs 完整架构,右侧为编码器和解码器层的详细表示。与之前的 Wave-U-Net 相比,主要创新包括编码器和解码器中的 GLU 激活、介于两者之间的双向 LSTM,以及由于所有卷积中的步幅为 4 而允许的指数增长的通道数量。
On the other hand, working directly on the waveform only became possible with deep learning models. A Wavenet-like but regression based approach was first used for speech denoising [23] and then adapted to source separation [19]. Concurrently, a convolutional network with a U-Net structure called Wave-U-Net was used first on spectrograms [14] and then adapted to the waveform domain [28]. Those methods performs significantly worse than the spectrogram ones as shown in the latest SiSec Mus source separation evaluation campaign [29]. As shown in Section 5], we outperform Wave-U-Net by a large margin with our architecture alone. 另一方面,直接处理波形仅在深度学习模型出现后才成为可能。一种类似于 Wavenet 但基于回归的方法首次用于语音去噪[23],随后被调整用于源分离[19]。与此同时,一种名为 Wave-U-Net 的卷积网络结构首先在声谱图上使用[14],然后调整到波形域[28]。如最新的 SiSec Mus 源分离评估活动[29]所示,这些方法的表现明显不如声谱图方法。如第 5 节所示,我们的架构在性能上大幅超越了 Wave-U-Net。
In [21], the problem of semi-supervised source separation is tackled for 2 sources separation where a dataset of mixtures and unaligned isolated examples of source 1 but not source 2 is available. Using specifically crafted adversarial losses the authors manage to learn a separation model. In [11], the case of blind, i.e., completely unsupervised source separation is covered, combining adversarial losses with a remixing trick similar in spirit to our unlabeled data remixing presented in Section 4 . Both papers are different from our own setup, as they assume that they completely lack isolated audio for some or all sources. Finally, when having extra isolated sources, previous work showed that it was possible to use adversarial losses to leverage them without using them to generate new mixtures [27]. Unfortunately, extra isolated sources are exactly the kind of data that is hard to come by. As far as we know, no previous work tried to leverage unlabeled songs in order to improve supervised source separation performance. Besides, most previous work relied on adversarial losses, which can prove expensive while our remixing trick allows for direct supervision of the training loss. 在[21]中,针对两个源的半监督源分离问题进行了探讨,其中可用的数据集包含混合信号和未对齐的源 1 的孤立示例,但没有源 2 的孤立示例。作者通过特定设计的对抗损失成功学习了一个分离模型。在[11]中,讨论了盲源分离,即完全无监督的源分离,结合了对抗损失和一种类似于我们在第 4 节中提出的无标签数据重混合的技巧。这两篇论文与我们的设置不同,因为它们假设对于某些或所有源完全缺乏孤立音频。最后,当有额外的孤立源时,之前的工作表明可以利用对抗损失来利用这些源,而不需要使用它们生成新的混合信号[27]。不幸的是,额外的孤立源正是难以获得的数据。就我们所知,之前没有工作尝试利用无标签歌曲来提高监督源分离的性能。此外,大多数之前的工作依赖于对抗损失,这可能会很昂贵,而我们的重混合技巧则允许对训练损失进行直接监督。
3 Model Architecture 3 模型架构
Our network architecture is a blend of ideas from the SING architecture [7] developed for music note synthesis and Wave-U-Net. We reuse the synthesis with large strides and large number of channels as well as the combination of a LSTM and convolutional layers from SING, while retaining the U-Net [24] structure of Wave-U-Net. The model is composed of a convolutional encoder, an LSTM and a convolutional decoder, with the encoder and decoder linked with skip U-Net connections. The model takes a stereo mixture as input and outputs a stereo estimate for each source Similarly to other work in generation in both image [16, 15] and sound [7], we do not use batch normalization [13] as our early experiments showed that it was detrimental to the model performance. 我们的网络架构融合了为音乐音符合成开发的 SING 架构[7]和 Wave-U-Net 的思想。我们重用了具有大步幅和大量通道的合成,以及来自 SING 的 LSTM 和卷积层的组合,同时保留了 Wave-U-Net 的 U-Net[24]结构。该模型由一个卷积编码器、一个 LSTM 和一个卷积解码器组成,编码器和解码器通过跳跃 U-Net 连接相连。模型以立体混合信号 作为输入,并为每个源输出立体估计 。与图像[16, 15]和声音[7]生成中的其他工作类似,我们不使用批量归一化[13],因为我们的早期实验表明这对模型性能有害。
The encoder is composed of stacked layers numbered from 1 to . Layer is composed of a convolution with kernel size , stride input channels, output channels and ReLU activation followed by a convolution with GLU activation [6]. As the GLU outputs channels with channels as input, we double the number of channels in the 1 x 1 convolution. We define the number of channels in the input mixture and the initial number of channels for our model. For we take so that the final number of channels is . We then use a bidirectional LSTM with 2 layers and a hidden size . The LSTM outputs channels per time position. We use a 1 x 1 convolution with ReLU activation to take that number down to . 编码器由 个堆叠层组成,编号从 1 到 。第 层由一个卷积组成,卷积核大小为 ,步幅为 ,输入通道为 ,输出通道为 ,激活函数为 ReLU,随后是一个带有 GLU 激活的 卷积[6]。由于 GLU 输出 个通道,输入为 个通道,我们在 1 x 1 卷积中将通道数加倍。我们定义 为输入混合物中的通道数, 为我们模型的初始通道数。对于 ,我们取 ,以便最终通道数为 。然后我们使用一个具有 2 层和隐藏层大小为 的双向 LSTM。LSTM 在每个时间位置输出 个通道。我们使用一个带有 ReLU 激活的 1 x 1 卷积将通道数减少到 。
The decoder is almost the symmetric of the encoder. It is composed of layers numbered in reverse order from to 1 . The -th layer starts with a convolution with kernel size 3 and stride 1 , input/output channels and a ReLU activation. We concatenate its result with the output of the -th layer of the encoder to form a U-Net and take back the number of channels to using a 1 x 1 convolution with GLU activation. Finally, we use a transposed convolution with kernel size and stride , outputs and ReLU activation. For the final layer, we instead output channels and do not use any activation function. 解码器几乎是编码器的对称体。它由 层组成,层编号从 到 1 反向排列。 -th 层以卷积开始,卷积核大小为 3,步幅为 1,输入/输出通道为 ,并使用 ReLU 激活。我们将其结果与编码器的 -th 层的输出连接,形成 U-Net,并通过 1 x 1 卷积与 GLU 激活将通道数恢复到 。最后,我们使用卷积转置,卷积核大小为 ,步幅为 ,输出为 ,并使用 ReLU 激活。对于最后一层,我们输出 个通道,并不使用任何激活函数。
Weights rescaling The weights of a convolutional layer in a deep learning model are usually initialized in a way that account for the number of input channels and receptive field of the convolutions (i.e., fan in), as introduced by He et al. [10]. The initial weights of a convolution will roughly scale as where is the kernel size and the number of input channels. For instance, the standard deviation after initialization of the weights of the first layer of our encoder is about 0.2 , while that of the last layer is 0.01 . Modern optimizers such as Adam [17] normalize the gradient update per coordinate so that, on average, each weight will receive updates of the same magnitude. Thus, if we want to take a learning rate large enough to tune the weights of the first layer, it will most likely be too large for the last layer. 权重重缩放 深度学习模型中卷积层的权重通常以考虑输入通道数量和卷积的感受野(即,fan in)的方式进行初始化,如 He 等人所述[10]。卷积的初始权重大致按 缩放,其中 是卷积核大小, 是输入通道数量。例如,我们编码器第一层权重初始化后的标准差约为 0.2,而最后一层的标准差为 0.01。现代优化器如 Adam [17]对每个坐标的梯度更新进行归一化,以便平均而言,每个权重将接收相同幅度的更新。因此,如果我们希望选择一个足够大的学习率来调整第一层的权重,那么对于最后一层来说,这个学习率很可能会过大。
In order to remedy this problem, we use a trick that is equivalent to using specific learning rates per layer. Let us denote the weights at initialization used to compute the convolution . We take , where is a reference scale. We replace by and the output of the convolution by , so that the output of the layer is unchanged. This is similar to the equalized learning rate trick used for image generation with GAN [16]. We observed both faster decay of the training loss and convergence to a better optimum when using the weight rescaling trick, see Section55.3 Optimal performance was obtained for a reference level . We also tried rescaling the weights by rather than however this made the training loss diverge. 为了补救这个问题,我们使用了一种技巧,相当于为每一层使用特定的学习率。我们用 表示用于计算卷积 的初始化权重。我们取 ,其中 是参考尺度。我们用 替换 ,并将卷积的输出替换为 ,以使层的输出保持不变。这类似于用于生成对抗网络(GAN)图像的均衡学习率技巧[16]。我们观察到使用权重重缩放技巧时,训练损失的衰减速度更快,并且收敛到更好的最优解,见第 55.3 节。对于参考水平 ,获得了最佳性能。我们还尝试通过 而不是 来重缩放权重,但这导致训练损失发散。
Synthesis vs. filtering Let us denote the output of the -th layer of the encoder and the output of the -th layer of the decoder. Wave-U-Net takes and upsamples it using linear interpolation. It then concatenates it with (with ) and applies a convolution with a stride of 1 to obtain . Thus, it works by taking a coarse representation, upsampling it, adding back the fine representation from and filtering it out to separate channels. 合成与过滤 让我们用 表示编码器第 层的输出,用 表示解码器第 层的输出。Wave-U-Net 接收 并使用线性插值进行上采样。然后,它将其与 (带有 )连接,并应用步幅为 1 的卷积以获得 。因此,它的工作原理是获取粗略表示,进行上采样,添加来自 的细致表示,并将其过滤以分离通道。
On the other hand, our model takes and concatenates it with and uses a transposed convolution to obtain . A transposed convolution is different from a linear interpolation upsampling. With a sufficient number of input channels, it can generate any signal, while a linear upsampling will generate a signal with higher sampling rate but no high frequencies. High frequencies are injected using a U-Net skip connection. Separation is performed by applying various filters to the obtained signal (aka convolution with a stride of 1). 另一方面,我们的模型将 与 连接,并使用转置卷积获得 。转置卷积与线性插值上采样不同。通过足够数量的输入通道,它可以生成任何信号,而线性上采样将生成具有更高采样率但没有高频的信号。高频通过 U-Net 跳跃连接注入。通过对获得的信号应用各种滤波器(即步幅为 1 的卷积)来进行分离。
Thus, Wave-U-Net generates its output by iteratively upsampling, adding back the high frequency part of the signal from the matching encoder output (or from the input for the last decoder layer) and filtering. On the other hand, our approach consist in a direct synthesis. The main benefit of synthesis is that we can use a relatively large stride in the decoder, thus speeding up the computations and allowing for a larger number of channels. We believe this larger number of channels is one of the reasons for the better performance of our model as shown in Section 5 因此,Wave-U-Net 通过迭代上采样、将匹配编码器输出(或最后解码器层的输入)中的高频部分加回并进行滤波来生成其输出。另一方面,我们的方法则是直接合成。合成的主要好处在于我们可以在解码器中使用相对较大的步幅,从而加快计算速度并允许更多的通道。我们认为,这种更多的通道数量是我们模型在第 5 节中所示的更好性能的原因之一。
4 Unlabeled Data Remixing 4 无标签数据重混合
In order to leverage unlabeled songs, we propose to first train a classifier to detect the absence or presence of each source on small time frames, using a supervised train set for which we know the 为了利用未标记的歌曲,我们建议首先训练一个分类器,以检测每个来源在小时间段内的缺失或存在,使用一个我们已知的监督训练集
Figure 2: Overall representation of our unlabeled data remixing pipeline. When we detect an excerpt of at least 5 seconds with one source silent, here the bass, we recombine it with a single bass sample from the training set. We can then provide strong supervision for the silent source, and weak supervision for the other 3 as we only know the ground truth for their sum. 图 2:我们未标记数据重混合管道的整体表示。当我们检测到至少 5 秒的片段且一个源静音时,这里是低音,我们将其与训练集中单个低音样本重新组合。然后,我们可以为静音源提供强监督,而对其他三个源提供弱监督,因为我们只知道它们总和的真实值。
contribution of each source. When we detect an audio excerpt with at least 5 seconds of silence for source , we add it to a new set . We can then mix an example with a single source taken from the supervised train set in order to form a new mixture , noting the ground truth for this example (potentially unknown to us) for each source . As the source is silent in we can provide strong supervision for source as we have and weak supervision for the other sources as we have . The whole process pipeline is represented in Figure 2 每个来源的贡献。当我们检测到一个音频片段 ,其在来源 中至少有 5 秒的静音时,我们将其添加到一个新集合 中。然后,我们可以将一个示例 与从监督训练集中提取的单一来源 混合,以形成一个新的混合体 ,并注意 该示例的真实情况(可能对我们未知)对于每个来源 。由于来源 在 中是静音的,我们可以为来源 提供强监督,因为我们有 ,而对其他来源则提供弱监督,因为我们有 。整个过程管道如图 2 所示。
Motivation for this approach comes from our early experiments with the available supervised data which showed a clear tendency for overfitting when training our separation models. We first tried using completely unsupervised regularization, for instance given an unlabeled track , we want where is the estimated source . This proved too weak to improve performance. We then tried to detect passages with a single source present however this proved too rare of an event in Pop/Rock music: for the standard MusDB dataset presented in Section 5.1 source other is alone 2.6% of the time while the others are so less than of the time. Accounting for the fact that our model will never reach a recall of , this represents too few extractable data to be interesting. On the other hand, a source being silent happen quite often, of the time for the drums, for the bass or for the vocals. This time, the other is the least frequent with and hardest to extract as noted hereafter. 这种方法的动机来源于我们早期对可用监督数据的实验,这些实验显示在训练我们的分离模型时存在明显的过拟合倾向。我们首先尝试使用完全无监督的正则化,例如给定一个未标记的轨道 ,我们希望得到 ,其中 是估计的源 。这被证明对提高性能过于薄弱。然后我们尝试检测只有一个源存在的片段,但这在流行/摇滚音乐中被证明是一个过于稀有的事件:在第 5.1 节中介绍的标准 MusDB 数据集中,源“其他”单独出现的时间仅为 2.6%,而其他源的出现频率更低,少于 。考虑到我们的模型永远无法达到 的召回率,这代表可提取的数据太少,不具吸引力。另一方面,源静音的情况相当频繁,鼓的静音频率为 ,贝斯为 ,人声为 。这次,“其他”是出现频率最低的,仅为 ,并且如后所述,提取难度最大。
We first formalize our classification problem and then describe the extraction procedure. The use of the extracted data for training is detailed in Section 5.2 . 我们首先对分类问题进行形式化,然后描述提取过程。提取数据用于训练的详细信息在第 5.2 节中。
4.1 Silent source detector 4.1 静音源探测器
Given a mixture , we define for all sources the relative volume and a source being silent as . For instance, having means that source is 100 times quieter than the mixture. Doing informal testing, we observe that a source with a relative volume between 0 and -10 will be perceived clearly, between -10 and -20 it will feel like a whisper and almost silent between -20 and -30 . A source with a relative volume under -30 is perceptually zero. 给定一个混合物 ,我们为所有源 定义相对音量 ,并将源静音定义为 。例如,拥有 意味着源 比混合物安静 100 倍。通过非正式测试,我们观察到,相对音量在 0 到-10 之间的源会被清晰感知,在-10 到-20 之间会感觉像耳语,而在-20 到-30 之间几乎是静音。相对音量低于-30 的源在感知上为零。
We can then train a classifier to estimate , the probability that source is silent given the input mixture . Given the limited amount of supervised data, we use a Wavelet scattering transform [1] of order two as input features rather than the raw waveform. This transformation is computed using the Kymatio package [2]. The model is then composed of convolutional layers with max pooling and batch normalization and a final LSTM that produces an estimate for every window of 0.64 seconds with a stride of 64 ms . We detail the architecture in the Section 2 of the supplementary material. We have observed that using a downsampled