NEXT-TDNN: MODERNIZING MULTI-SCALE TEMPORAL CONVOLUTION BACKBONE FOR SPEAKER VERIFICATION NEXT-TDNN:現代化多尺度時序卷積支撐架構於語者驗證
Hyun-Jun Heo , Ui-Hyeop Shin , Ran Lee , YoungJu Cheon , Hyung-Min Park Department of Electronic Engineering, Sogang University, Seoul, Republic of Korea 韓國首爾西江大學電子工程學系 Hyundai Motor Company, Seoul, Republic of Korea 現代汽車公司,首爾,大韓民國
Abstract 摘要
In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res 2 Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). 在語者驗證中,ECAPA-TDNN 通過利用一維(1D)Res 2 Net 塊和壓縮-激勵(SE)模塊,以及多層特徵聚合(MFA),展現了顯著的改進。 Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. 同時,在視覺任務中,通過參考 Transformer,ConvNet 結構已經得到現代化,從而提高了性能。在本文中,我們提出了一種用於話者驗證中 TDNN 的改進型塊設計。 Inspired by recent ConvNet structures, we replace the SE-Res 2 Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. 受到近期 ConvNet 結構的啟發,我們用一種新穎的一維雙步多尺度 ConvNeXt 塊替換了 ECAPA-TDNN 中的 SE-Res 2 Net 塊,我們稱之為 TS-ConvNeXt。 The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. TS-ConvNeXt 區塊是由兩個獨立的子模組構成:一個時間多尺度卷積(MSC)和一個逐幀前饋網絡(FFN)。這種兩步驟的設計允許靈活捕捉幀間和幀內上下文。 Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. 此外,我們為前饋神經網絡(FFN)模組引入全局響應歸一化(GRN),以實現類似於 ECAPA-TDNN 中 SE 模組的更選擇性特徵傳播。 Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies. 實驗結果表明,具有現代化背骨塊的 NeXt-TDNN 在語者驗證任務中顯著提高了性能,同時減少了參數大小和推理時間。我們已經發布了我們的代碼 供未來研究使用。
Index Terms- speaker recognition, speaker verification, TDNN, ConvNeXt, multi-scale 索引詞-語者識別、語者驗證、時延神經網絡(TDNN)、ConvNeXt、多尺度
1. INTRODUCTION 1. 導論
With the rise of deep neural networks, the conventional human-crafted embedding feature vector for speaker identity (i-vector) [1] was rapidly replaced with the DNN-based vector (d-vector) [2] in speaker verification. 隨著深度神經網絡的興起,傳統的人工製作的嵌入特徵向量用於講者身份識別(即 i-vector)[1]迅速被基於 DNN 的向量(即 d-vector)[2]在講者驗證中取代。 In particular, x -vector [3] has shown significantly improved performance through the use of time delay neural network (TDNN), and its various improvements have been studied [4, 5]. 特別是,透過使用時延神經網絡(TDNN),x-向量[3]已顯著提升了性能,其各種改進已被研究[4, 5]。 More recently, ECAPA-TDNN [6] has been proposed, which improved the TDNN architecture and achieved state-of-the-art 最近,已提出 ECAPA-TDNN [6],它改進了 TDNN 架構並達到了最先進的水平
Fig. 1: Block diagram for the proposed NeXt-TDNN architecture. 圖 1:所提出的 NeXt-TDNN 架構的方塊圖。
performance. To capture the spectral structure in multi-scales, it utilized Res2Net [7] with one-dimensional(1D) convolution as backbone layers. Additionally, squeeze-and-excitation(SE) blocks [8] were used for feature gating across the global temporal contexts. 性能。為了捕捉多尺度的光譜結構,它利用了具有一維(1D)卷積的 Res2Net [7] 作為主幹層。此外,還使用了壓縮-激勵(SE)塊 [8] 來進行全局時域上的特徵閘控。 Finally, it introduced a multi-layer feature aggregation (MFA) structure for utilizing shallow-layer features before temporal pooling. Therefore, ECAPA-TDNN has become a standard model for speaker verification tasks and been used as a base model in mainstream works [9, 10, 11]. 最後,它引入了一種多層特徵聚合(MFA)結構,用於在時域池化之前利用淺層特徵。因此,ECAPA-TDNN 已成為語者驗證任務的標準模型,並已在主流作品中被用作基礎模型[9, 10, 11]。
However, speech can be also regarded as spectral twodimensional(2D) visual features. 然而,語音也可以被視為光譜二維(2D)視覺特徵。 Therefore, for extracting a speaker embedding vector, a simple 2D ResNet-based architecture [12] has been often used, which also exhibited stable results for speaker verification [13, 14, 15, 16]. 因此,為了提取說話者嵌入向量,常用的是一種簡單的二維 ResNet 基礎架構[12],這也為說話者驗證展示了穩定的結果[13, 14, 15, 16]。 In traditional vision tasks, on the other hand, ConvNets including the conventional ResNet structure have been repeatedly improved and modernized after the advent of the vision Transformer [17]. 在傳統視覺任務中,另一方面,自視覺轉換器[17]問世後,包括傳統 ResNet 結構在內的卷積網絡已被反覆改進和現代化。 ConvNeXt [18] is one of those improvements, which refers to the architecture of the Transformer. In the following study, ConvNeXt-V2 [19] was presented with global response normalization (GRN), which emphasized the channel dimension in spatially global contexts. ConvNeXt [18] 是這些改進之一,它指的是變壓器的架構。在接下來的研究中,ConvNeXt-V2 [19] 被提出,其中包含全局響應歸一化(GRN),這強調了在空間全局上下文中的通道維度。
In this paper, we present a modernized backbone block for TDNN inspired by the recent ConvNeXt structure. As shown in Fig. 1, we have re-designed the ConvNeXt block to be a 在本文中,我們提出了一種現代化的 TDNN 主幹塊,其靈感來自於最近的 ConvNeXt 結構。如圖 1 所示,我們已重新設計了 ConvNeXt 塊,使其成為一個
suitable substitute for the SE-Res2Net block, as a two-step ConvNeXt (TS-ConvNeXt) block. The TS-ConvNeXt block is composed of two sub-modules. 適合替代 SE-Res2Net 模塊的是兩步驟 ConvNeXt(TS-ConvNeXt)模塊。TS-ConvNeXt 模塊由兩個子模組組成。 The first module is based on a parallel 1D depth-wise convolution (DConv1D) layer with different scales, which we call multi-scale convolution (MSC) module. Then, the separate feed-forward network (FFN) is placed like a Transformer structure. 首個模組基於具有不同尺度的平行一維深度卷積(DConv1D)層,我們稱之為多尺度卷積(MSC)模組。然後,單獨的前饋網絡(FFN)被放置,類似於變壓器結構。 Additionally, we adopted the GRN [19] in the FFN to enhance the channel contrast, which may replace the SE module in the SE-Res 2 Net block of the ECAPA-TDNN. 此外,我們在 FFN 中採用了 GRN [19] 以增強通道對比,這可能會取代 ECAPA-TDNN 中 SE-Res 2 Net 區塊的 SE 模組。 Based on the proposed TS-ConvNeXt block, we have developed the modernized NeXt-TDNN for extracting an improved speaker embedding vector. 基於所提出的 TS-ConvNeXt 模塊,我們開發了現代化的 NeXt-TDNN 以提取改進的說話者嵌入向量。
2.1. MFA layer and ASP as temporal pooling 2.1. MFA 層與 ASP 作為時間池化
As shown in Fig 1, the input feature is given as a melspectrogram with a shape of , where is the number of frames. Then, it is processed by a standard convolution layer with a kernel size of 4 , which converts the spectral dimension of to a latent channel dimension of . The output feature is then processed by three stages with TS-ConvNeXt blocks. To aggregate the outputs from all three stages, we utilized the MFA layer similarly to the ECAPA-TDNN [6]. Specifically, they are concatenated as and processed by 1D point-wise convolution (PConv1D) with an output dimension of followed by Layer Normalization (LN). 如圖 1 所示,輸入特徵以形狀為 的梅爾頻譜圖給出,其中 是幀數。然後,它被一個標準的卷積層處理,該層的核大小為 4,將 的頻譜維度轉換為 的潛在通道維度。輸出特徵 接著被三個階段的 TS-ConvNeXt 塊處理。為了聚合來自所有三個階段的輸出 ,我們類似於 ECAPA-TDNN [6]使用了 MFA 層。具體來說,它們被連接為 ,並通過 1D 點對點卷積(PConv1D)處理,其輸出維度為 ,接著是層正規化(LN)。
To extract an utterance-level embedding from the framelevel features, we utilized the ASP pooling layer [20] with channel-dependent attention values following the ECAPA-TDNN. When the output of the MFA layer is given as , the ASP layer calculates attention values between 0 and 1 from these frame-level features and uses them for a weighted mean and a standard deviation vector as and where denotes the Hadamard product. Then, the speaker embedding is extracted from these two statistics using a linear layer. 為了從幀級特徵中提取話語級嵌入,我們在 ECAPA-TDNN 之後使用了具有通道依賴注意力值的 ASP 池化層[20]。當 MFA 層的輸出被給定為 時,ASP 層從這些幀級特徵計算出 0 到 1 之間的注意力值,並使用它們來計算加權平均值和標準差向量,分別為 和 ,其中 表示哈達瑪乘積。然後,使用線性層從這兩個統計數據中提取說話者嵌入 。
As shown in Fig. 11 the backbone consists of three stages of TS-ConvNeXt blocks and the blocks are repeated times in each stage. In the -th stage, , the input representation at the -th block, , is processed as follows: 圖 11 所示,主幹網絡由三個階段的 TS-ConvNeXt 塊組成,並且每個階段中的塊重複 次。在第 階段, ,輸入表示 在第 塊, ,按以下方式處理:
with . The output of the -th block becomes the input of the MFA layer as well as the first block of the next stage: . The MSC module is 與 。第 塊 的輸出成為 MFA 層的輸入,以及下一階段第一塊的輸入: 。MSC 模組是
(a) Res2Conv module (a) Res2Conv 模組
(b) MSC module (b) MSC 模組
Fig. 2: Block diagrams of (a) the conventional Res2Conv module in the Res2Net block and (b) the MSC module in the TSConvNeXt block . 圖 2:(a) Res2Net 區塊中傳統 Res2Conv 模組的方塊圖 及(b) TSConvNeXt 區塊中 MSC 模組的方塊圖 。
used to capture inter-frame temporal contexts, while the FFN module processes frame-wise features independently. 用於捕捉幀間時序上下文,而 FFN 模組則獨立處理逐幀特徵。
To effectively capture multi-scale features, we have replaced Res2Net [7] in Fig. (2)(a)] with parallel DConv1D layers in the MSC module. Each layer has a different kernel size of , with a scale factor of as shown in Fig. (2)(b), For the optimized training at each scale, the input feature of the MSC module is projected to a reduced dimension of before each DConv1D layer. Then, these multi-scale features are concatenated to have a dimension of and once again projected to by PConv1D after GELU activation. The mechanism of MSC is simple but similar to the multi-head self-attention (MHSA) in the transformer [21]. The only difference is that the attention operation in MHSA is replaced with DConv1D in MSC. 為了有效捕捉多尺度特徵,我們已經在圖(2)(a)中用 個平行的 DConv1D 層替換了 Res2Net [7],並在 MSC 模組中使用。每一層都有不同的核心大小 ,並且如圖(2)(b)所示,具有 的比例因子。為了在每個尺度上優化訓練,MSC 模組的輸入特徵在進入每個 DConv1D 層之前,會先被投影到縮減的維度 。然後,這些多尺度特徵被串聯起來,具有 的維度,並在 GELU 激活後,再次被 PConv1D 投影到 。MSC 的機制雖然簡單,但與變壓器[21]中的多頭自注意力(MHSA)相似。唯一的區別在於,MHSA 中的注意力操作被 MSC 中的 DConv1D 取代。
On the other hand, the FFN includes two PConv1D layers with an expansion factor of 4 with GELU activation. Also, GRN [19] is adopted in the FFN to enhance the contrast of individual channels. 另一方面,FFN 包含兩個擴展因子為 4 的 PConv1D 層,並使用 GELU 激活函數。此外,FFN 中採用了 GRN [19] 以增強個別通道的對比度。 GRN is a simple and efficient method, as it does not require additional parameter layers. Specifically, when the hidden feature is given as after the GELU activation, GRN calculates L2-norm across the temporal dimension into a vector and apply a response normalization function to the aggregated values as where is L1-norm of a vectot , Then, the original features are calibrated using the values from the response normalization function with skip connection, which is calculated as GRN 是一種簡單且高效的方法,因為它不需要額外的參數層。具體來說,當隱藏特徵在 GELU 激活後被給定為 時,GRN 會計算跨時間維度的 L2 範數成為向量 ,並對聚合值 應用響應歸一化函數 ,表示為 ,其中 是向量 的 L1 範數。然後,原始特徵 使用來自響應歸一化函數的值進行校準,並加上跳躍連接,計算方式為
with trainable parameters for affine transformation with their initial values set to zero. This allows the GRN to operate as a bypass at the initial step and adapt gradually as the training progresses. 具有可訓練參數 用於仿射變換,其初始值設為零。這使得 GRN 在初始步驟時能夠作為一個旁路操作,並隨著訓練的進行逐漸適應。
2.3. Rationale of the backbone block design 2.3. 主幹塊設計的基本原理
In Fig. 33 the block structure designs are shown for conventional and proposed blocks. The SE-Res2Net block [7] in 在圖 33 中,展示了傳統和提議塊的塊結構設計。SE-Res2Net 塊[7]在
(a) SE-Res2Net Translated Text: SE-Res2Net
(b) Transformer (b) 變壓器
(c) ConvNeXt
(d) TS-ConvNeXt
(e) TS-ConvNeXt-l
Fig. 3: Block designs of the conventional (a) 1D SE-Res2Net in ECAPA-TDNN, (b) Transformer, and (c) 1D ConvNeXt and the proposed (d) 1D TS-ConvNeXt in NeXt-TDNN and (e) 1D TS-ConvNeXt- in NeXt-TDNN-l. 圖 3:傳統的 (a) 1D SE-Res2Net 在 ECAPA-TDNN 中的區塊設計,(b) Transformer,以及 (c) 1D ConvNeXt,和所提出的 (d) 在 NeXt-TDNN 中的 1D TS-ConvNeXt 和 (e) 在 NeXt-TDNN-l 中的 1D TS-ConvNeXt- 。
Fig. 3 (a) follows the conventional ResNet structure and consists of two PConv2D layers to control the channel dimension. Each layer is followed by Batch Normalization (BN) and ReLU activation. 圖 3 (a) 遵循傳統的 ResNet 結構,包含兩個 PConv2D 層以控制通道維度。每個層之後接著批次正規化(BN)和 ReLU 激活函數。 On the other hand, it replaces the standard convolution in ResNet with a Res2 Dilated Conv1D layer of Fig. (2) (a) to capture multi-scale contexts and adds an SE block to attend to the specific channels. 另一方面,它用圖(2)(a)中的 Res2 Dilated Conv1D 層替換了 ResNet 中的標準卷積,以捕捉多尺度上下文,並增加了一個 SE 塊來關注特定的通道。 Meanwhile, the traditional Transformer [21] has two separate sub-modules of MHSA and FFN as shown in Fig. 3 (b) Therefore, ConvNeXt(-V2) block [18, 19] in Fig. 33(c) was designed by changing the ResNet block regarding the structure of the Transformer. 與此同時,傳統的 Transformer [21] 如圖 3 (b) 所示,有兩個獨立的子模組:MHSA 和 FFN。因此,圖 33(c) 中的 ConvNeXt(-V2) 塊 [18, 19] 是通過改變 Transformer 結構中的 ResNet 塊來設計的。 In particular, after changing the standard convolution of ResNet to DConv, the ConvNeXt block moves up the DConv with a larger kernel size by expecting a similar role of MHSA in the Transformer. 尤其是在將 ResNet 的標準卷積改為 DConv 之後,ConvNeXt 塊通過期望與 Transformer 中的 MHSA 具有類似作用,提升了具有更大核心大小的 DConv。 It also uses an inverted bottleneck structure with fewer norms and activations as in FFN of the Transformer. In ConvNeXt-V2 [19], GRN was newly proposed and introduced behind the GELU activation in the inverted bottleneck of ConvNeXt to enhance the performance. 它還使用了一種倒置瓶頸結構,在轉換器的前饋神經網絡(FFN)中具有更少的規範和激活。在 ConvNeXt-V2 [19]中,新提出並在 ConvNeXt 的倒置瓶頸中的 GELU 激活後引入了 GRN,以增強性能。
To modernize the backbone block from the SE-Res2Net of the ECAPA-TDNN, we designed TS-ConvNeXt from the ConvNeXt by considering the structures of SE-Res 2 Net and Transformer more proactively as shown in Fig. 為了將 ECAPA-TDNN 中的 SE-Res2Net 主幹網絡現代化,我們參考了 SE-Res2Net 和 Transformer 的結構,更積極地設計了 TS-ConvNeXt,如圖所示。 3(d) Specifically, we separated the temporal block of MSC and positionwise processing block of FFN to perform in a two-step way and introduced temporal multi-scale features in MSC module. 3(d) 具體來說,我們將 MSC 的時序塊和 FFN 的位置處理塊分開,以兩步方式進行,並在 MSC 模組中引入了時序多尺度特徵。 This modified design is more similar to the Transformer structure and can provide the benefit of more flexible training for inter- and intra-frame contexts. 這個修改過的設計與變壓器結構更為相似,並且能夠為幀間和幀內上下文提供更靈活訓練的好處。 In terms of SE-Res2Net, MSC may replace the Res 2 Net for multi-scale temporal modeling, and FFN with GRN can be used instead of the SE block for position-wise feature processing. Additionally, we considered TS-ConvNeXt-l block in Fig. 33(e) as a light version of TS-ConvNeXt. 在 SE-Res2Net 方面,MSC 可以取代 Res2Net 用於多尺度時間建模,而 FFN 配合 GRN 可以用來替代 SE 模塊進行位置感知的特徵處理。此外,我們將圖 33(e)中的 TS-ConvNeXt-l 塊視為 TS-ConvNeXt 的輕量版。 TS-ConvNeXt- uses a single DConv1D layer instead of the MSC module. TS-ConvNeXt- 使用單一的 DConv1D 層取代了 MSC 模組。
3. EXPERIMENTAL SETUP 3. 實驗設置
3.1. Dataset and metrics 3.1. 資料集與衡量標準
In our experiments, training and evaluation were based on VoxCeleb1 [22] and VoxCeleb2 [13] datasets. VoxCeleb1 in- cludes two subsets: the development set with 1,211 speakers and the evaluation set with 40 speakers. 在我們的實驗中,訓練和評估是基於 VoxCeleb1 [22]和 VoxCeleb2 [13]數據集進行的。VoxCeleb1 包括兩個子集:開發集包含 1,211 位講者,評估集包含 40 位講者。 Also, VoxCeleb2 is divided into two subsets: the development and evaluation sets, the number of whose speakers are 5,994 and 118, respectively. As a training dataset, we used the development dataset of VoxCeleb2. Translated Text: 此外,VoxCeleb2 被分為兩個子集:開發集和評估集,其講者的數量分別為 5,994 人和 118 人。作為訓練數據集,我們使用了 VoxCeleb2 的開發數據集。 On the other hand, we used three subsets from the VoxCeleb1 dataset for the evaluation: VoxCeleb1O only with its original test set, VoxCelebl-E with its entire dataset including the development and test sets, VoxCeleb1H selected to have the same nationality and gender from the VoxCeleb-E. 另一方面,我們使用了 VoxCeleb1 數據集的三個子集進行評估:只包含原始測試集的 VoxCeleb1O、包括開發集和測試集的整個數據集 VoxCelebl-E,以及從 VoxCeleb-E 中選取具有相同國籍和性別的 VoxCeleb1H。
For evaluation, we measured the performance of the models by the equal error rate (EER) and the minimum normalized detection cost function (minDCF) with and . We calculated scores using the cosine distance of an embedding pair and applied score normalization with adaptive s-norm [23]. 6,000 utterances were picked as an imposter cohort in the training set and 300 top imposter scores were used for the score normalization. 為了評估,我們通過等錯誤率(EER)和最小標準化檢測成本函數(minDCF)來衡量模型的性能,分別使用 和 。我們計算分數時使用嵌入對的餘弦距離,並應用適應性 s-norm 進行分數標準化[23]。在訓練集中選取了 6,000 個語音樣本作為冒名頂替者群體,並使用了 300 個最高的冒名頂替者分數進行分數標準化。
3.2. Configurations 3.2. 配置
As an input feature, 80 log-Mel-filterbanks were extracted from a spectrogram using an FFT size of 512 and a -long Hamming window with a shift. We trained the network using randomly cropped 3 -s segments with AAM-softmax [24] with a margin of 0.3 and a scale of 40. For the proposed NeXt-TDNN, the output dimension of the MFA layer was set to which was equal to its input dimension. For the MSC module in the TS-ConvNeXt block, the projection dimension was set to . 作為輸入特徵,從頻譜圖中提取了 80 個對數梅爾濾波器組,使用了 512 的 FFT 大小和一個 長的漢明窗,以及 的位移。我們使用隨機裁剪的 3 秒片段以及帶有 0.3 邊界和 40 比例的 AAM-softmax [24]來訓練網絡。對於提出的 NeXt-TDNN,MFA 層的輸出維度 被設置為 ,與其輸入維度相等。對於 TS-ConvNeXt 塊中的 MSC 模塊,投影維度被設置為 。
Following the ECAPA-TDNN [6], we also utilized the Kaldi recipe [4] for data augmentation with noise source from MUSAN [25] and RIR [26]. SpecAugment [27] was applied, too. 繼 ECAPA-TDNN [6]之後,我們也利用了 Kaldi 食譜[4]進行數據增強,增強的噪聲源自 MUSAN [25]和 RIR [26]。同時也應用了 SpecAugment [27]。 We set the batch size of 500 for the mobile model size and 300 for base models, and the learning rates were initialized as and , respectively [28]. The models were trained up to 200 epochs, and the learning rate was reduced by 0.8 times every 10 epochs. As an optimizer, AdamW was used with a weight decay of 0.01 , and gradient clipping with a maximum L2-norm of 1 was applied for stable training. 我們為移動模型大小設定了 500 的批次大小,基礎模型則設定為 300,學習率分別初始化為 和 [28]。模型訓練至多 200 個時期,每 10 個時期學習率減少 0.8 倍。作為優化器,使用了 AdamW,其權重衰減為 0.01,並且為了穩定訓練,應用了最大 L2 範數為 1 的梯度裁剪。
Table 2: Evaluation for conventional and proposed models on Voxceleb1-O, E, and H datasets. For conventional models, we considered Fast ResNet-34 [16], ECAPA-TDNN [6] , and EfficientTDNN as an improved version for ECAPA-TDNN. The kernel size of TSConvNeXt-l block was set to 65 in the NeXt-TDNN-l. For the MSC module of the TS-ConvNeXt in the NeXt-TDNN model, the scale factor and the set for multi-scale kernel sizes were set to and [7,65], respectively. MACs and RTF were measured on 3-s long segments. RTF was measured by repeating the inference on a test environment sufficiently with Nvidia GeForce GTX 1080 Ti . 表 2:在 Voxceleb1-O、E 和 H 數據集上對傳統模型和提出模型的評估。對於傳統模型,我們考慮了 Fast ResNet-34 [16]、ECAPA-TDNN [6]、以及作為 ECAPA-TDNN 改進版本的 EfficientTDNN。在 NeXt-TDNN-l 中,TSConvNeXt-l 塊的核心大小設為 65。對於 NeXt-TDNN 模型中的 TS-ConvNeXt 的 MSC 模塊,比例因子和多尺度核心大小集分別設為和[7,65]。MACs 和 RTF 是在 3 秒長的片段上測量的。RTF 是在 Nvidia GeForce GTX 1080 Ti 的測試環境中進行足夠重複的推理後測量的。
Model 模型
Params 參數
MACs
VoxCeleb1-O
VoxCeleb1-E
VoxCeleb1-H
EER(%) 能源效率比(%)
EER(%) 能源效率比(%)
EER(%) 能源效率比(%)
Fast ResNet-34 [16 快速 ResNet-34 [16]
1.4 M Translated Text: 1.4 百萬
0.675 G
1.67
2.08
0.2729
2.18
0.2632
4.19
0.3797
ECAPA-TDNN ( ) ECAPA-TDNN (ECAPA-TDNN)
1.9 M Translated Text: 1.9 百萬
0.410 G
1.60
1.56
0.1551
1.56
0.1656
2.70
0.2512
EfficientTDNN-Mobile [9] 高效 TDNN-Mobile [9]
2.4 M
0.574 G
1.20
1.41
0.1247
1.53
0.1654
2.72
0.2526
NeXt-TDNN-l
1.6 M Translated Text: 1.6 百萬
0.417 G
0.51
1.39
0.1304
1.52
0.1497
2.49
0.2274
NeXt-TDNN-l
1.6 M Translated Text: 1.6 百萬
0.441 G
0.89
1.10
0.1079
1.24
0.1334
2.12
0.2006
NeXt-TDNN
1.8 M 1.8 兆
0.478 G
0.63
1.31
0.1319
1.39
0.1409
2.28
0.2073
NeXt-TDNN
1.9 M Translated Text: 190 萬
0.519 G
1.29
1.03
0.0954
1.17
0.1260
1.98
0.1903
ECAPA-TDNN ( ECAPA-TDNN (ECAPA-TDNN)
6.2 M
1.569 G
1.80
1.13
0.1118
1.36
0.1464
2.44
0.2368
EfficientTDNN-Base [9]
5.8 M Translated Text: 5.8 百萬
1.450 G
1.32
0.96
0.0924
1.20
0.1296
2.17
0.2073
NeXt-TDNN-l
5.9 M Translated Text: 5.9 百萬
1.609 G
0.63
1.05
0.0957
1.18
0.1208
2.02
0.1882
NeXt-TDNN-l
6.0 M
1.695 G
0.88
0.81
0.0909
1.04
0.1157
1.86
0.1844
NeXt-TDNN NeXt-TDNN
6.7 M Translated Text: 6.7 百萬
0.71
0.93
0.0833
1.11
0.1160
1.89
0.1758
NeXt-TDNN
7.1 M
2.027 G
1.31
0.79
0.0865
1.04
0.1152
1.82
0.1818
Table 1: Evaluation on VoxCeleb-O for the proposed NeXt-TDNN depending on the backbone structure with and . denotes the kernel size (or the set of sizes) in the DConv1D layer. 表 1:根據背骨結構對提出的 NeXt-TDNN 在 VoxCeleb-O 上的評估,其中 和 。 表示 DConv1D 層中的核心大小(或一組大小)。
Backbone block 背骨塊
GRN
Params 參數
EER(%) 能源效率比(%)
minDCF 最小檢測成本函數
ConvNeXt
7
5.9 M
1.08
0.0956
ConvNeXt
7
5.9 M Translated Text: 5.9 百萬
1.00
0.1153
ConvNeXt
65
6.0 M
1.05
0.1283
TS-ConvNeXt-
7
5.9 M Translated Text: 5.9 百萬
1.08
0.1016
TS-ConvNeXt-
7
5.9 M
0.96
0.1039
TS-ConvNeXt- TS-ConvNeXt-
65
6.0 M
0.81
0.0910
TS-ConvNeXt
65
7.2 M Translated Text: 7.2 百萬
0.84
0.0884
TS-ConvNeXt
7.1 M
0.0865
TS-ConvNeXt
7.1 M
0.84
4. EXPERIMENTAL RESULT 實驗結果
First, we validated the effectiveness of TS-ConvNeXt block in Table 1. Generally, using GRN improved the performance in terms of the EER by emphasizing the frame-level features based on global temporal context, similar to the SE block in ECAPA-TDNN. 首先,我們在表 1 中驗證了 TS-ConvNeXt 區塊的有效性。一般而言,使用 GRN 通過強調基於全局時序上下文的幀級特徵來改善 EER 表現,這與 ECAPA-TDNN 中的 SE 區塊類似。 Also, contrary to the ConvNeXt, TSConvNeXt-l showed effectiveness with a larger kernel size of , which demonstrates that simply dividing ConvNeXt into two sub-modules is appropriate with the large kernel. With the TS-ConvNeXt block, using a multi-scale kernel with small and large kernels further improved the results than using a single kernel of 65 . 此外,與 ConvNeXt 相反,TSConvNeXt-l 在使用更大的核心尺寸 時顯示出效果,這證明了僅將 ConvNeXt 分為兩個子模塊在大核心上是適當的。通過使用 TS-ConvNeXt 塊,使用多尺度核心,結合小和大的核心,進一步改善了結果,相比之下單一使用 65 尺寸的核心。
In Table 2, we evaluated our proposed NeXt-TDNN and NeXt-TDNN-l for mobile and base models. Also, we measured the multiply-accumulate operations (MACs) to compare the computational costs and the real-time factors (RTFs) 在表二中,我們評估了我們提出的 NeXt-TDNN 以及為移動和基礎模型設計的 NeXt-TDNN-l。同時,我們測量了乘累積運算(MACs)來比較計算成本以及實時因子(RTFs)。
to evaluate the inference speed. In particular, we considered two configurations for proposed NeXt-TDNN and NeXtTDNN-l: the faster version with and the deeper one with while arranging the corresponding to keep the parameter sizes similar. 為了評估推論速度,我們特別考慮了兩種配置的 NeXt-TDNN 和 NeXtTDNN-l:較快版本配備了 ,而較深版本則配備了 ,同時安排相應的 以保持參數大小相似。
Compared to the conventional models, our models consistently achieved improved performances in both mobile and base model sizes. In particular, our NeXt-TDNN with achieved better result with more than two times faster speed than the ECAPA-TDNN. Furthermore, NeXt-TDNN with the MSC module generally improved the performance compared to the NeXt-TDNN-l. Finally, it is noteworthy that the mobile NeXt-TDNN with and achieved better results than even the base ECAPA-TDNN with while being three times smaller in both the parameter size and the computational costs ( 1.9 M vs. 6.2 M and 0.519 G vs. 1.569 G. Because the inference was performed using a GPU, the RTF was not significantly affected by the channel dimension due to the advantage of parallel computing, but rather by the block repetition . 與傳統模型相比,我們的模型在移動和基礎模型尺寸上均一致實現了性能提升。特別是,我們的 NeXt-TDNN 結合 實現了更好的結果,速度比 ECAPA-TDNN 快了兩倍以上。此外,配備 MSC 模塊的 NeXt-TDNN 通常比 NeXt-TDNN-l 的性能有所提高。最後,值得注意的是,移動 NeXt-TDNN 結合 和 的結果甚至比基礎 ECAPA-TDNN 配備 的結果還要好,同時在參數大小和計算成本上都小了三倍(1.9 M 對比 6.2 M,以及 0.519 G 對比 1.569 G )。由於推理是使用 GPU 進行的,RTF 並未顯著受到通道維度 的影響,這得益於並行計算的優勢,而是受到了塊重複 的影響。
5. CONCLUSION 結論
We have modernized the block design in TDNNs for speaker verification from the popular ECAPA-TDNN. Inspired by the structure of Transformer and recent ConvNets, we replaced the SE-Res2Net block in the ECAPA-TDNN with the novel TS-ConvNeXt block. 我們從流行的 ECAPA-TDNN 中現代化了用於語者驗證的 TDNNs 中的塊設計。受到 Transformer 和最近的 ConvNets 結構的啟發,我們用新穎的 TS-ConvNeXt 塊替換了 ECAPA-TDNN 中的 SE-Res2Net 塊。 This block consists of two separated sub-modules: MSC and FFN, which capture multi-scale temporal and position-wise channel contexts, respectively. 此區塊包含兩個獨立的子模組:MSC 和 FFN,分別捕捉多尺度時間和位置通道上下文。 Additionally, we introduced GRN in the FFN modules to enhance feature contrast. Experimental results showed that the NeXtTDNN with the modernized TS-ConvNeXt block was effective for speaker verification. 此外,我們在 FFN 模組中引入了 GRN 以增強特徵對比。實驗結果顯示,採用現代化 TS-ConvNeXt 區塊的 NeXtTDNN 對於說話者驗證是有效的。
6. REFERENCES 參考文獻
[1] Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-end factor analysis for speaker verification," IEEE TASLP, vol. 19, no. 4, pp. 788798, 2011. [1] Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, 和 Pierre Ouellet, "用於語者驗證的前端因素分析," IEEE TASLP, 卷. 19, 第 4 期, 頁 788-798, 2011.
[2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. of ICASSP, 2014, pp. 4052-4056. [2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, 和 Javier Gonzalez-Dominguez, "用於小型足跡文本依賴型語者驗證的深度神經網絡," 於 ICASSP 會議論文集, 2014, 頁 4052-4056。
[3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," in Proc. of ICASSP, 2018, pp. 5329-5333. [3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, 及 Sanjeev Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," 於 Proc. of ICASSP, 2018, 頁 5329-5333。
[4] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur, "Speaker recognition for multi-speaker conversations using x-vectors," in Proc. of ICASSP, 2019, pp. 5796-5800. [4] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, 和 Sanjeev Khudanpur, "使用 x-向量進行多人對話的語者識別," 於 ICASSP 會議論文集, 2019, 頁 5796-5800。
[5] Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell, "Jhu-HLTCOE system for the voxsrc speaker recognition challenge," in Proc. of ICASSP, 2020, pp. 7559-7563. [5] Daniel Garcia-Romero、Alan McCree、David Snyder 與 Gregory Sell,"Jhu-HLTCOE 系統應用於 voxsrc 語者識別挑戰",載於 2020 年 ICASSP 會議論文集,頁 7559-7563。
[6] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification," in Proc. Interspeech, 2020, pp. 3830-3834. [6] Brecht Desplanques, Jenthe Thienpondt, 和 Kris Demuynck, "ECAPA-TDNN: 在基於 TDNN 的語者驗證中強調的通道注意、傳播與聚合," 於 Proc. Interspeech, 2020, 頁 3830-3834。
[7] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr, "Res2Net: A new multiscale backbone architecture," IEEE TPAMI, vol. 43, no. 2, pp. 652-662, 2021. [7] 高尚華、程明明、趙凱、張新宇、楊明軒、Philip Torr,"Res2Net:一種新的多尺度骨幹架構",IEEE TPAMI,第 43 卷,第 2 期,頁 652-662,2021 年。
[8] Jie Hu, Li Shen, and Gang Sun, "Squeeze-and-excitation networks," in Proc. of CVPR, June 2018. [8] 胡杰、沈立、孫剛,"壓縮與激勵網絡",於 2018 年 6 月 CVPR 會議論文集中。
[9] Rui Wang, Zhihua Wei, Haoran Duan, Shouling Ji, Yang Long, and Zhen Hong, "EfficientTDNN: Efficient architecture search for speaker recognition," IEEE/ACM TASLP, vol. 30, pp. 2267-2279, 2022. [9] Rui Wang, Zhihua Wei, Haoran Duan, Shouling Ji, Yang Long, Zhen Hong, "EfficientTDNN: 高效的架構搜索於語者識別," IEEE/ACM TASLP, 卷 30, 頁 2267-2279, 2022。
[10] Jee weon Jung, Youjin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung, "Pushing the limits of raw waveform speaker recognition," in Proc. Interspeech, 2022, pp. 2228-2232. [10] Jee weon Jung, Youjin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, 以及 Joon Son Chung, "推動原始波形揚聲器識別的極限," 於 Proc. Interspeech, 2022, 頁 2228-2232。
[11] Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung yi Lee, and Helen Meng, "MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification," in Proc. Interspeech, 2022, pp. 306-310. [11] 張揚、呂志強、吳海斌、張珊珊、胡鵬飛、吳志勇、李宏毅、孟海倫, "MFA-Conformer: 多尺度特徵聚合的 Conformer 用於自動語者驗證," 於 Proc. Interspeech, 2022, 頁 306-310。
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep residual learning for image recognition," in Proc. of CVPR, June 2016. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,"用於圖像識別的深度殘差學習",於 CVPR 會議論文集,2016 年 6 月。
[13] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, "VoxCeleb2: Deep speaker recognition," in Proc. Interspeech, 2018, pp. 1086-1090. [13] Joon Son Chung, Arsha Nagrani, 和 Andrew Zisserman, "VoxCeleb2: 深度語者識別," 於 Proc. Interspeech, 2018, 頁 1086-1090.
[14] Weicheng Cai, Jinkun Chen, and Ming Li, "Exploring the encoding layer and loss function in end-to-end speaker and language recognition system," in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 74-81. [14] 蔡偉成、陳進坤、李明,"探索端到端語者與語言識別系統中的編碼層與損失函數",於 Proc. Odyssey 2018 The Speaker and Language Recognition Workshop,2018 年,頁 74-81。
[15] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, and Oldřich Plchot, "BUT system description to voxCeleb speaker recognition challenge 2019," 2019. [15] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, 和 Oldřich Plchot, "2019 年 voxCeleb 語者識別挑戰賽中布特拉夫系統的描述," 2019。
[16] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han, "In Defence of Metric Learning for Speaker Recognition," in Proc. Interspeech, 2020, pp. 2977-2981. [16] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, 和 Icksang Han, "為語者識別的度量學習辯護," 於 Proc. Interspeech, 2020, 頁 2977-2981。
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," in International Conference on Learning Representations, 2021 [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 和 Neil Houlsby, "一張圖片相當於 16x16 個字:大規模圖像識別中的變壓器," 於國際學習表示會議, 2021
[18] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie, "A Convnet for the 2020s," in Proc. of CVPR, June 2022, pp. 11976-11986 [18] 莊劉、毛漢字、吳超元、Christoph Feichtenhofer、Trevor Darrell、謝賽寧, "2020 年代的卷積神經網絡," 於 CVPR 會議論文集, 2022 年 6 月, 頁 11976-11986
[19] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie, "ConvNext V2: Co-designing and scaling ConvNets with masked autoencoders," in Proc. of CVPR, June 2023, pp. 16133-16142. [19] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, 和 Saining Xie, "ConvNext V2: 與遮罩自編碼器共同設計與擴展卷積網絡," 於 CVPR 會議論文集, 2023 年 6 月, 頁 16133-16142。
[20] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, "Attentive statistics pooling for deep speaker embedding," in Proc. Interspeech, 2018, pp. 2252-2256. [20] Koji Okabe, Takafumi Koshinaka, 和 Koichi Shinoda, "深度說話者嵌入的注意力統計池化," 於 Proc. Interspeech, 2018, 頁 2252-2256.
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, Illia Polosukhin, "注意力就是你所需要的," 在神經資訊處理系統進展中, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc. Vishwanathan 與 R. Garnett 編輯。2017 年,第 30 卷,Curran Associates, Inc.。
[22] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, "VoxCeleb: A Large-Scale Speaker Identification Dataset," in Proc. Interspeech, 2017, pp. 2616-2620. [22] Arsha Nagrani, Joon Son Chung, 和 Andrew Zisserman, "VoxCeleb: 一個大規模的說話者識別數據集," 於 Proc. Interspeech, 2017, 頁 2616-2620.
[23] Pavel Matějka, Ondřej Novotný, Oldřich Plchot, Lukáš Burget, Mireia Diez Sánchez, and Jan Černocký, "Analysis of score normalization in multilingual speaker recognition," in Proc. Interspeech, 2017, pp. 1567-1571. [23] Pavel Matějka, Ondřej Novotný, Oldřich Plchot, Lukáš Burget, Mireia Diez Sánchez, 和 Jan Černocký, "多語言語者識別中分數標準化的分析," 於 Proc. Interspeech, 2017, 頁 1567-1571.
[24] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in Proc. of CVPR, June 2019. [24] Jiankang Deng, Jia Guo, Niannan Xue, 和 Stefanos Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," 於 CVPR 會議論文集, 2019 年 6 月。
[25] David Snyder, Guoguo Chen, and Daniel Povey, "Musan: A music, speech, and noise corpus," 2015. [25] David Snyder、Guoguo Chen 與 Daniel Povey,"Musan: 一個音樂、語音和噪音語料庫",2015。
[26] J. B. Alien and D. A. Berkley, "Image method for efficiently simulating small-room acoustics," The Journal of the Acoustical Society of America, vol. 60, no. S1, pp. S9-S9, 082005. [26] J. B. Alien 與 D. A. Berkley,"用於高效模擬小型房間聲學的影像法",《美國聲學學會期刊》,第 60 卷,第 S1 期,頁 S9-S9,082005。
[27] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition," in Proc. Interspeech, 2019, pp. 2613-2617. [27] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, 和 Quoc V. Le, "SpecAugment: 一種簡單的自動語音識別數據增強方法," 於 Proc. Interspeech, 2019, 頁 2613-2617。
[28] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," 2018. [28] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, 和 Kaiming He, "準確的大批量 SGD:1 小時內訓練 ImageNet," 2018。
[29] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio, "SpeechBrain: A general-purpose speech toolkit," 2021, arXiv:2106.04624. [29] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, 和 Yoshua Bengio, "SpeechBrain: 一個通用語音工具包," 2021, arXiv:2106.04624。
This work was supported by Hyundai Motors Co. 本研究由現代汽車公司支持。