Comparison of reference-based and single-ended prediction models for perceived listening effort 基於參考的預測模型和單端預測模型的感知聆聽努力度的比較
Jan Reimes ^(1){ }^{1}, Rainer Huber ^(2){ }^{2}, Jan Rennies ^(2){ }^{2} Jan Reimes ^(1){ }^{1} , Rainer Huber ^(2){ }^{2} , Jan Rennies ^(2){ }^{2}^(1){ }^{1} HEAD acoustics GmbH, Herzogenrath, Germany ^(1){ }^{1} HEAD acoustics GmbH, 德國 黑措根拉特^(2){ }^{2} Fraunhofer IDMT, Oldenburg, Germany ^(2){ }^{2} Fraunhofer IDMT, 奧爾登堡, 德國
Abstract 抽象
This study presents a comparative analysis of two prediction models for perceived listening effort: the standardized model according to ETSI TS 103558 [1], referred to as Assessment of Binaural Listening Effort (ABLE), and a novel single-ended model, Listening Effort Prediction from Acoustic Parameters (LEAP). Although developed for different use cases, there are several overlapping scenarios that would be applicable to both models. 本研究對兩種感知聽力努力度預測模型進行了比較分析:根據 ETSI TS 103558 [1] 的標準化模型,稱為雙耳聽力努力度評估 (ABLE),以及一種新的單端模型,根據聲學參數預測聽力努力度 (LEAP)。儘管針對不同的使用案例開發,但有幾個重疊的場景適用於這兩種模型。
ABLE utilizes both a binaural recording and the original reference signal as input for the prediction. It was trained on a comprehensive set of listening test databases and was formally validated across multiple applications. In contrast, LEAP operates without the reference signal, relying solely on the degraded recording. It employs a deep neural network-based automatic speech recognition engine to calculate the phoneme posterior probabilities and quantifies their temporal smearing. ABLE 利用雙耳錄音和原始參考信號作為預測的輸入。它接受了一套全面的聽力測試資料庫的訓練,並在多個應用程式中進行了正式驗證。相比之下,LEAP 在沒有參考信號的情況下運行,僅依賴於降級的記錄。它採用基於深度神經網路的自動語音辨識引擎來計算音素後驗概率並量化它們的時間塗抹。
Listening effort predictions are compared to auditory data unseen to both models, and which contain recordings from typical real-world scenarios. Besides considering consistency, accuracy and robustness, this analysis aims to highlight the trade-offs in performance between reference-based and single-ended approaches, as well as providing insights into their respective strengths and possible limitations. 將收聽努力度預測與兩個模型都看不到的聽覺數據進行比較,其中包含來自典型真實場景的錄音。除了考慮一致性、準確性和穩健性外,該分析還旨在強調基於參考的方法和單端方法之間的性能權衡,並提供有關它們各自優勢和可能局限性的見解。
1 Introduction 1 引言
Modern speech communication systems are increasingly challenged by complex acoustic environments, where users face cognitive load due to noise, reverberation, and/or speech distortions. These degradations could originate from environmental factors (e.g., background noise, room reverberation), network impairments, or farend processing artifacts. Traditional metrics such as intelligibility tests, while widely adopted, fail to fully capture the comprehension range of the listeners and typically saturate already at medium-high Signal-to-Noise Ratios (SNRs), as illustrated in Figure 1. 現代語音通信系統越來越受到複雜聲學環境的挑戰,使用者會因雜訊、混響和/或語音失真而面臨認知負荷。這些劣化可能源於環境因素(例如背景雜訊、房間混響)、網路損傷或遠端處理偽影。傳統指標(如清晰度測試)雖然被廣泛採用,但無法完全捕捉聽眾的理解範圍,並且通常在中高信噪比 (SNR) 時已經飽和,如圖 1 所示。
Several studies like e.g., [2-4], have demonstrated that Listening Effort (LE) provides a more sensitive measure of the perceived speech, as finer-grained assessments across a broader SNR range can be obtained. 一些研究,例如 [2-4],已經表明,聽力努力度 (LE) 提供了對感知語音的更敏感的測量,因為可以在更廣泛的 SNR 範圍內獲得更精細的評估。
Latest developments and trends in speech communication has intensified the need for objective prediction models. Subjective evaluations, though reliable, are resourceintensive and impractical for large-scale applications. Instead, instrumental methods that account for diverse 語音通信的最新發展和趨勢加劇了對客觀預測模型的需求。主觀評估雖然可靠,但需要大量資源,並且對於大規模應用程式來說是不切實際的。相反,工具方法考慮了不同的
Figure 1: SI and LE versus SNR. Colored areas indicate non-saturated operational range per measure for e.g., possible improvements/testing. 圖 1:SI 和 LE 與 SNR。彩色區域表示每個度量的非飽和作範圍,例如,可能的改進/測試。
acoustic paths, user behaviors, and real-world impairments are essential. Such models must balance accuracy, robustness and versatility to address scenarios ranging from controlled laboratory settings to dynamic live environments. 聲學路徑、用戶行為和實際損傷是必不可少的。此類模型必須平衡準確性、穩健性和多功能性,以應對從受控實驗室設置到動態即時環境的各種場景。
2 Prediction models 2 預測模型
2.1 ABLE 2.1 亞能
Assessment of Binaural Listening Effort (ABLE) is a prediction model for the perceived binaural listening effort and is specified in ETSI TS 103558 [1]. Similar to commonly used speech quality prediction models (like e.g., [5] or [6]), the reference signal is required as an input to the model. As illustrated in the overview in Figure 2, it consists of several stages: 雙耳聽力努力度評估 (Assessment of Binaural Listening Effort, ABLE) 是一種預測感知到的雙耳聽力努力度的模型,在 ETSI TS 103558 中指定 [1]。與常用的語音質量預測模型(如 [5] 或 [6])類似,需要參考信號作為模型的輸入。如圖 2 中的概述所示,它由幾個階段組成:
Pre-processing: Delay compensation between reference and degraded signals and internal signal calibration. 預處理:參考信號和降級信號之間的延遲補償以及內部信號校準。
Figure 2: Overview ABLE prediction model 圖 2:ABLE 預測模型概述
Comparison of reference-based and single-ended prediction models for perceived listening effort 基於參考的預測模型和單端預測模型的感知聆聽努力度的比較
This study presents a comparative analysis of two predic- tion models for perceived listening effort:the standard- ized model according to ETSI TS 103558 [1],referred to as Assessment of Binaural Listening Effort(ABLE),and a novel single-ended model,Listening Effort Prediction from Acoustic Parameters(LEAP).Although developed for different use cases,there are several overlapping sce- narios that would be applicable to both models. 本研究對感知聽力努力度的兩種預測模型進行了比較分析:根據 ETSI TS 103558 [1] 的標準化模型,稱為雙耳聽力努力度評估 (ABLE),以及一種新的單端模型,根據聲學參數預測聽力努力度 (LEAP)。
ABLE utilizes both a binaural recording and the original reference signal as input for the prediction.It was trained on a comprehensive set of listening test databases and was formally validated across multiple applications.In contrast,LEAP operates without the reference signal, relying solely on the degraded recording.It employs a deep neural network-based automatic speech recognition engine to calculate the phoneme posterior probabilities and quantifies their temporal smearing. ABLE 利用雙耳錄音和原始參考信號作為輸入,prediction.It 在一套全面的聽力測試資料庫上進行了訓練,並在多個 applications.In 對比中進行了正式驗證,LEAP 在沒有參考信號的情況下運行,僅依靠降級的 recording.It 採用基於深度神經網路的自動語音辨識引擎來計算音素后驗概率並量化它們的時間塗抹。
Listening effort predictions are compared to auditory data unseen to both models,and which contain record- ings from typical real-world scenarios.Besides consid- ering consistency,accuracy and robustness,this analysis aims to highlight the trade-offs in performance between reference-based and single-ended approaches,as well as providing insights into their respective strengths and pos- sible limitations. 除了考慮一致性、準確性和穩健性外,該分析還旨在強調基於參考的方法和單端方法之間的性能權衡,並深入瞭解它們各自的優勢和可能的局限性。
1 Introduction 1 引言
Modern speech communication systems are increasingly challenged by complex acoustic environments,where users face cognitive load due to noise,reverberation, and/or speech distortions.These degradations could originate from environmental factors(e.g.,background noise,room reverberation),network impairments,or far- end processing artifacts.Traditional metrics such as in- telligibility tests,while widely adopted,fail to fully cap- ture the comprehension range of the listeners and typ- ically saturate already at medium-high Signal-to-Noise Ratios(SNRs),as illustrated in Figure 1. 現代語音通信系統越來越受到複雜聲學環境的挑戰,使用者由於雜訊、混響和/或語音失真而面臨認知負荷。這些退化可能源於環境因素(例如,背景雜訊、房間混響)、網路損傷或遠端處理偽影。傳統指標,如辨識度測試,雖然被廣泛採用,但無法完全捕捉聽眾的理解範圍,而且通常在中高信噪比 (SNR) 下已經飽和,如圖 1 所示。
Several studies like e.g.,[2-4],have demonstrated that Listening Effort(LE)provides a more sensitive mea- sure of the perceived speech,as finer-grained assessments across a broader SNR range can be obtained. [2-4] 等幾項研究表明,聽力努力 (LE) 提供了對感知語音的更敏感的測量,因為可以在更廣泛的 SNR 範圍內獲得更精細的評估。
Latest developments and trends in speech communica- tion has intensified the need for objective prediction mod- els.Subjective evaluations,though reliable,are resource- intensive and impractical for large-scale applications.In- stead,instrumental methods that account for diverse 語音交際的最新發展和趨勢加劇了對客觀預測模型的需求。主觀評估雖然可靠,但對於考慮各種的大規模 applications.In 工具性方法來說,是資源密集型的,而且是不切實際的
Figure 1:SI and LE versus SNR.Colored areas indicate non-saturated operational range per measure for e.g.,possible improvements/testing. 圖 1:SI 和 LE 與 SNR.彩色區域表示每個測量的非飽和作範圍,例如,可能的改進/測試。
acoustic paths,user behaviors,and real-world impair- ments are essential.Such models must balance accuracy, robustness and versatility to address scenarios ranging from controlled laboratory settings to dynamic live envi- ronments. 聲學路徑、用戶行為和現實世界的障礙是必不可少的。這些模型必須平衡準確性、穩健性和多功能性,以應對從受控實驗室設置到動態實時環境的各種場景。
2 Prediction models 2 預測模型
2.1 ABLE 2.1 亞能
Assessment of Binaural Listening Effort(ABLE)is a pre- diction model for the perceived binaural listening effort and is specified in ETSI TS 103558 [1].Similar to com- monly used speech quality prediction models(like e.g., [5]or[6]),the reference signal is required as an input to the model.As illustrated in the overview in Figure 2,it consists of several stages: 雙耳聽力努力度評估 (ABLE) 是感知雙耳聽力努力度的判詞模型,在 ETSI TS 103558 [1] 中指定.與通用的語音質量預測模型(如[5]或[6])類似,參考信號需要作為圖 2 中概述所示 model.As 的輸入,它由幾個階段組成:
-Pre-processing:Delay compensation between refer- ence and degraded signals and internal signal cali- bration. -前處理:參考信號和降級信號以及內部信號校準之間的延遲補償。
Figure 2:Overview ABLE prediction model 圖 2:ABLE 預測模型概述
Value 價值
Category 類別
5
Complete relaxation possible; No effort required 可以完全放鬆;無需任何努力
Separation into speech and noise components. 分離為語音和雜訊元件。
Binaural processing: Short-time equalizationcancellation model is applied. 雙耳處理:應用短時均衡消除模型。
Metric calculations: Several independent measures are extracted from hearing model spectra. 度量計算:從聽覺模型頻譜中提取幾個獨立的度量。
Metric combination: The estimated LE is obtained from the measures by applying a Random-Forest Regression. 量度組合:估計的 LE 是通過應用隨機森林回歸從度量中獲得的。
The standard also specifies the corresponding listening test design that is modeled by the predictor. The attributes of the Absolute Category Rating (ACR) test according to ITU-T P. 800 [7] are provided in Table 1, results are thus reported in terms of Mean Opinion Score (MOS), i.e., MOS_(LE)\mathrm{MOS}_{\mathrm{LE}}. 該標準還指定了由預測變數建模的相應聽力測試設計。表 1 提供了根據 ITU-T P. 800 [7] 的絕對類別評級 (ACR) 測試的屬性,因此結果以平均意見分數 (MOS) 報告,即 MOS_(LE)\mathrm{MOS}_{\mathrm{LE}} .
The model is trained on a comprehensive amount of auditory databases (about 10 k samples). Scope, usage scenarios and performance were formally and independently validated in ETSI. 該模型在大量聽覺資料庫(約 10 k 樣本)上進行訓練。範圍、使用場景和性能在 ETSI 中進行了正式和獨立的驗證。
2.2 LEAP 2.2 飛躍
Listening Effort Prediction from Acoustic Parameters (LEAP) is the result of various contributions based on work such as e.g., [9-14][9-14] and others. The singleended model leveraging an Automatic Speech Recognition (ASR) back-end to predict LE without the use of reference signals. Its core innovation lies in quantifying the temporal smearing of phoneme posteriorgrams. An algorithmic overview is illustrated in Figure 3: Listening Effort Prediction from Acoustic Parameters (LEAP) 是基於工作(例如, [9-14][9-14] 等)的各種貢獻的結果。單端模型利用自動語音辨識 (ASR) 後端來預測 LE,而無需使用參考信號。它的核心創新在於量化音素後音素的時間塗抹。演算法概述如圖 3 所示:
Feature extraction: Log-Mel spectrograms are fed into a Deep Neural Network (DNN) (about 8 k hours of German speech used for training) to estimate phoneme probabilities. 特徵提取:Log-Mel 頻譜圖被饋送到深度神經網路 (DNN)(大約 8 k 小時的德語語音用於訓練)中,以估計音素概率。
Temporal smearing metric: The Mean Temporal Distance (MTD) [9] measures dispersion of phoneme 時間塗抹指標:平均時間距離 (MTD) [9] 測量音素的離散度
Figure 3: Overview LEAP prediction model 圖 3:LEAP 預測模型概述
probabilities over time, correlating with effort (lower MTD == higher smearing == higher effort). 隨時間變化的概率,與努力度相關(MTD == 越低,塗抹越高, == 努力度越高)。
Mapping to Effort Scale Categorial Unit (ESCU): A linear function converts MTD, also denoted as M-Measure ( bar(M))(\bar{M}), to ESCU [8], ranging from 1 (no effort) to 14 (no target speech audible), as shown in Table 2. 映射到努力量表分類單位 (ESCU):線性函數將 MTD(也稱為 M-Measure ( bar(M))(\bar{M}) )轉換為 ESCU [8],範圍從 1(不費力)到 14(聽不到目標語音),如表 2 所示。
3 Auditory data for comparison 3 用於比較的聽覺數據
To ensure fair comparison between ABLE and LEAP, auditory data that is unseen to both models is required. For the prediction with ABLE, also the reference signals needs to be available. Suitable auditory databases meeting both requirements can be found in Annex D of ETSI TS 103558 [1], which was used to validate ABLE for the applications listed in there. The data consists of different heterogeneous real-world scenarios, which include various stimuli of binaurally presented speech under noisy conditions. As detailed in Table 3, the evaluation covered four scenarios: 為了確保 ABLE 和 LEAP 之間的公平比較,需要兩個模型都看不到的聽覺數據。對於使用 ABLE 進行預測,還需要提供參考信號。滿足這兩個要求的合適聽覺資料庫可以在 ETSI TS 103558 [1] 的附錄 D 中找到,該附錄 D 用於驗證 ABLE 是否適用於其中列出的應用。數據由不同的異構真實場景組成,其中包括在嘈雜條件下雙耳呈現的語音的各種刺激。如表 3 所示,評估涵蓋四種情況:
Active Noise Cancellation (ANC) headsets: typical ambient noise scenarios, speech presented via headset or external sound source. 主動降噪 (ANC) 耳機:典型的環境噪音場景,通過耳機或外部聲源呈現的語音。
In-car Communication (ICC): low to high driving conditions, various settings of speech processing, different talker/listener positions. 車載通信 (ICC):從低到高的駕駛條件、語音處理的各種設置、不同的說話者/聽者位置。
Handset (HS): mobile phone mounted to ear, multiple typical ambient noises, speech of different audio bandwidths/codec in down-link. 手機 (HS):貼耳式行動電話、多種典型環境雜訊、下行鏈路中不同音訊頻寬/編解碼器的語音。
Handheld Hands-free (HHHF): mobile phone in loud-speaking mode, multiple typical ambient noises and reverberation, speech of different audio bandwidths/codec in down-link. 手持式免提 (HHHF):大聲通話模式下的行動電話、多種典型的環境雜訊和混響、下行鏈路中不同音頻頻寬/編解碼器的語音。
See Annex D of [1] for a detailed description of the four application scenarios and the conducted auditory experiments. 有關四種應用場景和進行的聽覺實驗的詳細說明,請參見 [1] 的附件 D。