Abstract
Human listeners often have difficulties understanding speech in the presence of background noise in daily speech communication environments. Recently, deep neural network (DNN)-based techniques have been successfully applied to speech enhancement and achieved significant improvements over the conventional approaches. However, existing DNN-based methods usually minimize the log-power spectral-based or the masking-based mean squared error (MSE) between the enhanced output and the training target (e.g., the ideal ratio mask (IRM) of the clean speech), which is not closely related to human auditory perception. In this letter, a modified bark spectral distortion loss function, which can be considered as an auditory perception-based MSE, is proposed to replace the conventional MSE in DNN-based speech enhancement approaches to further improve the objective perceptual quality. Experimental results reveal that the proposed method can obtain improved speech enhancement performance, especially in terms of objective perceptual quality in all experimental settings when compared with the DNN-based methods using the conventional MSE criterion.
References
Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 6645–6649
Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Xu Y, Du J, Dai LR, Lee CH (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(1):7–19
Trentin E (2015) Maximum-likelihood normalization of features increases the robustness of neural-based spoken human–computer interaction. Pattern Recogn Lett 66:71–80
Kumar A, Florencio D (2016) Speech enhancement in multiple-noise conditions using deep neural networks. arXiv preprint arXiv:1605.02427
Subramanian AS, Chen SJ, Watanabe S (2018) Student–teacher learning for BLSTM mask-based speech enhancement. arXiv preprint arXiv:1803.10013
Sun L, Du J, Dai LR, Lee CH (2017) Multiple-target deep learning for LSTM–RNN based speech enhancement. In: Hands-free speech communications and microphone arrays (HSCMA), pp 136–140
Narayanan A, Wang D (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, speech and signal processing (ICASSP), pp 7092–7096
Fu SW, Wang TW, Tsao Y, Lu X, Kawai H (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(9):1570–1584
Liu Q, Wang W, Jackson PJ, et al (2017) A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions. In: Signal processing conference (EUSIPCO), pp 1270–1274
Kang TG, Shin JW, Kim NS (2018) DNN-based monaural speech enhancement with temporal and spectral variations equalization. Digit Signal Process 74:102–110
Zhao Y, Xu B, Giri R, Zhang T (2018) Perceptually guided speech enhancement using deep neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 5074–5078
Zhang H, Zhang X, Gao G (2018) Training supervised speech separation system to improve STOI and PESQ directly. In: Acoustics, speech and signal processing (ICASSP), pp 5374–5378
Taal CH, Hendriks RC, Heusdens R, Jensen J (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Acoustics, speech and signal processing (ICASSP), pp 4214–4217
Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech and signal processing (ICASSP), vol 2, pp 749–752
Wang S, Sekey A, Gersho A (1992) An objective measure for predicting subjective quality of speech coders. IEEE J Sel Areas Commun 10(5):819–829
Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton
Du J, Huo Q (2008) A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: Ninth annual conference of the international speech communication association
Zwicker E, Feldtkeller R (1967) Das Ohr als Nachrichtenempfänger. Hirzel, Stuttgart
Garofalo J, Graff D, Paul D, Pallett D (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia
Hu G (2015) 100 nonspeech sounds. http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Garofolo JS (1993) TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136
Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International conference on latent variable analysis and signal separation, pp 91–99
Acknowledgements
Funding was provided by National Natural Science Foundation of China (Grant No. 61501072).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shu, X., Zhou, Y., Liu, H. et al. A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement. Neural Process Lett 51, 2945–2957 (2020). https://doi.org/10.1007/s11063-020-10212-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10212-z