A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement

Shu, Xiaofeng; Zhou, Yi; Liu, Hongqing; Truong, Trieu-Kien

doi:10.1007/s11063-020-10212-z

A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement

Published: 03 March 2020

Volume 51, pages 2945–2957, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Xiaofeng Shu¹,
Yi Zhou ORCID: orcid.org/0000-0001-7445-226X¹,
Hongqing Liu¹ &
…
Trieu-Kien Truong²

356 Accesses
4 Citations
Explore all metrics

Abstract

Human listeners often have difficulties understanding speech in the presence of background noise in daily speech communication environments. Recently, deep neural network (DNN)-based techniques have been successfully applied to speech enhancement and achieved significant improvements over the conventional approaches. However, existing DNN-based methods usually minimize the log-power spectral-based or the masking-based mean squared error (MSE) between the enhanced output and the training target (e.g., the ideal ratio mask (IRM) of the clean speech), which is not closely related to human auditory perception. In this letter, a modified bark spectral distortion loss function, which can be considered as an auditory perception-based MSE, is proposed to replace the conventional MSE in DNN-based speech enhancement approaches to further improve the objective perceptual quality. Experimental results reveal that the proposed method can obtain improved speech enhancement performance, especially in terms of objective perceptual quality in all experimental settings when compared with the DNN-based methods using the conventional MSE criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 6645–6649
Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Article Google Scholar
Xu Y, Du J, Dai LR, Lee CH (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(1):7–19
Article Google Scholar
Trentin E (2015) Maximum-likelihood normalization of features increases the robustness of neural-based spoken human–computer interaction. Pattern Recogn Lett 66:71–80
Article Google Scholar
Kumar A, Florencio D (2016) Speech enhancement in multiple-noise conditions using deep neural networks. arXiv preprint arXiv:1605.02427
Subramanian AS, Chen SJ, Watanabe S (2018) Student–teacher learning for BLSTM mask-based speech enhancement. arXiv preprint arXiv:1803.10013
Sun L, Du J, Dai LR, Lee CH (2017) Multiple-target deep learning for LSTM–RNN based speech enhancement. In: Hands-free speech communications and microphone arrays (HSCMA), pp 136–140
Narayanan A, Wang D (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, speech and signal processing (ICASSP), pp 7092–7096
Fu SW, Wang TW, Tsao Y, Lu X, Kawai H (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(9):1570–1584
Article Google Scholar
Liu Q, Wang W, Jackson PJ, et al (2017) A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions. In: Signal processing conference (EUSIPCO), pp 1270–1274
Kang TG, Shin JW, Kim NS (2018) DNN-based monaural speech enhancement with temporal and spectral variations equalization. Digit Signal Process 74:102–110
Article Google Scholar
Zhao Y, Xu B, Giri R, Zhang T (2018) Perceptually guided speech enhancement using deep neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 5074–5078
Zhang H, Zhang X, Gao G (2018) Training supervised speech separation system to improve STOI and PESQ directly. In: Acoustics, speech and signal processing (ICASSP), pp 5374–5378
Taal CH, Hendriks RC, Heusdens R, Jensen J (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Acoustics, speech and signal processing (ICASSP), pp 4214–4217
Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech and signal processing (ICASSP), vol 2, pp 749–752
Wang S, Sekey A, Gersho A (1992) An objective measure for predicting subjective quality of speech coders. IEEE J Sel Areas Commun 10(5):819–829
Article Google Scholar
Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton
Book Google Scholar
Du J, Huo Q (2008) A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: Ninth annual conference of the international speech communication association
Zwicker E, Feldtkeller R (1967) Das Ohr als Nachrichtenempfänger. Hirzel, Stuttgart
Google Scholar
Garofalo J, Graff D, Paul D, Pallett D (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia
Google Scholar
Hu G (2015) 100 nonspeech sounds. http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Article Google Scholar
Garofolo JS (1993) TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia
Book Google Scholar
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136
Article Google Scholar
Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International conference on latent variable analysis and signal separation, pp 91–99

Download references

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant No. 61501072).

Author information

Authors and Affiliations

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China
Xiaofeng Shu, Yi Zhou & Hongqing Liu
Department of Information Engineering, I-Shou University, Kaohsiung, 84001, Taiwan
Trieu-Kien Truong

Authors

Xiaofeng Shu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hongqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Trieu-Kien Truong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shu, X., Zhou, Y., Liu, H. et al. A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement. Neural Process Lett 51, 2945–2957 (2020). https://doi.org/10.1007/s11063-020-10212-z

Download citation

Published: 03 March 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11063-020-10212-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation