Skip to main content
Log in

A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Human listeners often have difficulties understanding speech in the presence of background noise in daily speech communication environments. Recently, deep neural network (DNN)-based techniques have been successfully applied to speech enhancement and achieved significant improvements over the conventional approaches. However, existing DNN-based methods usually minimize the log-power spectral-based or the masking-based mean squared error (MSE) between the enhanced output and the training target (e.g., the ideal ratio mask (IRM) of the clean speech), which is not closely related to human auditory perception. In this letter, a modified bark spectral distortion loss function, which can be considered as an auditory perception-based MSE, is proposed to replace the conventional MSE in DNN-based speech enhancement approaches to further improve the objective perceptual quality. Experimental results reveal that the proposed method can obtain improved speech enhancement performance, especially in terms of objective perceptual quality in all experimental settings when compared with the DNN-based methods using the conventional MSE criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  2. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 6645–6649

  3. Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858

    Article  Google Scholar 

  4. Xu Y, Du J, Dai LR, Lee CH (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(1):7–19

    Article  Google Scholar 

  5. Trentin E (2015) Maximum-likelihood normalization of features increases the robustness of neural-based spoken human–computer interaction. Pattern Recogn Lett 66:71–80

    Article  Google Scholar 

  6. Kumar A, Florencio D (2016) Speech enhancement in multiple-noise conditions using deep neural networks. arXiv preprint arXiv:1605.02427

  7. Subramanian AS, Chen SJ, Watanabe S (2018) Student–teacher learning for BLSTM mask-based speech enhancement. arXiv preprint arXiv:1803.10013

  8. Sun L, Du J, Dai LR, Lee CH (2017) Multiple-target deep learning for LSTM–RNN based speech enhancement. In: Hands-free speech communications and microphone arrays (HSCMA), pp 136–140

  9. Narayanan A, Wang D (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, speech and signal processing (ICASSP), pp 7092–7096

  10. Fu SW, Wang TW, Tsao Y, Lu X, Kawai H (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(9):1570–1584

    Article  Google Scholar 

  11. Liu Q, Wang W, Jackson PJ, et al (2017) A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions. In: Signal processing conference (EUSIPCO), pp 1270–1274

  12. Kang TG, Shin JW, Kim NS (2018) DNN-based monaural speech enhancement with temporal and spectral variations equalization. Digit Signal Process 74:102–110

    Article  Google Scholar 

  13. Zhao Y, Xu B, Giri R, Zhang T (2018) Perceptually guided speech enhancement using deep neural networks. In: Acoustics, speech and signal processing (ICASSP), pp 5074–5078

  14. Zhang H, Zhang X, Gao G (2018) Training supervised speech separation system to improve STOI and PESQ directly. In: Acoustics, speech and signal processing (ICASSP), pp 5374–5378

  15. Taal CH, Hendriks RC, Heusdens R, Jensen J (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Acoustics, speech and signal processing (ICASSP), pp 4214–4217

  16. Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech and signal processing (ICASSP), vol 2, pp 749–752

  17. Wang S, Sekey A, Gersho A (1992) An objective measure for predicting subjective quality of speech coders. IEEE J Sel Areas Commun 10(5):819–829

    Article  Google Scholar 

  18. Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton

    Book  Google Scholar 

  19. Du J, Huo Q (2008) A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: Ninth annual conference of the international speech communication association

  20. Zwicker E, Feldtkeller R (1967) Das Ohr als Nachrichtenempfänger. Hirzel, Stuttgart

    Google Scholar 

  21. Garofalo J, Graff D, Paul D, Pallett D (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  22. Hu G (2015) 100 nonspeech sounds. http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html

  23. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  24. Garofolo JS (1993) TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia

    Book  Google Scholar 

  25. Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136

    Article  Google Scholar 

  26. Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International conference on latent variable analysis and signal separation, pp 91–99

Download references

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant No. 61501072).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shu, X., Zhou, Y., Liu, H. et al. A Human Auditory Perception Loss Function Using Modified Bark Spectral Distortion for Speech Enhancement. Neural Process Lett 51, 2945–2957 (2020). https://doi.org/10.1007/s11063-020-10212-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10212-z

Keywords

Navigation