Skip to main content
Log in

A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Environmental noise degrades the speech intelligibility when listening to the phone. Although the phone has a clean signal source, it is still difficult for the listener to get information. Intelligibility enhancement (IENH) is a type of perceptual enhancement technique for clean speech rendered in noisy environments. This study focuses on IENH by normal-to-Lombard speech conversion, which is inspired by Lombard reflex. In this conversion process, the key point is to map the spectral tilt from the normal speech (normal style) to the Lombard speech (Lombard style). For mapping the spectral tilt, we propose a mapping model combining linear-prediction-based mapping networks and tilt modification. Compared with previous studies, we use deep neural networks (DNNs) instead of Gaussian-based models for higher dimensional mapping, and inventively add a tilt modification module to reduce the mapping errors of formant magnitudes further. In this paper, we use AVS-M codec and two datasets as the benchmark platform. The valuation shows that our method gets better results than reference methods in both objective and subjective experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. [25] is the original study of the Lombard reflex discoverer – Etienne Lombard. It is a French reference without electronic version.

  2. https://tensorflow.google.cn/lite/

  3. https://developer.android.google.cn/training/articles/perf-jni?hl=en

References

  1. Alghamdi N, Maddock S, Marxer R, Barker J, Brown GJ (2018) A corpus of audio-visual Lombard speech with frontal and profile views. J Acoust Soc Am 143:EL523–EL529. [Available]: https://datashare.is.ed.ac.uk/handle/10283/347

    Article  Google Scholar 

  2. ANSI (1997) American national standard methods for calculation of the speech intelligibility index. American National Standard Institute s3.5-1997

  3. AVS (2010) Information technology - Advanced coding of audio and video - Part 10: Mobile speech and audio (GB/T20090.10-2013). National Standards of the People’s Republic of China

  4. Chen J, Benesty J, Huang Y, Doclo S (2006) New insights into the noise reduction Wiener filter. IEEE/ACM Trans Audio Speech Language Process 14 (4):1218–1234

    Article  Google Scholar 

  5. Cooke M, King S, Garnier M, Aubanel V (2014) The listening talker: a review of human and algorithmic context-induced modifications of speech. Comput Speech Lang 28(2, SI):543–571

    Article  Google Scholar 

  6. Cooke M, Mayo C, Valentini-Botinhao C (2013) Intelligibility-enhancing speech modifications: the hurricane challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3552–3556

  7. Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers Inc., Boston

    Book  MATH  Google Scholar 

  8. Ellis D (2003) Dynamic time warp (DTW) in MATLAB. [Available]: http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

  9. Gao L, Hu R, Yang Y (2014) A spatial priority based scalable audio coding. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 3670–3674

  10. Garnier M, Henrich N (2014) Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?. Comput Speech Lang 28(2, SI):580–597

    Article  Google Scholar 

  11. Huber R, Ooster J, Meyer BT (2018) Single-ended speech quality prediction based on automatic speech recognition. J Audio Eng Soc 66(10):759–769

    Article  Google Scholar 

  12. ITU-T R (1996) P. 800 Methods for subjective determination of transmission quality

  13. Jensen TL, Giacobello D, van Waterschoot T, Christensen MG (2016) Fast algorithms for high-order sparse linear prediction with applications to speech processing. Speech Comm 76:143–156

    Article  Google Scholar 

  14. Jokinen E, Alku P (2017) Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network. J Acoust Soc Am 141(4):EL327–EL330

    Article  Google Scholar 

  15. Jokinen E, Remes U, Alku P (2015) Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 85–89

  16. Jokinen E, Remes U, Alku P (2017) Intelligibility enhancement of telephone speech using Gaussian process regression for normal-to-Lombard spectral tilt conversion. IEEE/ACM Trans Audio Speech Language Process 25(10):1985–1996

    Article  Google Scholar 

  17. Jokinen E, Remes U, Takanen M, Palomȧki K, Kurimo M, Alku P (2014) Spectral tilt modelling with GMMs for intelligibility enhancement of narrowband telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 2036–2040

  18. Junqua JC (1991) The influence of psychoacoustic and psycholinguistic factors on listener judgments of intelligibility of normal and Lombard speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, pp 361–364

  19. Junqua JC, Fincke S, Field K (1999) The Lombard effect: A reflex to better communicate with others in noise. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 2083–2086

  20. Kakouros S, Räsänen O, Alku P (2018) Comparison of spectral tilt measures for sentence prominence in speech-effects of dimensionality and adverse noise conditions. Speech Comm 103:11–26

    Article  Google Scholar 

  21. Kleijn WB, Crespo JB, Hendriks RC, Petkov PN, Sauert B, Vary P (2015) Optimizing speech intelligibility in a noisy environment: a unified view. IEEE Signal Proc Mag 32(2):43–54

    Article  Google Scholar 

  22. Kodrasi I, Cauchi B, Goetze S, Doclo S (2017) Instrumental and perceptual evaluation of dereverberation techniques based on robust acoustic multichannel equalization. J Audio Eng Soc 65(1/2):117–129

    Article  Google Scholar 

  23. Koutsogiannaki M, Francois H, Choo K, Oh E (2017) Real-time modulation enhancement of temporal envelopes for increasing speech intelligibility. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1973–1977

  24. Koutsogiannaki M, Stylianou Y (2014) Simple and artefact-free spectral modifications for enhancing the intelligibility of casual speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  25. Lombard E (1911) Le signe de l’elevation de la voix. Ann. Mal. de L’Oreille et du Larynx pp. 101–119

  26. Lȯpez AR, Seshadri S, Juvela L, Rȧsȧnen O, Alku P (2017) Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1363–1367

  27. Lu Y, Cooke M (2009) The contribution of changes in f0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Comm 51(12):1253–1262

    Article  Google Scholar 

  28. Petkov PN, Kleijn WB (2015) Spectral dynamics recovery for enhanced speech intelligibility in noise. IEEE/ACM Trans Audio, Speech, Language Process 23(2):327–338

    Article  Google Scholar 

  29. Rabiner LR, Schafer RW (2011) Theory and applications of digital speech processing. Pearson, Upper Saddle River

    Google Scholar 

  30. Schepker H, Rennies J, Doclo S (2013) Improving speech intelligibility in noise by SII-dependent preprocessing using frequency-dependent amplification and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3577–3581

  31. Sołoducha M, Raake A, Kettler F, Voigt P (2016) Lombard speech database for German language. In: Proceedings of German Annual Conference on Acoustics (DAGA). [Available]: http://spandh.dcs.shef.ac.uk/avlombard/

  32. Taal CH, Hendriks RC, Heusdens R (2014) Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput Speech Lang 28(4):858–872

    Article  Google Scholar 

  33. Taal CH, Jensen J (2013) SII-Based speech preprocessing for intelligibility improvement in noise. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3582–3586

  34. Varga A, Steeneken H (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251

    Article  Google Scholar 

  35. Wang X, Wang Y, Hang B (2013) Application of AVS-p10 mobile speech and audio coding in social multimedia. In: Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS), pp 101–104

  36. Wen Z, Tao Z, Liang Z, Hai Z (2010) Performance analysis and evaluation of AVS-m audio coding. In: Proceedings of the International Conference on Audio, Language and Image Processing, Proceedings (ICALIP), pp 31–36

  37. Zhang R, Hu R, Li G, Wang X (2019) Spectral tilt estimation for speech intelligibility enhancement using RNN based on all-pole model. In: Proceedings of the International Conference on Multimedia Modeling (MMM), pp 144–156

  38. Zorilȧ TC, Kandia V, Stylianou Y (2012) Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 634–637

Download references

Acknowledgements

This work was supported by National Nature Science Foundation of China (Grant Nos. 61801334 and U1736206) and National Key Research and Development Program of China (Grant Nos. 2017YFB1002803).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruimin Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, G., Hu, R., Zhang, R. et al. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement. Multimed Tools Appl 79, 19471–19491 (2020). https://doi.org/10.1007/s11042-020-08838-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08838-1

Keywords

Navigation