Abstract
Environmental noise degrades the speech intelligibility when listening to the phone. Although the phone has a clean signal source, it is still difficult for the listener to get information. Intelligibility enhancement (IENH) is a type of perceptual enhancement technique for clean speech rendered in noisy environments. This study focuses on IENH by normal-to-Lombard speech conversion, which is inspired by Lombard reflex. In this conversion process, the key point is to map the spectral tilt from the normal speech (normal style) to the Lombard speech (Lombard style). For mapping the spectral tilt, we propose a mapping model combining linear-prediction-based mapping networks and tilt modification. Compared with previous studies, we use deep neural networks (DNNs) instead of Gaussian-based models for higher dimensional mapping, and inventively add a tilt modification module to reduce the mapping errors of formant magnitudes further. In this paper, we use AVS-M codec and two datasets as the benchmark platform. The valuation shows that our method gets better results than reference methods in both objective and subjective experiments.
Similar content being viewed by others
Notes
[25] is the original study of the Lombard reflex discoverer – Etienne Lombard. It is a French reference without electronic version.
References
Alghamdi N, Maddock S, Marxer R, Barker J, Brown GJ (2018) A corpus of audio-visual Lombard speech with frontal and profile views. J Acoust Soc Am 143:EL523–EL529. [Available]: https://datashare.is.ed.ac.uk/handle/10283/347
ANSI (1997) American national standard methods for calculation of the speech intelligibility index. American National Standard Institute s3.5-1997
AVS (2010) Information technology - Advanced coding of audio and video - Part 10: Mobile speech and audio (GB/T20090.10-2013). National Standards of the People’s Republic of China
Chen J, Benesty J, Huang Y, Doclo S (2006) New insights into the noise reduction Wiener filter. IEEE/ACM Trans Audio Speech Language Process 14 (4):1218–1234
Cooke M, King S, Garnier M, Aubanel V (2014) The listening talker: a review of human and algorithmic context-induced modifications of speech. Comput Speech Lang 28(2, SI):543–571
Cooke M, Mayo C, Valentini-Botinhao C (2013) Intelligibility-enhancing speech modifications: the hurricane challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3552–3556
Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers Inc., Boston
Ellis D (2003) Dynamic time warp (DTW) in MATLAB. [Available]: http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/
Gao L, Hu R, Yang Y (2014) A spatial priority based scalable audio coding. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 3670–3674
Garnier M, Henrich N (2014) Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?. Comput Speech Lang 28(2, SI):580–597
Huber R, Ooster J, Meyer BT (2018) Single-ended speech quality prediction based on automatic speech recognition. J Audio Eng Soc 66(10):759–769
ITU-T R (1996) P. 800 Methods for subjective determination of transmission quality
Jensen TL, Giacobello D, van Waterschoot T, Christensen MG (2016) Fast algorithms for high-order sparse linear prediction with applications to speech processing. Speech Comm 76:143–156
Jokinen E, Alku P (2017) Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network. J Acoust Soc Am 141(4):EL327–EL330
Jokinen E, Remes U, Alku P (2015) Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 85–89
Jokinen E, Remes U, Alku P (2017) Intelligibility enhancement of telephone speech using Gaussian process regression for normal-to-Lombard spectral tilt conversion. IEEE/ACM Trans Audio Speech Language Process 25(10):1985–1996
Jokinen E, Remes U, Takanen M, Palomȧki K, Kurimo M, Alku P (2014) Spectral tilt modelling with GMMs for intelligibility enhancement of narrowband telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 2036–2040
Junqua JC (1991) The influence of psychoacoustic and psycholinguistic factors on listener judgments of intelligibility of normal and Lombard speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, pp 361–364
Junqua JC, Fincke S, Field K (1999) The Lombard effect: A reflex to better communicate with others in noise. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 2083–2086
Kakouros S, Räsänen O, Alku P (2018) Comparison of spectral tilt measures for sentence prominence in speech-effects of dimensionality and adverse noise conditions. Speech Comm 103:11–26
Kleijn WB, Crespo JB, Hendriks RC, Petkov PN, Sauert B, Vary P (2015) Optimizing speech intelligibility in a noisy environment: a unified view. IEEE Signal Proc Mag 32(2):43–54
Kodrasi I, Cauchi B, Goetze S, Doclo S (2017) Instrumental and perceptual evaluation of dereverberation techniques based on robust acoustic multichannel equalization. J Audio Eng Soc 65(1/2):117–129
Koutsogiannaki M, Francois H, Choo K, Oh E (2017) Real-time modulation enhancement of temporal envelopes for increasing speech intelligibility. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1973–1977
Koutsogiannaki M, Stylianou Y (2014) Simple and artefact-free spectral modifications for enhancing the intelligibility of casual speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Lombard E (1911) Le signe de l’elevation de la voix. Ann. Mal. de L’Oreille et du Larynx pp. 101–119
Lȯpez AR, Seshadri S, Juvela L, Rȧsȧnen O, Alku P (2017) Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1363–1367
Lu Y, Cooke M (2009) The contribution of changes in f0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Comm 51(12):1253–1262
Petkov PN, Kleijn WB (2015) Spectral dynamics recovery for enhanced speech intelligibility in noise. IEEE/ACM Trans Audio, Speech, Language Process 23(2):327–338
Rabiner LR, Schafer RW (2011) Theory and applications of digital speech processing. Pearson, Upper Saddle River
Schepker H, Rennies J, Doclo S (2013) Improving speech intelligibility in noise by SII-dependent preprocessing using frequency-dependent amplification and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3577–3581
Sołoducha M, Raake A, Kettler F, Voigt P (2016) Lombard speech database for German language. In: Proceedings of German Annual Conference on Acoustics (DAGA). [Available]: http://spandh.dcs.shef.ac.uk/avlombard/
Taal CH, Hendriks RC, Heusdens R (2014) Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput Speech Lang 28(4):858–872
Taal CH, Jensen J (2013) SII-Based speech preprocessing for intelligibility improvement in noise. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3582–3586
Varga A, Steeneken H (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251
Wang X, Wang Y, Hang B (2013) Application of AVS-p10 mobile speech and audio coding in social multimedia. In: Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS), pp 101–104
Wen Z, Tao Z, Liang Z, Hai Z (2010) Performance analysis and evaluation of AVS-m audio coding. In: Proceedings of the International Conference on Audio, Language and Image Processing, Proceedings (ICALIP), pp 31–36
Zhang R, Hu R, Li G, Wang X (2019) Spectral tilt estimation for speech intelligibility enhancement using RNN based on all-pole model. In: Proceedings of the International Conference on Multimedia Modeling (MMM), pp 144–156
Zorilȧ TC, Kandia V, Stylianou Y (2012) Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 634–637
Acknowledgements
This work was supported by National Nature Science Foundation of China (Grant Nos. 61801334 and U1736206) and National Key Research and Development Program of China (Grant Nos. 2017YFB1002803).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, G., Hu, R., Zhang, R. et al. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement. Multimed Tools Appl 79, 19471–19491 (2020). https://doi.org/10.1007/s11042-020-08838-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08838-1