A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement

Li, Gang; Hu, Ruimin; Zhang, Rui; Wang, Xiaochen

doi:10.1007/s11042-020-08838-1

A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement

Published: 24 March 2020

Volume 79, pages 19471–19491, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Gang Li¹,
Ruimin Hu^1,2,
Rui Zhang¹ &
…
Xiaochen Wang^1,3

289 Accesses
6 Citations
Explore all metrics

Abstract

Environmental noise degrades the speech intelligibility when listening to the phone. Although the phone has a clean signal source, it is still difficult for the listener to get information. Intelligibility enhancement (IENH) is a type of perceptual enhancement technique for clean speech rendered in noisy environments. This study focuses on IENH by normal-to-Lombard speech conversion, which is inspired by Lombard reflex. In this conversion process, the key point is to map the spectral tilt from the normal speech (normal style) to the Lombard speech (Lombard style). For mapping the spectral tilt, we propose a mapping model combining linear-prediction-based mapping networks and tilt modification. Compared with previous studies, we use deep neural networks (DNNs) instead of Gaussian-based models for higher dimensional mapping, and inventively add a tilt modification module to reduce the mapping errors of formant magnitudes further. In this paper, we use AVS-M codec and two datasets as the benchmark platform. The valuation shows that our method gets better results than reference methods in both objective and subjective experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 3

Adaptive Speech Intelligibility Enhancement for Far-and-Near-end Noise Environments Based on Self-attention StarGAN

Understanding Lombard speech: a review of compensation techniques towards improving speech based recognition systems

Article 18 September 2020

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Notes

[25] is the original study of the Lombard reflex discoverer – Etienne Lombard. It is a French reference without electronic version.
https://tensorflow.google.cn/lite/
https://developer.android.google.cn/training/articles/perf-jni?hl=en

References

Alghamdi N, Maddock S, Marxer R, Barker J, Brown GJ (2018) A corpus of audio-visual Lombard speech with frontal and profile views. J Acoust Soc Am 143:EL523–EL529. [Available]: https://datashare.is.ed.ac.uk/handle/10283/347
Article Google Scholar
ANSI (1997) American national standard methods for calculation of the speech intelligibility index. American National Standard Institute s3.5-1997
AVS (2010) Information technology - Advanced coding of audio and video - Part 10: Mobile speech and audio (GB/T20090.10-2013). National Standards of the People’s Republic of China
Chen J, Benesty J, Huang Y, Doclo S (2006) New insights into the noise reduction Wiener filter. IEEE/ACM Trans Audio Speech Language Process 14 (4):1218–1234
Article Google Scholar
Cooke M, King S, Garnier M, Aubanel V (2014) The listening talker: a review of human and algorithmic context-induced modifications of speech. Comput Speech Lang 28(2, SI):543–571
Article Google Scholar
Cooke M, Mayo C, Valentini-Botinhao C (2013) Intelligibility-enhancing speech modifications: the hurricane challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3552–3556
Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers Inc., Boston
Book MATH Google Scholar
Ellis D (2003) Dynamic time warp (DTW) in MATLAB. [Available]: http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/
Gao L, Hu R, Yang Y (2014) A spatial priority based scalable audio coding. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 3670–3674
Garnier M, Henrich N (2014) Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?. Comput Speech Lang 28(2, SI):580–597
Article Google Scholar
Huber R, Ooster J, Meyer BT (2018) Single-ended speech quality prediction based on automatic speech recognition. J Audio Eng Soc 66(10):759–769
Article Google Scholar
ITU-T R (1996) P. 800 Methods for subjective determination of transmission quality
Jensen TL, Giacobello D, van Waterschoot T, Christensen MG (2016) Fast algorithms for high-order sparse linear prediction with applications to speech processing. Speech Comm 76:143–156
Article Google Scholar
Jokinen E, Alku P (2017) Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network. J Acoust Soc Am 141(4):EL327–EL330
Article Google Scholar
Jokinen E, Remes U, Alku P (2015) Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 85–89
Jokinen E, Remes U, Alku P (2017) Intelligibility enhancement of telephone speech using Gaussian process regression for normal-to-Lombard spectral tilt conversion. IEEE/ACM Trans Audio Speech Language Process 25(10):1985–1996
Article Google Scholar
Jokinen E, Remes U, Takanen M, Palomȧki K, Kurimo M, Alku P (2014) Spectral tilt modelling with GMMs for intelligibility enhancement of narrowband telephone speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 2036–2040
Junqua JC (1991) The influence of psychoacoustic and psycholinguistic factors on listener judgments of intelligibility of normal and Lombard speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1, pp 361–364
Junqua JC, Fincke S, Field K (1999) The Lombard effect: A reflex to better communicate with others in noise. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp 2083–2086
Kakouros S, Räsänen O, Alku P (2018) Comparison of spectral tilt measures for sentence prominence in speech-effects of dimensionality and adverse noise conditions. Speech Comm 103:11–26
Article Google Scholar
Kleijn WB, Crespo JB, Hendriks RC, Petkov PN, Sauert B, Vary P (2015) Optimizing speech intelligibility in a noisy environment: a unified view. IEEE Signal Proc Mag 32(2):43–54
Article Google Scholar
Kodrasi I, Cauchi B, Goetze S, Doclo S (2017) Instrumental and perceptual evaluation of dereverberation techniques based on robust acoustic multichannel equalization. J Audio Eng Soc 65(1/2):117–129
Article Google Scholar
Koutsogiannaki M, Francois H, Choo K, Oh E (2017) Real-time modulation enhancement of temporal envelopes for increasing speech intelligibility. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1973–1977
Koutsogiannaki M, Stylianou Y (2014) Simple and artefact-free spectral modifications for enhancing the intelligibility of casual speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Lombard E (1911) Le signe de l’elevation de la voix. Ann. Mal. de L’Oreille et du Larynx pp. 101–119
Lȯpez AR, Seshadri S, Juvela L, Rȧsȧnen O, Alku P (2017) Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1363–1367
Lu Y, Cooke M (2009) The contribution of changes in f0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Comm 51(12):1253–1262
Article Google Scholar
Petkov PN, Kleijn WB (2015) Spectral dynamics recovery for enhanced speech intelligibility in noise. IEEE/ACM Trans Audio, Speech, Language Process 23(2):327–338
Article Google Scholar
Rabiner LR, Schafer RW (2011) Theory and applications of digital speech processing. Pearson, Upper Saddle River
Google Scholar
Schepker H, Rennies J, Doclo S (2013) Improving speech intelligibility in noise by SII-dependent preprocessing using frequency-dependent amplification and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3577–3581
Sołoducha M, Raake A, Kettler F, Voigt P (2016) Lombard speech database for German language. In: Proceedings of German Annual Conference on Acoustics (DAGA). [Available]: http://spandh.dcs.shef.ac.uk/avlombard/
Taal CH, Hendriks RC, Heusdens R (2014) Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput Speech Lang 28(4):858–872
Article Google Scholar
Taal CH, Jensen J (2013) SII-Based speech preprocessing for intelligibility improvement in noise. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 3582–3586
Varga A, Steeneken H (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251
Article Google Scholar
Wang X, Wang Y, Hang B (2013) Application of AVS-p10 mobile speech and audio coding in social multimedia. In: Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS), pp 101–104
Wen Z, Tao Z, Liang Z, Hai Z (2010) Performance analysis and evaluation of AVS-m audio coding. In: Proceedings of the International Conference on Audio, Language and Image Processing, Proceedings (ICALIP), pp 31–36
Zhang R, Hu R, Li G, Wang X (2019) Spectral tilt estimation for speech intelligibility enhancement using RNN based on all-pole model. In: Proceedings of the International Conference on Multimedia Modeling (MMM), pp 144–156
Zorilȧ TC, Kandia V, Stylianou Y (2012) Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 634–637

Download references

Acknowledgements

This work was supported by National Nature Science Foundation of China (Grant Nos. 61801334 and U1736206) and National Key Research and Development Program of China (Grant Nos. 2017YFB1002803).

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, China
Gang Li, Ruimin Hu, Rui Zhang & Xiaochen Wang
Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430072, China
Ruimin Hu
Collaborative Innovation Center of Geospatial Technology, Wuhan, 430079, China
Xiaochen Wang

Authors

Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruimin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruimin Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, G., Hu, R., Zhang, R. et al. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement. Multimed Tools Appl 79, 19471–19491 (2020). https://doi.org/10.1007/s11042-020-08838-1

Download citation

Received: 29 April 2019
Revised: 08 January 2020
Accepted: 09 March 2020
Published: 24 March 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11042-020-08838-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement

Abstract

Access this article

Similar content being viewed by others

Adaptive Speech Intelligibility Enhancement for Far-and-Near-end Noise Environments Based on Self-attention StarGAN

Understanding Lombard speech: a review of compensation techniques towards improving speech based recognition systems

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement

Abstract

Access this article

Similar content being viewed by others

Adaptive Speech Intelligibility Enhancement for Far-and-Near-end Noise Environments Based on Self-attention StarGAN

Understanding Lombard speech: a review of compensation techniques towards improving speech based recognition systems

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation