Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise

Manzini, Thomas; Black, Alan

doi:10.1007/978-3-319-99579-3_39

Thomas Manzini¹⁶ &
Alan Black¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1386 Accesses

Abstract

This paper explores how different synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notification. We discuss prior work done on listening tasks as well as speech in noise. We analyze three different speech synthesizers in three different noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Black, A.W., Lenzo, K.A.: Flite: a small fast run-time synthesis engine. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001)
Google Scholar
Cooke, M.: A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119(3), 1562–1573 (2006)
Article Google Scholar
Dau, T., Püschel, D., Kohlrausch, A.: A quantitative model of the “effective” signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996)
Article Google Scholar
Davies, M.: The corpus of contemporary American English (Coca): 450 million words, 1990–2012. Brigham Young University (2002)
Google Scholar
Duddington, J.: eSpeak text to speech (2012)
Google Scholar
Durette, P.N.: gTTS: a python interface for google’s text to speech api (2017). https://github.com/pndurette/gTTS. Accessed 15 Apr 2018
Fiedrich, F., Burghardt, P.: Agent-based systems for disaster management. Commun. ACM 50(3), 41–42 (2007)
Article Google Scholar
Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S.: AIDR: Artificial intelligence for disaster response. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. ACM (2014)
Google Scholar
Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: ICASSP, vol. 4, pp. 44164–44164. Citeseer (2002)
Google Scholar
Killion, M.C., Niquette, P.A., Gudmundsen, G.I., Revit, L.J., Banerjee, S.: Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 116(4), 2395–2405 (2004)
Article Google Scholar
McAulay, R., Malpass, M.: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 28(2), 137–145 (1980)
Article Google Scholar
Park, Y., Patwardhan, S., Visweswariah, K., Gates, S.C.: An empirical analysis of word error rate and keyword error rate. In: Ninth Annual Conference of the International Speech Communication Association (2008)
Google Scholar
Pichora-Fuller, M.K., Schneider, B.A., Daneman, M.: How young and old adults listen to and remember speech in noise. J. Acoust. Soc. Am. 97(1), 593–608 (1995)
Article Google Scholar
Ravichander, A., Manzini, T., Grabmair, M., Neubig, G., Francis, J., Nyberg, E.: How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 374–383 (2017)
Google Scholar
Schmidt-Nielsen, A.: Intelligibility and acceptability testing for speech technology. Technical report, Naval Research Lab, Washington DC (1992)
Google Scholar
Valentini-Botinhao, C., Yamagishi, J., King, S.: Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Valentini-Botinhao, C., Yamagishi, J., King, S.: Evaluation of objective measures for intelligibility prediction of hmm-based synthetic speech in noise. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5112–5115. IEEE (2011)
Google Scholar
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp. 577–582. IEEE (2003)
Google Scholar

Download references

Acknowledgments

We would like to acknowledge several people for their help and support on this work. Particularly Carolyn Penstein, Rajat Kulshreshtha, Abhilasha Ravichander, and the officers of CMU EMS. As well as the several people who helped edit this work, especially Elise Romberger. Finally, thank you to reviewers reading and examining our experiments, methodology, and submission.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Thomas Manzini & Alan Black

Authors

Thomas Manzini
View author publications
You can also search for this author in PubMed Google Scholar
Alan Black
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Thomas Manzini or Alan Black .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manzini, T., Black, A. (2018). Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_39
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics