Abstract
This paper explores how different synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notification. We discuss prior work done on listening tasks as well as speech in noise. We analyze three different speech synthesizers in three different noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Black, A.W., Lenzo, K.A.: Flite: a small fast run-time synthesis engine. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001)
Cooke, M.: A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119(3), 1562–1573 (2006)
Dau, T., Püschel, D., Kohlrausch, A.: A quantitative model of the “effective” signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996)
Davies, M.: The corpus of contemporary American English (Coca): 450 million words, 1990–2012. Brigham Young University (2002)
Duddington, J.: eSpeak text to speech (2012)
Durette, P.N.: gTTS: a python interface for google’s text to speech api (2017). https://github.com/pndurette/gTTS. Accessed 15 Apr 2018
Fiedrich, F., Burghardt, P.: Agent-based systems for disaster management. Commun. ACM 50(3), 41–42 (2007)
Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S.: AIDR: Artificial intelligence for disaster response. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. ACM (2014)
Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: ICASSP, vol. 4, pp. 44164–44164. Citeseer (2002)
Killion, M.C., Niquette, P.A., Gudmundsen, G.I., Revit, L.J., Banerjee, S.: Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 116(4), 2395–2405 (2004)
McAulay, R., Malpass, M.: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 28(2), 137–145 (1980)
Park, Y., Patwardhan, S., Visweswariah, K., Gates, S.C.: An empirical analysis of word error rate and keyword error rate. In: Ninth Annual Conference of the International Speech Communication Association (2008)
Pichora-Fuller, M.K., Schneider, B.A., Daneman, M.: How young and old adults listen to and remember speech in noise. J. Acoust. Soc. Am. 97(1), 593–608 (1995)
Ravichander, A., Manzini, T., Grabmair, M., Neubig, G., Francis, J., Nyberg, E.: How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 374–383 (2017)
Schmidt-Nielsen, A.: Intelligibility and acceptability testing for speech technology. Technical report, Naval Research Lab, Washington DC (1992)
Valentini-Botinhao, C., Yamagishi, J., King, S.: Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Valentini-Botinhao, C., Yamagishi, J., King, S.: Evaluation of objective measures for intelligibility prediction of hmm-based synthetic speech in noise. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5112–5115. IEEE (2011)
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp. 577–582. IEEE (2003)
Acknowledgments
We would like to acknowledge several people for their help and support on this work. Particularly Carolyn Penstein, Rajat Kulshreshtha, Abhilasha Ravichander, and the officers of CMU EMS. As well as the several people who helped edit this work, especially Elise Romberger. Finally, thank you to reviewers reading and examining our experiments, methodology, and submission.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Manzini, T., Black, A. (2018). Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)