Abstract
This paper presents easy-to-use modifications to unit-selection speech-synthesis algorithm with voices built from audiobooks. Audiobooks are a very good source of large and high quality audio data for speech synthesis; however, they usually do not meet basic requirements for standard unit-selection synthesis: “neutral” speech properties with no expressive or spontaneous expressions, stable prosodic patterns, careful pronunciation, and consistent voice style during recording. However, if these conditions are taken into consideration, few modifications can be made to adjust the general unit-selection algorithm to make it more robust for synthesis from such audiobook data. Listening test shows that these adjustments increased perceived speech quality and acceptability against a baseline TTS system. Modifications presented here can also allow to exploit audio data variability to control pitch and tempo of synthesized speech.
The work has been supported by the grant of the University of West Bohemia, project No. SGS-2016-039, and by the Technology Agency of the Czech Republic, project No. TA01011264.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, neutral speech is meant as news broadcasting style, which is very often used by modern commercial TTS system.
References
Dutoit, T.: Corpus-based speech synthesis. In: Benesty, J., Sondhi, M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 437–455. Springer, Dordrecht (2008)
Charfuelan, M., Steiner, I.: Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In: Proceedings of INTERSPEECH (2013)
Eyben, F., Buchholz, S., Braunschweiler, N., Latorre, J., Wan, V., Gales, M., Knill, K.: Unsupervised clustering of emotion and voice styles for expressive TTS. In: ICASSP, pp. 4009–4012 (2012)
Zhao, Y., Peng, D., Wang, L., Chu, M., Chen, Y., Yu, P., Guo, J.: Constructing stylistic synthesis databases from audio books. In: INTERSPEECH, Pittsburgh, PA, USA (2006)
Székely, E., Cabral, J.P., Cahill, P., Carson-Berndsen, J.: Clustering expressive speech styles in audiobooks using glottal source parameters. In: INTERSPEECH, pp. 2409–2412 (2011)
Székely, E., Cabral, J.P., Abou-Zleikha, M., Cahill, P., Carson-Berndsen, J.: Evaluating expressive speech synthesis from audiobook corpora for conversational phrases. In: Proceedings of LREC 2012 (2012)
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC 2008 (2008)
Prahallad, K., Toth, A.R., Black, A.W.: Automatic building of synthetic voices from large multi-paragraph speech databases. In: INTERSPEECH, pp. 2901–2904 (2007)
Braunschweiler, N., Buchholz, S.: Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality. In: INTERSPEECH, pp. 1821–1824 (2011)
Prahallad, K., Black, A.W.: Handling large audio files in audio books for building synthetic voices. In: The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis, pp. 148–153, Japan, Kyoto (2010)
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH, pp. 1511–1515, Lyon, France (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Vít, J., Matoušek, J. (2016). Unit-Selection Speech Synthesis Adjustments for Audiobook-Based Voices. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-45510-5_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)