Abstract
This paper deals with using models with a variable number of states in the HMM-based speech synthesis system. The paper also includes some implementation details on how to use these models in systems based on the HTS toolkit, which cannot handle the models with an unequal number of states directly. A bypass to enable this functionality is proposed here. A data-based method for the determination of the optimal number of states for particular models is proposed here and experimentally tested on 4 large speech corpora. The preference listening test, focused on local differences, proved the preference of the proposed system to the traditional system with 5-state models, while the size of the proposed system (the total number of states) is lower.
This research was supported by the Czech Science Foundation (GA CR), project No. GA16-04420S. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme CESNET LM2015042, is greatly appreciated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Praat: doing phonetics by computer, www.praat.org.
- 2.
HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp.
- 3.
The detailed scheme of the training procedure is more complex, e.g. the reestimation and clustering of models are usually repeated twice.
- 4.
A bug had to be fixed in HTS toolkit ver.2.2 (file HFB.c) to allow using the 1-state models or else it did not work properly.
- 5.
Names of HTS tools are stated here to specify the point of transition to 1-state models as precisely as possible.
- 6.
However, proposed methods are certainly not language-dependent.
References
Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)
Ling, Z.H., Kang, S.Y., Zen, H., Senior, A., Schuster, M., Qian, X.J., Meng, H.M., Deng, L.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32(3), 35–52 (2015)
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)
Romportl, J., Matoušek, J., Tihelka, D.: Advanced prosody modelling. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 441–447. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30120-2_56
Shao, X., Pollet, V., Breen, A.: Refined statistical model tuning for speech synthesis. In: Proceedings of the 7th ISCA Workshop on Speech Synthesis, pp. 284–287 (2010)
Wells, J.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, pp. 684–732. Mouton de Gruyter, Berlin (1997)
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hanzlíček, Z. (2017). Optimal Number of States in HMM-Based Speech Synthesis. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_40
Download citation
DOI: https://doi.org/10.1007/978-3-319-64206-2_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)