Abstract
Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Black, A., Taylor, P., Caley, R.: The festival speech synthesis system: system documentation (1.3.1). Technical report HCRC/TR-83, Human Communication Research Centre (December 1998)
HTS: HMM-based speech synthesis system version 2.1 (2010)
Imai, S., Kobayashi, T.: Speech signal processing toolkit (SPTK) version 3.7 (2013)
King, S., Karaiskos, V.: The Blizzard challenge 2011 (2011)
Lu, H., King, S., Watts, O.: Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In: SSW8, pp. 281–285 (August 2013)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceeding of ASRU (2011)
Qian, Y., Fan, Y., Hu, W., Soong, F.: On the training aspects of deep neural network (DNN) for parametric tts synthesis. In: ICASSP, pp. 3829–3833 (2014)
Watts, O.: Unsupervised Learning for Text-to-Speech Synthesis. Ph.D. thesis, University of Edinburgh (2012)
Wester, M., Dines, J., Gibson, M., Liang, H., Wu, Y.J., Saheer, L., King, S., Oura, K., Garner, P.N., Byrne, W., Guan, Y., Hirsimäki, T., Karhila, R., Kurimo, M., Shannon, M., Shiota, S., Tian, J., Tokuda, K., Yamagishi, J.: Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In: SSW7, pp. 192–197 (2010)
Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. Trans. Audio Speech Lang. Proc. 17(1), 66–83 (2009)
Zen, H., Tokuda, K., Kitamura, T.: Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. Comput. Speech Lang. 21, 153–173 (2006)
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966 (2013)
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Acknowledgements
This work has received funding from the Swiss National Science Foundation under the SIWIS project and was supported by Eurostars Programme powered by Eurostars and the European Community under the project “D-Box: A generic dialog box for multi-lingual conversational applications”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Lazaridis, A., Potard, B., Garner, P.N. (2015). DNN-Based Speech Synthesis: Importance of Input Features and Training Data. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-23132-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23131-0
Online ISBN: 978-3-319-23132-7
eBook Packages: Computer ScienceComputer Science (R0)