DNN-Based Speech Synthesis: Importance of Input Features and Training Data

Lazaridis, Alexandros; Potard, Blaise; Garner, Philip N.

doi:10.1007/978-3-319-23132-7_24

Alexandros Lazaridis⁷,
Blaise Potard⁷ &
Philip N. Garner⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9319))

Included in the following conference series:

International Conference on Speech and Computer

1719 Accesses
2 Citations

Abstract

Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Black, A., Taylor, P., Caley, R.: The festival speech synthesis system: system documentation (1.3.1). Technical report HCRC/TR-83, Human Communication Research Centre (December 1998)
Google Scholar
HTS: HMM-based speech synthesis system version 2.1 (2010)
Google Scholar
Imai, S., Kobayashi, T.: Speech signal processing toolkit (SPTK) version 3.7 (2013)
Google Scholar
King, S., Karaiskos, V.: The Blizzard challenge 2011 (2011)
Google Scholar
Lu, H., King, S., Watts, O.: Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In: SSW8, pp. 281–285 (August 2013)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceeding of ASRU (2011)
Google Scholar
Qian, Y., Fan, Y., Hu, W., Soong, F.: On the training aspects of deep neural network (DNN) for parametric tts synthesis. In: ICASSP, pp. 3829–3833 (2014)
Google Scholar
Watts, O.: Unsupervised Learning for Text-to-Speech Synthesis. Ph.D. thesis, University of Edinburgh (2012)
Google Scholar
Wester, M., Dines, J., Gibson, M., Liang, H., Wu, Y.J., Saheer, L., King, S., Oura, K., Garner, P.N., Byrne, W., Guan, Y., Hirsimäki, T., Karhila, R., Kurimo, M., Shannon, M., Shiota, S., Tian, J., Tokuda, K., Yamagishi, J.: Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In: SSW7, pp. 192–197 (2010)
Google Scholar
Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. Trans. Audio Speech Lang. Proc. 17(1), 66–83 (2009)
Article Google Scholar
Zen, H., Tokuda, K., Kitamura, T.: Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. Comput. Speech Lang. 21, 153–173 (2006)
Article Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966 (2013)
Google Scholar
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar

Download references

Acknowledgements

This work has received funding from the Swiss National Science Foundation under the SIWIS project and was supported by Eurostars Programme powered by Eurostars and the European Community under the project “D-Box: A generic dialog box for multi-lingual conversational applications”.

Author information

Authors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Alexandros Lazaridis, Blaise Potard & Philip N. Garner

Authors

Alexandros Lazaridis
View author publications
You can also search for this author in PubMed Google Scholar
Blaise Potard
View author publications
You can also search for this author in PubMed Google Scholar
Philip N. Garner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandros Lazaridis .

Editor information

Editors and Affiliations

SPIIRAS, Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Patras, Patras, Greece
Nikos Fakotakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lazaridis, A., Potard, B., Garner, P.N. (2015). DNN-Based Speech Synthesis: Importance of Input Features and Training Data. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-23132-7_24
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23131-0
Online ISBN: 978-3-319-23132-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics