Skip to main content

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9319))

Included in the following conference series:

Abstract

Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Black, A., Taylor, P., Caley, R.: The festival speech synthesis system: system documentation (1.3.1). Technical report HCRC/TR-83, Human Communication Research Centre (December 1998)

    Google Scholar 

  2. HTS: HMM-based speech synthesis system version 2.1 (2010)

    Google Scholar 

  3. Imai, S., Kobayashi, T.: Speech signal processing toolkit (SPTK) version 3.7 (2013)

    Google Scholar 

  4. King, S., Karaiskos, V.: The Blizzard challenge 2011 (2011)

    Google Scholar 

  5. Lu, H., King, S., Watts, O.: Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In: SSW8, pp. 281–285 (August 2013)

    Google Scholar 

  6. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceeding of ASRU (2011)

    Google Scholar 

  7. Qian, Y., Fan, Y., Hu, W., Soong, F.: On the training aspects of deep neural network (DNN) for parametric tts synthesis. In: ICASSP, pp. 3829–3833 (2014)

    Google Scholar 

  8. Watts, O.: Unsupervised Learning for Text-to-Speech Synthesis. Ph.D. thesis, University of Edinburgh (2012)

    Google Scholar 

  9. Wester, M., Dines, J., Gibson, M., Liang, H., Wu, Y.J., Saheer, L., King, S., Oura, K., Garner, P.N., Byrne, W., Guan, Y., Hirsimäki, T., Karhila, R., Kurimo, M., Shannon, M., Shiota, S., Tian, J., Tokuda, K., Yamagishi, J.: Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In: SSW7, pp. 192–197 (2010)

    Google Scholar 

  10. Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. Trans. Audio Speech Lang. Proc. 17(1), 66–83 (2009)

    Article  Google Scholar 

  11. Zen, H., Tokuda, K., Kitamura, T.: Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. Comput. Speech Lang. 21, 153–173 (2006)

    Article  Google Scholar 

  12. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966 (2013)

    Google Scholar 

  13. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

This work has received funding from the Swiss National Science Foundation under the SIWIS project and was supported by Eurostars Programme powered by Eurostars and the European Community under the project “D-Box: A generic dialog box for multi-lingual conversational applications”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandros Lazaridis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Lazaridis, A., Potard, B., Garner, P.N. (2015). DNN-Based Speech Synthesis: Importance of Input Features and Training Data. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23132-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23131-0

  • Online ISBN: 978-3-319-23132-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics