Skip to main content

Fine Vocoder Tuning for HMM-Based Speech Synthesis: Effect of the Analysis Window Length

  • Conference paper
Advances in Speech and Language Technologies for Iberian Languages

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8854))

  • 822 Accesses

Abstract

This paper studies how the length of the window used during spectral envelope estimation influences the perceptual quality of HMM-based speech synthesis. We show that the acoustic differences due to variations in the window length are audible. The experiments reveal an overall preference towards short analysis windows, although longer windows seem to alleviate some artifacts related to training data scarcity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11), 1039–1064 (2009)

    Article  Google Scholar 

  2. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.: Speech synthesis based on hidden Markov Models. Proceedings IEEE 101(5), 1234–1252 (2013)

    Article  Google Scholar 

  3. Toda, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and System E90-D(5), 816–824 (2007)

    Article  Google Scholar 

  4. HHM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/

  5. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings ICSLP, vol. 3, pp. 1043–1046 (1994)

    Google Scholar 

  6. Imai, S.: Cepstral analysis synthesis on the mel frequency scale. In: Proceedigns ICASSP, pp. 93–96 (1983)

    Google Scholar 

  7. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Mixed excitation for HMM-based speech synthesis. In: Proceedings Eurospeech, pp. 2263–2266 (2001)

    Google Scholar 

  8. Gonzalvo, X., Socorro, J.C., Iriondo, I., Monzo, C., Martinez, E.: Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish. In: Proceedings of the 6th ISCA Speech Synthesis Workshop, pp. 362–367 (2007)

    Google Scholar 

  9. Maia, R., Toda, T., Zen, H., Nankaku, Y., Tokuda, K.: An excitation model for HMM-based speech synthesis based on residual modeling. In: Proceedings 6th ISCA Speech Synthesis Workshop, pp. 131–136 (2007)

    Google Scholar 

  10. Drugman, T., Wilfart, G., Dutoit, T.: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In: Proceedings Interspeech, pp. 1779–1782 (2009)

    Google Scholar 

  11. Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Transactions on Information and System E90-D(1), 325–333 (2007)

    Article  Google Scholar 

  12. Kawahara, H., Masuda-Kasuse, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Communication 27, 187–207 (1999)

    Article  Google Scholar 

  13. Cabral, J.P., Renals, S., Richmond, K., Yamagishi, J.: Glottal Spectra Separation for Parametric Speech Synthesis. In: Proceedings Interspeech, pp. 1829–1832 (2008)

    Google Scholar 

  14. Lanchantin, P., Degottex, G., Rodet, X.: A HMM-based speech synthesis system using a new glottal source and vocal-tract separation method. In: Proceedings ICASSP, pp. 4630–4633 (2010)

    Google Scholar 

  15. Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., Alku, P.: HMM-based Speech Synthesis Utilizing Glottal Inverse Filtering. IEEE Transactions on Audio Speech and Language Processing 19(1), 153–165 (2011)

    Article  Google Scholar 

  16. Banos, E., Derro, D., Bonafonte, A., Moreno, A.: Flexible harmonic/stochastic modeling for HMM-based speech synthesis. In: Proceedings V Jornadas en Tecnologías del Habla, pp. 145–148 (2008)

    Google Scholar 

  17. Shechtman, S., Sorin, A.: Sinusoidal model parameterization for HMM-based TTS system. In: Proceedings Interspeech, pp. 805–808 (2010)

    Google Scholar 

  18. Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing (in press)

    Google Scholar 

  19. Toda, T., Tokuda, K.: Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM. In: Proceedings ICASSP, pp. 3925–3928 (2008)

    Google Scholar 

  20. Wu, Y.J., Tokuda, K.: Minimum generation error training by using original spectrum as reference for log spectral distortion measure. In: Proceedings ICASSP, pp. 4013–4016 (2009)

    Google Scholar 

  21. Ling, Z.H., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio Speech and Language Processing 21(10), 2129–2139 (2013)

    Article  Google Scholar 

  22. Hojo, N., Yoshizato, K., Kameoka, H., Saito, D., Sagayama, S.: Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models. In: Proceedings of the 8th ISCA Speech Synthesis Workshop, pp. 129–134 (2013)

    Google Scholar 

  23. Stylianou, Y.: Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Ph.D. thesis, École Nationale Supèrieure de Télécommunications, Paris (1996)

    Google Scholar 

  24. Erro, D., Sainz, I., Navas, E., Hernaez, I.: Efficient spectral envelope estimation from harmonic speech signals. IET Electronics Letters 48(16), 1019–1021 (2012)

    Article  Google Scholar 

  25. Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: Proceedings WASPAA, pp. 213–219 (1995)

    Google Scholar 

  26. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In: Proceedings ICASSP, pp. 749–752 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Alonso, A., Erro, D., Navas, E., Hernaez, I. (2014). Fine Vocoder Tuning for HMM-Based Speech Synthesis: Effect of the Analysis Window Length. In: Navarro Mesa, J.L., et al. Advances in Speech and Language Technologies for Iberian Languages. Lecture Notes in Computer Science(), vol 8854. Springer, Cham. https://doi.org/10.1007/978-3-319-13623-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13623-3_3

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13622-6

  • Online ISBN: 978-3-319-13623-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics