Skip to main content
Log in

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F 0) contours in the framework of HMM-based speech synthesis. An F 0 contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F 0 model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F 0 contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F 0 behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F 0 model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse In Cohen, P., Morgan, J., & P.llack, M. (Eds.), Intentions in communication, (pp. 271–311). Cambridge: MIT Press.

    Google Scholar 

  2. Taylor, P. (2009). Text-to-speech synthesis: Cambridge University Press.

  3. Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039– 1064.

    Article  Google Scholar 

  4. Zen, H., Tokuda, K., Nasuko, T., Kobayashi, T., & Kitamura, T. (2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information Systems, E90-D(5), 825–834.

    Article  Google Scholar 

  5. Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information Systems, E90-D(5), 816–824.

    Article  Google Scholar 

  6. Hunt, A, & Black, A. (1996). Unit selection in a cancatenative speech synthesis system using a large speech database. In ICASSP1996 (pp. 373–376).

  7. Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Guan, Y., Hu, R., Oura, K., Wu, Y.J., Tokuda, K., Karhila, R., & Kurimo, M. (2010). Thousands of voices for HMM-based speech synthesis — analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 984–1004.

    Article  Google Scholar 

  8. Ni, J., & Kawai, H. (2010). An unsupervised approach to creating web audio contents-based HMM voices. In Proceedings of INTERSPEECH2010 (pp. 849–852).

  9. Maeno, Y., Nose, T., Kobayashi, T., Ijima, Y., Nakajima, H., Mizuno, H., & Yoshioka, O. (2011). HMM-based emphatic speech synthesis using unsupervised context labelling. In Proceedings of INTERSPEECH2011 (pp. 1849–1852).

  10. Kawai, H., Toda, T., Ni, J., Tsuzaki, M., & Tokuda, K. (2004). XIMERA: a new TTS from ATR based on corpus-based technologies. In 5th ISCA Speech Synthesis Workshop (pp. 179–184).

  11. Tokuda, K., Masuko, T., Miyazaki, N., & Kobayashi, T. (1999). Hidden Markov models based on multispace probability distribution for pitch pattern modeling. In proceedings of ICASSP1999 (pp. 229–232).

  12. Santen, J., & Hirschberg, J. (1994). Segmental effect on timing and height of pitch contours. In in Proceedings of ICSLP1994.

  13. Yu, K., & Young, S. (2011). Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Transactions on ASLP, 19(5), 1071–1079.

    Google Scholar 

  14. Sakai, S. (2005). Fundamental frequency modeling for speech synthesis based on a statistical learning technique. IEICE Transactions on Information Systems, E88D(3), 489–495.

    Article  Google Scholar 

  15. Wu, Y.J., & Soong, F. (2012). Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In Proceedings of ICASSP2012 (pp. 4017–4020).

  16. Beckman, M.E., & Pierrehumbert, J.B. (1986). Intonational structure in Japanese and English. Phonology Yearbook, 3, 255–309.

    Article  Google Scholar 

  17. ’tHart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation: an experimental-phonetic approach to speech melody: Cambridge University Press.

  18. Fujisaki, H. (2004). Information, prosody, and modeling — with emphasis on tonal features of speech —. In proceedings of Speech Prosody 2004 (pp. 1–10).

  19. Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan, 5, 233–242.

    Article  Google Scholar 

  20. Garding, E. (1993). On parameters and principles in intonation analysis, Lund University, Department of Linguistics. Working Papers, 40, 25–47.

    Google Scholar 

  21. van Santen, J., Mashira, T., & Klabbers, E. (2004). Estimating phrase curves in the general superpositional intonation model. In Proceedings of the 5th Speech Synthesis Workshop (pp. 61–66).

  22. Matsuda, T., Hirose, K., & Minematsu, N. (2012). Applying generation process model constraint to fundamental frequency contours generated by hidden-Markove-model-based speech synthesis. Acoustical Science and Technology, 33(4), 221–228.

    Article  Google Scholar 

  23. Hashimoto, H., Hirose, K., & Minematsu, N. (2012). Improved automatic extraction of generation process model commands and its use for generating fundamental frequency contours for training HMM-based speech synthesis. In Proceedings of INTERSPEECH2012.

  24. Sakurai, A., & Hirose, K. (1996). Detection of phrase boundaries in Japanese by low-pass filtering of fundamental frequency contours. In Proceedings of ICSLP1996 (pp. 817–820).

  25. Mixdorff, H. (2000). A novel approach to the fully automatic extraction of fujisaki model parameters. In Proceedings of ICASSP 2000, (Vol. 3 pp. 1281–1284).

  26. Narusawa, S., Minematsu, N., Hirose, K., & Fujisaki, H. (2002). A method for automatic extraction of model parameters from fundamental frequency contours of speech. In proceedings of ICASSP 2002, pp. I-509 – I-512.

  27. Langarani, M., Klabbers, E., & Santen, J. (2014). A novel pitch decomposition method for the generalized linear alignment model. In Proceedings of ICASSP2014 (pp. 2603–2607).

  28. Kameoka, H., Yoshizato, K., Ishihara, T., Ohishi, Y., Kashino, K., & Sagayama, S. (2013). Generative modeling of speech F0 contours. In Proceedings of INTERSPEECH2013 (pp. 1826–1830).

  29. Ni, J., Shiga, Y., Hori, C., & Kidawara, Y. (2013). A targets-based superpositional model of fundamental frequency contours applied to HMM-based speech synthesis. In Proceedings of INTERSPEECH2013 (pp. 1052–1056).

  30. Mishira, T. (2008). Decomposition of fundamental frequency contours in the general superpositional intonation model, PhD. thesis, the Oregon Health & Science University.

  31. Ni, J., & Nakamura, S. (2007). Use of Poisson processes to generate fundamental frequency contours. In Proceedings of ICASSP2007 (pp. 825–828).

  32. Ni, J., Kawai, H., & Hirose, K. (2006). Constrained tone transformation technique for separation and combination of Mandarin tone and intonation. Journal of the Acoustical Society of America, 119(3), 1764–1782.

    Article  Google Scholar 

  33. Venditti, J., Maekawa, K., & Beckman, M. (2008). Prominence marking in the Japanese intonation system, The Oxford Handbook of Japanese Linguistics: Oxford University Press.

  34. http://hts.sp.nitech.ac.jp/.

  35. Clark, R., & Dusterhoff, K. (1999). Objective methods for evaluating synthetic intonation. In Proceedings of Eurospeech 1999, (Vol. 4 pp. 1623–1626).

  36. Oura, K., Zen, H., Nankaku, Y., Lee, A., & Tokuda, K. (2010). A covariance-tying technique for HMM-based speech synthesis. IEICE Transactions on Information Systems, E93-D(3), 595–601.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinfu Ni.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ni, J., Shiga, Y. & Hori, C. Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model. J Sign Process Syst 82, 273–286 (2016). https://doi.org/10.1007/s11265-015-1011-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1011-7

Keywords

Navigation