Towards Realizing Mandarin-Tibetan Bi-lingual Emotional Speech Synthesis with Mandarin Emotional Training Corpus

Wu, Peiwen; Yang, Hongwu; Gan, Zhenye

doi:10.1007/978-981-10-6388-6_11

Peiwen Wu¹⁵,
Hongwu Yang¹⁵ &
Zhenye Gan¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 728))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1638 Accesses
1 Citations

Abstract

This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus. A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training. Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adaptation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker dependent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the average emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barra-Chicote, R., Yamagishi, J., King, S., et al.: Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Commun. 52, 394–404 (2010)
Article Google Scholar
Lorenzo-Trueba, J., Barra-Chicote, R., San-Segundo, R., et al.: Emotion transplantation through adaptation in HMM-based speech synthesis. Comput. Speech Lang. 34, 292–307 (2015)
Article Google Scholar
Schröder M.: Emotional speech synthesis: a review. In: Interspeech, pp. 561–564 (2001)
Google Scholar
Adell, J., Escudero, D., Bonafonte, A.: Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Commun. 54, 459–476 (2012)
Article Google Scholar
Hamza, W., Eide, E., Bakis, R., et al.: The IBM expressive speech synthesis system. In: Interspeech (2004)
Google Scholar
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)
Article Google Scholar
Pitrelli, J.F., Bakis, R., Eide, E.M., et al.: The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14, 1099–1108 (2006)
Article Google Scholar
Strom, V., King, S.: Investigating Festival’s target cost function using perceptual experiments (2008)
Google Scholar
Yamagishi, J., Onishi, K., Masuko, T., et al.: Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Trans. Inf. Syst. 88, 502–509 (2005)
Article Google Scholar
Tachibana, M., Yamagishi, J., Masuko, T., et al.: Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Trans. Inf. Syst. 88, 2484–2491 (2005)
Article Google Scholar
Takashi, N., Yamagishi, J., Masuko, T., et al.: A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. Syst. 90, 1406–1413 (2007)
Google Scholar
Yamagishi, J., Kobayashi, T., Nakano, Y., et al.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 17, 66–83 (2009)
Article Google Scholar
Tokuda, K., Nankaku, Y., Toda, T., et al.: Speech synthesis based on hidden Markov models. Proc. IEEE 101, 1234–1252 (2013)
Article Google Scholar
Masuko, T., Tokuda, K., Kobayashi, T., et al.: Voice characteristics conversion for HMM-based speech synthesis system. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1997, vol. 3, pp. 1611–1614. IEEE (1997)
Google Scholar
Tamura, M., Masuko, T., Tokuda, K., et al.: Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings, (ICASSP 2001), vol. 2, pp. 805–808. IEEE (2001)
Google Scholar
Lorenzo-Trueba, J., Barra-Chicote, R., Yamagishi, J., Montero, J.M.: Towards cross-lingual emotion transplantation. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS, vol. 8854, pp. 199–208. Springer, Cham (2014). doi:10.1007/978-3-319-13623-3_21
Google Scholar
Yamagishi, J., Kobayashi, T.: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Trans. Inf. Syst. 90, 533–543 (2007)
Article Google Scholar
Yang, H., Oura, K., Wang, H., et al.: Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimedia Tools Appl. 74, 9927–9942 (2015)
Article Google Scholar
Russell, J.A.: Pancultural aspects of the human conceptual organization of emotions. J. Pers. Soc. Psychol. 45, 1281 (1983)
Article Google Scholar
Wester, M.: The emime bilingual database. University of Edinburgh (2010)
Google Scholar
Kawahara, H., Masuda-Katsuse, I., De Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)
Article Google Scholar
Loizou, P.C.: Speech quality assessment. In: Lin, W., Tao, D., Kacprzyk, J., Li, Z., Izquierdo, E., Wang, H. (eds.) Multimedia Analysis, Processing and Communications. SCI. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19551-8_23
Google Scholar

Download references

Acknowledgments

The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 11664036, 61263036) and Natural Science Foundation of Gansu (Grant No. 1506RJYA126).

Author information

Authors and Affiliations

College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou, 730070, China
Peiwen Wu, Hongwu Yang & Zhenye Gan

Authors

Peiwen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hongwu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenye Gan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongwu Yang .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Beiji Zou
Harbin Engineering University, Harbin, China
Qilong Han
Harbin University of Science and Technology, Harbin, China
Guanglu Sun
Northeast Forestry University, Harbin, China
Weipeng Jing
Huaihua University, Huaihua, Hunan, China
Xiaoning Peng
Sciences of Country Tripod Institute of Data Science, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, P., Yang, H., Gan, Z. (2017). Towards Realizing Mandarin-Tibetan Bi-lingual Emotional Speech Synthesis with Mandarin Emotional Training Corpus. In: Zou, B., Han, Q., Sun, G., Jing, W., Peng, X., Lu, Z. (eds) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, vol 728. Springer, Singapore. https://doi.org/10.1007/978-981-10-6388-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-6388-6_11
Published: 16 September 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6387-9
Online ISBN: 978-981-10-6388-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics