Skip to main content

An Emotional Text-Driven 3D Visual Pronunciation System for Mandarin Chinese

  • Conference paper
  • First Online:
Pattern Recognition (CCPR 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 662))

Included in the following conference series:

  • 1836 Accesses

Abstract

This paper proposes an emotional text-driven 3D visual pronunciation system for Mandarin Chinese. Firstly, based on an articulatory speech corpus collected by Electro-Magnetic Articulography (EMA), the articulatory features are trained by Hidden Markov model (HMM), and the fully context-dependent modeling is taken into account by making full use of the rich linguistic features. Secondly, considering the fact that the emotion is more remarkably adjusted in the articulatory domain owing to the independency in the manipulation of articulators, the differences between articulatory movements in different emotions are investigated. Thirdly, the emotional speech is generated by adjusting the speech parameters, such as fundamental frequency (F0), duration and intensity, based on Praat. Then when playing the generated emotional speech, the corresponding articulatory movements are synthesized by the HMM prediction rules simultaneously which is used to drive the head mesh model along with emotional speech. The experiments demonstrate the system can synthesize accurate emotional speech synchronized animation of articulators at phoneme level.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Yu, J., Li, A.: 3D visual pronunciation of Mandarine Chinese for language learning. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 2036–2040. IEEE (2014)

    Google Scholar 

  2. Ling, Z.-H., Richmond, K., Yamagishi, J.: An analysis of HMM-based prediction of articulatory movements. Speech Commun. 52(10), 834–846 (2010)

    Article  Google Scholar 

  3. Ling, Z.H., Richmond, K., Yamagishi, J., et al.: Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1171–1185 (2009)

    Article  Google Scholar 

  4. Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)

    Article  Google Scholar 

  5. Ben-Youssef, A., Shimodaira, H., Braude, D.A.: Speech driven talking head from estimated articulatory features. In: The International Conference on Acoustics, Speech and Signal Processing, pp. 4573–4577 (2014)

    Google Scholar 

  6. Zhu, P., Xie, L., Chen, Y.: Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  7. Yu, J., Wang, Z.F.: A video, text and speech driven realistic 3D virtual head for human-machine interface. IEEE Trans. Cybern. 45(5), 977–988 (2015)

    Google Scholar 

  8. Jun, Y., Wang, Z.F.: 3D facial motion tracking by combining online appearance model and cylinder head model in particle filtering. Sci. Chin. - Inf. Sci. 57(2), 274–280 (2014)

    Google Scholar 

  9. Lee, S., Yildirim, S., Kazemzadeh, A., et al.: An articulatory study of emotional speech production. In: INTERSPEECH, pp. 497–500 (2005)

    Google Scholar 

  10. Erickson, D., Zhu, C., Kawahara, S., et al.: Articulation, acoustics and perception of Mandarin Chinese emotional speech

    Google Scholar 

  11. Erickson, D., Abramson, A., Maekawa, K., et al.: Articulatory characteristics of emotional utterances in spoken English. In: INTERSPEECH, pp. 365–368 (2000)

    Google Scholar 

  12. Li, A., Fang, Q., Hu, F., et al.: Acoustic and articulatory analysis on Mandarin Chinese vowels in emotional speech. In: 2010 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 38–43. IEEE (2010)

    Google Scholar 

  13. Lee, S., Kato, T., Narayanan, S.S.: Relation between geometry and kinematics of articulatory trajectory associated with emotional speech production. In: Ninth Annual Conference of the International Speech Communication Association (2008)

    Google Scholar 

  14. Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)

    Article  Google Scholar 

  15. Odell, J.J.: The use of context in large vocabulary speech recognition. Am. J. Math. 75(2), 241–259 (1996)

    MathSciNet  Google Scholar 

  16. Yoshimura, T.: Duration modeling for HMM-based speech synthesis. In: ICSLP, vol. 90, no. 3, pp. 692–693 (1998)

    Google Scholar 

  17. Tokuda, K., Yoshimura, T., Masuko, T., et al.: Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE International Conference on-ICASSP, pp. 1315–1318 (2000)

    Google Scholar 

  18. Lee, Y, Terzopoulos, D, Waters, K.: Realistic modeling for facial animation. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 55–62. ACM (1995)

    Google Scholar 

  19. Marcos, S, Bermejo, J.G.G., Zalama, E.: A realistic facial animation suitable for human-robot interfacing. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2008, pp. 3810–3815. IEEE (2008)

    Google Scholar 

  20. Ekman, P., Friesen, W.V.: Manual for the Facial Action Coding System. Psychologists Press, Palo Altom (1978)

    Google Scholar 

  21. Tang, C.Y., Zhang, G., Tsui, C.P.: A 3D skeletal muscle model coupled with active contraction of muscle fibres and hyperelastic behaviour. J. Biomech. 42(7), 865–872 (2009)

    Article  Google Scholar 

  22. Zen, H., Nose, T., Yamagishi, J., et al.: The HMM-based speech synthesis system (HTS) version 2.0. Ieice Technical report Natural Language Understanding and Models of Communication, vol. 107, no. 406, pp. 301–306 (2002)

    Google Scholar 

  23. Praat speech processing softward. http://www.fon.hum.uva.nl/praat/

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 61572450 and No. 61303150), the Open Project Program of the State Key Lab of CAD & CG, Zhejiang University (No. A1501), the Fundamental Research Funds for the Central Universities (WK2350000002), the Open Funding Project of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (No. BUAA-VR-16KF-12).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Yu, L., Luo, C., Yu, J. (2016). An Emotional Text-Driven 3D Visual Pronunciation System for Mandarin Chinese. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 662. Springer, Singapore. https://doi.org/10.1007/978-981-10-3002-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3002-4_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3001-7

  • Online ISBN: 978-981-10-3002-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics