Skip to main content
Log in

Acoustic to articulatory mapping with deep neural network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Synthetic talking avatar has been demonstrated to be very useful in human-computer interactions. In this paper, we discuss the problem of acoustic to articulatory mapping and explore different kinds of models to describe the mapping function. We try general linear model (GLM), Gaussian mixture model (GMM), artificial neural network (ANN) and deep neural network (DNN) for the problem. Taking the advantage of neural network that its prediction stage can be finished in a very short time (e.g. real-time), we develop a real-time speech driven talking avatar system based on DNN. The input of the system is acoustic speech and the output is articulatory movements (that are synchronized with the input speech) on a three-dimensional avatar. Several experiments are conducted to compare the performance of GLM, GMM, ANN and DNN on a well known acoustic-articulatory English speech corpus MNGU0. Experimental results demonstrate that the proposed acoustic to articulatory mapping method with DNN can achieve the best performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Cassell J (2001) Embodied conversational agents: representation and intelligence in user interfaces. AI Mag 22(4):67–83

    Google Scholar 

  2. Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) “Lifelike talking faces for interactive services”. Proc IEEE 91:1406–1429

    Article  Google Scholar 

  3. Deng L (2011) “An overview of deep-structured learning for information processing,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC), pp 1–14

  4. Ding C, Xie L, Zhu PC (2014) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl. doi:10.1007/s11042-014-2156-2

  5. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

    Article  MATH  MathSciNet  Google Scholar 

  6. Hinton GE (2007) To recognize shapes, first learn to generate images. Prog Brain Res 165:535–547

    Article  Google Scholar 

  7. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MATH  MathSciNet  Google Scholar 

  8. Hiroya S, Honda M (2002) “Determination of articulatory movements from speech acoustics using an HMM-based speech production model,” In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp 437–440

  9. Hiroya S, Honda M (2002) “Acoustic-to-articulatory inverse mapping using an HMM-based speech production model,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 2305–2308

  10. Jia J, Wu ZY, Zhang S, Meng H, Cai LH (2013) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/s11042-013-1604-8

  11. Jia J, Zhang S, Meng FB, Wang YX, Cai LH (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Transaction on Audio, Speech, and Language Processing, 19(3):570–582

  12. Karlsson I, Faulkner A, Salvi G (2003) “SYNFACE - A talking face telephone,” In: Proc. European Conf. on Speech Communication and Technology (EUROSPEECH), pp 1297–1300

  13. Kawahara H, Estill J, Fujimura O (2001) “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” In: Proc. Int. Workshop Models and Analysis of Vocal Emissions for Biomedical Application (MAVEBA)

  14. Massaro DW (1987) Speech perception by ear and eye: a paradigm for psychological inquiry. Lawrence Erlbaum Associates, Hillsdale

    Google Scholar 

  15. McCullagh P (1984) Generalized linear models. Eur J Oper Res 16(3):285–292

    Article  MATH  MathSciNet  Google Scholar 

  16. Meng FB, Wu ZY, Jia J, Meng H, Cai LH (2013) Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y

  17. Mohamed A, Dahl G, Hinton GE (2009) “Deep belief networks for phone recognition,” In: Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications

  18. Reynolds D (2009) “Gaussian mixture models,” Encyclopedia of Biometrics

  19. Richmond K (2002) “Estimating articulatory parameters from the acoustic speech signal,” PhD thesis, The Centre for Speech Technology Research, Edinburgh University

  20. Richmond K, Hoole P, King S (2011) “Announcing the electromagnetic articulography (day 1) subset of the MNGU0 articulatory corpus,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1505–1508

  21. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Process 1:318–362

    Google Scholar 

  22. Tieleman T, Hinton GE (2009) “Using fast weights to improve persistent contrastive divergence,” In: Proc. ACM International Conference on Machine Learning (ICML), pp 1033–1040

  23. Toda T, Black AW, Tokuda K (2004) “Acoustic-to-articulatory inversion mapping with Gaussian mixture model,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1129–1132

  24. Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Comm 50:215–227

    Article  Google Scholar 

  25. Uria B, Murray I, Renals S, Richmond K (2012) “Deep architectures for articulatory inversion,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH)

  26. Wu ZY, Zhang S, Cai LH, Meng H (2006) “Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 1802–1805

  27. Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510

    Article  Google Scholar 

  28. Xie L, Sun NC, Fan B (2013) A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl. doi:10.1007/s11042-013-1633-3

  29. Yegnanarayana B (2006) Artificial neural networks, Prentice Hall of India

  30. Zhang L, Renals S (2008) Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process Lett 15:245–248

    Article  Google Scholar 

  31. Zhao TY, Ling ZH, Lei M, Dai LR, Liu QF (2010) “Minimum generation error training for HMM-based prediction of articulatory movement,” In: Proc. Int. Symposium on Chinese Spoken Language Processing (ISCSLP), pp 99–102

  32. Zhao K, Wu ZY, Cai LH (2013)“A real-time speech driven talking avatar based on deep neural network,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC)

Download references

Acknowledgements

This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government’s Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (61375027, 61370023 and 60805008), the National Social Science Foundation Major Project (13&ZD189) and Guangdong Provincial Science and Technology Program (2012A011100008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xixin Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Zhao, K., Wu, X. et al. Acoustic to articulatory mapping with deep neural network. Multimed Tools Appl 74, 9889–9907 (2015). https://doi.org/10.1007/s11042-014-2183-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2183-z

Keywords

Navigation