Acoustic to articulatory mapping with deep neural network

Wu, Zhiyong; Zhao, Kai; Wu, Xixin; Lan, Xinyu; Meng, Helen

doi:10.1007/s11042-014-2183-z

Acoustic to articulatory mapping with deep neural network

Published: 01 August 2014

Volume 74, pages 9889–9907, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zhiyong Wu^1,2,3,
Kai Zhao^1,3,
Xixin Wu^1,3,
Xinyu Lan^1,3 &
…
Helen Meng^1,2

623 Accesses
17 Citations
1 Altmetric
Explore all metrics

Abstract

Synthetic talking avatar has been demonstrated to be very useful in human-computer interactions. In this paper, we discuss the problem of acoustic to articulatory mapping and explore different kinds of models to describe the mapping function. We try general linear model (GLM), Gaussian mixture model (GMM), artificial neural network (ANN) and deep neural network (DNN) for the problem. Taking the advantage of neural network that its prediction stage can be finished in a very short time (e.g. real-time), we develop a real-time speech driven talking avatar system based on DNN. The input of the system is acoustic speech and the output is articulatory movements (that are synchronized with the input speech) on a three-dimensional avatar. Several experiments are conducted to compare the performance of GLM, GMM, ANN and DNN on a well known acoustic-articulatory English speech corpus MNGU0. Experimental results demonstrate that the proposed acoustic to articulatory mapping method with DNN can achieve the best performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

References

Cassell J (2001) Embodied conversational agents: representation and intelligence in user interfaces. AI Mag 22(4):67–83
Google Scholar
Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) “Lifelike talking faces for interactive services”. Proc IEEE 91:1406–1429
Article Google Scholar
Deng L (2011) “An overview of deep-structured learning for information processing,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC), pp 1–14
Ding C, Xie L, Zhu PC (2014) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl. doi:10.1007/s11042-014-2156-2
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article MATH MathSciNet Google Scholar
Hinton GE (2007) To recognize shapes, first learn to generate images. Prog Brain Res 165:535–547
Article Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MATH MathSciNet Google Scholar
Hiroya S, Honda M (2002) “Determination of articulatory movements from speech acoustics using an HMM-based speech production model,” In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp 437–440
Hiroya S, Honda M (2002) “Acoustic-to-articulatory inverse mapping using an HMM-based speech production model,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 2305–2308
Jia J, Wu ZY, Zhang S, Meng H, Cai LH (2013) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/s11042-013-1604-8
Jia J, Zhang S, Meng FB, Wang YX, Cai LH (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Transaction on Audio, Speech, and Language Processing, 19(3):570–582
Karlsson I, Faulkner A, Salvi G (2003) “SYNFACE - A talking face telephone,” In: Proc. European Conf. on Speech Communication and Technology (EUROSPEECH), pp 1297–1300
Kawahara H, Estill J, Fujimura O (2001) “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” In: Proc. Int. Workshop Models and Analysis of Vocal Emissions for Biomedical Application (MAVEBA)
Massaro DW (1987) Speech perception by ear and eye: a paradigm for psychological inquiry. Lawrence Erlbaum Associates, Hillsdale
Google Scholar
McCullagh P (1984) Generalized linear models. Eur J Oper Res 16(3):285–292
Article MATH MathSciNet Google Scholar
Meng FB, Wu ZY, Jia J, Meng H, Cai LH (2013) Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y
Mohamed A, Dahl G, Hinton GE (2009) “Deep belief networks for phone recognition,” In: Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications
Reynolds D (2009) “Gaussian mixture models,” Encyclopedia of Biometrics
Richmond K (2002) “Estimating articulatory parameters from the acoustic speech signal,” PhD thesis, The Centre for Speech Technology Research, Edinburgh University
Richmond K, Hoole P, King S (2011) “Announcing the electromagnetic articulography (day 1) subset of the MNGU0 articulatory corpus,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1505–1508
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Process 1:318–362
Google Scholar
Tieleman T, Hinton GE (2009) “Using fast weights to improve persistent contrastive divergence,” In: Proc. ACM International Conference on Machine Learning (ICML), pp 1033–1040
Toda T, Black AW, Tokuda K (2004) “Acoustic-to-articulatory inversion mapping with Gaussian mixture model,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1129–1132
Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Comm 50:215–227
Article Google Scholar
Uria B, Murray I, Renals S, Richmond K (2012) “Deep architectures for articulatory inversion,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH)
Wu ZY, Zhang S, Cai LH, Meng H (2006) “Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 1802–1805
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510
Article Google Scholar
Xie L, Sun NC, Fan B (2013) A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl. doi:10.1007/s11042-013-1633-3
Yegnanarayana B (2006) Artificial neural networks, Prentice Hall of India
Zhang L, Renals S (2008) Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process Lett 15:245–248
Article Google Scholar
Zhao TY, Ling ZH, Lei M, Dai LR, Liu QF (2010) “Minimum generation error training for HMM-based prediction of articulatory movement,” In: Proc. Int. Symposium on Chinese Spoken Language Processing (ISCSLP), pp 99–102
Zhao K, Wu ZY, Cai LH (2013)“A real-time speech driven talking avatar based on deep neural network,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC)

Download references

Acknowledgements

This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government’s Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (61375027, 61370023 and 60805008), the National Social Science Foundation Major Project (13&ZD189) and Guangdong Provincial Science and Technology Program (2012A011100008).

Author information

Authors and Affiliations

Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, and Shenzhen Key Laboratory of Information Science and Technology, Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China
Zhiyong Wu, Kai Zhao, Xixin Wu, Xinyu Lan & Helen Meng
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, SAR, China
Zhiyong Wu & Helen Meng
Tsinghua National Laboratory for Information Science and Technology (TNList), and Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Zhiyong Wu, Kai Zhao, Xixin Wu & Xinyu Lan

Authors

Zhiyong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xixin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Lan
View author publications
You can also search for this author in PubMed Google Scholar
Helen Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xixin Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Z., Zhao, K., Wu, X. et al. Acoustic to articulatory mapping with deep neural network. Multimed Tools Appl 74, 9889–9907 (2015). https://doi.org/10.1007/s11042-014-2183-z

Download citation

Received: 28 February 2014
Revised: 04 June 2014
Accepted: 07 July 2014
Published: 01 August 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11042-014-2183-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic to articulatory mapping with deep neural network

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Acoustic to articulatory mapping with deep neural network

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation