Skip to main content
Log in

Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Constructing a mapping between articulatory movements and corresponding speech could significantly facilitate speech training and the development of speech aids for voice disorder patients. In this paper, we propose a novel deep learning framework for the creation of a bidirectional mapping between articulatory information and synchronized speech recorded using an ultrasound system. We created a dataset comprising six Chinese vowels and employed the Bimodal Deep Autoencoders algorithm based on the Restricted Boltzmann Machine (RBM) to learn the correlation between speech and ultrasound images of the tongue and the weight matrices of the data representations obtained. Speech and ultrasound images were then reconstructed from the extracted features. The reconstruction error of the ultrasound images created with our method was found to be less than that of the approach based on Principal Components Analysis (PCA). Further, the reconstructed speech approximated the original as the mean formants error (MFE) was small. Following acquisition of their shared representations using the RBM-based deep autoencoder, we carried out mapping between ultrasound images of the tongue and corresponding acoustics signals with a Deep Neural Network (DNN) framework using the revised Deep Denoising Autoencoders. The results obtained indicate that the performance of our proposed method is better than that of a Gaussian Mixture Model (GMM)-based method to which it was compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Badino L, Canevari C, Fadiga L, Metta G (2012) Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition. In Spoken Language Technology Workshop (SLT). IEEE 370–375

  2. Ben-Youssef A, Shimodaira H, Braude D (2014) Speech driven talking head from estimated articulatory features. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4573–4577

  3. Ghosh PK, Narayanan SS (2011) A subject-independent acoustic-to-articulatory inversion. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4624–4627

  4. Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):926

    Google Scholar 

  5. Hiroya S, Honda M (2004) Estimation of articulatory movements from speech acoustics using an HMM-based speech production model. IEEE Trans Speech Audio Process 12(2):175–185

    Article  Google Scholar 

  6. Hodgen J, Valdez P (2001) A stochastic articulatory-to-acoustic mapping as a basis for speech recognition. In Instrumentation and Measurement Technology Conference, 2001. IMTC 2001. Proceedings of the 18th IEEE. IEEE 2:1105–1110

  7. Hogden J, Lofqvist A, Gracco V, Zlokarnik I, Rubin P, Saltzman E (1996) Accurate recovery of articulator positions from acoustics: new conclusions based on human data. J Acoust Soc Am 100(3):1819–1834

    Article  Google Scholar 

  8. Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 7596–7599

  9. Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364

    Article  Google Scholar 

  10. Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364

    Article  Google Scholar 

  11. Ladefoged P (1980) What are linguistic sounds made of? Language 485–502

  12. Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Saenko K (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU summer workshop. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Int Conf IEEE 4:IV-621

  13. Nakamura K, Toda T, Nankaku Y, Tokuda K (2006) On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. IEEE Int Conf IEEE 1:I-I

  14. Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP). IEEE Int Conf IEEE 2:II-2013

  15. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) 689–696

  16. Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans Audio Speech Lang Process 17(3):423–435

    Article  Google Scholar 

  17. Richmond K (2006) A trajectory mixture density network for the acoustic-articulatory inversion mapping. In Interspeech

  18. Richmond K, King S, Taylor P (2003) Modelling the uncertainty in recovering articulation from acoustics. Comput Speech Lang 17(2):153–172

    Article  Google Scholar 

  19. Saenko K, Darrell T, Glass JR (2004) Articulatory features for robust visual speech recognition. In Proceedings of the 6th international conference on Multimodal interfaces. ACM 152–158

  20. Schroeter J, Sondhi MM (1992) Speech coding based on physiological models of speech production. Advances Speech Signal Process 231–267

  21. Schroeter J, Sondhi MM (1994) Techniques for estimating vocal-tract shapes from the speech signal. IEEE Trans Speech Audio Process 2(1):133–150

    Article  Google Scholar 

  22. Simko J, Cummins F (2009) Sequencing embodied gestures in speech

  23. Suzuki S, Okadome T, Honda M (1998) Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints. In ICSLP

  24. Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun 50(3):215–227

    Article  Google Scholar 

  25. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  26. Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In INTERSPEECH 10:446–449

  27. Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Basic Research Program of China (No. 2013CB329305), and in part by grants from the National Natural Science Foundation of China (No. 61175016, No. 61304250).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhuan Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, J., Fang, Q., Zheng, X. et al. Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework. Multimed Tools Appl 75, 5223–5245 (2016). https://doi.org/10.1007/s11042-015-3038-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-3038-y

Keywords

Navigation