Speech-driven talking face using embedded confusable system for real time mobile multimedia

Shih, Po-Yi; Paul, Anand; Wang, Jhing-Fa; Chen, Yi-Hung

doi:10.1007/s11042-013-1609-3

Speech-driven talking face using embedded confusable system for real time mobile multimedia

Published: 17 August 2013

Volume 73, pages 417–437, (2014)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Po-Yi Shih¹,
Anand Paul^1,2,
Jhing-Fa Wang¹ &
…
Yi-Hung Chen¹

362 Accesses
Explore all metrics

Abstract

This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NASR: NonAuditory Speech Recognition with Motion Sensors in Head-Mounted Displays

Speech Recognition Native Module Environment Inherent in Mobiles Devices

A Survey on Different Visual Speech Recognition Techniques

References

Bregler C, Covell M, Slaney M (1997) Video rewrite: Driving visual speech with audio. In Proc. ACM SIGGRAPH’97
Cambridge University Engineering Dept. HTK Toolkit 3.4. http://htk.eng.cam.ac.uk/
Chen T (2001) Audiovisual speech processing: Lip reading and lip synchronization. IEEE Signal Process Mag 18(1):9–21
Article MATH Google Scholar
Chen T, Rao RR (1998) Audio-visual integration in multimodal communication. Processing of the IEEE 86(5):837–852
Article MathSciNet Google Scholar
Choi K, Luo Y, Hwang J-N (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2):51–61
Article MATH Google Scholar
Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann M, Thalmann D (eds) Models and techniques in computer animation. Springer, Tokyo, pp 139–156
Chapter Google Scholar
Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. in Proc. IEEE Computer Animation, pp. 103–110
Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2:152–163
Article Google Scholar
Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428
Article Google Scholar
CrazyTalk, V 2.0 Lip-Sync, 2010. Http://www.reallusion.com/Crazytalk/.
Curinga S, Lavagetto F, Vignoli F (1996) Lip movements synthesis using time delay neural networks. in Proc. EUSIPCO 96—Systems and Computers, pp. 36–46
Ephraim Y, Trees HLV (1995) A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 3(4):251–266
Article Google Scholar
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. Proc ACM SIGGRAPH’02 21(3):388–397
Google Scholar
Imperl B, Horvat B (1999) The clustering algorithm for the definition of multilingual set of context dependent speech models. In Proceedings of the European Conference of Speech Communication and Technology, pp. 887–890
Koster BE, Rodman RD, Bitzer D (1994) Automated lip-sync: Direct translation of speech-sound to mouth-shape. in Proc. 28th Annu. Asilomar Conf. Signals, pp. 583–586
Lee S, Yook D (2002) Audio-to-visual conversion using hidden Markov models. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag, pp. 563–570
Lucey P, Martin T, Sridharan S (2004) Confusability of phonemes grouped according to their Viseme classes in noisy environments. Presented at tenth Australian international conference on speech science & technology. Macquarie University, Sydney
Google Scholar
Mcallister DV, Rodman RD, Bitzer DL, Freeman AS (1997) Lip synchronization for Animation. Proc. SIGGRAPH 97, Los Angeles, CA, pp. 225–228
Morishima S (1998) Real-time talking head driven by voice and its application to communication and entertainment. in Proc. AVSP 98, pp. 195–200
Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In Proc of ICPR’04 3:826–833
Google Scholar
Park J, Ko H (2008) Real-time continuous phoneme recognition system using class-dependent tied-mixture HMM with HBT structure for speech-driven lip-sync. IEEE Trans Multimedia 10(7):1299–1306
Article Google Scholar
Parke F, Waters K (1996) Computer facial animation
Tamura M, Masuko T, Kobayashi T, Tokuday K (1998) Visual speech synthesis based on parameter generation from HMM: Speech driven and text-and-speech driven approaches. in Proc. Audio-Visual Speech Processing (AVSP 98), pp. 221–226
Theobald B, Bangham A, Matthews I, Cawley G (2004) Near-videorealistic synthetic talking faces: Implementation and evaluation. Speech Communication 44:127–140
Article Google Scholar
Theobald BJ, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. AVSP 2007, international conference on auditory-visual speech processing 2007. Kasteel Groenendael, Hilvarenbeek
Google Scholar
Turunen E (2001) Survey of theory and applications of Lukasiewicz-Pavelka fuzzy logic. Lectures on Soft Computing and Fuzzy Logic. Advances in Soft Computing, pp. 313–337
Wang HC (1997) MAT—A project to collect Mandarin speech data through telephone networks. Computational Linguistics and Chinese Language Processing, Computational Linguistics Society of R.O.C., vol.2, no. 1, pp. 73–90.
Wang J-C, Lee H-P, Wang J-F, Lin C-B (2008) Robust environmental sound recognition for home automation. IEEE Transaction on Automation Science and Engineering 5(1):25–31
Article MathSciNet Google Scholar
Wang J-C, Lee H-P, Wang J-F, Yang C-H (2007) Critical band subspace-based speech enhancement using SNR and auditory masking aware technique. IEICE Trans Inf Syst E90-D(7):1055–1062
Article Google Scholar
Xie L, Liu Z (2007) Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans Multimedia 9(3):500–510
Article Google Scholar
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on Hidden Markov models. Speech Communication 26(1–2):105–115
Article Google Scholar
Ye J, Yao H, Jiang F (2004) Based on HMM and SVM multilayer architecture classifier for chinese sign language recognition with large vocabulary. Proc. Third Int’l Conf. Image and Graphics (ICIG’04), 377–380
Zgank A, Imperl B, Johansen F (2001) Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering. In Proceedings of the European Conference of Speech Communication and Technology, pp. 2725–2728
Zhong D, Defée I (2007) Performance of similarity measures based on histograms of local image feature vectors. J Patt Recog Lett 28(15)

Download references

Acknowledgments

This research is partially support by National Cheng Kung University and NSC Research Fund.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan
Po-Yi Shih, Anand Paul, Jhing-Fa Wang & Yi-Hung Chen
The School of Computer Science and Engineering, Kyungpook National University, Daegu, South Korea
Anand Paul

Authors

Po-Yi Shih
View author publications
You can also search for this author inPubMed Google Scholar
Anand Paul
View author publications
You can also search for this author inPubMed Google Scholar
Jhing-Fa Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yi-Hung Chen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Anand Paul.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shih, PY., Paul, A., Wang, JF. et al. Speech-driven talking face using embedded confusable system for real time mobile multimedia. Multimed Tools Appl 73, 417–437 (2014). https://doi.org/10.1007/s11042-013-1609-3

Download citation

Published: 17 August 2013
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11042-013-1609-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech-driven talking face using embedded confusable system for real time mobile multimedia

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NASR: NonAuditory Speech Recognition with Motion Sensors in Head-Mounted Displays

Speech Recognition Native Module Environment Inherent in Mobiles Devices

A Survey on Different Visual Speech Recognition Techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now