Skip to main content
Log in

Speech-driven talking face using embedded confusable system for real time mobile multimedia

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig 6
Fig 7
Fig 8
Fig 9
Fig. 10

Similar content being viewed by others

References

  1. Bregler C, Covell M, Slaney M (1997) Video rewrite: Driving visual speech with audio. In Proc. ACM SIGGRAPH97

  2. Cambridge University Engineering Dept. HTK Toolkit 3.4. http://htk.eng.cam.ac.uk/

  3. Chen T (2001) Audiovisual speech processing: Lip reading and lip synchronization. IEEE Signal Process Mag 18(1):9–21

    Article  MATH  Google Scholar 

  4. Chen T, Rao RR (1998) Audio-visual integration in multimodal communication. Processing of the IEEE 86(5):837–852

    Article  MathSciNet  Google Scholar 

  5. Choi K, Luo Y, Hwang J-N (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2):51–61

    Article  MATH  Google Scholar 

  6. Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann M, Thalmann D (eds) Models and techniques in computer animation. Springer, Tokyo, pp 139–156

    Chapter  Google Scholar 

  7. Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. in Proc. IEEE Computer Animation, pp. 103–110

  8. Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2:152–163

    Article  Google Scholar 

  9. Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428

    Article  Google Scholar 

  10. CrazyTalk, V 2.0 Lip-Sync, 2010. Http://www.reallusion.com/Crazytalk/.

  11. Curinga S, Lavagetto F, Vignoli F (1996) Lip movements synthesis using time delay neural networks. in Proc. EUSIPCO 96Systems and Computers, pp. 36–46

  12. Ephraim Y, Trees HLV (1995) A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 3(4):251–266

    Article  Google Scholar 

  13. Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. Proc ACM SIGGRAPH’02 21(3):388–397

    Google Scholar 

  14. Imperl B, Horvat B (1999) The clustering algorithm for the definition of multilingual set of context dependent speech models. In Proceedings of the European Conference of Speech Communication and Technology, pp. 887–890

  15. Koster BE, Rodman RD, Bitzer D (1994) Automated lip-sync: Direct translation of speech-sound to mouth-shape. in Proc. 28th Annu. Asilomar Conf. Signals, pp. 583–586

  16. Lee S, Yook D (2002) Audio-to-visual conversion using hidden Markov models. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag, pp. 563–570

  17. Lucey P, Martin T, Sridharan S (2004) Confusability of phonemes grouped according to their Viseme classes in noisy environments. Presented at tenth Australian international conference on speech science & technology. Macquarie University, Sydney

    Google Scholar 

  18. Mcallister DV, Rodman RD, Bitzer DL, Freeman AS (1997) Lip synchronization for Animation. Proc. SIGGRAPH 97, Los Angeles, CA, pp. 225–228

  19. Morishima S (1998) Real-time talking head driven by voice and its application to communication and entertainment. in Proc. AVSP 98, pp. 195–200

  20. Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In Proc of ICPR’04 3:826–833

    Google Scholar 

  21. Park J, Ko H (2008) Real-time continuous phoneme recognition system using class-dependent tied-mixture HMM with HBT structure for speech-driven lip-sync. IEEE Trans Multimedia 10(7):1299–1306

    Article  Google Scholar 

  22. Parke F, Waters K (1996) Computer facial animation

  23. Tamura M, Masuko T, Kobayashi T, Tokuday K (1998) Visual speech synthesis based on parameter generation from HMM: Speech driven and text-and-speech driven approaches. in Proc. Audio-Visual Speech Processing (AVSP 98), pp. 221–226

  24. Theobald B, Bangham A, Matthews I, Cawley G (2004) Near-videorealistic synthetic talking faces: Implementation and evaluation. Speech Communication 44:127–140

    Article  Google Scholar 

  25. Theobald BJ, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. AVSP 2007, international conference on auditory-visual speech processing 2007. Kasteel Groenendael, Hilvarenbeek

    Google Scholar 

  26. Turunen E (2001) Survey of theory and applications of Lukasiewicz-Pavelka fuzzy logic. Lectures on Soft Computing and Fuzzy Logic. Advances in Soft Computing, pp. 313–337

  27. Wang HC (1997) MAT—A project to collect Mandarin speech data through telephone networks. Computational Linguistics and Chinese Language Processing, Computational Linguistics Society of R.O.C., vol.2, no. 1, pp. 73–90.

  28. Wang J-C, Lee H-P, Wang J-F, Lin C-B (2008) Robust environmental sound recognition for home automation. IEEE Transaction on Automation Science and Engineering 5(1):25–31

    Article  MathSciNet  Google Scholar 

  29. Wang J-C, Lee H-P, Wang J-F, Yang C-H (2007) Critical band subspace-based speech enhancement using SNR and auditory masking aware technique. IEICE Trans Inf Syst E90-D(7):1055–1062

    Article  Google Scholar 

  30. Xie L, Liu Z (2007) Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans Multimedia 9(3):500–510

    Article  Google Scholar 

  31. Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on Hidden Markov models. Speech Communication 26(1–2):105–115

    Article  Google Scholar 

  32. Ye J, Yao H, Jiang F (2004) Based on HMM and SVM multilayer architecture classifier for chinese sign language recognition with large vocabulary. Proc. Third Intl Conf. Image and Graphics (ICIG04), 377–380

  33. Zgank A, Imperl B, Johansen F (2001) Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering. In Proceedings of the European Conference of Speech Communication and Technology, pp. 2725–2728

  34. Zhong D, Defée I (2007) Performance of similarity measures based on histograms of local image feature vectors. J Patt Recog Lett 28(15)

Download references

Acknowledgments

This research is partially support by National Cheng Kung University and NSC Research Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Paul.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shih, PY., Paul, A., Wang, JF. et al. Speech-driven talking face using embedded confusable system for real time mobile multimedia. Multimed Tools Appl 73, 417–437 (2014). https://doi.org/10.1007/s11042-013-1609-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1609-3

Keywords

Navigation