Abstract
This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.










Similar content being viewed by others
References
Bregler C, Covell M, Slaney M (1997) Video rewrite: Driving visual speech with audio. In Proc. ACM SIGGRAPH’97
Cambridge University Engineering Dept. HTK Toolkit 3.4. http://htk.eng.cam.ac.uk/
Chen T (2001) Audiovisual speech processing: Lip reading and lip synchronization. IEEE Signal Process Mag 18(1):9–21
Chen T, Rao RR (1998) Audio-visual integration in multimodal communication. Processing of the IEEE 86(5):837–852
Choi K, Luo Y, Hwang J-N (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2):51–61
Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann M, Thalmann D (eds) Models and techniques in computer animation. Springer, Tokyo, pp 139–156
Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. in Proc. IEEE Computer Animation, pp. 103–110
Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2:152–163
Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428
CrazyTalk, V 2.0 Lip-Sync, 2010. Http://www.reallusion.com/Crazytalk/.
Curinga S, Lavagetto F, Vignoli F (1996) Lip movements synthesis using time delay neural networks. in Proc. EUSIPCO 96—Systems and Computers, pp. 36–46
Ephraim Y, Trees HLV (1995) A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 3(4):251–266
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. Proc ACM SIGGRAPH’02 21(3):388–397
Imperl B, Horvat B (1999) The clustering algorithm for the definition of multilingual set of context dependent speech models. In Proceedings of the European Conference of Speech Communication and Technology, pp. 887–890
Koster BE, Rodman RD, Bitzer D (1994) Automated lip-sync: Direct translation of speech-sound to mouth-shape. in Proc. 28th Annu. Asilomar Conf. Signals, pp. 583–586
Lee S, Yook D (2002) Audio-to-visual conversion using hidden Markov models. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag, pp. 563–570
Lucey P, Martin T, Sridharan S (2004) Confusability of phonemes grouped according to their Viseme classes in noisy environments. Presented at tenth Australian international conference on speech science & technology. Macquarie University, Sydney
Mcallister DV, Rodman RD, Bitzer DL, Freeman AS (1997) Lip synchronization for Animation. Proc. SIGGRAPH 97, Los Angeles, CA, pp. 225–228
Morishima S (1998) Real-time talking head driven by voice and its application to communication and entertainment. in Proc. AVSP 98, pp. 195–200
Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In Proc of ICPR’04 3:826–833
Park J, Ko H (2008) Real-time continuous phoneme recognition system using class-dependent tied-mixture HMM with HBT structure for speech-driven lip-sync. IEEE Trans Multimedia 10(7):1299–1306
Parke F, Waters K (1996) Computer facial animation
Tamura M, Masuko T, Kobayashi T, Tokuday K (1998) Visual speech synthesis based on parameter generation from HMM: Speech driven and text-and-speech driven approaches. in Proc. Audio-Visual Speech Processing (AVSP 98), pp. 221–226
Theobald B, Bangham A, Matthews I, Cawley G (2004) Near-videorealistic synthetic talking faces: Implementation and evaluation. Speech Communication 44:127–140
Theobald BJ, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. AVSP 2007, international conference on auditory-visual speech processing 2007. Kasteel Groenendael, Hilvarenbeek
Turunen E (2001) Survey of theory and applications of Lukasiewicz-Pavelka fuzzy logic. Lectures on Soft Computing and Fuzzy Logic. Advances in Soft Computing, pp. 313–337
Wang HC (1997) MAT—A project to collect Mandarin speech data through telephone networks. Computational Linguistics and Chinese Language Processing, Computational Linguistics Society of R.O.C., vol.2, no. 1, pp. 73–90.
Wang J-C, Lee H-P, Wang J-F, Lin C-B (2008) Robust environmental sound recognition for home automation. IEEE Transaction on Automation Science and Engineering 5(1):25–31
Wang J-C, Lee H-P, Wang J-F, Yang C-H (2007) Critical band subspace-based speech enhancement using SNR and auditory masking aware technique. IEICE Trans Inf Syst E90-D(7):1055–1062
Xie L, Liu Z (2007) Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans Multimedia 9(3):500–510
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on Hidden Markov models. Speech Communication 26(1–2):105–115
Ye J, Yao H, Jiang F (2004) Based on HMM and SVM multilayer architecture classifier for chinese sign language recognition with large vocabulary. Proc. Third Int’l Conf. Image and Graphics (ICIG’04), 377–380
Zgank A, Imperl B, Johansen F (2001) Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering. In Proceedings of the European Conference of Speech Communication and Technology, pp. 2725–2728
Zhong D, Defée I (2007) Performance of similarity measures based on histograms of local image feature vectors. J Patt Recog Lett 28(15)
Acknowledgments
This research is partially support by National Cheng Kung University and NSC Research Fund.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shih, PY., Paul, A., Wang, JF. et al. Speech-driven talking face using embedded confusable system for real time mobile multimedia. Multimed Tools Appl 73, 417–437 (2014). https://doi.org/10.1007/s11042-013-1609-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1609-3