Abstract
In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant.
Chapter PDF
References
Garcia, J.O., Bigun, J., Reynolds, D., Rodriguez, J.G.: Authentication Gets Personal with Biometrics. IEEE Sig. Proc. Mag. 21(2), 50–62 (2004)
Mian, A.S., Bennamoun, M., Owens, R.: 2D and 3D multimodal hybrid face recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 344–355. Springer, Heidelberg (2006)
Gordon, E., Harold, L.: Control Methods Used in a Study of the Vowels. J. Acoust. Soc. of America 24, 175–184 (1952)
Sundberg, J., Nordström, P.-E.: Raised and lowered larynx - the effect on vowel formant frequencies. J. STL-QPSR 17, 035–039 (1976), http://www.speech.kth.se/qpsr
Lewis, J.: Automated Lip-Sync: Background and Techniques. J. of Visualization and Computer Animation 2, 118–122 (1991)
Koster, B., Rodman, R., Bitzer, D.: Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In: 28th IEEE Annual Asilomar Conf. on Sig., Sys. and Comp., pp. 33–46 (1994)
Chen, T., Graph, H., Wang, K.: Lip Synchronization Using Speech-Assisted Video Processing. IEEE Sig. Proc. Letters 2, 57–59 (1995)
McClean, M.D.: Lip-muscle reflexes during speech movement preparation in stutterers. Journal of Fluency Disorders 21, 49–60 (1996)
Zhang, B., Fukui, Y.: Research On An Automated Speech Pattern Recognition System Based On Lip Movement. In: 18th Annual Inter Conf. of the IEEE Eng. in Med. and Bio. Society, vol. 4, pp. 1530–1531 (1996)
Mori, K., Sonoda, Y.: Relationship between lip shapes and acoustical characteristics during speech. Acoust. Soc. of America and Acoust. Soc. of Japan 2pSC22, 879–882 (1996)
Huang, F., Chen, T.: Real-Time Lip-Synch Face Animation Driven By Human Voice. In: IEEE 2nd Multimedia Sig. Proc., pp. 352–357 (1998)
Potamianos, A., Maragos, P.: Speech analysis and synthesis using an AM-FM modulation model. Elsevier in Speech Commun. 28, 195–209 (1999)
Ezzaidi, H., Rouat, J.: Comparison of MFCC and pitch synchronous AM, FM parameters for speaker identification. ICSLP 2, 318–321 (2000)
Barbosa, A., Yehia, H.: Measuring The Relation Between Speech Acoustics and 2D Facial Motion. In: IEEE ICASSP 2001, vol. 1, pp. 181–184 (2001)
Ogata, S., Murai, K., Nakamura, S., Morishima, S.: Model-Based Lip Synchronization With Automatically Translated Synthetic Voice Toward A Multi-Modal Translation System. In: IEEE Inter. Conf. on Multimedia and Expo., pp. 28–31 (2001)
Dimitriadis, D.-V., Maragos, P., Potamianos, A.: Robust AM-FM Features for Speech Recognition. IEEE Sig. Proc. Lett. 12, 621–624 (2005)
Caslon Analytics profile identity crime (2006), http://www.caslon.com.au/idtheftprofile.htm
Groot, C., Davis, C.: Auditory-Visual Speech Recognition with Amplitude and Frequency Modulations. In: 11th Australian Inter. Conf. on Speech Science & Technology (2006)
Ellis, D.: Speech and Audio Processing and Recognition, A course and publications (2006), http://www.ee.columbia.edu/~dpwe/e6820/
Englebienne, G., Cootes, T., Rattray, M.: A probabilistic model for generating realistic lip movements from speech. In: NIPS 2007 (2007), http://books.nips.cc/papers/files/nips20/NIPS2007_0442.pdf
Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM Verlag (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
El-Sallam, A.A., Mian, A.S. (2009). Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation. In: Kamel, M., Campilho, A. (eds) Image Analysis and Recognition. ICIAR 2009. Lecture Notes in Computer Science, vol 5627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02611-9_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-02611-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02610-2
Online ISBN: 978-3-642-02611-9
eBook Packages: Computer ScienceComputer Science (R0)