Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation

El-Sallam, Amar A.; Mian, Ajmal S.

doi:10.1007/978-3-642-02611-9_40

Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation

Amar A. El-Sallam¹⁸ &
Ajmal S. Mian¹⁹

Conference paper

2198 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5627))

Abstract

In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant.

Download to read the full chapter text

Chapter PDF

References

Garcia, J.O., Bigun, J., Reynolds, D., Rodriguez, J.G.: Authentication Gets Personal with Biometrics. IEEE Sig. Proc. Mag. 21(2), 50–62 (2004)
Article Google Scholar
Mian, A.S., Bennamoun, M., Owens, R.: 2D and 3D multimodal hybrid face recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 344–355. Springer, Heidelberg (2006)
Chapter Google Scholar
Gordon, E., Harold, L.: Control Methods Used in a Study of the Vowels. J. Acoust. Soc. of America 24, 175–184 (1952)
Article Google Scholar
Sundberg, J., Nordström, P.-E.: Raised and lowered larynx - the effect on vowel formant frequencies. J. STL-QPSR 17, 035–039 (1976), http://www.speech.kth.se/qpsr
Lewis, J.: Automated Lip-Sync: Background and Techniques. J. of Visualization and Computer Animation 2, 118–122 (1991)
Article Google Scholar
Koster, B., Rodman, R., Bitzer, D.: Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In: 28th IEEE Annual Asilomar Conf. on Sig., Sys. and Comp., pp. 33–46 (1994)
Google Scholar
Chen, T., Graph, H., Wang, K.: Lip Synchronization Using Speech-Assisted Video Processing. IEEE Sig. Proc. Letters 2, 57–59 (1995)
Article Google Scholar
McClean, M.D.: Lip-muscle reflexes during speech movement preparation in stutterers. Journal of Fluency Disorders 21, 49–60 (1996)
Article Google Scholar
Zhang, B., Fukui, Y.: Research On An Automated Speech Pattern Recognition System Based On Lip Movement. In: 18th Annual Inter Conf. of the IEEE Eng. in Med. and Bio. Society, vol. 4, pp. 1530–1531 (1996)
Google Scholar
Mori, K., Sonoda, Y.: Relationship between lip shapes and acoustical characteristics during speech. Acoust. Soc. of America and Acoust. Soc. of Japan 2pSC22, 879–882 (1996)
Google Scholar
Huang, F., Chen, T.: Real-Time Lip-Synch Face Animation Driven By Human Voice. In: IEEE 2nd Multimedia Sig. Proc., pp. 352–357 (1998)
Google Scholar
Potamianos, A., Maragos, P.: Speech analysis and synthesis using an AM-FM modulation model. Elsevier in Speech Commun. 28, 195–209 (1999)
Article Google Scholar
Ezzaidi, H., Rouat, J.: Comparison of MFCC and pitch synchronous AM, FM parameters for speaker identification. ICSLP 2, 318–321 (2000)
Google Scholar
Barbosa, A., Yehia, H.: Measuring The Relation Between Speech Acoustics and 2D Facial Motion. In: IEEE ICASSP 2001, vol. 1, pp. 181–184 (2001)
Google Scholar
Ogata, S., Murai, K., Nakamura, S., Morishima, S.: Model-Based Lip Synchronization With Automatically Translated Synthetic Voice Toward A Multi-Modal Translation System. In: IEEE Inter. Conf. on Multimedia and Expo., pp. 28–31 (2001)
Google Scholar
Dimitriadis, D.-V., Maragos, P., Potamianos, A.: Robust AM-FM Features for Speech Recognition. IEEE Sig. Proc. Lett. 12, 621–624 (2005)
Article Google Scholar
Caslon Analytics profile identity crime (2006), http://www.caslon.com.au/idtheftprofile.htm
Groot, C., Davis, C.: Auditory-Visual Speech Recognition with Amplitude and Frequency Modulations. In: 11th Australian Inter. Conf. on Speech Science & Technology (2006)
Google Scholar
Ellis, D.: Speech and Audio Processing and Recognition, A course and publications (2006), http://www.ee.columbia.edu/~dpwe/e6820/
Englebienne, G., Cootes, T., Rattray, M.: A probabilistic model for generating realistic lip movements from speech. In: NIPS 2007 (2007), http://books.nips.cc/papers/files/nips20/NIPS2007_0442.pdf
Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM Verlag (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical, Electronic and Computer Engineering, Australia
Amar A. El-Sallam
School of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway Crawley, WA, 6009, Australia
Ajmal S. Mian

Authors

Amar A. El-Sallam
View author publications
You can also search for this author in PubMed Google Scholar
Ajmal S. Mian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Waterloo, Ontario, Canada
Mohamed Kamel
Faculty of Engineering, Institute of Biomedical Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
Aurélio Campilho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

El-Sallam, A.A., Mian, A.S. (2009). Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation. In: Kamel, M., Campilho, A. (eds) Image Analysis and Recognition. ICIAR 2009. Lecture Notes in Computer Science, vol 5627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02611-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-02611-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02610-2
Online ISBN: 978-3-642-02611-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics