Abstract
This paper addresses the subject of liveness detection, which is a test that ensures that biometric cues are acquired from a live person who is actually present at the time of capture. The liveness check is performed by measuring the degree of synchrony between the lips and the voice extracted from a video sequence. Three new methods for asynchrony detection based on co-inertia analysis (CoIA) and a fourth based on coupled hidden Markov models (CHMMs) are derived. Experimental comparisons are made with several methods previously used in the literature for asynchrony detection and speaker location. The reported results demonstrate the effectiveness and superiority of the proposed new methods based on both CoIA and CHMMs as asynchrony detection methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig8_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0121-2/MediaObjects/10044_2008_121_Fig11_HTML.gif)
Similar content being viewed by others
References
Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. Issues Vis Audio Vis Speech Process
Liu X, Liang L, Zhaa Y, Pi X, Nefian AV (2002) Audio-visual continuous speech recognition using a coupled hidden Markov model. In: Proceedings of the international conference on spoken language processing
Gurbuz S, Tufekci Z, Patterson T, Gowdy JN (2002) Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Orlando
Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37
Pan H, Liang Z-P, Huang TS (2000) A new approach to integrate audio and visual features of speech. In: IEEE international conference on multimedia and expo., pp 1093 – 1096
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia expo., vol III. Baltimore, pp 9–12, July 2003
Chetty G, Wagner M (2004) “Liveness” verification in audio-video authentication. In: Australian international conference on speech science and technology, pp 358–363
Eveno N, Besacier L (2005) A speaker independent liveness test for audio-video biometrics. In: Nineth European conference on speech communication and technology
Hershey J, Movellan J (2000) Audio vision: using audiovisual synchrony to locate sounds. In: Advances in neural information processing systems, vol 12, pp 813–819
Slaney M, Covell M (2000) FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. Neural Inf Process Soc 13
Fisher JW, Darell T (2004) Speaker association with signal-level audiovisual fusion. IEEE Trans Multimed 6(3):406–413
Nock HJ, Iyengar G, Neti C (2002) Assessing face and speech consistency for monologue detection in video. Multimedia 303–306
Bredin H, Chollet G (2006) Measuring audio and visual speech synchrony: methods and applications. In: International conference on visual information engineering
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: DARPA image understanding workshop, pp 121–130
Bredin H, Aversano G, Mokbel C, Chollet G (2006) The biosecure talking-face reference system. In: Second workshop on multimodal user authentication, May 2006
Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species-environment relationships. Freshw Biol 31:277–294
Bailly-Baillière E, Bengio E, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The BANCA database and evaluation protocol. In: Lecture notes in computer science, vol 2688, pp 625–638, January 2003
Gutiérrez J, Rouas J-L, André-Obrecht R (2004) Weighted loss functions to make risk-based language identification fused decisions. In: IEEE Computer Society (ed). Proceedings of the 17th international conference on pattern recognition (ICPR’04)
Qian J-Z, Ross A, Jain A (2001) Information fusion in biometrics. In: Proceedings of 3rd international conference on audio- and video-based person authentication (AVBPA), pp 354–359, Sweden, June 2001
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: European conference on speech communication and technology, pp 1895–1898
Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Marióthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The banca database and evaluation protocol
Bengio S, Mariéthoz J (2004) A statistical significance test for person authentication. ODYSSEY 2004—the speaker and language recognition workshop, pp 237–244
Zhang X, Mersereau RM, Clements M (2002) Bimodal fusion in audio-visual speech recognition, vol 1. In: IEEE 2002 international conference on image processing, pp 964–967, September 2002
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of the international conference on acoustics speech and signal processing (ICASSP02), May 2002
Tao D, Li X, Hu W, Maybank S, Wu X (2007) Supervised tensor learning. knowledge and information systems, 13(1):1–42
Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29(10):700–715
Acknowledgments
This work has been partially supported by Spanish Ministry of Education and Science (project PRESA TEC2005-07212), by the Xunta de Galicia (project PGIDIT05TIC32202PR) and by the European Union through the European Networks of Excellence BioSecure and K-Space.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Argones Rúa, E., Bredin, H., García Mateo, C. et al. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal Applic 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-008-0121-2