Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

Argones Rúa, Enrique; Bredin, Hervé; García Mateo, Carmen; Chollet, Gérard; González Jiménez, Daniel

doi:10.1007/s10044-008-0121-2

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

Theoretical Advances
Published: 06 May 2008

Volume 12, pages 271–284, (2009)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Enrique Argones Rúa¹,
Hervé Bredin²,
Carmen García Mateo¹,
Gérard Chollet² &
…
Daniel González Jiménez¹

331 Accesses
22 Citations
3 Altmetric
Explore all metrics

Abstract

This paper addresses the subject of liveness detection, which is a test that ensures that biometric cues are acquired from a live person who is actually present at the time of capture. The liveness check is performed by measuring the degree of synchrony between the lips and the voice extracted from a video sequence. Three new methods for asynchrony detection based on co-inertia analysis (CoIA) and a fourth based on coupled hidden Markov models (CHMMs) are derived. Experimental comparisons are made with several methods previously used in the literature for asynchrony detection and speaker location. The reported results demonstrate the effectiveness and superiority of the proposed new methods based on both CoIA and CHMMs as asynchrony detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection

Article 24 November 2020

Efficient speaker identification using spectral entropy

Article 02 January 2019

References

Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. Issues Vis Audio Vis Speech Process
Liu X, Liang L, Zhaa Y, Pi X, Nefian AV (2002) Audio-visual continuous speech recognition using a coupled hidden Markov model. In: Proceedings of the international conference on spoken language processing
Gurbuz S, Tufekci Z, Patterson T, Gowdy JN (2002) Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Orlando
Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37
Article Google Scholar
Pan H, Liang Z-P, Huang TS (2000) A new approach to integrate audio and visual features of speech. In: IEEE international conference on multimedia and expo., pp 1093 – 1096
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia expo., vol III. Baltimore, pp 9–12, July 2003
Chetty G, Wagner M (2004) “Liveness” verification in audio-video authentication. In: Australian international conference on speech science and technology, pp 358–363
Eveno N, Besacier L (2005) A speaker independent liveness test for audio-video biometrics. In: Nineth European conference on speech communication and technology
Hershey J, Movellan J (2000) Audio vision: using audiovisual synchrony to locate sounds. In: Advances in neural information processing systems, vol 12, pp 813–819
Slaney M, Covell M (2000) FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. Neural Inf Process Soc 13
Fisher JW, Darell T (2004) Speaker association with signal-level audiovisual fusion. IEEE Trans Multimed 6(3):406–413
Article Google Scholar
Nock HJ, Iyengar G, Neti C (2002) Assessing face and speech consistency for monologue detection in video. Multimedia 303–306
Bredin H, Chollet G (2006) Measuring audio and visual speech synchrony: methods and applications. In: International conference on visual information engineering
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: DARPA image understanding workshop, pp 121–130
Bredin H, Aversano G, Mokbel C, Chollet G (2006) The biosecure talking-face reference system. In: Second workshop on multimodal user authentication, May 2006
Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species-environment relationships. Freshw Biol 31:277–294
Article Google Scholar
Bailly-Baillière E, Bengio E, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The BANCA database and evaluation protocol. In: Lecture notes in computer science, vol 2688, pp 625–638, January 2003
Gutiérrez J, Rouas J-L, André-Obrecht R (2004) Weighted loss functions to make risk-based language identification fused decisions. In: IEEE Computer Society (ed). Proceedings of the 17th international conference on pattern recognition (ICPR’04)
Qian J-Z, Ross A, Jain A (2001) Information fusion in biometrics. In: Proceedings of 3rd international conference on audio- and video-based person authentication (AVBPA), pp 354–359, Sweden, June 2001
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: European conference on speech communication and technology, pp 1895–1898
Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Marióthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The banca database and evaluation protocol
Bengio S, Mariéthoz J (2004) A statistical significance test for person authentication. ODYSSEY 2004—the speaker and language recognition workshop, pp 237–244
Zhang X, Mersereau RM, Clements M (2002) Bimodal fusion in audio-visual speech recognition, vol 1. In: IEEE 2002 international conference on image processing, pp 964–967, September 2002
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of the international conference on acoustics speech and signal processing (ICASSP02), May 2002
Tao D, Li X, Hu W, Maybank S, Wu X (2007) Supervised tensor learning. knowledge and information systems, 13(1):1–42
Google Scholar
Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29(10):700–715
Article Google Scholar

Download references

Acknowledgments

This work has been partially supported by Spanish Ministry of Education and Science (project PRESA TEC2005-07212), by the Xunta de Galicia (project PGIDIT05TIC32202PR) and by the European Union through the European Networks of Excellence BioSecure and K-Space.

Author information

Authors and Affiliations

SPG, STC Department, University of Vigo, 36200, Vigo, Spain
Enrique Argones Rúa, Carmen García Mateo & Daniel González Jiménez
Dépt. TSI, CNRS-LTCI, GET-ENST, Paris, France
Hervé Bredin & Gérard Chollet

Authors

Enrique Argones Rúa
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Bredin
View author publications
You can also search for this author in PubMed Google Scholar
Carmen García Mateo
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Chollet
View author publications
You can also search for this author in PubMed Google Scholar
Daniel González Jiménez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrique Argones Rúa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Argones Rúa, E., Bredin, H., García Mateo, C. et al. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal Applic 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2

Download citation

Received: 08 February 2007
Accepted: 02 April 2008
Published: 06 May 2008
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10044-008-0121-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

Abstract

Access this article

Similar content being viewed by others

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection

Efficient speaker identification using spectral entropy

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

Abstract

Access this article

Similar content being viewed by others

Robust Multi-Modal Speech Recognition in Two Languages Utilizing Video and Distance Information from the Kinect

Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection

Efficient speaker identification using spectral entropy

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation