Skip to main content
Log in

Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

This paper addresses the subject of liveness detection, which is a test that ensures that biometric cues are acquired from a live person who is actually present at the time of capture. The liveness check is performed by measuring the degree of synchrony between the lips and the voice extracted from a video sequence. Three new methods for asynchrony detection based on co-inertia analysis (CoIA) and a fourth based on coupled hidden Markov models (CHMMs) are derived. Experimental comparisons are made with several methods previously used in the literature for asynchrony detection and speaker location. The reported results demonstrate the effectiveness and superiority of the proposed new methods based on both CoIA and CHMMs as asynchrony detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. Issues Vis Audio Vis Speech Process

  2. Liu X, Liang L, Zhaa Y, Pi X, Nefian AV (2002) Audio-visual continuous speech recognition using a coupled hidden Markov model. In: Proceedings of the international conference on spoken language processing

  3. Gurbuz S, Tufekci Z, Patterson T, Gowdy JN (2002) Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Orlando

  4. Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37

    Article  Google Scholar 

  5. Pan H, Liang Z-P, Huang TS (2000) A new approach to integrate audio and visual features of speech. In: IEEE international conference on multimedia and expo., pp 1093 – 1096

  6. Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia expo., vol III. Baltimore, pp 9–12, July 2003

  7. Chetty G, Wagner M (2004) “Liveness” verification in audio-video authentication. In: Australian international conference on speech science and technology, pp 358–363

  8. Eveno N, Besacier L (2005) A speaker independent liveness test for audio-video biometrics. In: Nineth European conference on speech communication and technology

  9. Hershey J, Movellan J (2000) Audio vision: using audiovisual synchrony to locate sounds. In: Advances in neural information processing systems, vol 12, pp 813–819

  10. Slaney M, Covell M (2000) FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. Neural Inf Process Soc 13

  11. Fisher JW, Darell T (2004) Speaker association with signal-level audiovisual fusion. IEEE Trans Multimed 6(3):406–413

    Article  Google Scholar 

  12. Nock HJ, Iyengar G, Neti C (2002) Assessing face and speech consistency for monologue detection in video. Multimedia 303–306

  13. Bredin H, Chollet G (2006) Measuring audio and visual speech synchrony: methods and applications. In: International conference on visual information engineering

  14. Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: DARPA image understanding workshop, pp 121–130

  15. Bredin H, Aversano G, Mokbel C, Chollet G (2006) The biosecure talking-face reference system. In: Second workshop on multimodal user authentication, May 2006

  16. Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species-environment relationships. Freshw Biol 31:277–294

    Article  Google Scholar 

  17. Bailly-Baillière E, Bengio E, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The BANCA database and evaluation protocol. In: Lecture notes in computer science, vol 2688, pp 625–638, January 2003

  18. Gutiérrez J, Rouas J-L, André-Obrecht R (2004) Weighted loss functions to make risk-based language identification fused decisions. In: IEEE Computer Society (ed). Proceedings of the 17th international conference on pattern recognition (ICPR’04)

  19. Qian J-Z, Ross A, Jain A (2001) Information fusion in biometrics. In: Proceedings of 3rd international conference on audio- and video-based person authentication (AVBPA), pp 354–359, Sweden, June 2001

  20. Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: European conference on speech communication and technology, pp 1895–1898

  21. Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Marióthoz J, Matas J, Messer K, Popovici V, Porée F, Ruiz B, Thiran J-P (2003) The banca database and evaluation protocol

  22. Bengio S, Mariéthoz J (2004) A statistical significance test for person authentication. ODYSSEY 2004—the speaker and language recognition workshop, pp 237–244

  23. Zhang X, Mersereau RM, Clements M (2002) Bimodal fusion in audio-visual speech recognition, vol 1. In: IEEE 2002 international conference on image processing, pp 964–967, September 2002

  24. Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of the international conference on acoustics speech and signal processing (ICASSP02), May 2002

  25. Tao D, Li X, Hu W, Maybank S, Wu X (2007) Supervised tensor learning. knowledge and information systems, 13(1):1–42

    Google Scholar 

  26. Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29(10):700–715

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by Spanish Ministry of Education and Science (project PRESA TEC2005-07212), by the Xunta de Galicia (project PGIDIT05TIC32202PR) and by the European Union through the European Networks of Excellence BioSecure and K-Space.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enrique Argones Rúa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Argones Rúa, E., Bredin, H., García Mateo, C. et al. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal Applic 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-008-0121-2

Keywords

Navigation