Skip to main content
Log in

Audiovisual synchrony assessment for replay attack detection in talking face biometrics

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audiovisual speech synchrony detection is an important liveness check for talking face verification systems in order to make sure that the input biometric samples are actually acquired from the same source. In prior work, the used visual speech features have been mainly describing facial appearance or mouth shape in frame-wise manner, thus ignoring the lip motion between consecutive frames. Since also the visual speech dynamics are important, we take the spatiotemporal information into account and propose the use of space-time auto-correlation of gradients (STACOG) for measuring the audiovisual synchrony. For evaluating the effectiveness of the proposed approach, a set of challenging and realistic attack scenarios are designed by augmenting publicly available BANCA and XM2VTS datasets with synthetic replay attacks. Our experimental analysis shows that the STACOG features outperform the state of the art, e.g. discrete cosine transform based features, in measuring the audiovisual synchrony.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.reallusion.com/crazytalk/

References

  1. Argones Rúa E, Bredin H, Garca Mateo C, Chollet G, Gonzlez Jimnez D (2009) Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal Applic 12(3):271–284

    Article  MathSciNet  Google Scholar 

  2. Bailly-Baillire E, Bengio S, Bimbot F, Hamouz M, Kittler J, Marithoz J, Matas J, Messer K, Popovici V, Pore F, Ruiz B, Thiran JP (2003) The banca database and evaluation protocol. In: Kittler J, Nixon M (eds) Audio- and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, vol 2688, pp 625–638. Springer, Berlin

    Google Scholar 

  3. Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks 10(5):1065–1074

    Article  Google Scholar 

  4. Bredin H, Chollet G (2008) Making talking-face authentication robust to deliberate imposture. In: International conference on acoustics, speech and signal processing (ICASSP), pp 1693–1696

  5. Chetty G (2009) Biometric liveness detection based on cross modal fusion. In: 12th International conference on information fusion, FUSION ’09, pp 2255–2262

  6. Chetty G (2010) Robust audio visual biometric person authentication with liveness verification. In: Sencar H, Velastin S, Nikolaidis N, Lian S (eds) Intelligent multimedia analysis for security applications, studies in computational intelligence, vol 282, pp 59–78. Springer, Berlin

    Google Scholar 

  7. EL-Sallam AA, Mian AS (2011) Correlation based speech-video synchronization. Pattern Recogn Lett 32(6):780–786

    Article  Google Scholar 

  8. Eveno N, Besacier L (2005) Co-inertia analysis for ”liveness” test in audio-visual biometrics. In: International symposium on image and signal processing and analysis, (ISPA), pp 257–261

  9. Faraj MI, Bigun J (2007) Audio-visual person authentication using lip-motion from orientation maps. Pattern Recogn Lett 28(11):1368–1382

    Article  Google Scholar 

  10. Fauve B, Bredin H, Karam W, Verdet F, Mayoue A, Chollet G, Hennebert J, Lewis R, Mason J, Mokbel C, Petrovska D (2008) Some results from the biosecure talking face evaluation campaign. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2008, pp 4137–4140

  11. Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  12. Karam W, Bredin H, Greige H, Chollet G, Mokbel C (2009) Talking-face identity verification, audiovisual forgery, and robustness issues. EURASIP Journal on Advances in Signal Processing 4

  13. Kobayashi T, Otsu N (2008) Image feature extraction using gradient local auto-correlations. In: Proceedings of the 10th European conference on computer vision: Part I, ECCV ’08, pp 346–358. Springer, Berlin

    Google Scholar 

  14. Kobayashi T, Otsu N (2012) Motion recognition using local auto-correlation of space-time gradients. Pattern Recognit Lett 33(9):1188–1195

    Article  Google Scholar 

  15. Liu Y, Sato Y (2010) Recovery of audio-to-video synchronization through analysis of cross-modality correlation. Pattern Recognit Lett 31(8):696–701

    Article  Google Scholar 

  16. Marcel S, Nixon MS, Li SZ (2014) Handbook of Biometric Anti-Spoofing: Trusted Biometrics Under Spoofing Attacks. Springer

  17. Messer K, Matas J, Kittler J, Jonsson K (1999) Xm2vtsdb: The extended m2vts database. In: 2nd international conference on audio and video-based biometric person authentication, pp 72–77

  18. Rodrigues RN, Ling LL, Govindaraju V (2009) Robustness of multimodal biometric fusion methods against spoof attacks. Journal of Visual Language and Computing 20(3):169–179

    Article  Google Scholar 

  19. Rosipal R, Krmer N (2006) Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, Latent Structure and Feature Selection, Lecture Notes in Computer Science, vol 3940, pp 34–51. Springer, Berlin

    Google Scholar 

  20. Slaney M, Covell M (2000) Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Neural information processing systems conference, pp 814–820

  21. Uṙiċȧṙ M, Franc V, Thomas D, Akihiro S, Hlavȧċ V (2015) Real-time multi-view facial landmark detector learned by the structured output svm. In: IEEE international conference on automatic face and gesture recognition conference and workshops. IEEE

  22. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  23. Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: IEEE conference on computer vision and pattern recognition, pp 2879–2886

  24. Zhu ZY, He QH, Feng XH, Li YX, Feng Wang Z (2013) Liveness detection using time drift between lip movement and voice. In: International conference on machine learning and cybernetics (ICMLC), vol 02, pp 973–978

Download references

Acknowledgments

E. Boutellaa is acknowledging the financial support of the Algerian MESRS and CDTA under the grant number 060/PNE/ENS/FINLANDE/2014-2015. The support of the Academy of Finland and Infotech Oulu Doctoral Program is also acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elhocine Boutellaa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boutellaa, E., Boulkenafet, Z., Komulainen, J. et al. Audiovisual synchrony assessment for replay attack detection in talking face biometrics. Multimed Tools Appl 75, 5329–5343 (2016). https://doi.org/10.1007/s11042-015-2848-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2848-2

Keywords

Navigation