VALID: A New Practical Audio-Visual Database, and Comparative Results

Fox, Niall A.; O’Mullane, Brian A.; Reilly, Richard B.

doi:10.1007/11527923_81

Niall A. Fox¹⁹,
Brian A. O’Mullane¹⁹ &
Richard B. Reilly¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3546))

Included in the following conference series:

International Conference on Audio- and Video-Based Biometric Person Authentication

2350 Accesses

Abstract

The performance of deployed audio, face, and multi-modal person recognition systems in non-controlled scenarios, is typically lower than systems developed in highly controlled environments. With the aim to facilitate the development of robust audio, face, and multi-modal person recognition systems, the new large and realistic multi-modal (audio-visual) VALID database was acquired in a noisy “real world” office scenario with no control on illumination or acoustic noise. In this paper we describe the acquisition and content of the VALID database, consisting of five recording sessions of 106 subjects over a period of one month. Speaker identification experiments using visual speech features extracted from the mouth region are reported. The performance based on the uncontrolled VALID database is compared with that of the controlled XM2VTS database. The best VALID and XM2VTS based accuracies are 63.21% and 97.17% respectively. This highlights the degrading effect of an uncontrolled illumination environment and the importance of this database for deploying real world applications. The VALID database is available to the academic community through http://ee.ucd.ie/validdb/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unified System for Visual Speech Recognition and Speaker Identification

Vienna Talking Faces (ViTaFa): A multimodal person database with synchronized videos, images, and voices

Article Open access 10 November 2023

The Sabancı University Dynamic Face Database (SUDFace): Development and validation of an audiovisual stimulus set of recited and free speeches with neutral facial expressions

Article 26 August 2022

References

Pigeon, S., Vandendorpe, L.: The M2VTS Multimodal Face Database (Release 1.00). In: Proc. First International Conf. on Audio- and Video-based Biometric Person Authentication, Crans-Montana, Switzerland, pp. 403–409 (1997)
Google Scholar
The XM2VTS database, http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/
Fox, N.A., Gross, R., de Chazal, P., Cohn, J.F., Reilly, R.B.: Person Identification Using Automatic Integration of Speech, Lip, and Face Experts. In: Proceedings of the 2003 ACM SIGMM workshop on Biometrics Methods and Applications, Berkley, California, November 2003, pp. 25–32 (2003)
Google Scholar
Bailliere, E.B., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Chapter Google Scholar
Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Jardins, J.L.l., Lunter, J., Ni, Y., Petrovska-Delacretaz, D.: BIOMET: A Multimodal Person Authentication Database Including Face, Voice, Fingerprint, Hand and Signature Modalities, 2688 ed, (2003)
Google Scholar
Fox, N.A., Reilly, R.B.: Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 743–751. Springer, Heidelberg (2003)
Chapter Google Scholar
Fox, N.A., Reilly, R.: Robust Multi-modal Person Identification with Tolerance of Facial Expression. In: IEEE International Conference on Systems, Man and Cybernetics, The Hague, The Netherlands, October 10-13, vol. 1, pp. 580–585 (2004)
Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings of the IEEE 91, 1306–1324 (2003)
Article Google Scholar
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4, 23–35 (2002)
Article Google Scholar
Luettin, J., Thacker, N.A., Beet, S.W.: Speaker Identification by Lipreading. In: Proceedings of the Fourth International Conference on Spoken Language, ICSLP 1996, October 1996, vol. 1, pp. 62–65 (1996)
Google Scholar
Wark, T., Sridharan, S., Chandran, V.: The use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMMs. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 6, pp. 2389–2392 (2000)
Google Scholar
Matthews, G., Potamianos, C.: A Comparison of Model and Transform- based Visual Features for Audio-Visual LVCSR. In: IEEE International Conference on Multimedia and Expo, pp. 825–828 (2001)
Google Scholar
Heckmann, M., Kroschel, K., Savariaux, C., Berthommier, F.: DCT-Based Video Features for Audio-visual Speech Recognition. In: Proceedings of the 7th ICSLP, Denver, Colorado, USA, vol. 3, pp. 1925–1928 (2002)
Google Scholar
Potamianos, G., Graf, H., Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. In: Proceedings of the IEEE International Conference on Image Processing, ICIP 1998, Chicago, October 1998, vol. 3, pp. 173–177 (1998)
Google Scholar
Scanlon, P., Reilly, R.B.: Feature Analysis for Automatic Speechreading. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 625–630 (October 2001)
Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London (1999)
Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department: Microsoft Corporation (2001)
Google Scholar
Luettin, J.: Speaker verification experiments on the XM2VTS database. In: IDIAP Communication 98-02: IDIAP, Martigny, Switzerland (1999)
Google Scholar
Lucey, S.: An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 260–267. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland
Niall A. Fox, Brian A. O’Mullane & Richard B. Reilly

Authors

Niall A. Fox
View author publications
You can also search for this author in PubMed Google Scholar
Brian A. O’Mullane
View author publications
You can also search for this author in PubMed Google Scholar
Richard B. Reilly
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Robotics Institute, Carnegie Mellon University., Pittsburgh, 15213-3890, Pennsylvania, USA
Takeo Kanade
Withington Hospital, Nightingale Centre, Manchester, UK
Anil Jain
IBM Thomas J. Watson Research Center, 19 Skyline Drive, NY 10598, Hawthorne, USA
Nalini K. Ratha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fox, N.A., O’Mullane, B.A., Reilly, R.B. (2005). VALID: A New Practical Audio-Visual Database, and Comparative Results. In: Kanade, T., Jain, A., Ratha, N.K. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2005. Lecture Notes in Computer Science, vol 3546. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527923_81

Download citation

DOI: https://doi.org/10.1007/11527923_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27887-0
Online ISBN: 978-3-540-31638-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics