Abstract
This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to the noise condition of the given speech datum. Experimental results demonstrate that the proposed method is effective for producing noise-robust recognition performance without a priori knowledge about the noise conditions of the speech data.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Trans. Multimedia 4, 23–37 (2002)
Bregler, C., Konig, Y.: ‘Eigenlips’ for Robust Speech Recognition. In: Proc. ICASSP, Adelaide, Australia, pp. 669–672 (1994)
Rogozan, A., Deléglise, P.: Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition. Speech Commun. 26, 149–161 (1998)
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Lee, J.-S., Park, C.H.: Adaptive Decision Fusion for Audio-Visual Speech Recognition. In: Mihelič, F., Žibert, J. (eds.) Speech Recognition, Technologies and Applications, I-Tech, Vienna Austria, pp. 275–296 (2008a)
Benoît, C.: The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces. In: Taylor, M.M., Nel, F., Bouwhuis, D. (eds.) The Structure of Multimodal Dialogue II, pp. 485–502. John Benjamins, Amsterdam (2000)
Meyer, G.F., Mullligan, J.B., Wuerger, S.M.: Continuous Audio-Visual Digit Recognition using N-Best Decision Fusion. Information Fusion 5, 91–101 (2004)
Conrey, B., Pisoni, D.B.: Auditory-Visual Speech Perception and Synchrony Detection for Speech and Nonspeech Signals. J. Acoust. Soc. Amer. 119, 4065–4073 (2006)
Fisher III, J.W., Darrell, T.: Speaker Association with Signal-Level Audiovisual Fusion. IEEE Trans. Multimedia 6, 406–413 (2004)
Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual Synchronization and Fusion using Canonical Correlation Analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007)
Bredin, H., Chollet, G.: Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Advances in Signal Processing 2007, 11 pages, Article ID 70186 (2007)
Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 814–820. MIT Press, Cambridge (2001)
Eveno, N., Besacier, L.: Co-Inertia Analysis for “Liveness” Test in Audio-Visual Biometrics. In: Proc. Int. Symposium on Image and Signal Processing and Analysis, Zagreb, Croatia, pp. 257–261 (2005)
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River (2001)
Lee, J.-S., Park, C.H.: Training Hidden Markov Models by Hybrid Simulated Annealing for Visual Speech Recognition. In: Proc. IEEE Int. Conf. Systems, Man, Cybernetics, Taipei, Taiwan, pp. 198–202 (2006)
Hermansky, H.: Exploring Temporal Domain for Robustness in Speech Recognition. In: Proc. Int. Congress on Acoustics, Trondheim, Norway, pp. 61–64 (1995)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Dept. Comput. Sci., Univ. London, UK, Tech. Rep. CSD-TR-03-02 (2003)
Gopinath, R.A.: Maximum Likelihood Modeling with Gaussian Distributions for Classification. In: Proc. ICASSP, Seattle, USA, pp. 661–664 (1998)
Lee, J.-S., Park, C.H.: Robust Audio-Visual Speech Recognition based on Late Integration. IEEE Trans. Multimedia 10, 767–779 (2008b)
Lewis, T.W., Powers, D.M.W.: Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition. In: Proc. 27th Australasian Conf. Computer Science, Dunedin, New Zealand, pp. 305–314 (2004)
Movellan, J.R.: Visual Speech Recognition with Stochastic Networks. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 851–858. MIT Press, Cambridge (1995)
Chibelushi, C.C., Gandon, S., Mason, J.S.D., Deravi, F., Johnston, R.D.: Design Issues for a Digital Audio-Visual Integrated Database. In: Proc. IEE Colloq. Integrated Audio-Visual Processing for Recognition, Synthesis, Communication, London, UK, pp. 7/1–7/7 (1996)
Pigeon, S., Vandendrope, L.: The M2VTS Multimodal Face Database (Release 1.00). In: Proc. Int. Conf. Audio- and Video-based Biometric Authentication, Crans-Montana, Switzerland, pp. 403–409 (1997)
Varga, V., Steeneken, H.J.M.: Assessment for Automatic Speech Recognition: II. NOISEX 1992: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 12, 247–251 (1993)
Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals from Convolutive Mixtures. IEEE Trans. Multimedia 15, 96–108 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, JS., Ebrahimi, T. (2009). Two-Level Bimodal Association for Audio-Visual Speech Recognition. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2009. Lecture Notes in Computer Science, vol 5807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04697-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-04697-1_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04696-4
Online ISBN: 978-3-642-04697-1
eBook Packages: Computer ScienceComputer Science (R0)