Abstract
Under acoustically distorted conditions, any available video information is especially helpful for increasing recognition robustness. However, an optimal strategy for integrating audio and video information is difficult to find, since both streams may independently suffer from time-varying degrees of distortion. In this chapter, we show how missing-feature techniques for coupled HMMs can help us fuse information from both uncertain information sources.We also focus on the estimation of reliability for the video feature stream, which is obtained from a linear discriminant analysis (LDA) applied to a set of shape- and appearance-based features. The approach has resulted in significant performance improvements under strongly distorted conditions, while, in conjunction with stream weight tuning, being lowerbounded in performance by the best of the two single-stream recognizers under all tested conditions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ahmad, N., Datta, S., Mulvaney, D., Farooq, O.: A comparison of visual features for audiovisual automatic speech recognition. In: Acoustics 2008, Paris, pp. 6445–6448 (2008). DOI 10.1121/1.2936016
Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing 11, 1213–1227 (2002)
Astudillo, R.F., Kolossa, D., Orglmeister, R.: Uncertainty propagation for speech recognition using RASTA features in highly nonstationary noisy environments. In: Proc. ITG (2008)
Astudillo, R.F., Kolossa, D., Orglmeister, R.: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement. In: Proc. Interspeech (2009)
Barker, J., Green, P., Cooke, M.: Linking auditory scene analysis and robust ASR by missing data techniques. In: Proceedings WISP 2001 (2001)
Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 92, 236–242 (1994). URL citeseer.ist.psu.edu/barron92performance.html
Cetingl, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Image Processing, IEEE Transactions on 15(10), 2879–2891 (2006). DOI 10.1109/TIP.2006.877528
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoustical Society of America Journal 120, 2421–2424 (2006). DOI 10.1121/1.2229005
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech and Audio Processing 13(3), 412–421 (2005)
Dixon, P.R., Oonishi, T., Furui, S.: Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comput. Speech Lang. 23(4), 510–526 (2009). DOI http://dx.doi.org/10.1016/j.csl.2009.03.005
Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ (2005). Online web resource, last checked: 01 July 2010
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms, ETSI ES 202 050 v1.1.5 (2007-01) (January 2007)
Gejguš, P., Šperka, M.: Face tracking in color video sequences. In: SCCG ’03: Proceedings of the 19th Spring Conference on Computer Graphics, pp. 245–249. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/984952.984992
Goecke, R.: A stereo vision lip tracking algorithm and subsequent statistical analyses of the audio-video correlation in Australian English. Ph.D. thesis, Australian National University, Canberra, Australia (2004). URL citeseer.ist.psu.edu/goecke04stereo.html
Gowdy, J., Subramanya, A., Bartels, C., Bilmes, J.: DBN based multi-stream models for audio-visual speech recognition. In: Proc. ICASSP, vol. 1, pp. I–993–6 vol.1 (2004). DOI 10.1109/ICASSP.2004.1326155
Hermansky, H., Morgan, N.: RASTA processing of speech. Speech and Audio Processing, IEEE Transactions on 2(4), 578–589 (1994). DOI 10.1109/89.326616
Hsu, R.L., Abdel-Mottaleb, M., Jain, A.K.: Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 696–706 (2002)
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988). URL http://www.springerlink.com/content/q7g93335q86604x6/fulltext.pdf
Kolossa, D., Astudillo, R.F., Zeiler, S., Vorwerk, A., Lerch, D., Chong, J., Orglmeister, R.: Missing feature audiovisual speech recognition under real-time constraints. Accepted for publication in ITG Fachtagung Sprachkommunikation (2010)
Kolossa, D., Chong, J., Zeiler, S., Keutzer, K.: Efficient manycore CHMM speech recognition for audiovisual and multistream data. Accepted for publication in Proc. Interspeech (2010)
Kolossa, D., Klimas, A., Orglmeister, R.: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques. In: Proc. WASPAA, pp. 82–85 (2005). DOI 10.1109/ASPAA.2005.1540174
Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A.: Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit. In: DAGM-Symposium, pp. 488–495 (2004)
Lan, Y., Theobald, B.J., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Int. Conf. on Auditory-Visual Speech Processing (AVSP2009). Norwich, UK (2009)
Lerch, D.: Audiovisuelle Spracherkennung unter Berücksichtigung der Unsicherheit von visuellen Merkmalen. Diploma thesis, TU Berlin, dennis_lerch@gmx.de (2009)
Lewis, T.W., Powers, D.M.W.: Lip feature extraction using red exclusion. In: VIP’00: Selected Papers from the Pan-Sydney Workshop on Visualisation, pp. 61–67. Australian Computer Society, Inc., Darlinghurst, Australia, Australia (2001)
Lucey, P.J., Dean, D.B., Sridharan, S.: Problems associated with current area-based visual speech feature extraction techniques. In: AVSP 2005, pp. 73–78 (2005). URL http://eprints.qut.edu.au/12847/
Luettin, J., Potamianos, G., Neti, C.: Asynchronous stream modelling for large vocabulary audio-visual speech recognition. In: Proc. ICASSP (2001)
Mase, K., Pentland, A.: Automatic lip-reading by optical flow analysis. Trans. Systems and Computers in Japan 22, 67–76 (1991)
Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(2), 198–213 (2002). DOI 10.1109/34.982900
Metze, F.: Articulatory features for conversational speech recognition. Ph.D. thesis, Universität Fridericiana zu Karlsruhe (2005)
Naseem, I., Deriche, M.: Robust human face detection in complex color images. IEEE International Conference on Image Processing, ICIP 2005. 2, 338–341 (2005). DOI 10.1109/ICIP.2005.1530061
Nefian, A., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing 11, 1274–1288 (2002)
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. Tech. Rep. WS00AVSR, Johns Hopkins University, CLSP (2000). URL citeseer.ist.psu.edu/neti00audiovisual.html
NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)
Otsuki, T., Ohtomo, T.: Automatic lipreading of station names using optical flow and HMM. Technical report of IEICE. HIP 102(473), 25–30 (2002). URL http://ci.nii.ac.jp/naid/110003271904/en/
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing, IEEE Trans. 17(3), 423–435 (2009). DOI 10. 1109/TASL.2008.2011515
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Appl. Signal Process. 2002(1), 1189–1201 (2002). DOI http://dx.doi.org/10.1155/S1110865702206101
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: E. Vatikiotis-Bateson, P. Perrier (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press (2004)
Raj, B., Stern, R.: Missing-feature approaches in speech recognition. Signal Processing Magazine, IEEE 22(5), 101–116 (2005)
Schwerdt, K., Crowley, J.L.: Robust face tracking using color. In: Proc. of 4th International Conference on Automatic Face and Gesture Recognition, pp. 90–95. Grenoble, France (2000). URL citeseer.ist.psu.edu/schwerdt00robust.html
Shdaifat, I., Grigat, R.R., Lütgert, S.: Recognition of the German visemes using multiple feature matching. In: B. Radig, S. Florczyk (eds.) Lecture Notes in Computer Science, Pattern Recognition, vol. 2191/2001, pp. 437–442. Springer-Verlag Berlin Heidelberg (2001)
Tamura, S., Iwano, K., Furui, S.: A robust multi-modal speech recognition method using optical-flow analysis. In: Multi-Modal Dialogue in Mobile Environments, ISCA Tutorial and Research Workshop (ITRW). ISCA, Kloster Irsee, Germany (2002)
Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proc. Graphicon, pp. 85–92. Moscow, Russia (2003). URL citeseer.ist.psu.edu/vezhnevets03survey.html
Wang, X., Hao, Y., Fu, D., Yuan, C.: ROI processing for visual features extraction in lip-reading. In: 2008 International Conference on Neural Networks and Signal Processing, pp. 178–181. IEEE (2008)
Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)
Young, S., Russell, N., Thornton, J.: Token passing: A simple conceptual model for connected speech recognition systems. Tech. Rep. CUED/FINFENG /TR.38, Cambridge University Engineering Department (1989)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Vorwerk, A., Zeiler, S., Kolossa, D., Astudillo, R.F., Lerch, D. (2011). Use of Missing and Unreliable Data for Audiovisual Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-21317-5_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)