Use of Missing and Unreliable Data for Audiovisual Speech Recognition

Vorwerk, Alexander; Zeiler, Steffen; Kolossa, Dorothea; Astudillo, Ramón Fernandez; Lerch, Dennis

doi:10.1007/978-3-642-21317-5_13

Use of Missing and Unreliable Data for Audiovisual Speech Recognition

Alexander Vorwerk³,
Steffen Zeiler⁴,
Dorothea Kolossa⁴,
Ramón Fernandez Astudillo⁴ &
…
Dennis Lerch⁴

Chapter
First Online: 01 January 2011

862 Accesses
5 Citations

Abstract

Under acoustically distorted conditions, any available video information is especially helpful for increasing recognition robustness. However, an optimal strategy for integrating audio and video information is difficult to find, since both streams may independently suffer from time-varying degrees of distortion. In this chapter, we show how missing-feature techniques for coupled HMMs can help us fuse information from both uncertain information sources.We also focus on the estimation of reliability for the video feature stream, which is obtained from a linear discriminant analysis (LDA) applied to a set of shape- and appearance-based features. The approach has resulted in significant performance improvements under strongly distorted conditions, while, in conjunction with stream weight tuning, being lowerbounded in performance by the best of the two single-stream recognizers under all tested conditions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahmad, N., Datta, S., Mulvaney, D., Farooq, O.: A comparison of visual features for audiovisual automatic speech recognition. In: Acoustics 2008, Paris, pp. 6445–6448 (2008). DOI 10.1121/1.2936016
Google Scholar
Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing 11, 1213–1227 (2002)
Google Scholar
Astudillo, R.F., Kolossa, D., Orglmeister, R.: Uncertainty propagation for speech recognition using RASTA features in highly nonstationary noisy environments. In: Proc. ITG (2008)
Google Scholar
Astudillo, R.F., Kolossa, D., Orglmeister, R.: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement. In: Proc. Interspeech (2009)
Google Scholar
Barker, J., Green, P., Cooke, M.: Linking auditory scene analysis and robust ASR by missing data techniques. In: Proceedings WISP 2001 (2001)
Google Scholar
Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 92, 236–242 (1994). URL citeseer.ist.psu.edu/barron92performance.html
Google Scholar
Cetingl, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Image Processing, IEEE Transactions on 15(10), 2879–2891 (2006). DOI 10.1109/TIP.2006.877528
Article Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoustical Society of America Journal 120, 2421–2424 (2006). DOI 10.1121/1.2229005
Article Google Scholar
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech and Audio Processing 13(3), 412–421 (2005)
Article Google Scholar
Dixon, P.R., Oonishi, T., Furui, S.: Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comput. Speech Lang. 23(4), 510–526 (2009). DOI http://dx.doi.org/10.1016/j.csl.2009.03.005
Google Scholar
Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ (2005). Online web resource, last checked: 01 July 2010
Google Scholar
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms, ETSI ES 202 050 v1.1.5 (2007-01) (January 2007)
Google Scholar
Gejguš, P., Šperka, M.: Face tracking in color video sequences. In: SCCG ’03: Proceedings of the 19th Spring Conference on Computer Graphics, pp. 245–249. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/984952.984992
Google Scholar
Goecke, R.: A stereo vision lip tracking algorithm and subsequent statistical analyses of the audio-video correlation in Australian English. Ph.D. thesis, Australian National University, Canberra, Australia (2004). URL citeseer.ist.psu.edu/goecke04stereo.html
Google Scholar
Gowdy, J., Subramanya, A., Bartels, C., Bilmes, J.: DBN based multi-stream models for audio-visual speech recognition. In: Proc. ICASSP, vol. 1, pp. I–993–6 vol.1 (2004). DOI 10.1109/ICASSP.2004.1326155
Google Scholar
Hermansky, H., Morgan, N.: RASTA processing of speech. Speech and Audio Processing, IEEE Transactions on 2(4), 578–589 (1994). DOI 10.1109/89.326616
Article Google Scholar
Hsu, R.L., Abdel-Mottaleb, M., Jain, A.K.: Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 696–706 (2002)
Article Google Scholar
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988). URL http://www.springerlink.com/content/q7g93335q86604x6/fulltext.pdf
Google Scholar
Kolossa, D., Astudillo, R.F., Zeiler, S., Vorwerk, A., Lerch, D., Chong, J., Orglmeister, R.: Missing feature audiovisual speech recognition under real-time constraints. Accepted for publication in ITG Fachtagung Sprachkommunikation (2010)
Google Scholar
Kolossa, D., Chong, J., Zeiler, S., Keutzer, K.: Efficient manycore CHMM speech recognition for audiovisual and multistream data. Accepted for publication in Proc. Interspeech (2010)
Google Scholar
Kolossa, D., Klimas, A., Orglmeister, R.: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques. In: Proc. WASPAA, pp. 82–85 (2005). DOI 10.1109/ASPAA.2005.1540174
Google Scholar
Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A.: Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit. In: DAGM-Symposium, pp. 488–495 (2004)
Google Scholar
Lan, Y., Theobald, B.J., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Int. Conf. on Auditory-Visual Speech Processing (AVSP2009). Norwich, UK (2009)
Google Scholar
Lerch, D.: Audiovisuelle Spracherkennung unter Berücksichtigung der Unsicherheit von visuellen Merkmalen. Diploma thesis, TU Berlin, dennis_lerch@gmx.de (2009)
Google Scholar
Lewis, T.W., Powers, D.M.W.: Lip feature extraction using red exclusion. In: VIP’00: Selected Papers from the Pan-Sydney Workshop on Visualisation, pp. 61–67. Australian Computer Society, Inc., Darlinghurst, Australia, Australia (2001)
Google Scholar
Lucey, P.J., Dean, D.B., Sridharan, S.: Problems associated with current area-based visual speech feature extraction techniques. In: AVSP 2005, pp. 73–78 (2005). URL http://eprints.qut.edu.au/12847/
Luettin, J., Potamianos, G., Neti, C.: Asynchronous stream modelling for large vocabulary audio-visual speech recognition. In: Proc. ICASSP (2001)
Google Scholar
Mase, K., Pentland, A.: Automatic lip-reading by optical flow analysis. Trans. Systems and Computers in Japan 22, 67–76 (1991)
Article Google Scholar
Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(2), 198–213 (2002). DOI 10.1109/34.982900
Article Google Scholar
Metze, F.: Articulatory features for conversational speech recognition. Ph.D. thesis, Universität Fridericiana zu Karlsruhe (2005)
Google Scholar
Naseem, I., Deriche, M.: Robust human face detection in complex color images. IEEE International Conference on Image Processing, ICIP 2005. 2, 338–341 (2005). DOI 10.1109/ICIP.2005.1530061
Google Scholar
Nefian, A., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing 11, 1274–1288 (2002)
Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. Tech. Rep. WS00AVSR, Johns Hopkins University, CLSP (2000). URL citeseer.ist.psu.edu/neti00audiovisual.html
Google Scholar
NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)
Google Scholar
Otsuki, T., Ohtomo, T.: Automatic lipreading of station names using optical flow and HMM. Technical report of IEICE. HIP 102(473), 25–30 (2002). URL http://ci.nii.ac.jp/naid/110003271904/en/
Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing, IEEE Trans. 17(3), 423–435 (2009). DOI 10. 1109/TASL.2008.2011515
Google Scholar
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Appl. Signal Process. 2002(1), 1189–1201 (2002). DOI http://dx.doi.org/10.1155/S1110865702206101
Google Scholar
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: E. Vatikiotis-Bateson, P. Perrier (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press (2004)
Google Scholar
Raj, B., Stern, R.: Missing-feature approaches in speech recognition. Signal Processing Magazine, IEEE 22(5), 101–116 (2005)
Article Google Scholar
Schwerdt, K., Crowley, J.L.: Robust face tracking using color. In: Proc. of 4th International Conference on Automatic Face and Gesture Recognition, pp. 90–95. Grenoble, France (2000). URL citeseer.ist.psu.edu/schwerdt00robust.html
Google Scholar
Shdaifat, I., Grigat, R.R., Lütgert, S.: Recognition of the German visemes using multiple feature matching. In: B. Radig, S. Florczyk (eds.) Lecture Notes in Computer Science, Pattern Recognition, vol. 2191/2001, pp. 437–442. Springer-Verlag Berlin Heidelberg (2001)
Google Scholar
Tamura, S., Iwano, K., Furui, S.: A robust multi-modal speech recognition method using optical-flow analysis. In: Multi-Modal Dialogue in Mobile Environments, ISCA Tutorial and Research Workshop (ITRW). ISCA, Kloster Irsee, Germany (2002)
Google Scholar
Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proc. Graphicon, pp. 85–92. Moscow, Russia (2003). URL citeseer.ist.psu.edu/vezhnevets03survey.html
Google Scholar
Wang, X., Hao, Y., Fu, D., Yuan, C.: ROI processing for visual features extraction in lip-reading. In: 2008 International Conference on Neural Networks and Signal Processing, pp. 178–181. IEEE (2008)
Google Scholar
Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)
Article Google Scholar
Young, S., Russell, N., Thornton, J.: Token passing: A simple conceptual model for connected speech recognition systems. Tech. Rep. CUED/FINFENG /TR.38, Cambridge University Engineering Department (1989)
Google Scholar

Download references

Author information

Authors and Affiliations

Electronics and Medical Signal Processing (EMSP), Technische Universität Berlin (TU Berlin), Einsteinufer 17, 10587, Berlin, Germany
Alexander Vorwerk
EMSP, TU Berlin, Berlin, Germany
Steffen Zeiler, Dorothea Kolossa, Ramón Fernandez Astudillo & Dennis Lerch

Authors

Alexander Vorwerk
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Zeiler
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Kolossa
View author publications
You can also search for this author in PubMed Google Scholar
Ramón Fernandez Astudillo
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Lerch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Vorwerk .

Editor information

Editors and Affiliations

Institute of Communication Acoustics, Ruhr-Universität Bochum, Universitätsstrasse 150, Bochum, 44801, Germany
Dorothea Kolossa
, Dept. of Communications Engineering, University of Paderborn, Warburger Strasse 100, Paderborn, 33098, Germany
Reinhold Häb-Umbach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Vorwerk, A., Zeiler, S., Kolossa, D., Astudillo, R.F., Lerch, D. (2011). Use of Missing and Unreliable Data for Audiovisual Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-21317-5_13
Published: 23 June 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21316-8
Online ISBN: 978-3-642-21317-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics