Skip to main content

Use of Missing and Unreliable Data for Audiovisual Speech Recognition

  • Chapter
  • First Online:

Abstract

Under acoustically distorted conditions, any available video information is especially helpful for increasing recognition robustness. However, an optimal strategy for integrating audio and video information is difficult to find, since both streams may independently suffer from time-varying degrees of distortion. In this chapter, we show how missing-feature techniques for coupled HMMs can help us fuse information from both uncertain information sources.We also focus on the estimation of reliability for the video feature stream, which is obtained from a linear discriminant analysis (LDA) applied to a set of shape- and appearance-based features. The approach has resulted in significant performance improvements under strongly distorted conditions, while, in conjunction with stream weight tuning, being lowerbounded in performance by the best of the two single-stream recognizers under all tested conditions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahmad, N., Datta, S., Mulvaney, D., Farooq, O.: A comparison of visual features for audiovisual automatic speech recognition. In: Acoustics 2008, Paris, pp. 6445–6448 (2008). DOI 10.1121/1.2936016

    Google Scholar 

  2. Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing 11, 1213–1227 (2002)

    Google Scholar 

  3. Astudillo, R.F., Kolossa, D., Orglmeister, R.: Uncertainty propagation for speech recognition using RASTA features in highly nonstationary noisy environments. In: Proc. ITG (2008)

    Google Scholar 

  4. Astudillo, R.F., Kolossa, D., Orglmeister, R.: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean square error speech enhancement. In: Proc. Interspeech (2009)

    Google Scholar 

  5. Barker, J., Green, P., Cooke, M.: Linking auditory scene analysis and robust ASR by missing data techniques. In: Proceedings WISP 2001 (2001)

    Google Scholar 

  6. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 92, 236–242 (1994). URL citeseer.ist.psu.edu/barron92performance.html

    Google Scholar 

  7. Cetingl, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Image Processing, IEEE Transactions on 15(10), 2879–2891 (2006). DOI 10.1109/TIP.2006.877528

    Article  Google Scholar 

  8. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoustical Society of America Journal 120, 2421–2424 (2006). DOI 10.1121/1.2229005

    Article  Google Scholar 

  9. Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech and Audio Processing 13(3), 412–421 (2005)

    Article  Google Scholar 

  10. Dixon, P.R., Oonishi, T., Furui, S.: Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comput. Speech Lang. 23(4), 510–526 (2009). DOI http://dx.doi.org/10.1016/j.csl.2009.03.005

    Google Scholar 

  11. Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ (2005). Online web resource, last checked: 01 July 2010

    Google Scholar 

  12. ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms, ETSI ES 202 050 v1.1.5 (2007-01) (January 2007)

    Google Scholar 

  13. Gejguš, P., Šperka, M.: Face tracking in color video sequences. In: SCCG ’03: Proceedings of the 19th Spring Conference on Computer Graphics, pp. 245–249. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/984952.984992

    Google Scholar 

  14. Goecke, R.: A stereo vision lip tracking algorithm and subsequent statistical analyses of the audio-video correlation in Australian English. Ph.D. thesis, Australian National University, Canberra, Australia (2004). URL citeseer.ist.psu.edu/goecke04stereo.html

    Google Scholar 

  15. Gowdy, J., Subramanya, A., Bartels, C., Bilmes, J.: DBN based multi-stream models for audio-visual speech recognition. In: Proc. ICASSP, vol. 1, pp. I–993–6 vol.1 (2004). DOI 10.1109/ICASSP.2004.1326155

    Google Scholar 

  16. Hermansky, H., Morgan, N.: RASTA processing of speech. Speech and Audio Processing, IEEE Transactions on 2(4), 578–589 (1994). DOI 10.1109/89.326616

    Article  Google Scholar 

  17. Hsu, R.L., Abdel-Mottaleb, M., Jain, A.K.: Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 696–706 (2002)

    Article  Google Scholar 

  18. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988). URL http://www.springerlink.com/content/q7g93335q86604x6/fulltext.pdf

    Google Scholar 

  19. Kolossa, D., Astudillo, R.F., Zeiler, S., Vorwerk, A., Lerch, D., Chong, J., Orglmeister, R.: Missing feature audiovisual speech recognition under real-time constraints. Accepted for publication in ITG Fachtagung Sprachkommunikation (2010)

    Google Scholar 

  20. Kolossa, D., Chong, J., Zeiler, S., Keutzer, K.: Efficient manycore CHMM speech recognition for audiovisual and multistream data. Accepted for publication in Proc. Interspeech (2010)

    Google Scholar 

  21. Kolossa, D., Klimas, A., Orglmeister, R.: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques. In: Proc. WASPAA, pp. 82–85 (2005). DOI 10.1109/ASPAA.2005.1540174

    Google Scholar 

  22. Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A.: Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit. In: DAGM-Symposium, pp. 488–495 (2004)

    Google Scholar 

  23. Lan, Y., Theobald, B.J., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Int. Conf. on Auditory-Visual Speech Processing (AVSP2009). Norwich, UK (2009)

    Google Scholar 

  24. Lerch, D.: Audiovisuelle Spracherkennung unter Berücksichtigung der Unsicherheit von visuellen Merkmalen. Diploma thesis, TU Berlin, dennis_lerch@gmx.de (2009)

    Google Scholar 

  25. Lewis, T.W., Powers, D.M.W.: Lip feature extraction using red exclusion. In: VIP’00: Selected Papers from the Pan-Sydney Workshop on Visualisation, pp. 61–67. Australian Computer Society, Inc., Darlinghurst, Australia, Australia (2001)

    Google Scholar 

  26. Lucey, P.J., Dean, D.B., Sridharan, S.: Problems associated with current area-based visual speech feature extraction techniques. In: AVSP 2005, pp. 73–78 (2005). URL http://eprints.qut.edu.au/12847/

  27. Luettin, J., Potamianos, G., Neti, C.: Asynchronous stream modelling for large vocabulary audio-visual speech recognition. In: Proc. ICASSP (2001)

    Google Scholar 

  28. Mase, K., Pentland, A.: Automatic lip-reading by optical flow analysis. Trans. Systems and Computers in Japan 22, 67–76 (1991)

    Article  Google Scholar 

  29. Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(2), 198–213 (2002). DOI 10.1109/34.982900

    Article  Google Scholar 

  30. Metze, F.: Articulatory features for conversational speech recognition. Ph.D. thesis, Universität Fridericiana zu Karlsruhe (2005)

    Google Scholar 

  31. Naseem, I., Deriche, M.: Robust human face detection in complex color images. IEEE International Conference on Image Processing, ICIP 2005. 2, 338–341 (2005). DOI 10.1109/ICIP.2005.1530061

    Google Scholar 

  32. Nefian, A., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing 11, 1274–1288 (2002)

    Google Scholar 

  33. Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. Tech. Rep. WS00AVSR, Johns Hopkins University, CLSP (2000). URL citeseer.ist.psu.edu/neti00audiovisual.html

    Google Scholar 

  34. NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)

    Google Scholar 

  35. Otsuki, T., Ohtomo, T.: Automatic lipreading of station names using optical flow and HMM. Technical report of IEICE. HIP 102(473), 25–30 (2002). URL http://ci.nii.ac.jp/naid/110003271904/en/

    Google Scholar 

  36. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing, IEEE Trans. 17(3), 423–435 (2009). DOI 10. 1109/TASL.2008.2011515

    Google Scholar 

  37. Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Appl. Signal Process. 2002(1), 1189–1201 (2002). DOI http://dx.doi.org/10.1155/S1110865702206101

    Google Scholar 

  38. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: E. Vatikiotis-Bateson, P. Perrier (eds.) Issues in Visual and Audio-Visual Speech Processing. MIT Press (2004)

    Google Scholar 

  39. Raj, B., Stern, R.: Missing-feature approaches in speech recognition. Signal Processing Magazine, IEEE 22(5), 101–116 (2005)

    Article  Google Scholar 

  40. Schwerdt, K., Crowley, J.L.: Robust face tracking using color. In: Proc. of 4th International Conference on Automatic Face and Gesture Recognition, pp. 90–95. Grenoble, France (2000). URL citeseer.ist.psu.edu/schwerdt00robust.html

    Google Scholar 

  41. Shdaifat, I., Grigat, R.R., Lütgert, S.: Recognition of the German visemes using multiple feature matching. In: B. Radig, S. Florczyk (eds.) Lecture Notes in Computer Science, Pattern Recognition, vol. 2191/2001, pp. 437–442. Springer-Verlag Berlin Heidelberg (2001)

    Google Scholar 

  42. Tamura, S., Iwano, K., Furui, S.: A robust multi-modal speech recognition method using optical-flow analysis. In: Multi-Modal Dialogue in Mobile Environments, ISCA Tutorial and Research Workshop (ITRW). ISCA, Kloster Irsee, Germany (2002)

    Google Scholar 

  43. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proc. Graphicon, pp. 85–92. Moscow, Russia (2003). URL citeseer.ist.psu.edu/vezhnevets03survey.html

    Google Scholar 

  44. Wang, X., Hao, Y., Fu, D., Yuan, C.: ROI processing for visual features extraction in lip-reading. In: 2008 International Conference on Neural Networks and Signal Processing, pp. 178–181. IEEE (2008)

    Google Scholar 

  45. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)

    Article  Google Scholar 

  46. Young, S., Russell, N., Thornton, J.: Token passing: A simple conceptual model for connected speech recognition systems. Tech. Rep. CUED/FINFENG /TR.38, Cambridge University Engineering Department (1989)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Vorwerk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Vorwerk, A., Zeiler, S., Kolossa, D., Astudillo, R.F., Lerch, D. (2011). Use of Missing and Unreliable Data for Audiovisual Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21317-5_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21316-8

  • Online ISBN: 978-3-642-21317-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics