skip to main content
10.1145/2141622.2141646acmotherconferencesArticle/Chapter ViewAbstractPublication PagespetraConference Proceedingsconference-collections
research-article

Audio visual speech recognition in noisy visual environments

Published: 25 May 2011 Publication History

Abstract

Speech recognition is a natural means of interaction for a human with a smart assistive environment. In order for this interaction to be effective, such a system should attain a high recognition rate even under adverse conditions. Audio-visual speech recognition (AVSR) can be of help in such environments, especially under the presence of audio noise. However the impact of visual noise to its performance has not been studied sufficiently in the literature. In this paper, we examine the effects of visual noise to AVSR, reporting experiments on the relatively simple task of connected digit recognition, under moderate acoustic noise and a variety of types of visual noise. The latter can be caused by either faulty sensors or video signal transmission problems that can be found in smart assistive environments. Our AVSR system exhibits higher accuracy in comparison to an audio-only recognizer and robust performance in most cases of noisy video signals considered.

References

[1]
J. Huang, X. Zhuang, V. Libal and G. Potamianos, "Long-time span acoustic activity analysis from far-field sensors in smart homes", In Proc. ICASSP, pp. 4173--4176, 2009.
[2]
K. Iwano, S. Tamura and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", In Proc. HSC, pp.187--190, 2001.
[3]
S. Nakamura, H. Ito and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", In Proc. ICSLP, vol. 3, pp. 20--24, 2000.
[4]
G. Potamianos, C. Neti, G. Gravier, A. Garg and A. W. Senior, "Recent advances in the automatic recognition of audio-visual speech.", Invited, In Proc. IEEE, vol. 91, no. 9, pp. 1306--1326, 2003.
[5]
G. Potamianos, H. P. Graf and E. Cosatto, "An image transform approach for HMM based automatic lipreading", In Proc. ICIP, vol. 3, pp. 173--177, Chicago, IL, 1998.
[6]
G. Bradski and A. Kaehler. "Learning OpenCV: Computer Vision with the OpenCV Library." O'Reilly Media, 1st edition, September 2008.
[7]
C. M. Bishop, "Pattern Recognition and Machine Learning." Springer, Heidelberg, 2006.
[8]
G. Potamianos and P. Scalnon, "Exploiting lower face symmetry in appearance-based automatic speechreading", In Proc. Works. AVSP, pp. 79--84, Vancouver Island, Canada, 2005.
[9]
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, "The HTK Book", Cambridge Univ. Eng. Dept., Tech Rep, 2002.
[10]
E. K. Patterson, S. Gurbuz, Z. Tufekci and J. N. Gowdy, "CUAVE: A new audio-visual database for multimodal human-computer interface research", In Proc. IEEE ICASSP, vol. 2, pp. 2017--2020, 2002.
[11]
J. Shain, C. B. Owen and F. Makedon, "Detecting lip motion in digital video", In Proc. SPIE Multimedia Systems and Applications, vol. 3528, pp.15--25, 1999.

Cited By

View all
  • (2022)Improved Lite Audio-Visual Speech EnhancementIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315326530(1345-1359)Online publication date: 24-Feb-2022
  • (2019)Improved Features and Dynamic Stream Weight Adaption for Robust Audio-Visual Speech Recognition FrameworkDigital Signal Processing10.1016/j.dsp.2019.02.016Online publication date: Mar-2019
  • (2015)TCD-TIMIT: An Audio-Visual Corpus of Continuous SpeechIEEE Transactions on Multimedia10.1109/TMM.2015.240769417:5(603-615)Online publication date: May-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PETRA '11: Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments
May 2011
401 pages
ISBN:9781450307727
DOI:10.1145/2141622
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • NSF: National Science Foundation
  • Foundation of the Hellenic World
  • ICS-FORTH: Institute of Computer Science, Foundation for Research and Technology - Hellas
  • U of Tex at Arlington: U of Tex at Arlington
  • UCG: University of Central Greece
  • Didaskaleio Konstantinos Karatheodoris, University of the Aegean
  • Fulbrigh, Greece: Fulbright Foundation, Greece
  • Ionian: Ionian University, GREECE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 May 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic speech recognition
  2. computer vision
  3. discrete cosine transform
  4. hidden Markov models
  5. multi-modality

Qualifiers

  • Research-article

Funding Sources

Conference

PETRA '11
Sponsor:
  • NSF
  • ICS-FORTH
  • U of Tex at Arlington
  • UCG
  • Fulbrigh, Greece
  • Ionian

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Improved Lite Audio-Visual Speech EnhancementIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315326530(1345-1359)Online publication date: 24-Feb-2022
  • (2019)Improved Features and Dynamic Stream Weight Adaption for Robust Audio-Visual Speech Recognition FrameworkDigital Signal Processing10.1016/j.dsp.2019.02.016Online publication date: Mar-2019
  • (2015)TCD-TIMIT: An Audio-Visual Corpus of Continuous SpeechIEEE Transactions on Multimedia10.1109/TMM.2015.240769417:5(603-615)Online publication date: May-2015
  • (2012)Audio-visual speech recognition using depth information from the Kinect in noisy video conditionsProceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments10.1145/2413097.2413100(1-4)Online publication date: 6-Jun-2012
  • (2012)Audio-visual vibraphone transcription in real time2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP)10.1109/MMSP.2012.6343443(215-220)Online publication date: Sep-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media