Skip to main content

Multi-modal Speech Processing Methods: An Overview and Future Research Directions Using a MATLAB Based Audio-Visual Toolbox

  • Conference paper
Multimodal Signals: Cognitive and Algorithmic Issues

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5398))

  • 1227 Accesses

Abstract

This paper presents an overview of the main multi-modal speech enhancement methods reported to date. In particular, a new MATLAB based Toolbox developed by Barbosa et al (2007) for processing audio-visual data is reviewed and its performance potential evaluated. It is shown that the tool does not represent a complete and comprehensive speech processing solution, but rather serves as a standardised, yet versatile base to build upon with further research. To demonstrate this versatility, preliminary examples that make use of these computational procedures with an audiovisual corpus are demonstrated. Finally, some future research directions in the area of multi-modal speech processing are outlined, including future research that the authors aim to carry out with the aid of this newly developed audio-visual MATLAB toolbox, including toolbox customisation, and processing noisy speech in real world environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Haykin, S., Chen, Z.: The Cocktail Party Problem. Neural Computation 17(9), 1875–1902 (2005)

    Article  Google Scholar 

  2. Sumby, W.H., Pollack, I.: Visual Contribution to Speech Intelligibility in Noise. J. Acc. Soc. America 26(2), 212–215 (1954)

    Article  Google Scholar 

  3. Schwartz, J.L., Berthommier, F., Savariaux, C.: Audio-visual scene analysis: evidence for a ”very-early” integration process in audio-visual speech perception. In: ICSLP 2002, pp. 1937–1940 (2002)

    Google Scholar 

  4. Barker, J., Shao, X.: Audio-Visual Speech Fragment Decoding. In: AVSP 2007, paper L5-2 (accepted, 2007)

    Google Scholar 

  5. Almajai, I., Milner, B.: Maximising Audio-Visual Speech Correlation. In: AVSP 2007, paper P16 (accepted, 2007)

    Google Scholar 

  6. Barbosa, A.V., Yehia, H.C., Vatikiotis-Bateson, E.: MATLAB toolbox for audiovisual speech processing. In: AVSP 2007, paper P38 (accepted, 2007)

    Google Scholar 

  7. Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Trans. on Audio, Speech, and Lang. Processing 15(1), 96–108 (2007)

    Article  Google Scholar 

  8. Almajai, I., Milner, B., Darch, J., Vaseghi, S.: Visually-Derived Wiener Filters for Speech Enhancement. In: ICASSP 2007, vol. 4, p. IV-585–IV-588 (2007)

    Google Scholar 

  9. Scanlon, P., Reilly, R.: Feature analysis for automatic speechreading. Mult. Sig. Processing. In: 2001 IEEE Fourth Workshop on, pp. 625–630 (2001)

    Google Scholar 

  10. Hazen, J.T., Saenko, K., La, C.H., Glass, J.R.: A Segment Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments. In: ICMI 2004: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242 (2004)

    Google Scholar 

  11. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings - IEEE, part. 9, 91, 1306–1326 (2003)

    Article  Google Scholar 

  12. Goecke, R.: Current Trends In Joint Audio-Video Signal Processing: A Review. In: Proceedings of the Eighth Int. Symposium on Signal Processing and Its Applications, pp. 70–73 (2005)

    Google Scholar 

  13. Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: AVSP 2003, pp. 95–104 (2003)

    Google Scholar 

  14. Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM-Verlag (2008)

    Google Scholar 

  15. Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: audio-visual speech corpus in a car environment. In: Interspeech 2004, pp. 2489–2492 (2004)

    Google Scholar 

  16. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. On Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abel, A., Hussain, A. (2009). Multi-modal Speech Processing Methods: An Overview and Future Research Directions Using a MATLAB Based Audio-Visual Toolbox. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00525-1_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00524-4

  • Online ISBN: 978-3-642-00525-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics