Abstract
Recent researches have been focusing on fusion of audio and visual features for reliable speech recognition in noisy environments. In this paper, we propose a neural network based model of robust speech recognition by integrating audio, visual, and contextual information. Bimodal Neural Network (BMNN) is a multi-layer perceptron of 4 layers, which combines audio and visual features of speech to compensate loss of audio information caused by noise. In order to improve the accuracy of speech recognition in noisy environments, we also propose a post-processing based on contextual information which are sequential patterns of words spoken by a user. Our experimental results show that our model outperforms any single mode models. Particularly, when we use the contextual information, we can obtain over 90% recognition accuracy even in noisy environments, which is a significant improvement compared with the state of art in speech recognition.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4(1), 23–37 (2002)
Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Chung, K.C.: Audio-visual modeling for bimodal speech recognition. Proceedings of the IEEE Systems, Man, and Cybernetics Conference 1, 181–186 (2001)
Gemello, R., Albesano, D., Mana, F., Moisa, L.: Multi-source neural networks for speech recognition: a review of recent results. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 5, pp. 265–270 (2000)
Zhang, X., Merserratt, R.M., Clements, M.: Bimodal fusion in audio-visual speech recognition. In: International Conference on Image Processing, vol. 1, pp. 964–967 (2002)
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme Recognition Using Time-Delay Neural Networks. IEEE Trans. on Acoustics, Speech and Signal Processing 37(3), 328–339 (1989)
Haffiner, P., Waibel, A.: Multi-State Time Delay Neural Networks for Continuous Speech Recognition. In: Advances in Neural Information Processing Systems 4, Morgan Kaufmann, San Francisco (1992)
Tebelskis, J.: Speech Recognition using Neural Networks. CMU-CS-95-142, School of Computer Science Carnegie Mellon University Pittsburgh (1995)
Bregler, C., Manke, S., Hild, H., Waibel, A.: Bimodal sensor integration on the example of speech-reading. In: Proc. of IEEE Int. Conf. on Neural Networks, San Francisco (1993)
Kim, D.-S., Lee, S.-Y., Kil, R.M.: Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments. IEEE Trans. on Speech and Audio Processing 7(1), 55–69 (1999)
Creaney-Stockton, M.J., Beng, MSc.: Isolated Word Recognition Using Reduced Connectivity Neural Networks With Non-Linear Time Alignment Methods. Dept. of Electrical and Electronic Engineering Univ. of Newcastle-Upon-Tyne (1996)
Lee, S.W.: In Jung Park: A Study on Recognition of the Isolated Digits Using Integrated Processing of Speech-Image Information in Noisy Environments. Journal of the Institute of Electronics Engineers of Korea 38-CI(3), 61–67 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, M.W., Ryu, J.W., Kim, E.J. (2006). Speech Recognition with Multi-modal Features Based on Neural Networks. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893257_55
Download citation
DOI: https://doi.org/10.1007/11893257_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46481-5
Online ISBN: 978-3-540-46482-2
eBook Packages: Computer ScienceComputer Science (R0)