Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine

Yu, Xuhu; Wang, Lei; Chen, Changhao; Tie, Junbo; Guo, Shasha

doi:10.1007/978-981-99-1645-0_46

Xuhu Yu¹⁰,
Lei Wang¹⁰,
Changhao Chen¹¹,
Junbo Tie¹⁰ &
…
Shasha Guo¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

International Conference on Neural Information Processing

904 Accesses
2 Citations

Abstract

Audio-visual speech recognition is to solve the multimodal lip-reading task using audio and visual information, which is an important way to improve the performance of speech recognition in noisy conditions. Deep learning methods have achieved promising results in this regard. However, these methods have complex network architecture and are computationally intensive. Recently, Spiking Neural Networks (SNNs) have attracted attention due to their being event-driven and can enable low-power computing. SNNs can capture richer motion information and have been successful in work such as gesture recognition. But it has not been widely used in lipreading tasks. Liquid State Machines (LSMs) have been recognized in SNNs due to their low training costs and are well suited for spatiotemporal sequence problems of event streams. Multimodal lipreading based on Dynamic Vision Sensors (DVS) is also such a problem. Hence, we propose a soft fusion framework with LSM. The framework fuses visual and audio information to achieve the effect of higher reliability lip recognition. On the well-known public LRW dataset, our fusion network achieves a recognition accuracy of 86.8%. Compared with single modality recognition, the accuracy of the fusion method is improved by 5% to 6%. In addition, we add extra noise to the raw data, and the experimental results show that the fusion model outperforms the audio-only model significantly, proving the robustness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hinton, G., Deng, L., Yu, D., Dahl, G.E., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv (2017)
Google Scholar
Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv (2014)
Google Scholar
Roy, K., Jaiswal, A., Panda, P.: Towards spike-based machine intelligence with neuromorphic computing. Nature 575(7784), 607–617 (2019)
Article Google Scholar
Butts, D.A., et al.: Temporal precision in the neural code and the timescales of natural vision. Nature 449(7158), 92–95 (2007)
Article Google Scholar
Maass, W., Natschlger, T., Markram, H.: Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14(11), 2531–2560 (2002)
Google Scholar
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)
Article Google Scholar
Rebecq, H., Gehrig, D., Scaramuzza, D.: ESIM: an open event camera simulator. In: Conference on Robot Learning, pp. 969–982. PMLR (2018)
Google Scholar
Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 109–112. IEEE (1995)
Google Scholar
Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMs. In: ICASSP 2017–2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. IEEE (2020)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.D.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Pantic, M.: End-to-end audiovisual speech recognition. IEEE (2018)
Google Scholar
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 1 (2018)
Google Scholar
Li, X., Neil, D., Delbruck, T., Liu, S.C.: Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (2019)
Google Scholar
Tian, S., Qu, L., Wang, L., Hu, K., Li, N., Xu, W.: A neural architecture search based framework for liquid state machine design. Neurocomputing 443, 174–182 (2021)
Article Google Scholar
Chen, C., et al.: Selective sensor fusion for neural visual-inertial odometry. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems 27 (2014)
Google Scholar
Stimberg, M., Brette, R., Goodman, D.F.: Brian 2, an intuitive and efficient neural simulator. Elife 8, e47314 (2019)
Article Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016)
Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10(3), 1755–1758 (2009)
Google Scholar
Andrew, V., Herman, J., Steeneken, M.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha, 410073, China
Xuhu Yu, Lei Wang, Junbo Tie & Shasha Guo
College of Intelligence Science, National University of Defense Technology, Changsha, 410073, China
Changhao Chen

Authors

Xuhu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Changhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junbo Tie
View author publications
You can also search for this author in PubMed Google Scholar
Shasha Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, X., Wang, L., Chen, C., Tie, J., Guo, S. (2023). Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_46

Download citation

DOI: https://doi.org/10.1007/978-981-99-1645-0_46
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine