Abstract
Audio-visual speech recognition is to solve the multimodal lip-reading task using audio and visual information, which is an important way to improve the performance of speech recognition in noisy conditions. Deep learning methods have achieved promising results in this regard. However, these methods have complex network architecture and are computationally intensive. Recently, Spiking Neural Networks (SNNs) have attracted attention due to their being event-driven and can enable low-power computing. SNNs can capture richer motion information and have been successful in work such as gesture recognition. But it has not been widely used in lipreading tasks. Liquid State Machines (LSMs) have been recognized in SNNs due to their low training costs and are well suited for spatiotemporal sequence problems of event streams. Multimodal lipreading based on Dynamic Vision Sensors (DVS) is also such a problem. Hence, we propose a soft fusion framework with LSM. The framework fuses visual and audio information to achieve the effect of higher reliability lip recognition. On the well-known public LRW dataset, our fusion network achieves a recognition accuracy of 86.8%. Compared with single modality recognition, the accuracy of the fusion method is improved by 5% to 6%. In addition, we add extra noise to the raw data, and the experimental results show that the fusion model outperforms the audio-only model significantly, proving the robustness of our model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Vaswani, A., et al.: Attention is all you need. arXiv (2017)
Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv (2014)
Roy, K., Jaiswal, A., Panda, P.: Towards spike-based machine intelligence with neuromorphic computing. Nature 575(7784), 607–617 (2019)
Butts, D.A., et al.: Temporal precision in the neural code and the timescales of natural vision. Nature 449(7158), 92–95 (2007)
Maass, W., Natschlger, T., Markram, H.: Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14(11), 2531–2560 (2002)
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)
Rebecq, H., Gehrig, D., Scaramuzza, D.: ESIM: an open event camera simulator. In: Conference on Robot Learning, pp. 969–982. PMLR (2018)
Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 109–112. IEEE (1995)
Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMs. In: ICASSP 2017–2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. IEEE (2020)
Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.D.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Pantic, M.: End-to-end audiovisual speech recognition. IEEE (2018)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 1 (2018)
Li, X., Neil, D., Delbruck, T., Liu, S.C.: Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (2019)
Tian, S., Qu, L., Wang, L., Hu, K., Li, N., Xu, W.: A neural architecture search based framework for liquid state machine design. Neurocomputing 443, 174–182 (2021)
Chen, C., et al.: Selective sensor fusion for neural visual-inertial odometry. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems 27 (2014)
Stimberg, M., Brette, R., Goodman, D.F.: Brian 2, an intuitive and efficient neural simulator. Elife 8, e47314 (2019)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10(3), 1755–1758 (2009)
Andrew, V., Herman, J., Steeneken, M.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication (1993)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yu, X., Wang, L., Chen, C., Tie, J., Guo, S. (2023). Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_46
Download citation
DOI: https://doi.org/10.1007/978-981-99-1645-0_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)