Skip to main content

Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

Abstract

Audio-visual speech recognition is to solve the multimodal lip-reading task using audio and visual information, which is an important way to improve the performance of speech recognition in noisy conditions. Deep learning methods have achieved promising results in this regard. However, these methods have complex network architecture and are computationally intensive. Recently, Spiking Neural Networks (SNNs) have attracted attention due to their being event-driven and can enable low-power computing. SNNs can capture richer motion information and have been successful in work such as gesture recognition. But it has not been widely used in lipreading tasks. Liquid State Machines (LSMs) have been recognized in SNNs due to their low training costs and are well suited for spatiotemporal sequence problems of event streams. Multimodal lipreading based on Dynamic Vision Sensors (DVS) is also such a problem. Hence, we propose a soft fusion framework with LSM. The framework fuses visual and audio information to achieve the effect of higher reliability lip recognition. On the well-known public LRW dataset, our fusion network achieves a recognition accuracy of 86.8%. Compared with single modality recognition, the accuracy of the fusion method is improved by 5% to 6%. In addition, we add extra noise to the raw data, and the experimental results show that the fusion model outperforms the audio-only model significantly, proving the robustness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  2. Vaswani, A., et al.: Attention is all you need. arXiv (2017)

    Google Scholar 

  3. Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv (2014)

    Google Scholar 

  4. Roy, K., Jaiswal, A., Panda, P.: Towards spike-based machine intelligence with neuromorphic computing. Nature 575(7784), 607–617 (2019)

    Article  Google Scholar 

  5. Butts, D.A., et al.: Temporal precision in the neural code and the timescales of natural vision. Nature 449(7158), 92–95 (2007)

    Article  Google Scholar 

  6. Maass, W., Natschlger, T., Markram, H.: Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14(11), 2531–2560 (2002)

    Google Scholar 

  7. Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)

    Article  Google Scholar 

  8. Rebecq, H., Gehrig, D., Scaramuzza, D.: ESIM: an open event camera simulator. In: Conference on Robot Learning, pp. 969–982. PMLR (2018)

    Google Scholar 

  9. Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 109–112. IEEE (1995)

    Google Scholar 

  10. Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMs. In: ICASSP 2017–2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)

    Google Scholar 

  11. Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. IEEE (2020)

    Google Scholar 

  12. Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.D.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016)

  13. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Pantic, M.: End-to-end audiovisual speech recognition. IEEE (2018)

    Google Scholar 

  14. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 1 (2018)

    Google Scholar 

  15. Li, X., Neil, D., Delbruck, T., Liu, S.C.: Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (2019)

    Google Scholar 

  16. Tian, S., Qu, L., Wang, L., Hu, K., Li, N., Xu, W.: A neural architecture search based framework for liquid state machine design. Neurocomputing 443, 174–182 (2021)

    Article  Google Scholar 

  17. Chen, C., et al.: Selective sensor fusion for neural visual-inertial odometry. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  18. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems 27 (2014)

    Google Scholar 

  19. Stimberg, M., Brette, R., Goodman, D.F.: Brian 2, an intuitive and efficient neural simulator. Elife 8, e47314 (2019)

    Article  Google Scholar 

  20. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016)

    Google Scholar 

  21. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10(3), 1755–1758 (2009)

    Google Scholar 

  22. Andrew, V., Herman, J., Steeneken, M.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, X., Wang, L., Chen, C., Tie, J., Guo, S. (2023). Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_46

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1645-0_46

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1644-3

  • Online ISBN: 978-981-99-1645-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics