Skip to main content

Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Abstract

This work presents a scalable solution to speaker-dependent visual command recognition in vehicle cabin. The goal of this work is to recognize a limited number of most frequent driver’s requests based on his/her lip movements. Unlike previous works that have focused on automated lip-reading in controlled laboratory environment, we tackle this problem in real driving conditions based on the recorded RUSAVIC dataset. Due to limiting the scope of the task to speaker-dependency and vocabulary of 50 phrases, the models that we train surpass the performance of previous work and can be used in real-life speech recognition applications. To achieve this, we constructed end-to-end methodology that require only 10 repetition of each phrase in order to achieve reasonable recognition accuracy up to 54% based purely on video information. Our key contributions are: (1) we introduce a novel approach to visual speech data preprocessing and labeling, designed to tackle real-life drivers data from vehicle cabin; (2) we investigate to what extent lip-reading is complimentary to improve visual command recognition, depending on the set of recognizable commands; (3) we train, adapted for our task and compare three state-of-the-art CNN architectures, namely MobileNetV2, DenseNet121, NASNetMobile to evaluate the performance of developed system. The proposed system achieved word recognition rate (WRR) of 55% for a vehicle parked at the crossroad task and 54% for driving scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schroeder, P., Meyers, M., Kostuniuk, L.: National survey on distracted driving attitudes and behaviors, Report No. DOT HS 811 729. National Highway Traffic Safety Administration, Washington, DC (2019)

    Google Scholar 

  2. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  3. Shillingford, B., Assael, Y., Hoffman, M., et al.: Large-Scale Visual Speech Recognition. In: arXiv eprint 1807.05162, pp. 1–21 (2018)

    Google Scholar 

  4. Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process 14, 981–989 (2020)

    Article  Google Scholar 

  5. Afouras, T., Chung, J.C., Senior, A., et al.: Deep Audio-visual Speech Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)

    Google Scholar 

  6. Kashevnik, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)

    Article  Google Scholar 

  7. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)

    Article  Google Scholar 

  8. Gurban, M., Thiran, J.P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57, 4765–4776 (2009)

    Article  MathSciNet  Google Scholar 

  9. Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_76

    Chapter  Google Scholar 

  10. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  11. Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for hmm based automatic lipreading. In: IEEE Conference on Image Processing, pp. 173–177 (1998)

    Google Scholar 

  12. Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1

    Article  Google Scholar 

  13. Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)

    Article  Google Scholar 

  14. Assael, Y., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference, pp. 1–14 (2017)

    Google Scholar 

  15. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  16. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)

    Google Scholar 

  17. Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain adversarial training. Interspeech 2017, 3982–3987 (2017)

    Google Scholar 

  18. Kagirov, I., Ryumin, D., Axyonov, A.: Method for multimodal recognition of one-handed sign language gestures through 3D convolution and LSTM neural networks. In: International Conference on Speech and Computer, pp. 191–200 (2019)

    Google Scholar 

  19. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: Interspeech, pp. 3652–3656, ISCA (2017)

    Google Scholar 

  20. Wand, M., Koutnik, K., Schmidhuber, J.: Lipreading with long short-term memory. International Conference on Acoustics, Speech, and Signal Processing, pp. 6115–6119 (2016)

    Google Scholar 

  21. Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end multi-view lipreading. In: British Machine Vision Conference, pp. 1–14 (2017)

    Google Scholar 

  22. Sui, S., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: ICCV, pp. 154–162 (2015)

    Google Scholar 

  23. Noda, K., Yamaguchi, Y., Nakadai, K., et al.: Lipreading using convolutional neural network. In: Interspeech, pp. 1149–1153 (2014)

    Google Scholar 

  24. Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2722–2726 (2016)

    Google Scholar 

  25. Newman, J., Cox, S.: Language identification using visual features. In: Proc. IEEE Audio Speech and Language Processing, vol. 20, no. 7, pp. 1936–1947 (2012)

    Google Scholar 

  26. Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proc. International Conference Multimedia Expo (ICME), pp. 432–437 (2012)

    Google Scholar 

  27. Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. In: EURALISP Journal Advanced Signal Processing, vol. 51 (2012)

    Google Scholar 

  28. Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. In: arXiv: 2004.00849, pp. 1–17 (2020)

    Google Scholar 

  29. Ivanko, D., Ryumin, D., Axyonov, A., Železný, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 245–254. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_26

    Chapter  Google Scholar 

  30. Rajagopal, A., et al.: A deep learning model based on multi-objective particle swarm optimization for scene classification in unmanned aerial vehicles. IEEE Access 8, 135383–135393 (2020)

    Article  Google Scholar 

  31. Ivanko, D., Ryumin, D., Kipyatkova, I., et al.: Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human-Robot Interfaces. Smart Innovation, Systems and Technologies, vol. 154, pp. 477–486. Springer, Singapore (2020)

    Google Scholar 

  32. Xu, K., Li, D., Cassimatis, N., Wang, X. LCANet: End-to-end lipreading with cascaded attention-ctc. arXiv preprint arXiv:1803.04988, pp. 1–10 (2018)

  33. Ryumina, E., Karpov, A.: Facial expression recognition using distance importance scores between facial landmarks. CEUR Workshop Proceedings 2744, 1–10 (2020)

    Google Scholar 

  34. Thanda, A., Venkatesan, S.M.: Audio visual speech recognition using deep recurrent neural networks. In: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction, pp. 98–109 (2017)

    Google Scholar 

  35. Ryumina, E., Ryumin, D., Ivanko, D., Karpov, A.: A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021, 177–182 (2021)

    Article  Google Scholar 

  36. Lee, B., et al.: AVICAR: Audio-Visual Speech Corpus in a Car Environment. In: 8th International Conference on Spoken Language Processing, ICSLP 2004, pp. 1–5 (2004)

    Google Scholar 

  37. Ortega, A. et al.: AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: LREC, pp. 1354–1359 (2004)

    Google Scholar 

  38. Miloš, M., Milošželezny, M., Císař, P.: Czech Audio-Visual Speech Corpus of a Car Driver for In-Vehicle Audio-Visual Speech Recognition. In: International Conference on Audio-Visual Speech Processing, AVSP, pp. 1–5 (2003)

    Google Scholar 

  39. Kawasaki, T., et al.: An audio-visual in-car corpus “CENSREC-2-AV” for robust bimodal speech recognition. In: Vehicle Systems and Driver Modelling, pp. 181–190 (2017)

    Google Scholar 

  40. Vosk offiline speech recognition API Kaldi based, [online] Available: https://alphacephei.com/vosk/

  41. Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, pp. 1–4 (2019)

    Google Scholar 

  42. Howard, A., Zhu, M., Chen, B., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In arXiv:1704.04861, pp. 1–9 (2017)

  43. Huang, G., Liu, Z., et al.: Densely Connected Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2018)

    Google Scholar 

  44. Barret, Z., Vijay, V., Jonathon, S., Quoc, L.: Learning Transferable Architectures for Scalable Image Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)

    Google Scholar 

Download references

Acknowledgments

This research is financially supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A. (2021). Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics