Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation

Ivanko, Denis; Ryumin, Dmitry; Axyonov, Alexandr; Kashevnik, Alexey

doi:10.1007/978-3-030-87802-3_27

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1615 Accesses
7 Citations

Abstract

This work presents a scalable solution to speaker-dependent visual command recognition in vehicle cabin. The goal of this work is to recognize a limited number of most frequent driver’s requests based on his/her lip movements. Unlike previous works that have focused on automated lip-reading in controlled laboratory environment, we tackle this problem in real driving conditions based on the recorded RUSAVIC dataset. Due to limiting the scope of the task to speaker-dependency and vocabulary of 50 phrases, the models that we train surpass the performance of previous work and can be used in real-life speech recognition applications. To achieve this, we constructed end-to-end methodology that require only 10 repetition of each phrase in order to achieve reasonable recognition accuracy up to 54% based purely on video information. Our key contributions are: (1) we introduce a novel approach to visual speech data preprocessing and labeling, designed to tackle real-life drivers data from vehicle cabin; (2) we investigate to what extent lip-reading is complimentary to improve visual command recognition, depending on the set of recognizable commands; (3) we train, adapted for our task and compare three state-of-the-art CNN architectures, namely MobileNetV2, DenseNet121, NASNetMobile to evaluate the performance of developed system. The proposed system achieved word recognition rate (WRR) of 55% for a vehicle parked at the crossroad task and 54% for driving scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schroeder, P., Meyers, M., Kostuniuk, L.: National survey on distracted driving attitudes and behaviors, Report No. DOT HS 811 729. National Highway Traffic Safety Administration, Washington, DC (2019)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Shillingford, B., Assael, Y., Hoffman, M., et al.: Large-Scale Visual Speech Recognition. In: arXiv eprint 1807.05162, pp. 1–21 (2018)
Google Scholar
Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process 14, 981–989 (2020)
Article Google Scholar
Afouras, T., Chung, J.C., Senior, A., et al.: Deep Audio-visual Speech Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)
Google Scholar
Kashevnik, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)
Article Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Article Google Scholar
Gurban, M., Thiran, J.P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57, 4765–4776 (2009)
Article MathSciNet Google Scholar
Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_76
Chapter Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Article Google Scholar
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for hmm based automatic lipreading. In: IEEE Conference on Image Processing, pp. 173–177 (1998)
Google Scholar
Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1
Article Google Scholar
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)
Article Google Scholar
Assael, Y., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference, pp. 1–14 (2017)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Google Scholar
Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain adversarial training. Interspeech 2017, 3982–3987 (2017)
Google Scholar
Kagirov, I., Ryumin, D., Axyonov, A.: Method for multimodal recognition of one-handed sign language gestures through 3D convolution and LSTM neural networks. In: International Conference on Speech and Computer, pp. 191–200 (2019)
Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: Interspeech, pp. 3652–3656, ISCA (2017)
Google Scholar
Wand, M., Koutnik, K., Schmidhuber, J.: Lipreading with long short-term memory. International Conference on Acoustics, Speech, and Signal Processing, pp. 6115–6119 (2016)
Google Scholar
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end multi-view lipreading. In: British Machine Vision Conference, pp. 1–14 (2017)
Google Scholar
Sui, S., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: ICCV, pp. 154–162 (2015)
Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., et al.: Lipreading using convolutional neural network. In: Interspeech, pp. 1149–1153 (2014)
Google Scholar
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2722–2726 (2016)
Google Scholar
Newman, J., Cox, S.: Language identification using visual features. In: Proc. IEEE Audio Speech and Language Processing, vol. 20, no. 7, pp. 1936–1947 (2012)
Google Scholar
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proc. International Conference Multimedia Expo (ICME), pp. 432–437 (2012)
Google Scholar
Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. In: EURALISP Journal Advanced Signal Processing, vol. 51 (2012)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. In: arXiv: 2004.00849, pp. 1–17 (2020)
Google Scholar
Ivanko, D., Ryumin, D., Axyonov, A., Železný, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 245–254. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_26
Chapter Google Scholar
Rajagopal, A., et al.: A deep learning model based on multi-objective particle swarm optimization for scene classification in unmanned aerial vehicles. IEEE Access 8, 135383–135393 (2020)
Article Google Scholar
Ivanko, D., Ryumin, D., Kipyatkova, I., et al.: Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human-Robot Interfaces. Smart Innovation, Systems and Technologies, vol. 154, pp. 477–486. Springer, Singapore (2020)
Google Scholar
Xu, K., Li, D., Cassimatis, N., Wang, X. LCANet: End-to-end lipreading with cascaded attention-ctc. arXiv preprint arXiv:1803.04988, pp. 1–10 (2018)
Ryumina, E., Karpov, A.: Facial expression recognition using distance importance scores between facial landmarks. CEUR Workshop Proceedings 2744, 1–10 (2020)
Google Scholar
Thanda, A., Venkatesan, S.M.: Audio visual speech recognition using deep recurrent neural networks. In: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction, pp. 98–109 (2017)
Google Scholar
Ryumina, E., Ryumin, D., Ivanko, D., Karpov, A.: A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021, 177–182 (2021)
Article Google Scholar
Lee, B., et al.: AVICAR: Audio-Visual Speech Corpus in a Car Environment. In: 8th International Conference on Spoken Language Processing, ICSLP 2004, pp. 1–5 (2004)
Google Scholar
Ortega, A. et al.: AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: LREC, pp. 1354–1359 (2004)
Google Scholar
Miloš, M., Milošželezny, M., Císař, P.: Czech Audio-Visual Speech Corpus of a Car Driver for In-Vehicle Audio-Visual Speech Recognition. In: International Conference on Audio-Visual Speech Processing, AVSP, pp. 1–5 (2003)
Google Scholar
Kawasaki, T., et al.: An audio-visual in-car corpus “CENSREC-2-AV” for robust bimodal speech recognition. In: Vehicle Systems and Driver Modelling, pp. 181–190 (2017)
Google Scholar
Vosk offiline speech recognition API Kaldi based, [online] Available: https://alphacephei.com/vosk/
Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, pp. 1–4 (2019)
Google Scholar
Howard, A., Zhu, M., Chen, B., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In arXiv:1704.04861, pp. 1–9 (2017)
Huang, G., Liu, Z., et al.: Densely Connected Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2018)
Google Scholar
Barret, Z., Vijay, V., Jonathon, S., Quoc, L.: Learning Transferable Architectures for Scalable Image Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)
Google Scholar

Download references

Acknowledgments

This research is financially supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).

Author information

Authors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, 199178, Russia
Denis Ivanko, Dmitry Ryumin, Alexandr Axyonov & Alexey Kashevnik

Authors

Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ryumin
View author publications
You can also search for this author in PubMed Google Scholar
Alexandr Axyonov
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Kashevnik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A. (2021). Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_27
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics