Abstract
This work presents a scalable solution to speaker-dependent visual command recognition in vehicle cabin. The goal of this work is to recognize a limited number of most frequent driver’s requests based on his/her lip movements. Unlike previous works that have focused on automated lip-reading in controlled laboratory environment, we tackle this problem in real driving conditions based on the recorded RUSAVIC dataset. Due to limiting the scope of the task to speaker-dependency and vocabulary of 50 phrases, the models that we train surpass the performance of previous work and can be used in real-life speech recognition applications. To achieve this, we constructed end-to-end methodology that require only 10 repetition of each phrase in order to achieve reasonable recognition accuracy up to 54% based purely on video information. Our key contributions are: (1) we introduce a novel approach to visual speech data preprocessing and labeling, designed to tackle real-life drivers data from vehicle cabin; (2) we investigate to what extent lip-reading is complimentary to improve visual command recognition, depending on the set of recognizable commands; (3) we train, adapted for our task and compare three state-of-the-art CNN architectures, namely MobileNetV2, DenseNet121, NASNetMobile to evaluate the performance of developed system. The proposed system achieved word recognition rate (WRR) of 55% for a vehicle parked at the crossroad task and 54% for driving scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schroeder, P., Meyers, M., Kostuniuk, L.: National survey on distracted driving attitudes and behaviors, Report No. DOT HS 811 729. National Highway Traffic Safety Administration, Washington, DC (2019)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Shillingford, B., Assael, Y., Hoffman, M., et al.: Large-Scale Visual Speech Recognition. In: arXiv eprint 1807.05162, pp. 1–21 (2018)
Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process 14, 981–989 (2020)
Afouras, T., Chung, J.C., Senior, A., et al.: Deep Audio-visual Speech Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)
Kashevnik, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Gurban, M., Thiran, J.P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57, 4765–4776 (2009)
Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_76
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for hmm based automatic lipreading. In: IEEE Conference on Image Processing, pp. 173–177 (1998)
Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)
Assael, Y., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference, pp. 1–14 (2017)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain adversarial training. Interspeech 2017, 3982–3987 (2017)
Kagirov, I., Ryumin, D., Axyonov, A.: Method for multimodal recognition of one-handed sign language gestures through 3D convolution and LSTM neural networks. In: International Conference on Speech and Computer, pp. 191–200 (2019)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: Interspeech, pp. 3652–3656, ISCA (2017)
Wand, M., Koutnik, K., Schmidhuber, J.: Lipreading with long short-term memory. International Conference on Acoustics, Speech, and Signal Processing, pp. 6115–6119 (2016)
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end multi-view lipreading. In: British Machine Vision Conference, pp. 1–14 (2017)
Sui, S., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: ICCV, pp. 154–162 (2015)
Noda, K., Yamaguchi, Y., Nakadai, K., et al.: Lipreading using convolutional neural network. In: Interspeech, pp. 1149–1153 (2014)
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2722–2726 (2016)
Newman, J., Cox, S.: Language identification using visual features. In: Proc. IEEE Audio Speech and Language Processing, vol. 20, no. 7, pp. 1936–1947 (2012)
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proc. International Conference Multimedia Expo (ICME), pp. 432–437 (2012)
Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. In: EURALISP Journal Advanced Signal Processing, vol. 51 (2012)
Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. In: arXiv: 2004.00849, pp. 1–17 (2020)
Ivanko, D., Ryumin, D., Axyonov, A., Železný, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 245–254. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_26
Rajagopal, A., et al.: A deep learning model based on multi-objective particle swarm optimization for scene classification in unmanned aerial vehicles. IEEE Access 8, 135383–135393 (2020)
Ivanko, D., Ryumin, D., Kipyatkova, I., et al.: Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human-Robot Interfaces. Smart Innovation, Systems and Technologies, vol. 154, pp. 477–486. Springer, Singapore (2020)
Xu, K., Li, D., Cassimatis, N., Wang, X. LCANet: End-to-end lipreading with cascaded attention-ctc. arXiv preprint arXiv:1803.04988, pp. 1–10 (2018)
Ryumina, E., Karpov, A.: Facial expression recognition using distance importance scores between facial landmarks. CEUR Workshop Proceedings 2744, 1–10 (2020)
Thanda, A., Venkatesan, S.M.: Audio visual speech recognition using deep recurrent neural networks. In: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction, pp. 98–109 (2017)
Ryumina, E., Ryumin, D., Ivanko, D., Karpov, A.: A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021, 177–182 (2021)
Lee, B., et al.: AVICAR: Audio-Visual Speech Corpus in a Car Environment. In: 8th International Conference on Spoken Language Processing, ICSLP 2004, pp. 1–5 (2004)
Ortega, A. et al.: AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: LREC, pp. 1354–1359 (2004)
Miloš, M., Milošželezny, M., Císař, P.: Czech Audio-Visual Speech Corpus of a Car Driver for In-Vehicle Audio-Visual Speech Recognition. In: International Conference on Audio-Visual Speech Processing, AVSP, pp. 1–5 (2003)
Kawasaki, T., et al.: An audio-visual in-car corpus “CENSREC-2-AV” for robust bimodal speech recognition. In: Vehicle Systems and Driver Modelling, pp. 181–190 (2017)
Vosk offiline speech recognition API Kaldi based, [online] Available: https://alphacephei.com/vosk/
Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, pp. 1–4 (2019)
Howard, A., Zhu, M., Chen, B., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In arXiv:1704.04861, pp. 1–9 (2017)
Huang, G., Liu, Z., et al.: Densely Connected Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2018)
Barret, Z., Vijay, V., Jonathon, S., Quoc, L.: Learning Transferable Architectures for Scalable Image Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)
Acknowledgments
This research is financially supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A. (2021). Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)