Abstract
Speaking mode (i.e., talking or non-talking) detection is a significant research problem in the area of HCI and computer vision. Detecting the speaking mode of a speaker is quite challenging owing to the coarse-resolution images, interaction styles, and various noises. This paper proposes a vision-based technique to identify a human’s speaking mode in terms of talking and non-talking state by using the residual neural networks. Visual lip motions is a prominent cue and play a pivotal role in detecting the speaking mode of the human. Thus, we consider the vision-based technique rather voice-based, which is noisy or interrupted. The evaluation result in two datasets shows better performance (of \(99.56\%\) accuracy) in mouth state detection than previous approaches. Moreover, analysis with 36 min long video data of 15 participants reveals that the proposed technique acquired an accuracy of \(98.88\%\) in detecting speaking mode.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou, Z., Zhao, G., Hong, X., Pietikainen, N.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778. Las Vegas (2016)
Abtahi, S., Omidyeganeh, M., Shirmohammadi, S., Hariri, B.: YawDD: a yawning detection dataset. In: Proceedings of the 5th ACM Multimedia Systems Conference, pp. 24-28, New York, USA (2014)
Bendris, M., Charlet, D., Chollet, G.: Lip activity detection for talking faces classification in TV-content. In: The 3rd International Conference on Machine Vision (2010)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, April 2001
Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model, pp. 504–513 (2008)
Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy – automatic naming of characters in TV video. In: Proceedings of the British Machine Vision Conference (2006)
Ji, Y., et al.: Fatigue state detection based on multi-index fusion and state recognition network. IEEE Access 7, 64136–64147 (2019)
Hossain, M.R., Afroze, S., Hoque, M.M., Siddique, N.: Automatic detection of eye cataract using deep convolution neural networks (DCNNs). In: 2020 IEEE Region 10 Symposium (TENSYMP), pp. 1333-1338, June 2020
Jia, L.N., et al.: Smartphone-based fatigue detection system using progressive locating method. Inst. Eng. Technol. 10(3), 148–156 (2016)
Punitha, A., Geetha, M.K., Sivaprakash, A.: Driver fatigue monitoring system based on eye state analysis. In: International Conference on Circuits, Power and Computing Technologies, pp. 1405–1408. IEEE, India (2014)
Du, C., Gao, S.: Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE Access 5(99), 15750–15761 (2017)
Huang, H.Y., Lin, Y.C.: An efficient mouth detection based on face localization and edge projection. Int. J. Comput. Theor. Eng. 5(3), 514 (2013)
Afroze, S., Hoque, M.M.: Talking vs Non-Talking: a vision based approach to detect human speaking mode. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE). Cox’sBazar, Bangladesh (2019)
Chang, C.C., Lin, C.J.: LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (2011)
Zhang, K.P., Zhang, Z.P., Li, Z.F., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Int. Conf. Mach. Learn. 37, 448–456 (2015)
Chollet, F.: Keras (2015). https://github.com/keras-team/keras
Mandal, B., et al.: Towards detection of bus driver fatigue based on robust visual analysis of eye state. IEEE Trans. Intell. Transp. Syst. 18(3), 545–557 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Afroze, S., Hoque, M.M. (2021). Towards Lip Motion Based Speaking Mode Detection Using Residual Neural Networks. In: Abraham, A., et al. Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2020). SoCPaR 2020. Advances in Intelligent Systems and Computing, vol 1383. Springer, Cham. https://doi.org/10.1007/978-3-030-73689-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-73689-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73688-0
Online ISBN: 978-3-030-73689-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)