Towards Lip Motion Based Speaking Mode Detection Using Residual Neural Networks

Afroze, Sadia; Hoque, Mohammed Moshiul

doi:10.1007/978-3-030-73689-7_17

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1383))

Included in the following conference series:

International Conference on Soft Computing and Pattern Recognition

886 Accesses
2 Citations

Abstract

Speaking mode (i.e., talking or non-talking) detection is a significant research problem in the area of HCI and computer vision. Detecting the speaking mode of a speaker is quite challenging owing to the coarse-resolution images, interaction styles, and various noises. This paper proposes a vision-based technique to identify a human’s speaking mode in terms of talking and non-talking state by using the residual neural networks. Visual lip motions is a prominent cue and play a pivotal role in detecting the speaking mode of the human. Thus, we consider the vision-based technique rather voice-based, which is noisy or interrupted. The evaluation result in two datasets shows better performance (of \(99.56\%\) accuracy) in mouth state detection than previous approaches. Moreover, analysis with 36 min long video data of 15 participants reveals that the proposed technique acquired an accuracy of \(98.88\%\) in detecting speaking mode.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.youtube.com/watch?v=u0zP9TPMfNg&t=19s.

References

Zhou, Z., Zhao, G., Hong, X., Pietikainen, N.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778. Las Vegas (2016)
Google Scholar
Abtahi, S., Omidyeganeh, M., Shirmohammadi, S., Hariri, B.: YawDD: a yawning detection dataset. In: Proceedings of the 5th ACM Multimedia Systems Conference, pp. 24-28, New York, USA (2014)
Google Scholar
Bendris, M., Charlet, D., Chollet, G.: Lip activity detection for talking faces classification in TV-content. In: The 3rd International Conference on Machine Vision (2010)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, April 2001
Google Scholar
Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model, pp. 504–513 (2008)
Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy – automatic naming of characters in TV video. In: Proceedings of the British Machine Vision Conference (2006)
Google Scholar
Ji, Y., et al.: Fatigue state detection based on multi-index fusion and state recognition network. IEEE Access 7, 64136–64147 (2019)
Google Scholar
Hossain, M.R., Afroze, S., Hoque, M.M., Siddique, N.: Automatic detection of eye cataract using deep convolution neural networks (DCNNs). In: 2020 IEEE Region 10 Symposium (TENSYMP), pp. 1333-1338, June 2020
Google Scholar
Jia, L.N., et al.: Smartphone-based fatigue detection system using progressive locating method. Inst. Eng. Technol. 10(3), 148–156 (2016)
Google Scholar
Punitha, A., Geetha, M.K., Sivaprakash, A.: Driver fatigue monitoring system based on eye state analysis. In: International Conference on Circuits, Power and Computing Technologies, pp. 1405–1408. IEEE, India (2014)
Google Scholar
Du, C., Gao, S.: Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE Access 5(99), 15750–15761 (2017)
Article Google Scholar
Huang, H.Y., Lin, Y.C.: An efficient mouth detection based on face localization and edge projection. Int. J. Comput. Theor. Eng. 5(3), 514 (2013)
Google Scholar
Afroze, S., Hoque, M.M.: Talking vs Non-Talking: a vision based approach to detect human speaking mode. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE). Cox’sBazar, Bangladesh (2019)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (2011)
Google Scholar
Zhang, K.P., Zhang, Z.P., Li, Z.F., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. Int. Conf. Mach. Learn. 37, 448–456 (2015)
Google Scholar
Chollet, F.: Keras (2015). https://github.com/keras-team/keras
Mandal, B., et al.: Towards detection of bus driver fatigue based on robust visual analysis of eye state. IEEE Trans. Intell. Transp. Syst. 18(3), 545–557 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh
Sadia Afroze & Mohammed Moshiul Hoque

Authors

Sadia Afroze
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Moshiul Hoque
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Moshiul Hoque .

Editor information

Editors and Affiliations

Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Ajith Abraham
Systems Innovation, School of Engineering, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
Yukio Ohsawa
Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Niketa Gandhi
Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
Faculty of Sciences and Techniques, Hassan 1st University, Settat, Morocco
Abdelkrim Haqiq
Queen’s University, Belfast, UK
Seán McLoone
Northumbria University, Newcastle, UK
Biju Issac

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Afroze, S., Hoque, M.M. (2021). Towards Lip Motion Based Speaking Mode Detection Using Residual Neural Networks. In: Abraham, A., et al. Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2020). SoCPaR 2020. Advances in Intelligent Systems and Computing, vol 1383. Springer, Cham. https://doi.org/10.1007/978-3-030-73689-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-73689-7_17
Published: 16 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73688-0
Online ISBN: 978-3-030-73689-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics