Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory

Gholipour, Amir; Taheri, Alireza; Mohammadzade, Hoda

doi:10.1007/978-3-030-90525-5_7

Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory

Amir Gholipour¹⁶,
Alireza Taheri¹⁶ &
Hoda Mohammadzade¹⁷

Conference paper
First Online: 02 November 2021

2699 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13086))

Abstract

In Iranian Sign Language (ISL), alongside the movement of fingers/arms, the dynamic movement of lips is also essential to perform/recognize a sign completely and correctly. In a follow up of our previous studies in empowering the RASA social robot to interact with individuals with hearing problems via sign language, we have proposed two automated lip-reading systems based on DNN architectures, a CNN-LSTM and a 3D-CNN, on the robotic system to recognize OuluVS2 database words. In the first network, CNN was used to extract static features, and LSTM was used to model temporal dynamics. In the second one, a 3D-CNN network was used to extract appropriate visual and temporal features from the videos. The accuracy rate of 89.44% and 86.39% were obtained for the presented CNN-LSTM and 3D-CNN networks, respectively; which were fairly promising for our automated lip-reading robotic system. Although the proposed non-complex networks did not provide the highest accuracy for this database (based on the literature), 1) they were able to provide better results than some of the more complex and even pre-trained networks in the literature, 2) they are trained very fast, and 3) they are quite appropriate and acceptable for the robotic system during Human-Robot Interactions (HRI) via sign language.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

McGurk, H., Macdonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Article Google Scholar
Erber, N.P.: Auditory visual perception of speech. J. Speech Hear. Disorders 40(4), 481–492 (1975)
Article Google Scholar
Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. Speech Enhancement Model. Recogn. Algorithms Appl. 95–120 (2012)
Google Scholar
Antonakos, E., Roussos, A., Zafeiriou, S.:. A survey on mouth modeling and analysis for sign language recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015 (2015)
Google Scholar
Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lip-reading. Image Vis. Comput. 51, 1–12 (2016)
Article Google Scholar
Hassanat, A.: Visual passwords using automatic lip reading. Int. J. Sci. Basic Appl. Res (IJSBAR) 13, 218–231 (2014)
Google Scholar
Biswas, A., Sahu, P.K., Chandra, M.: Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. Int. J. Speech Technol. 19(1), 159–171 (2016). https://doi.org/10.1007/s10772-016-9332-x
Article Google Scholar
Basiri, S., Taheri, A., Meghdari, A., Alemi, M.: Design and Implementation of a robotic architecture for adaptive teaching: a case study on iranian sign language. J. Intell. Rob. Syst. 102(2), 1–19 (2021). https://doi.org/10.1007/s10846-021-01413-2
Article Google Scholar
Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Teaching Persian Sign Language to a Social Robot via the Learning from Demonstrations Approach. In: Salichs, M.A., et al. (eds.) ICSR 2019. LNCS (LNAI), vol. 11876, pp. 655–665. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35888-4_61
Chapter Google Scholar
Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Let there be intelligence! - A novel cognitive architecture for teaching assistant social robots. In: Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, pp. 275–285 (2018)
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Article Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.:. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011 (2011)
Google Scholar
Li, Y., Takashima, Y., Takiguchi, T., Ariki, Y.:. Lip reading using a dynamic feature of lip images and convolutional neural networks. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings (2016)
Google Scholar
Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMS. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2017)
Google Scholar
Fernandez-Lopez, A., Sukno, F.M.:. Lip-reading with limited-data network. In: European Signal Processing Conference (2019)
Google Scholar
Anina, I., Zhou, Z., Zhao, G., Pietikainen, M.:. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015. (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.:. LSTM can solve hard long time lag problems. Adv. Neural Inf. Process. Syst. (1997)
Google Scholar
Petridis, S., Wang, Y., Ma, P., Li, Z. and Pantic, M.:. End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421-427 (2020). http://www.ee.oulu.fi/research/imag/OuluVS2/ACCVW.html
Saitoh, T., Zhou, Z., Zhao, G., Pietikäinen, M.: Concatenated frame image based CNN for visual speech recognition. In: Chen, C.S., Lu, J., Ma, K.K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 277–289. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_21
Chapter Google Scholar
Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Chen, C.S., Lu, J., Ma, K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chapter Google Scholar
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421–427 (2020)
Article Google Scholar
Basiri, S., Taheri, A., Meghdari, A.F., Boroushaki, M., Alemi, M.: Dynamic iranian sign language recognition using an optimized deep neural network: an implementation via a robotic-based architecture. Int. J. Social Robot. (2021)
Google Scholar
Hosseini, S.R., Taheri, A., Alemi, M., Meghdari, A.: One-shot learning from demonstration approach toward a reciprocal sign language-based HRI. Int. J. Soc. Robot. (2021)
Google Scholar

Download references

Acknowledgement

This research was funded by Sharif University of Technology (Grant No. G980517). The complementary and continues support of the Social & Cognitive Robotics Laboratory by Dr. Ali Akbar Siassi Memorial Grant is also greatly appreciated.

Author information

Authors and Affiliations

Social and Cognitive Robotics Laboratory, Mechanical Engineering Department, Sharif University of Technology, Tehran, Iran
Amir Gholipour & Alireza Taheri
Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran
Hoda Mohammadzade

Authors

Amir Gholipour
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Taheri
View author publications
You can also search for this author in PubMed Google Scholar
Hoda Mohammadzade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alireza Taheri .

Editor information

Editors and Affiliations

Department of Electronic and Communication Engineering, National University of Singapore, Faculty of Engineering, Singapore, Singapore
Haizhou Li
The National University of Singapore, Singapore, Singapore
Shuzhi Sam Ge
A*STAR Institute for Infocomm Research, Singapore, Singapore
Yan Wu
Center for Human Technologies, Istituto Italiano Tecnologia, Genoa, Italy
Agnieszka Wykowska
Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS, USA
Hongsheng He
Qingdao University, Qingdao, China
Xiaorui Liu
School of Cyber Science and Technology, Beihang University, Beijing, Beijing, China
Dongyu Li
Social Cognition Human-Robot Interaction, Istituto Italiano di Tecnologia, Genoa, Italy
Jairo Perez-Osorio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gholipour, A., Taheri, A., Mohammadzade, H. (2021). Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory. In: Li, H., et al. Social Robotics. ICSR 2021. Lecture Notes in Computer Science(), vol 13086. Springer, Cham. https://doi.org/10.1007/978-3-030-90525-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-90525-5_7
Published: 02 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90524-8
Online ISBN: 978-3-030-90525-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics