Abstract
In Iranian Sign Language (ISL), alongside the movement of fingers/arms, the dynamic movement of lips is also essential to perform/recognize a sign completely and correctly. In a follow up of our previous studies in empowering the RASA social robot to interact with individuals with hearing problems via sign language, we have proposed two automated lip-reading systems based on DNN architectures, a CNN-LSTM and a 3D-CNN, on the robotic system to recognize OuluVS2 database words. In the first network, CNN was used to extract static features, and LSTM was used to model temporal dynamics. In the second one, a 3D-CNN network was used to extract appropriate visual and temporal features from the videos. The accuracy rate of 89.44% and 86.39% were obtained for the presented CNN-LSTM and 3D-CNN networks, respectively; which were fairly promising for our automated lip-reading robotic system. Although the proposed non-complex networks did not provide the highest accuracy for this database (based on the literature), 1) they were able to provide better results than some of the more complex and even pre-trained networks in the literature, 2) they are trained very fast, and 3) they are quite appropriate and acceptable for the robotic system during Human-Robot Interactions (HRI) via sign language.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
McGurk, H., Macdonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Erber, N.P.: Auditory visual perception of speech. J. Speech Hear. Disorders 40(4), 481–492 (1975)
Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. Speech Enhancement Model. Recogn. Algorithms Appl. 95–120 (2012)
Antonakos, E., Roussos, A., Zafeiriou, S.:. A survey on mouth modeling and analysis for sign language recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015 (2015)
Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lip-reading. Image Vis. Comput. 51, 1–12 (2016)
Hassanat, A.: Visual passwords using automatic lip reading. Int. J. Sci. Basic Appl. Res (IJSBAR) 13, 218–231 (2014)
Biswas, A., Sahu, P.K., Chandra, M.: Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. Int. J. Speech Technol. 19(1), 159–171 (2016). https://doi.org/10.1007/s10772-016-9332-x
Basiri, S., Taheri, A., Meghdari, A., Alemi, M.: Design and Implementation of a robotic architecture for adaptive teaching: a case study on iranian sign language. J. Intell. Rob. Syst. 102(2), 1–19 (2021). https://doi.org/10.1007/s10846-021-01413-2
Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Teaching Persian Sign Language to a Social Robot via the Learning from Demonstrations Approach. In: Salichs, M.A., et al. (eds.) ICSR 2019. LNCS (LNAI), vol. 11876, pp. 655–665. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35888-4_61
Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Let there be intelligence! - A novel cognitive architecture for teaching assistant social robots. In: Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, pp. 275–285 (2018)
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.:. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011 (2011)
Li, Y., Takashima, Y., Takiguchi, T., Ariki, Y.:. Lip reading using a dynamic feature of lip images and convolutional neural networks. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings (2016)
Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMS. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2017)
Fernandez-Lopez, A., Sukno, F.M.:. Lip-reading with limited-data network. In: European Signal Processing Conference (2019)
Anina, I., Zhou, Z., Zhao, G., Pietikainen, M.:. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015. (2015)
Hochreiter, S., Schmidhuber, J.:. LSTM can solve hard long time lag problems. Adv. Neural Inf. Process. Syst. (1997)
Petridis, S., Wang, Y., Ma, P., Li, Z. and Pantic, M.:. End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421-427 (2020). http://www.ee.oulu.fi/research/imag/OuluVS2/ACCVW.html
Saitoh, T., Zhou, Z., Zhao, G., Pietikäinen, M.: Concatenated frame image based CNN for visual speech recognition. In: Chen, C.S., Lu, J., Ma, K.K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 277–289. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_21
Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Chen, C.S., Lu, J., Ma, K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421–427 (2020)
Basiri, S., Taheri, A., Meghdari, A.F., Boroushaki, M., Alemi, M.: Dynamic iranian sign language recognition using an optimized deep neural network: an implementation via a robotic-based architecture. Int. J. Social Robot. (2021)
Hosseini, S.R., Taheri, A., Alemi, M., Meghdari, A.: One-shot learning from demonstration approach toward a reciprocal sign language-based HRI. Int. J. Soc. Robot. (2021)
Acknowledgement
This research was funded by Sharif University of Technology (Grant No. G980517). The complementary and continues support of the Social & Cognitive Robotics Laboratory by Dr. Ali Akbar Siassi Memorial Grant is also greatly appreciated.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Gholipour, A., Taheri, A., Mohammadzade, H. (2021). Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory. In: Li, H., et al. Social Robotics. ICSR 2021. Lecture Notes in Computer Science(), vol 13086. Springer, Cham. https://doi.org/10.1007/978-3-030-90525-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-90525-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90524-8
Online ISBN: 978-3-030-90525-5
eBook Packages: Computer ScienceComputer Science (R0)