Skip to main content

Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13086))

Abstract

In Iranian Sign Language (ISL), alongside the movement of fingers/arms, the dynamic movement of lips is also essential to perform/recognize a sign completely and correctly. In a follow up of our previous studies in empowering the RASA social robot to interact with individuals with hearing problems via sign language, we have proposed two automated lip-reading systems based on DNN architectures, a CNN-LSTM and a 3D-CNN, on the robotic system to recognize OuluVS2 database words. In the first network, CNN was used to extract static features, and LSTM was used to model temporal dynamics. In the second one, a 3D-CNN network was used to extract appropriate visual and temporal features from the videos. The accuracy rate of 89.44% and 86.39% were obtained for the presented CNN-LSTM and 3D-CNN networks, respectively; which were fairly promising for our automated lip-reading robotic system. Although the proposed non-complex networks did not provide the highest accuracy for this database (based on the literature), 1) they were able to provide better results than some of the more complex and even pre-trained networks in the literature, 2) they are trained very fast, and 3) they are quite appropriate and acceptable for the robotic system during Human-Robot Interactions (HRI) via sign language.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. McGurk, H., Macdonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)

    Article  Google Scholar 

  2. Erber, N.P.: Auditory visual perception of speech. J. Speech Hear. Disorders 40(4), 481–492 (1975)

    Article  Google Scholar 

  3. Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. Speech Enhancement Model. Recogn. Algorithms Appl. 95–120 (2012)

    Google Scholar 

  4. Antonakos, E., Roussos, A., Zafeiriou, S.:. A survey on mouth modeling and analysis for sign language recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015 (2015)

    Google Scholar 

  5. Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lip-reading. Image Vis. Comput. 51, 1–12 (2016)

    Article  Google Scholar 

  6. Hassanat, A.: Visual passwords using automatic lip reading. Int. J. Sci. Basic Appl. Res (IJSBAR) 13, 218–231 (2014)

    Google Scholar 

  7. Biswas, A., Sahu, P.K., Chandra, M.: Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. Int. J. Speech Technol. 19(1), 159–171 (2016). https://doi.org/10.1007/s10772-016-9332-x

    Article  Google Scholar 

  8. Basiri, S., Taheri, A., Meghdari, A., Alemi, M.: Design and Implementation of a robotic architecture for adaptive teaching: a case study on iranian sign language. J. Intell. Rob. Syst. 102(2), 1–19 (2021). https://doi.org/10.1007/s10846-021-01413-2

    Article  Google Scholar 

  9. Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Teaching Persian Sign Language to a Social Robot via the Learning from Demonstrations Approach. In: Salichs, M.A., et al. (eds.) ICSR 2019. LNCS (LNAI), vol. 11876, pp. 655–665. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35888-4_61

    Chapter  Google Scholar 

  10. Hosseini, S.R., Taheri, A., Meghdari, A., Alemi, M.: Let there be intelligence! - A novel cognitive architecture for teaching assistant social robots. In: Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, pp. 275–285 (2018)

    Google Scholar 

  11. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)

    Article  Google Scholar 

  12. Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  13. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.:. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011 (2011)

    Google Scholar 

  14. Li, Y., Takashima, Y., Takiguchi, T., Ariki, Y.:. Lip reading using a dynamic feature of lip images and convolutional neural networks. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings (2016)

    Google Scholar 

  15. Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMS. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2017)

    Google Scholar 

  16. Fernandez-Lopez, A., Sukno, F.M.:. Lip-reading with limited-data network. In: European Signal Processing Conference (2019)

    Google Scholar 

  17. Anina, I., Zhou, Z., Zhao, G., Pietikainen, M.:. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015. (2015)

    Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.:. LSTM can solve hard long time lag problems. Adv. Neural Inf. Process. Syst. (1997)

    Google Scholar 

  19. Petridis, S., Wang, Y., Ma, P., Li, Z. and Pantic, M.:. End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421-427 (2020). http://www.ee.oulu.fi/research/imag/OuluVS2/ACCVW.html

  20. Saitoh, T., Zhou, Z., Zhao, G., Pietikäinen, M.: Concatenated frame image based CNN for visual speech recognition. In: Chen, C.S., Lu, J., Ma, K.K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 277–289. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_21

    Chapter  Google Scholar 

  21. Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Chen, C.S., Lu, J., Ma, K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  22. Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Pattern Recogn. Lett. 131, 421–427 (2020)

    Article  Google Scholar 

  23. Basiri, S., Taheri, A., Meghdari, A.F., Boroushaki, M., Alemi, M.: Dynamic iranian sign language recognition using an optimized deep neural network: an implementation via a robotic-based architecture. Int. J. Social Robot. (2021)

    Google Scholar 

  24. Hosseini, S.R., Taheri, A., Alemi, M., Meghdari, A.: One-shot learning from demonstration approach toward a reciprocal sign language-based HRI. Int. J. Soc. Robot. (2021)

    Google Scholar 

Download references

Acknowledgement

This research was funded by Sharif University of Technology (Grant No. G980517). The complementary and continues support of the Social & Cognitive Robotics Laboratory by Dr. Ali Akbar Siassi Memorial Grant is also greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alireza Taheri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gholipour, A., Taheri, A., Mohammadzade, H. (2021). Automated Lip-Reading Robotic System Based on Convolutional Neural Network and Long Short-Term Memory. In: Li, H., et al. Social Robotics. ICSR 2021. Lecture Notes in Computer Science(), vol 13086. Springer, Cham. https://doi.org/10.1007/978-3-030-90525-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90525-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90524-8

  • Online ISBN: 978-3-030-90525-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics