Skip to main content
Log in

Deep motion templates and extreme learning machine for sign language recognition

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Sign language is a visual language used by persons with hearing and speech impairment to communicate through fingerspellings and body gestures. This paper proposes a framework for automatic sign language recognition without the need of hand segmentation. The proposed method first generates three different types of motion templates: motion history image, dynamic image and our proposed RGB motion image. These motion templates are used to fine-tune three ConvNets trained on ImageNet dataset. Fine-tuning avoids learning all the parameters from scratch, leading to faster network convergence even with a small number of training samples. For combining the output of three ConvNets, we propose a fusion technique based on Kernel-based extreme learning machine (KELM). The features extracted from the last fully connected layer of trained ConvNets are used to train three KELMs, and the final class label is predicted by averaging their scores. The proposed approach is validated on a number of publicly available sign language as well as human action recognition datasets, and state-of-the-art results are achieved. Finally, an Indian sign language dataset is also collected using a thermal camera. The experimental results obtained show that our ConvNet-based deep features along with proposed KELM-based fusion are robust for any type of human motion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. This dataset can be downloaded from our website: https://www.iitr.ac.in/mvlab/research.html.

References

  1. Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)

    Article  Google Scholar 

  2. Baraldi, L., Paci, F., Serra, G., Benini, L., Cucchiara, R.: Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In: IEEE CVPR (2014)

  3. Bauer, B., Kraiss, K.F.: Video-based sign recognition using self-organizing subunits. In: IEEE ICPR, pp. 434–437 (2002)

  4. Bi, L., Feng, D., Kim, J.: Dual-path adversarial learning for fully convolutional network (fcn)-based medical image segmentation. Vis. Comput. 34, 1–10 (2018)

    Article  Google Scholar 

  5. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: IEEE CVPR, pp. 3034–3042 (2016)

  6. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)

    Article  Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE CVPR, pp. 4724–4733 (2017)

  8. Chen, C., Liu, K., Kehtarnavaz, N.: Real-time human action recognition based on depth motion maps. J. Real-Time Image Process. 23, 1–9 (2013)

    Google Scholar 

  9. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences using depth motion maps-based local binary patterns. In: WACV, IEEE, pp. 1092–1099 (2015)

  10. Choi, H., Park, H.: A hierarchical structure for gesture recognition using RGB-D sensor. In: ACM HAI, pp. 265–268 (2014)

  11. Cirujeda, P., Binefa, X.: 4DCov: A nested covariance descriptor of spatio-temporal features for gesture recognition in depth sequences. In: IEEE 3DV, vol. 1, pp 657–664 (2014)

  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE CVPR, pp. 248–255 (2009)

  13. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE CVPR, pp. 2625–2634 (2015)

  14. Dreuw, P., Deselaers, T., Keysers, D., Ney, H.: Modeling image variability in appearance-based gesture recognition. In: ECCVW, pp. 7–18 (2006)

  15. Duan, J., Zhou, S., Wan, J., Guo, X., Li, S.Z.: Multi-modality fusion based on consensus-voting and 3D convolution for isolated gesture recognition (2016). arXiv preprint arXiv:1611.06689

  16. Escalera, S., Baró, X., Gonzalez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce-López, V., Escalante, H.J., Shotton, J., Guyon, I.: Chalearn looking at people challenge 2014: dataset and results. In: ECCVW, pp. 459–473 (2014)

  17. Feichtenhofer, C., Pinz, A., Zisserman: Convolutional two-stream network fusion for video action recognition. In: IEEE CVPR (2016)

  18. Gao, Z., Zhang, H., Liu, A.A., Xue, Yb, Xu, Gp: Human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning. KSII Trans. Internet Inf. Syst. (TIIS) 8(2), 483–503 (2014)

    Article  Google Scholar 

  19. Gao, Z., Zhang, H., Xu, G., Xue, Y.: Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151, 554–564 (2015)

    Article  Google Scholar 

  20. Gao, Z., Zhang, H., Liu, A.A., Xu, G., Xue, Y.: Human action recognition on depth dataset. Neural Comput. Appl. 27(7), 2047–2054 (2016)

    Article  Google Scholar 

  21. Geng, L., Ma, X., Wang, H., Gu, J., Li, Y.: Chinese sign language recognition with 3D hand motion trajectories and depth images. In: IEEE WCICA, pp. 1457–1461 (2014)

  22. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR, pp. 580–587 (2014)

  23. Gogić, I., Manhart, M., Pandžić, I.S., Ahlberg, J.: Fast facial expression recognition using local binary features and shallow neural networks. Vis. Comput. 1–16 (2018). https://doi.org/10.1007/s00371-018-1585-8

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR, pp. 770–778 (2016)

  25. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE CVPR, pp. 961–970 (2015)

  26. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. (2017). arXiv preprint arXiv:1704.04861

  27. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: IEEE ICPR, vol. 1, p. 3 (2017)

  28. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)

    Article  Google Scholar 

  29. Huang, G.B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. B Cybern. 42(2), 513–529 (2012)

    Article  Google Scholar 

  30. Imran, J., Kumar, P.: Human action recognition using RGB-D sensor and deep convolutional neural networks. In: IEEE ICACCI, pp. 144–148 (2016)

  31. Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. (2018). https://doi.org/10.1007/s00371-018-1565-z

    Article  Google Scholar 

  32. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)

  33. Li, X., Huang, H., Zhao, H., Wang, Y., Hu, M.: Learning a convolutional neural network for propagation-based stereo image segmentation. Vis. Comput. (2018). https://doi.org/10.1007/s00371-018-1582-y

    Article  Google Scholar 

  34. Lin, Y.C., Hu, M.C., Cheng, W.H., Hsieh, Y.H., Chen, H.M.: Human action recognition and retrieval using sole depth information. In: ACMMM, pp. 1053–1056 (2012)

  35. Liu, H., Tian, L., Liu, M., Tang, H.: SDM-BSM: a fusing depth scheme for human action recognition. In: IEEE ICIP, pp. 4674–4678 (2015)

  36. Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. IJCAI 1, 1493–1500 (2013)

    Google Scholar 

  37. Liu, M., Liu, H.: Depth context: a new descriptor for human activity recognition by using sole depth sequences. Neurocomputing 175, 747–758 (2016)

    Article  Google Scholar 

  38. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR, pp. 3431–3440 (2015)

  39. Ma, C., Wang, A., Chen, G., Xu, C.: Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network. Vis. Comput. 34(6–8), 1053–1063 (2018)

    Article  Google Scholar 

  40. Nishida, N., Nakayama, H.: Multimodal gesture recognition using multi-stream recurrent neural network. In: Springer PSIVT, pp. 682–694 (2015)

  41. Pigou, L., Dieleman, S., Kindermans, P.J., Schrauwen, B.: Sign language recognition using convolutional neural networks. In: ECCVW, pp. 572–578 (2014)

  42. Ronchetti, F., Quiroga, F., Estrebou, C.A., Lanzarini, L.C., Rosete, A.: LSA64: an Argentinian sign language dataset. In: CACIC (2016)

  43. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3D human activity analysis. In: IEEE CVPR, pp. 1010–1019 (2016)

  44. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014a)

  45. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-image recognition. (2014b). arXiv preprint arXiv:1409.1556

  46. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)

    Article  MathSciNet  Google Scholar 

  47. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE CVPR, pp. 1–9 (2015)

  48. Tanwar, V.K., Buckchash, H., Raman, B., Bhargava, R.: Dense motion analysis of German finger spellings. Multimed. Tools Appl. 78, 1–26 (2018)

    Google Scholar 

  49. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE CVPR, pp. 4489–4497 (2015)

  50. Tung, P.T., Ngoc, L.Q.: Elliptical density shape model for hand gesture recognition. In: ACM SoICT, pp. 186–191 (2014)

  51. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1510–1517 (2017)

    Article  Google Scholar 

  52. Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., Li, S.Z.: Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: IEEE CVPRW, pp. 56–64 (2016)

  53. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE CVPR, pp. 3169–3176 (2011)

  54. Wang, P., Li, W., Gao, Z., Tang, C., Zhang, J., Ogunbona, P.: Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACMMM, pp. 1119–1122 (2015)

  55. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum.–Mach. Syst. 46(4), 498–509 (2016a)

    Article  Google Scholar 

  56. Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., Ogunbona, P.: Large-scale isolated gesture recognition using convolutional neural networks. In: IEEE ICPR, pp. 7–12 (2016b)

  57. Yu, Z., Liu, Q., Liu, G.: Deeper cascaded peak-piloted network for weak expression recognition. Vis. Comput. 34, 1–9 (2017)

    Google Scholar 

  58. Zheng, J., Feng, Z., Xu, C., Hu, J., Ge, W.: Fusing shape and spatio-temporal features for depth-based dynamic hand gesture recognition. Multimed. Tools Appl. 76(20), 20525–20544 (2017)

    Article  Google Scholar 

  59. Zhou, F., Hu, Y., Shen, X.: MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition. Vis. Comput. 1–12 (2018)

  60. Zhu, G., Zhang, L., Shen, P., Song, J.: Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5, 4517–4524 (2017)

    Article  Google Scholar 

Download references

Funding

This project was funded by SMILE Project, IIT Roorkee.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javed Imran.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Imran, J., Raman, B. Deep motion templates and extreme learning machine for sign language recognition. Vis Comput 36, 1233–1246 (2020). https://doi.org/10.1007/s00371-019-01725-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-019-01725-3

Keywords

Navigation