Skip to main content
Log in

Graph convolutional networks and LSTM for first-person multimodal hand action recognition

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Graph convolutional networks (GCNs) have been successfully introduced in skeleton-based human action recognition. Both human skeletons and hand skeletons are composed of open-loop chains, and each chain is composed of rigid links (corresponding to bones) and revolving pairs (corresponding to joints). Despite this similarity, there has been no skeleton-based hand action recognition method that represents hand skeletons using GCNs. We first evaluate the effectiveness of traditional spatial–temporal GCNs for skeleton-based hand action recognition. Then, we propose to improve the traditional spatial–temporal GCNs by incorporating the third-order node information (geometric relationships between neighbor connected bones in a hand skeleton), and the geometric relationships are described by a Lie group, including relative translations and rotations. Finally, we study first-person multimodal hand action recognition with hand skeletons, RGB images, and depth maps jointly used as visual input. We propose to fuse the multimodal features by customized long short-term memory (LSTM) units, rather than simply concatenating them as a feature vector. Extensive ablation studies are conducted to demonstrate the improvements due to the use of the third-order node information and the advantages of our multimodal fusion strategy. Our method markedly outperforms recent baselines on a public first-person hand action recognition dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Jalal, A., Kamal, S., Kim, D.: A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7), 11735–11759 (2014)

    Article  Google Scholar 

  2. Liang, H., Yuan, J., Lee, J., et al.: Hough forest with optimized leaves for global hand pose estimation with arbitrary postures. IEEE Trans. Cybern. 49(2), 527–541 (2019)

    Article  Google Scholar 

  3. Mumtaz, A., Sargano, A.B., Habib, Z.: Violence detection in surveillance videos with deep network using transfer learning. In: 2018 2nd European Conference on Electrical Engineering and Computer Science (EECS), pp. 558–563

  4. Antotsiou, D., Garcia-Hernando, G., Kim, T.: Task-oriented hand motion retargeting for dexterous manipulation imitation. arXiv:1810.01845

  5. Li, R., Wang, H., Liu, Z.: Survey on mapping human hand motion to robotic hands for teleoperation. IEEE Trans. Circuits Syst. Video Technol. 32(5), 2647–2665 (2022)

    Article  Google Scholar 

  6. Ahmad, T., Jin, L., Zhang, X., et al.: Graph convolutional neural network for human action recognition: a comprehensive survey. IEEE Trans. Artif. Intell. 2(2), 128–145 (2021)

    Article  Google Scholar 

  7. Wu, Z., Pan, S., Chen, F., et al.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2021)

    Article  MathSciNet  Google Scholar 

  8. Li, F., Zhu, A., Liu, Z., et al.: Pyramidal graph convolutional network for skeleton-based human action recognition. IEEE Sens. J. 21(14), 16183–16191 (2021)

    Article  Google Scholar 

  9. Li, W., Liu, X., Liu, Z., et al.: Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8, 144529–144542 (2020)

    Article  Google Scholar 

  10. Liu, R., Xu, C., Zhang, T., et al.: Si-GCN: structure-induced graph convolution network for skeleton-based action recognition. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8

  11. Zhang, X., Xu, C., Tian, X., et al.: Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 31(8), 3047–3060 (2020)

    Article  Google Scholar 

  12. Hao, X., Li, J., Guo, Y., et al.: Hypergraph neural network for skeleton-based action recognition. IEEE Trans. Image Process. 30, 2263–2275 (2021)

    Article  MathSciNet  Google Scholar 

  13. Shahroudy, A., Liu, J., Ng, T., et al.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019

  14. Sijie, Y., Yuanjun, X., Dahua, L.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 2018 32nd AAAI Conference on Artificial Intelligence, pp. 7444–7452

  15. Shi, L., Zhang, Y., Cheng, J., et al.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12018–12027

  16. Shi, L., Zhang, Y., Cheng, J., et al.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020)

    Article  MATH  Google Scholar 

  17. Li, M., Chen, S., Chen, X., et al.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3590–3598

  18. Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14321–14330

  19. Nam, S., Lee, S.: JT-MGCN: joint-temporal motion graph convolutional network for skeleton-based action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6383–6390

  20. Zhang, G., Zhang, X.: Multi-heads attention graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4

  21. Ahmad, T., Mao, H., Lin, L., et al.: Action recognition using attention-joints graph convolutional neural networks. IEEE Access 8, 305–313 (2020)

    Article  Google Scholar 

  22. BanTeng, M.L., Wu, Z.: Channel-wise dense connection graph convolutional network for skeleton-based action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3799–3806

  23. Yang, H., Gu, Y., Zhu, J., et al.: PGCN-TCA: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8, 10040–10047 (2020)

    Article  Google Scholar 

  24. Feng, D., Wu, Z., Zhang, J., et al.: Multi-scale spatial temporal graph neural network for skeleton-based action recognition. IEEE Access 9, 58256–58265 (2021)

    Article  Google Scholar 

  25. Xia, H., Gao, X.: Multi-scale mixed dense graph convolution network for skeleton-based action recognition. IEEE Access 9, 36475–36484 (2021)

    Article  Google Scholar 

  26. Cheng, K., Zhang, Y., He, X., et al.: Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 180–189

  27. Li, S., Yi, J., Farha, Y.A., et al.: Pose refinement graph convolutional network for skeleton-based action recognition. IEEE Robot. Autom. Lett. 6(2), 1028–1035 (2021)

    Article  Google Scholar 

  28. Tang, Y., Tian, Y., Lu, J., et al.: Action recognition in RGB-D egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414

  29. Garcia-Hernando, G., Yuan, S., Baek, S., et al.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 409–419

  30. Liu, J., Akhtar, N., Mian, A.: Viewpoint invariant action recognition using RGB-D videos. IEEE Access 6, 70061–70071 (2018)

    Article  Google Scholar 

  31. Li, R., Liu, Z., Tan, J.: Exploring 3D human action recognition: from offline to online. Sensors 18(2), 633 (2018)

    Article  Google Scholar 

  32. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941

  33. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition. ECCV 9912, 20–36 (2016)

    Google Scholar 

  34. Tekin, B., Bogo, F., Pollefeys, M.: H+O: unified egocentric recognition of 3D hand-object poses and interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4506–4515

  35. Schwarz, M., Schulz, H., Behnke, S.: RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1329–1335

  36. Eitel, A., Springenberg, J.T., Spinello, L., et al.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, pp. 681–687

  37. Carlucci, F.M., Russo, P., Caputo, B.: (DE)2CO: deep depth colorization. IEEE Robot. Autom. Lett. 3(3), 2386–2393 (2018)

    Article  Google Scholar 

  38. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a Lie group. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp. 588–595

  39. Huang, Z., Wan, C., Probst, T., et al.: Deep learning on Lie groups for skeleton-based action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1243–1252

  40. Li, Y., Guo, T., Liu, X., et al.: Skeleton-based action recognition with Lie group and deep neural networks. In: 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 26–30

  41. Yang, K., Ding, X., Chen, W.: Multi-scale spatial temporal graph convolutional LSTM network for skeleton-based human action recognition. In: Proceedings of the 2019 International Conference on Video, Signal and Image Processing, pp. 3–9

  42. Xu, S., Rao, H., Hu, X., Hu, B.: Multi-level co-occurrence graph convolutional LSTM for skeleton-based action recognition. In: 2020 IEEE International Conference on E-health Networking, Application & Services (HEALTHCOM), pp. 1–7

  43. Si, C., Chen, W., Wang, W., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1227–1236

  44. Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual interpretation of hand gestures for human-computer interaction. IEEE Trans. Pattern Anal. Mach. Intell. 19(6), 677–695 (1997)

    Article  Google Scholar 

  45. Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)

    Article  Google Scholar 

  46. Kiliboz, N., Gudukbay, U.: A hand gesture recognition technique for human-computer interaction. J. Vis. Commun. Image Represent. 28, 97–104 (2015)

    Article  Google Scholar 

  47. Li, Y., Miao, Q., Qi, X., et al.: A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach. Vis. Appl. 30, 875–888 (2019)

    Article  Google Scholar 

  48. Huang, C., Jeng, S.: A model-based hand gesture recognition system. Mach. Vis. Appl. 12, 243–258 (2001)

    Article  Google Scholar 

  49. Panwar, M., Mehra, P.S.: Hand gesture recognition for human computer interaction. In: 2011 Proceedings of the International Conference on Image Information Processing, pp. 1–7

  50. Lu, Z., Qin, S., Li, X., et al.: One-shot learning hand gesture recognition based on modified 3D convolutional neural networks. Mach. Vis. Appl. 30, 1157–1180 (2019)

    Article  Google Scholar 

  51. Molina, J., Martínez, J.M.: A synthetic training framework for providing gesture scalability to 2.5D pose-based hand gesture recognition systems. Mach. Vis. Appl. 25, 1309–1315 (2014)

    Article  Google Scholar 

  52. Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: 2013 IEEE International Conference on Computer Vision, pp. 2752–2759

  53. Sun, D., Zeng, F., Luo, B., et al.: Information enhanced graph convolutional networks for skeleton-based action recognition. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7

  54. Zhang, Z., Wang, Z., Zhuang, S., et al.: Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8, 228108–228117 (2020)

    Article  Google Scholar 

  55. Wu, C., Wu, X., Kittler, J.: Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1740–1748

  56. Liu, K., Gao, L., Mefraz Khan, N., et al.: Graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: 2019 IEEE International Symposium on Multimedia (ISM), pp. 25–256

  57. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. arXiv:1512.03385v1

  58. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 2015 Proc. ICLR, pp. 1–14

  59. Li, R., Liu, Z., Tan, J.: Reassessing hierarchical representation for action recognition in still images. IEEE Access 6(1), 61386–61400 (2018)

    Article  Google Scholar 

  60. Ohn-Bar, E., Trivedi, M.M.: Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transp. Syst. 15(6), 2368–2377 (2014)

    Article  Google Scholar 

  61. Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723

  62. Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1506–1515

  63. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118

  64. Zhang, X., Wang, Y., Gou, M., et al.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4498–4507

  65. Garcia-Hernando, G., Kim, T.: Transition forests: learning discriminative temporal transitions for action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 407–415

  66. Hu, J., Zheng, W., Lai, J., et al.: Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2186–2200 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

Support from the China Postdoctoral Science Foundation (Grant No. 2019M661098) and the National Natural Science Foundation of China (Grant No. 61671103) is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyu Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R., Wang, H. Graph convolutional networks and LSTM for first-person multimodal hand action recognition. Machine Vision and Applications 33, 84 (2022). https://doi.org/10.1007/s00138-022-01328-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01328-4

Keywords

Navigation