Graph convolutional networks and LSTM for first-person multimodal hand action recognition

Li, Rui; Wang, Hongyu

doi:10.1007/s00138-022-01328-4

Graph convolutional networks and LSTM for first-person multimodal hand action recognition

Original Paper
Published: 06 September 2022

Volume 33, article number 84, (2022)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Rui Li¹ &
Hongyu Wang¹

463 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Graph convolutional networks (GCNs) have been successfully introduced in skeleton-based human action recognition. Both human skeletons and hand skeletons are composed of open-loop chains, and each chain is composed of rigid links (corresponding to bones) and revolving pairs (corresponding to joints). Despite this similarity, there has been no skeleton-based hand action recognition method that represents hand skeletons using GCNs. We first evaluate the effectiveness of traditional spatial–temporal GCNs for skeleton-based hand action recognition. Then, we propose to improve the traditional spatial–temporal GCNs by incorporating the third-order node information (geometric relationships between neighbor connected bones in a hand skeleton), and the geometric relationships are described by a Lie group, including relative translations and rotations. Finally, we study first-person multimodal hand action recognition with hand skeletons, RGB images, and depth maps jointly used as visual input. We propose to fuse the multimodal features by customized long short-term memory (LSTM) units, rather than simply concatenating them as a feature vector. Extensive ablation studies are conducted to demonstrate the improvements due to the use of the third-order node information and the advantages of our multimodal fusion strategy. Our method markedly outperforms recent baselines on a public first-person hand action recognition dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Anamika Dhillon & Gyanendra K. Verma

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Asifullah Khan, Anabia Sohail, … Aqsa Saeed Qureshi

References

Jalal, A., Kamal, S., Kim, D.: A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7), 11735–11759 (2014)
Article Google Scholar
Liang, H., Yuan, J., Lee, J., et al.: Hough forest with optimized leaves for global hand pose estimation with arbitrary postures. IEEE Trans. Cybern. 49(2), 527–541 (2019)
Article Google Scholar
Mumtaz, A., Sargano, A.B., Habib, Z.: Violence detection in surveillance videos with deep network using transfer learning. In: 2018 2nd European Conference on Electrical Engineering and Computer Science (EECS), pp. 558–563
Antotsiou, D., Garcia-Hernando, G., Kim, T.: Task-oriented hand motion retargeting for dexterous manipulation imitation. arXiv:1810.01845
Li, R., Wang, H., Liu, Z.: Survey on mapping human hand motion to robotic hands for teleoperation. IEEE Trans. Circuits Syst. Video Technol. 32(5), 2647–2665 (2022)
Article Google Scholar
Ahmad, T., Jin, L., Zhang, X., et al.: Graph convolutional neural network for human action recognition: a comprehensive survey. IEEE Trans. Artif. Intell. 2(2), 128–145 (2021)
Article Google Scholar
Wu, Z., Pan, S., Chen, F., et al.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2021)
Article MathSciNet Google Scholar
Li, F., Zhu, A., Liu, Z., et al.: Pyramidal graph convolutional network for skeleton-based human action recognition. IEEE Sens. J. 21(14), 16183–16191 (2021)
Article Google Scholar
Li, W., Liu, X., Liu, Z., et al.: Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8, 144529–144542 (2020)
Article Google Scholar
Liu, R., Xu, C., Zhang, T., et al.: Si-GCN: structure-induced graph convolution network for skeleton-based action recognition. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8
Zhang, X., Xu, C., Tian, X., et al.: Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 31(8), 3047–3060 (2020)
Article Google Scholar
Hao, X., Li, J., Guo, Y., et al.: Hypergraph neural network for skeleton-based action recognition. IEEE Trans. Image Process. 30, 2263–2275 (2021)
Article MathSciNet Google Scholar
Shahroudy, A., Liu, J., Ng, T., et al.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019
Sijie, Y., Yuanjun, X., Dahua, L.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 2018 32nd AAAI Conference on Artificial Intelligence, pp. 7444–7452
Shi, L., Zhang, Y., Cheng, J., et al.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12018–12027
Shi, L., Zhang, Y., Cheng, J., et al.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020)
Article MATH Google Scholar
Li, M., Chen, S., Chen, X., et al.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3590–3598
Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14321–14330
Nam, S., Lee, S.: JT-MGCN: joint-temporal motion graph convolutional network for skeleton-based action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6383–6390
Zhang, G., Zhang, X.: Multi-heads attention graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4
Ahmad, T., Mao, H., Lin, L., et al.: Action recognition using attention-joints graph convolutional neural networks. IEEE Access 8, 305–313 (2020)
Article Google Scholar
BanTeng, M.L., Wu, Z.: Channel-wise dense connection graph convolutional network for skeleton-based action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3799–3806
Yang, H., Gu, Y., Zhu, J., et al.: PGCN-TCA: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8, 10040–10047 (2020)
Article Google Scholar
Feng, D., Wu, Z., Zhang, J., et al.: Multi-scale spatial temporal graph neural network for skeleton-based action recognition. IEEE Access 9, 58256–58265 (2021)
Article Google Scholar
Xia, H., Gao, X.: Multi-scale mixed dense graph convolution network for skeleton-based action recognition. IEEE Access 9, 36475–36484 (2021)
Article Google Scholar
Cheng, K., Zhang, Y., He, X., et al.: Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 180–189
Li, S., Yi, J., Farha, Y.A., et al.: Pose refinement graph convolutional network for skeleton-based action recognition. IEEE Robot. Autom. Lett. 6(2), 1028–1035 (2021)
Article Google Scholar
Tang, Y., Tian, Y., Lu, J., et al.: Action recognition in RGB-D egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414
Garcia-Hernando, G., Yuan, S., Baek, S., et al.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 409–419
Liu, J., Akhtar, N., Mian, A.: Viewpoint invariant action recognition using RGB-D videos. IEEE Access 6, 70061–70071 (2018)
Article Google Scholar
Li, R., Liu, Z., Tan, J.: Exploring 3D human action recognition: from offline to online. Sensors 18(2), 633 (2018)
Article Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941
Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition. ECCV 9912, 20–36 (2016)
Google Scholar
Tekin, B., Bogo, F., Pollefeys, M.: H+O: unified egocentric recognition of 3D hand-object poses and interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4506–4515
Schwarz, M., Schulz, H., Behnke, S.: RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1329–1335
Eitel, A., Springenberg, J.T., Spinello, L., et al.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, pp. 681–687
Carlucci, F.M., Russo, P., Caputo, B.: (DE)2CO: deep depth colorization. IEEE Robot. Autom. Lett. 3(3), 2386–2393 (2018)
Article Google Scholar
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a Lie group. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp. 588–595
Huang, Z., Wan, C., Probst, T., et al.: Deep learning on Lie groups for skeleton-based action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1243–1252
Li, Y., Guo, T., Liu, X., et al.: Skeleton-based action recognition with Lie group and deep neural networks. In: 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 26–30
Yang, K., Ding, X., Chen, W.: Multi-scale spatial temporal graph convolutional LSTM network for skeleton-based human action recognition. In: Proceedings of the 2019 International Conference on Video, Signal and Image Processing, pp. 3–9
Xu, S., Rao, H., Hu, X., Hu, B.: Multi-level co-occurrence graph convolutional LSTM for skeleton-based action recognition. In: 2020 IEEE International Conference on E-health Networking, Application & Services (HEALTHCOM), pp. 1–7
Si, C., Chen, W., Wang, W., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1227–1236
Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual interpretation of hand gestures for human-computer interaction. IEEE Trans. Pattern Anal. Mach. Intell. 19(6), 677–695 (1997)
Article Google Scholar
Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)
Article Google Scholar
Kiliboz, N., Gudukbay, U.: A hand gesture recognition technique for human-computer interaction. J. Vis. Commun. Image Represent. 28, 97–104 (2015)
Article Google Scholar
Li, Y., Miao, Q., Qi, X., et al.: A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach. Vis. Appl. 30, 875–888 (2019)
Article Google Scholar
Huang, C., Jeng, S.: A model-based hand gesture recognition system. Mach. Vis. Appl. 12, 243–258 (2001)
Article Google Scholar
Panwar, M., Mehra, P.S.: Hand gesture recognition for human computer interaction. In: 2011 Proceedings of the International Conference on Image Information Processing, pp. 1–7
Lu, Z., Qin, S., Li, X., et al.: One-shot learning hand gesture recognition based on modified 3D convolutional neural networks. Mach. Vis. Appl. 30, 1157–1180 (2019)
Article Google Scholar
Molina, J., Martínez, J.M.: A synthetic training framework for providing gesture scalability to 2.5D pose-based hand gesture recognition systems. Mach. Vis. Appl. 25, 1309–1315 (2014)
Article Google Scholar
Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: 2013 IEEE International Conference on Computer Vision, pp. 2752–2759
Sun, D., Zeng, F., Luo, B., et al.: Information enhanced graph convolutional networks for skeleton-based action recognition. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7
Zhang, Z., Wang, Z., Zhuang, S., et al.: Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8, 228108–228117 (2020)
Article Google Scholar
Wu, C., Wu, X., Kittler, J.: Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1740–1748
Liu, K., Gao, L., Mefraz Khan, N., et al.: Graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: 2019 IEEE International Symposium on Multimedia (ISM), pp. 25–256
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. arXiv:1512.03385v1
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 2015 Proc. ICLR, pp. 1–14
Li, R., Liu, Z., Tan, J.: Reassessing hierarchical representation for action recognition in still images. IEEE Access 6(1), 61386–61400 (2018)
Article Google Scholar
Ohn-Bar, E., Trivedi, M.M.: Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transp. Syst. 15(6), 2368–2377 (2014)
Article Google Scholar
Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723
Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1506–1515
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118
Zhang, X., Wang, Y., Gou, M., et al.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4498–4507
Garcia-Hernando, G., Kim, T.: Transition forests: learning discriminative temporal transitions for action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 407–415
Hu, J., Zheng, W., Lai, J., et al.: Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2186–2200 (2017)
Article Google Scholar

Download references

Acknowledgements

Support from the China Postdoctoral Science Foundation (Grant No. 2019M661098) and the National Natural Science Foundation of China (Grant No. 61671103) is gratefully acknowledged.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Dalian University of Technology, Room B510, Innovation Park Building, Lingshui Campus, No. 2 Linggong Road, Dalian, 116024, China
Rui Li & Hongyu Wang

Authors

Rui Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyu Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, R., Wang, H. Graph convolutional networks and LSTM for first-person multimodal hand action recognition. Machine Vision and Applications 33, 84 (2022). https://doi.org/10.1007/s00138-022-01328-4

Download citation

Received: 30 July 2021
Revised: 18 May 2022
Accepted: 10 July 2022
Published: 06 September 2022
DOI: https://doi.org/10.1007/s00138-022-01328-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph convolutional networks and LSTM for first-person multimodal hand action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Graph convolutional networks and LSTM for first-person multimodal hand action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation