Skip to main content

Advertisement

Log in

MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Skeleton-based gesture recognition (SHGR) is a very challenging task due to the complex articulated topology of hands. Previous works often learn hand characteristics from a single observation viewpoint. However, the various context information hidden in multiple viewpoints is disregarded. To resolve this issue, we propose a novel multi-view hierarchical aggregation network for SHGR. Firstly, two-dimensional non-uniform spatial sampling, a novel strategy forming extrinsic parameter distributions of virtual cameras, is presented to enumerate viewpoints to observe hand skeletons. Afterward, we adopt coordinate transformation to generate multi-view hand skeletons and employ a multi-branch convolutional neural networks to further extract the multi-view features. Furthermore, we exploit a novel hierarchical aggregation network including hierarchical attention architecture and global context modeling to fuse the multi-view features for final classification. Experiments on three benchmarked datasets demonstrate that our work can be competitive with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Nuzzi, C., Pasinetti, S., Pagani, R., Ghidini, S., Beschi, M., Coffetti, G., Sansoni, G.: MEGURU: a gesture-based robot program builder for Meta-Collaborative workstations. Robot. Comput.-Integr. Manuf. 68, 102085 (2021)

    Article  Google Scholar 

  2. Boukdir, A., Benaddy, M., Ellahyani, A., Meslouhi, O.E., Kardouchi, M.: 3D gesture segmentation for word-level Arabic sign language using large-scale RGB video sequences and autoencoder convolutional networks. SIViP 16, 2055–2062 (2022)

    Article  Google Scholar 

  3. Wang, P., Bai, X., Billinghurst, M., Zhang, S., Wei, S., Xu, G., He, W., Zhang, X., Zhang, J.: 3DGAM: using 3D gesture and CAD models for training on mixed reality remote collaboration. Multimed. Tools Appl 80, 31059–31084 (2021)

    Article  Google Scholar 

  4. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2136–2145 (2017)

  5. Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–7 (2015)

  6. De Smedt, Q., Wannous, H., Vandeborre, J.-P.: Skeleton-based dynamic hand gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1206–1214 (2016)

  7. Smedt, Q.D., Wannous, H., Vandeborre, J.-P., Guerry, J., Saux, B.L., Filliat, D.: SHREC'17 track: 3D hand gesture recognition using a depth and skeletal dataset. In: 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pp. 1–6 (2017)

  8. Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016)

    Article  Google Scholar 

  9. Guo, F., He, Z., Zhang, S., Zhao, X., Tan, J.: Attention-based pose sequence machine for 3D hand pose estimation. IEEE Access 8, 18258–18269 (2020)

    Article  Google Scholar 

  10. Avola, D., Bernardi, M., Cinque, L., Foresti, G.L., Massaroni, C.: Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Trans. Multimed. 21, 234–245 (2019)

    Article  Google Scholar 

  11. Chen, X., Wang, G., Guo, H., Zhang, C., Wang, H., Zhang, L.: MFA-Net: motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors. 19, 239 (2019)

    Article  Google Scholar 

  12. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)

  13. Núñez, J.C., Cabido, R., Pantrigo, J.J., Montemayor, A.S., Vélez, J.F.: Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn. 76, 80–94 (2018)

    Article  Google Scholar 

  14. Hou, J., Wang, G., Chen, X., Xue, J.-H., Zhu, R., Yang, H.: Spatial-temporal attention Res-TCN for skeleton-based dynamic hand gesture recognition. In: Leal-Taixé, L., Roth, S. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 273–286 (2019)

  15. Fan, Z., Zhao, X., Lin, T., Su, H.: Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans. Multimed. 21, 363–374 (2019)

    Article  Google Scholar 

  16. Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3D shape recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272 (2018)

  17. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. arXiv:1505.00880 [cs] (2015)

  18. Wang, C., Pelillo, M., Siddiqi, K.: Dominant set clustering and pooling for multi-view 3D object recognition. arXiv:1906.01592 [cs] (2019)

  19. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.-K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–419 (2018)

  20. Ding, K., Liu, Y.-H.: Sphere image for 3-D model retrieval. IEEE Trans. Multimed. 16, 1369–1376 (2014)

    Article  Google Scholar 

  21. Biermann, H., Levin, A., Zorin, D.: Piecewise smooth subdivision surfaces with normal control. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 113–120 (2000)

  22. Neave, H.R.: On using the Box–Muller transformation with multiplicative congruential pseudo-random number generators. Appl. Stat. 22, 92 (1973)

    Article  Google Scholar 

  23. Liang, B., Li, H.: Specificity and latent correlation learning for action recognition using synthetic multi-view data from depth maps. IEEE Trans. Image Process. 26, 5560–5574 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  24. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  25. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 [cs]. (2017)

  26. Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: Dynamic hand gesture recognition based on 3D pattern assembled trajectories. In: 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6 (2017)

  27. Tu, J., Liu, M., Liu, H.: Skeleton-based human action recognition using spatial temporal 3D convolutional neural networks. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018)

  28. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp 7444–7452 (2018)

  29. Chen, Y.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. arXiv:1907.08871 [cs] (2019)

  30. Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on SPD manifold learning for skeleton-based hand gesture recognition. arXiv:1904.12970 [cs] (2019)

  31. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a Lie group. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595 (2014)

  32. Garcia-Hernando, G., Kim, T.-K.: Transition forests: learning discriminative temporal transitions for action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 407–415 (2017)

  33. Huang, Z., Van Gool, L.: A Riemannian network for SPD matrix learning. arXiv:1608.04233 [cs] (2016)

  34. Zhang, X., Wang, Y., Gou, M., Sznaier, M., Camps, O.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4498–4507 (2016)

Download references

Acknowledgements

This work was supported by Key Research and Development Program of Zhejiang Province under Grant 2022C01064.

Author information

Authors and Affiliations

Authors

Contributions

Shaochen Li: Conceptualization, Methodology, Writing—Original Draft Zhenyu Liu: Methodology Guifang Duan: Methodology, Writing—Review & Editing Jianrong Tan: Writing—Review & Editing.

Corresponding author

Correspondence to Guifang Duan.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Liu, Z., Duan, G. et al. MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition. SIViP 17, 2521–2529 (2023). https://doi.org/10.1007/s11760-022-02469-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02469-9

Keywords