Abstract
Skeleton-based gesture recognition (SHGR) is a very challenging task due to the complex articulated topology of hands. Previous works often learn hand characteristics from a single observation viewpoint. However, the various context information hidden in multiple viewpoints is disregarded. To resolve this issue, we propose a novel multi-view hierarchical aggregation network for SHGR. Firstly, two-dimensional non-uniform spatial sampling, a novel strategy forming extrinsic parameter distributions of virtual cameras, is presented to enumerate viewpoints to observe hand skeletons. Afterward, we adopt coordinate transformation to generate multi-view hand skeletons and employ a multi-branch convolutional neural networks to further extract the multi-view features. Furthermore, we exploit a novel hierarchical aggregation network including hierarchical attention architecture and global context modeling to fuse the multi-view features for final classification. Experiments on three benchmarked datasets demonstrate that our work can be competitive with the state-of-the-art methods.







Similar content being viewed by others
References
Nuzzi, C., Pasinetti, S., Pagani, R., Ghidini, S., Beschi, M., Coffetti, G., Sansoni, G.: MEGURU: a gesture-based robot program builder for Meta-Collaborative workstations. Robot. Comput.-Integr. Manuf. 68, 102085 (2021)
Boukdir, A., Benaddy, M., Ellahyani, A., Meslouhi, O.E., Kardouchi, M.: 3D gesture segmentation for word-level Arabic sign language using large-scale RGB video sequences and autoencoder convolutional networks. SIViP 16, 2055–2062 (2022)
Wang, P., Bai, X., Billinghurst, M., Zhang, S., Wei, S., Xu, G., He, W., Zhang, X., Zhang, J.: 3DGAM: using 3D gesture and CAD models for training on mixed reality remote collaboration. Multimed. Tools Appl 80, 31059–31084 (2021)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2136–2145 (2017)
Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–7 (2015)
De Smedt, Q., Wannous, H., Vandeborre, J.-P.: Skeleton-based dynamic hand gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1206–1214 (2016)
Smedt, Q.D., Wannous, H., Vandeborre, J.-P., Guerry, J., Saux, B.L., Filliat, D.: SHREC'17 track: 3D hand gesture recognition using a depth and skeletal dataset. In: 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pp. 1–6 (2017)
Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016)
Guo, F., He, Z., Zhang, S., Zhao, X., Tan, J.: Attention-based pose sequence machine for 3D hand pose estimation. IEEE Access 8, 18258–18269 (2020)
Avola, D., Bernardi, M., Cinque, L., Foresti, G.L., Massaroni, C.: Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Trans. Multimed. 21, 234–245 (2019)
Chen, X., Wang, G., Guo, H., Zhang, C., Wang, H., Zhang, L.: MFA-Net: motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors. 19, 239 (2019)
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)
Núñez, J.C., Cabido, R., Pantrigo, J.J., Montemayor, A.S., Vélez, J.F.: Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn. 76, 80–94 (2018)
Hou, J., Wang, G., Chen, X., Xue, J.-H., Zhu, R., Yang, H.: Spatial-temporal attention Res-TCN for skeleton-based dynamic hand gesture recognition. In: Leal-Taixé, L., Roth, S. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 273–286 (2019)
Fan, Z., Zhao, X., Lin, T., Su, H.: Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans. Multimed. 21, 363–374 (2019)
Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3D shape recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272 (2018)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. arXiv:1505.00880 [cs] (2015)
Wang, C., Pelillo, M., Siddiqi, K.: Dominant set clustering and pooling for multi-view 3D object recognition. arXiv:1906.01592 [cs] (2019)
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.-K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–419 (2018)
Ding, K., Liu, Y.-H.: Sphere image for 3-D model retrieval. IEEE Trans. Multimed. 16, 1369–1376 (2014)
Biermann, H., Levin, A., Zorin, D.: Piecewise smooth subdivision surfaces with normal control. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 113–120 (2000)
Neave, H.R.: On using the Box–Muller transformation with multiplicative congruential pseudo-random number generators. Appl. Stat. 22, 92 (1973)
Liang, B., Li, H.: Specificity and latent correlation learning for action recognition using synthetic multi-view data from depth maps. IEEE Trans. Image Process. 26, 5560–5574 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 [cs]. (2017)
Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: Dynamic hand gesture recognition based on 3D pattern assembled trajectories. In: 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6 (2017)
Tu, J., Liu, M., Liu, H.: Skeleton-based human action recognition using spatial temporal 3D convolutional neural networks. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp 7444–7452 (2018)
Chen, Y.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. arXiv:1907.08871 [cs] (2019)
Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on SPD manifold learning for skeleton-based hand gesture recognition. arXiv:1904.12970 [cs] (2019)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a Lie group. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595 (2014)
Garcia-Hernando, G., Kim, T.-K.: Transition forests: learning discriminative temporal transitions for action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 407–415 (2017)
Huang, Z., Van Gool, L.: A Riemannian network for SPD matrix learning. arXiv:1608.04233 [cs] (2016)
Zhang, X., Wang, Y., Gou, M., Sznaier, M., Camps, O.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4498–4507 (2016)
Acknowledgements
This work was supported by Key Research and Development Program of Zhejiang Province under Grant 2022C01064.
Author information
Authors and Affiliations
Contributions
Shaochen Li: Conceptualization, Methodology, Writing—Original Draft Zhenyu Liu: Methodology Guifang Duan: Methodology, Writing—Review & Editing Jianrong Tan: Writing—Review & Editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, S., Liu, Z., Duan, G. et al. MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition. SIViP 17, 2521–2529 (2023). https://doi.org/10.1007/s11760-022-02469-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02469-9