Abstract
Hand gesture segmentation is an important research topic in computer vision. Despite ongoing efforts, achieving optimal gesture segmentation remains challenging, attributed to factors like gesture morphology and intricate backgrounds. In light of these challenges, we propose a novel hand gesture segmentation approach that strategically combines the strengths of Convolutional Neural Networks (CNN) for local feature extraction and Transformer Networks for global feature integration. To be more specific, we design two feature fusion modules. One employs an attention mechanism to learn how to fuse features extracted by CNN and Transformer. The second module utilizes a combination of group convolution and activation functions to implement gating mechanisms, enhancing the response of crucial features while minimizing interference from weaker ones. Our proposed method achieves mIoU score of 93.53%, 97.25%, and 90.39% on OUHANDS, HGR1, and EgoHands hand gesture datasets respectively, which outperforms the state-of-the-art methods.







Similar content being viewed by others
Data Availability
No datasets were generated or analysed during the current study.
References
Aggarwal, A., Bhutani, N., Kapur, R., Dhand, G., Sheoran, K.: Real-time hand gesture recognition using multiple deep learning architectures. Signal Image Video Process. 17(8), 3963–3971 (2023)
Sahoo, J.P., Sahoo, S.P., Ari, S., Patra, S.K.: Rbi-2rcnn: residual block intensity feature using a two-stage residual convolutional neural network for static hand gesture recognition. Signal Image Video Process. 16(8), 2019–2027 (2022)
Jiang, Y., Zhao, M., Wang, C., Wei, F., Wang, K., Qi, H.: Diver’s hand gesture recognition and segmentation for human-robot interaction on AUV. Signal Image Video Process. 15(8), 1899–1906 (2021)
Urooj, A., Borji, A.: Analysis of hand segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4710–4719 (2018)
Gnanapriya, S., Rahimunnisa, K.: A hybrid deep learning model for real time hand gestures recognition. Intell. Autom. Soft Comput. 36(1), 1105–1119 (2023). https://doi.org/10.32604/iasc.2023.032832
Sagayam, K.M., Hemanth, D.J.: Hand posture and gesture recognition techniques for virtual reality applications: a survey. Virtual Real. 21, 91–107 (2017)
Kayalibay, B., Jensen, G., Smagt, P.: Cnn-based segmentation of medical imaging data (2017). arXiv preprint arXiv:1701.03056
Peng, C., Zhang, K., Ma, Y., Ma, J.: Cross fusion net: a fast semantic segmentation network for small-scale semantic information capturing in aerial scenes. IEEE Trans. Geosci. Remote Sens. 60, 1–13 (2021)
Liu, M., Shi, W., Zhao, L., Beyette, F.R., Jr.: Best performance with fewest resources: unveiling the most resource-efficient convolutional neural network for P300 detection with the aid of Explainable AI. Mach. Learn. Appl. 16, 100542 (2024)
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Liu, Y., Zhang, Y., Wang, Y., Mei, S.: Rethinking transformers for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. (2023). https://doi.org/10.1109/TGRS.2023.3302024
Dadashzadeh, A., Targhi, A.T., Tahmasbi, M., Mirmehdi, M.: Hgr-net: a fusion network for hand gesture segmentation and recognition. IET Comput. Vis. 13(8), 700–707 (2019)
Xu, Z., Zhang, W.: Hand segmentation pipeline from depth map: an integrated approach of histogram threshold selection and shallow cnn classification. Connect. Sci. 32(2), 162–173 (2020)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI, pp. 234–241 (2015)
Wang, W., Yu, K., Hugonot, J., Fua, P., Salzmann, M.: Recurrent u-net for resource-constrained segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2142–2151 (2019)
Yang, Z., Wang, Q., Zeng, J., Qin, P., Chai, R., Sun, D.: Rau-net: U-net network based on residual multi-scale fusion and attention skip layer for overall spine segmentation. Mach. Vis. Appl. 34(1), 10 (2023)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) COMPUTER VISION–ECCV 2018, pp. 833–851 (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Hong, Y., Pan, H., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes (2021). arXiv preprint arXiv:2101.06085
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218 (2022)
Wang H., Cao P., Liu X., Yang J., Zaiane O.: Narrowing the semantic gaps in U-Net with learnable skip connections: the case of medical image segmentation (2023). arXiv preprint arXiv:2312.15182
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: transformers make strong encoders for medical image segmentation (2021). arXiv preprint arXiv:2102.04306
Li, Z., Li, D., Xu, C., Wang, W., Hong, Q., Li, Q., Tian, J.: TFCNs: a CNN-transformer hybrid network for medical image segmentation. In: International Conference on Artificial Neural Networks, pp. 781–792 (2022)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. IEEE, Salt Lake City, UT, USA (2018). https://doi.org/10.1109/CVPR.2018.00813
Matilainen, M., Sangi, P., Holappa, J., Silven, O.: OUHANDS database for hand detection and pose recognition. In: 2016 6th International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–5. IEEE, Oulu, Finland (2016). https://doi.org/10.1109/IPTA.2016.7821025
HGR1. http://sun.aei.polsl.pl/mkawulok/gestures/
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957 (2015)
Acknowledgements
We gratefully appreciate the editor and reviewers for reviewing this manuscript. This work is partially supported by the Central Government Guided Local Funds for Science and Technology Development No.216Z0301G; National Natural Science Foundation of China No.61379065; Hebei Natural Science Foundation No.F2023203012; Science Research Project of Hebei Education Department No. QN2024010; Innovation Capability Improvement Plan Project of Hebei Province No.22567626H.
Funding
Central Government Guided Local Funds for Science and Technology Development (216Z0301G), National Natural Science Foundation of China (61379065), Hebei Natural Science Foundation (F2023203012), Science Research Project of Hebei Education Department (QN2024010), Innovation Capability Improvement Plan Project of Hebei Province (22567626H).
Author information
Authors and Affiliations
Contributions
SW wrote the main manuscript text and wrote codes. NY prepared the data and figures. ML edited the manuscript. QT validated the manuscript. SZ directed this study. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Yang, N., Liu, M. et al. Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks. SIViP 18, 5579–5588 (2024). https://doi.org/10.1007/s11760-024-03255-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-024-03255-5