skip to main content
10.1145/3595916.3626411acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation

Published: 01 January 2024 Publication History

Abstract

In this paper, we propose a lightweight model for hand gesture recognition using an RGB camera. The proposed model enables recognition of first-person hand gestures using a single camera and achieves near-real-time computational performance on both high-end and low-end computing devices. The proposed framework utilizes multi-task multi-modal learning and self-distillation to deal with the challenges in hand gesture recognition. We integrate additional modalities (depth) and a future prediction mechanism to enhance the model’s ability to learn spatio-temporal information. Furthermore, we employ self-distillation to compress the model, achieving a balance between accuracy and computational efficiency. We compared the proposed hand gesture recognition model with the state-of-the-art method, and our model outperforms the SOTA by 0.88% and 3.52% on the EgoGesture and NVGesture datasets, respectively. In terms of computational efficiency, our model takes only 161ms in average to recognize a gesture on a device with low-end GPUs (NVIDIA Jetson TX2), which is acceptable for interaction in XR applications.

Supplementary Material

MP4 File (Qualitative_Evaluation_and_Demonstration.mp4)
Qualitative Evaluation and Demonstration

References

[1]
Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1165–1174.
[2]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
[3]
Dinghao Fan, Hengjie Lu, Shugong Xu, and Shan Cao. 2021. Multi-task and multi-modal learning for rgb dynamic gesture recognition. IEEE Sensors Journal 21, 23 (2021), 27026–27036.
[4]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933–1941.
[5]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[6]
Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision. 1389–1397.
[7]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531 cite arxiv:1503.02531Comment: NIPS 2014 Deep Learning Workshop.
[8]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
[9]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083–7093.
[10]
Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4207–4215.
[11]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533–5541.
[12]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
[13]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.
[14]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
[15]
Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13214–13223.
[16]
Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3713–3722.
[17]
Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu. 2018. Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20, 5 (2018), 1038–1050.

Index Terms

  1. Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
    December 2023
    745 pages
    ISBN:9798400702051
    DOI:10.1145/3595916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional neural network
    2. Gesture recognition
    3. Knowledge distillation
    4. Multi-modalities
    5. Multi-task learning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MMAsia '23
    Sponsor:
    MMAsia '23: ACM Multimedia Asia
    December 6 - 8, 2023
    Tainan, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 151
      Total Downloads
    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media