research-article

Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation

Authors:

Herman Prawiro,

Chia-Chen Chiang,

Chih-Tsun Huang,

Min-Chun HuAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 39, Pages 1 - 7

https://doi.org/10.1145/3595916.3626411

Published: 01 January 2024 Publication History

Abstract

In this paper, we propose a lightweight model for hand gesture recognition using an RGB camera. The proposed model enables recognition of first-person hand gestures using a single camera and achieves near-real-time computational performance on both high-end and low-end computing devices. The proposed framework utilizes multi-task multi-modal learning and self-distillation to deal with the challenges in hand gesture recognition. We integrate additional modalities (depth) and a future prediction mechanism to enhance the model’s ability to learn spatio-temporal information. Furthermore, we employ self-distillation to compress the model, achieving a balance between accuracy and computational efficiency. We compared the proposed hand gesture recognition model with the state-of-the-art method, and our model outperforms the SOTA by 0.88% and 3.52% on the EgoGesture and NVGesture datasets, respectively. In terms of computational efficiency, our model takes only 161ms in average to recognize a gesture on a device with low-end GPUs (NVIDIA Jetson TX2), which is acceptable for interaction in XR applications.

Supplementary Material

MP4 File (Qualitative_Evaluation_and_Demonstration.mp4)

Qualitative Evaluation and Demonstration

Download
21.73 MB

References

[1]

Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1165–1174.

[2]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.

[3]

Dinghao Fan, Hengjie Lu, Shugong Xu, and Shan Cao. 2021. Multi-task and multi-modal learning for rgb dynamic gesture recognition. IEEE Sensors Journal 21, 23 (2021), 27026–27036.

[4]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933–1941.

[5]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[6]

Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision. 1389–1397.

[7]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531 cite arxiv:1503.02531Comment: NIPS 2014 Deep Learning Workshop.

[8]

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).

[9]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083–7093.

[10]

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4207–4215.

[11]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533–5541.

[12]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.

Digital Library

[13]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.

[14]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.

[15]

Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13214–13223.

[16]

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3713–3722.

[17]

Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu. 2018. Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20, 5 (2018), 1038–1050.

Index Terms

Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems

Recommendations

Multi-scenario gesture recognition using Kinect
CGAMES '12: Proceedings of the 2012 17th International Conference on Computer Games: AI, Animation, Mobile, Interactive Multimedia, Educational & Serious Games (CGAMES)

Hand gesture recognition (HGR) is an important research topic because some situations require silent communication with sign languages. Computational HGR systems assist silent communication, and help people learn a sign language. In this article, a ...
Smart Hand Device Gesture Recognition with Dynamic Time-Warping Method
BDIOT '17: Proceedings of the International Conference on Big Data and Internet of Thing

In this paper, we present a smart wearable hand-gesture recognition system based on the movement of the hand and fingers. The proposed smart wearable system is built using the fewest sensors necessary for gesture recognition. Thus, motion sensors are ...
Hand posture and gesture recognition technology

Hand gestures that are performed by one or two hands can be categorized according to their applications into different categories including conversational, controlling, manipulative and communicative gestures. Generally, hand gesture recognition aims to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science and Technology Council

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
151
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten