Abstract
As devices around us get more intelligent, new ways of interacting with them are sought to improve user convenience and comfort. While gesture-controlled systems have existed for some time, they either use additional specialized imaging equipment, require unreasonable computing resources, or are simply not accurate enough to be a viable alternative. In this work, a reliable method of recognizing gestures is proposed. The built model correctly classifies hand gestures for keyboard typing based on the activity captured by an ordinary camera. Two models are initially developed for classifying video data and classifying time-series sequences of the skeleton data extracted from a video. The models use different strategies of classification and are built using lightweight architectures. The two models are the baseline models which are integrated to form a single multi-modal model with multiple inputs, i.e., video and time-series inputs, to improve accuracy. The performances of the baseline models are then compared to the multimodal classifier. Since the multimodal classifier is based on the initial models, it naturally inherits the benefits of both baseline architectures and provides a higher testing accuracy of 100% compared to the accuracy of 85% and 75% for the baseline models respectively.
References
Alani, A.A., Cosma, G., Taherkhani, A., McGinnity, T.M.: Hand gesture recognition using an adapted convolutional neural network with data augmentation. In: 4th International Conference on Information Management, ICIM 2018, pp. 5–12 (2018)
Alani, A.A., Cosma, G., Taherkhani, A.: Classifying imbalanced multi-modal sensor data for human activity recognition in a smart home using deep learning. In: Proceedings of the International Joint Conference on Neural Networks, July 2020
Taherkhani, A., Cosma, G., Alani, A.A., McGinnity, T.M.: Activity recognition from multi-modal sensor data using a deep convolutional neural network. Adv. Intell. Syst. Comput. 857, 203–218 (2019)
Umakanthan, S., Denman, S., Sridharan, S., Fookes, C., Wark, T.: Spatio temporal feature evaluation for action recognition. In: International Conference on Digital Image Computing Techniques and Applications, DICTA 2012, pp. 1–8 (2012)
Kabir, R., Ahmed, N., Roy, N., Islam, M.R.: A novel dynamic hand gesture and movement trajectory recognition model for non-touch HRI interface. In: 2019 IEEE Eurasia Conf. on IOT, Communication and Engineering, pp. 505–508 (2019)
Wan, X., Xing, T., Ji, Y., Gong, S., Liu, C.: 3D human action recognition with skeleton orientation vectors and stacked residual bi-LSTM. In: Proceedings - 4th Asian Conference on Pattern Recognition, ACPR 2017, pp. 577–582 (2017)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 4645–4653 (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter, pp. 4489–4497 (2015)
Jiang, X., Xu, K., Sun, T.: Action recognition scheme based on skeleton representation with DS-LSTM network. IEEE Trans. Circuits Syst. Video Technol. 30(7), 2129–2140 (2020)
Pan, H., Chen, Y.: Multilevel LSTM for action recognition based on skeleton sequence. In: Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, pp. 2218–2223 (2019)
Baltrusaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Huo, D., Chen, Y., Li, F., Lei, Z.: Modality-convolutions: Multi-modal gesture recognition based on convolutional neural network. In: ICCSE 2017 - 12th International Conference on Computer Science and Education, ICCSE, pp. 349–353 (2017)
Suni, S.S., Gopakumar, K.: Fusing multimodal features for recognizing hand gestures. In: 2nd International Conference on Advanced Computational and Communication Paradigms, ICACCP 2019 (2019)
Liao, C.J., Su, S.F., Chen, M.C.: Vision-based hand gesture recognition system for a dynamic and complicated environment. In: Proceedings - 2015 IEEE International Conference on Systems, pp. 2891–2895 (2015)
Chen, J., Zheng, M.: A review of research on robot operation behavior based on deep reinforcement learning. Robot 44(02) (2022)
Acknowledgments
This work was supported in part by Liaoning Province Applied Basic Research Program: Human-machine Fusion Intelligent Modeling and Collaborative Optimization Driven by Data and Knowledge under Grant 2023JH2/101300184.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Fulsunder, S., Umar, S., Taherkhani, A., Liu, C., Yang, S. (2024). Hand Gesture Recognition Using a Multi-modal Deep Neural Network. In: Shi, Z., Torresen, J., Yang, S. (eds) Intelligent Information Processing XII. IIP 2024. IFIP Advances in Information and Communication Technology, vol 704. Springer, Cham. https://doi.org/10.1007/978-3-031-57919-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-57919-6_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57918-9
Online ISBN: 978-3-031-57919-6
eBook Packages: Computer ScienceComputer Science (R0)