Beyond Vision: A Semantic Reasoning Enhanced Model for Gesture Recognition with Improved Spatiotemporal Capacity

Wang, Yizhe; Cao, Congqi; Zhang, Yanning

doi:10.1007/978-3-031-18913-5_33

Beyond Vision: A Semantic Reasoning Enhanced Model for Gesture Recognition with Improved Spatiotemporal Capacity

Conference paper
First Online: 27 October 2022

1687 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Abstract

Gesture recognition is an imperative and practical problem owing to its great application potential. Although recent works have made great progress in this field, there also exist three non-negligible problems: 1) existing works lack efficient temporal modeling ability; 2) existing works lack effective spatial attention capacity; 3) most works only focus on the visual information, without considering the semantic relationship between different classes. To tackle the first problem, we propose a Long and Short-term Temporal Shift Module (LS-TSM). It extends the original TSM and expands the step size of shift operation to model long-term and short-term temporal information simultaneously. For the second problem, we expect to focus on the spatial area where the change of hand mainly occurs. Therefore, we propose a Spatial Attention Module (SAM) which utilizes the RGB difference between frames to get a spatial attention mask to assign different weights to different spatial positions. As for the last, we propose a Label Relation Module (LRM) which can take full advantage of the relationship among classes based on their labels’ semantic information. With the proposed modules, our work achieves the state-of-the-art performance on two commonly used gesture datasets, i.e., EgoGesture and NVGesture datasets. Extensive experiments demonstrate the effectiveness of our proposed modules.

This work was partly supported by the National Natural Science Foundation of China (61906155, U19B2037), the Young Talent Fund of Association for Science and Technology in Shaanxi, China (20220117), and the National Key R &D Program of China (2020AAA0106900).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abavisani, M., Joze, H.R.V., Patel, V.M.: Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1165–1174 (2019)
Google Scholar
Bhattacharya, S., Souly, N., Shah, M.: Covariance of motion and appearance featuresfor spatio temporal recognition tasks. arXiv preprint arXiv:1606.05355 (2016)
Cao, C., Zhang, Y., Wu, Y., Lu, H., Cheng, J.: Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3763–3771 (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, B., Li, J., Lu, G., Yu, H., Zhang, D.: Label co-occurrence learning with graph convolutional networks for multi-label chest x-ray image classification. IEEE J. Biomed. Health Inform. 24(8), 2292–2302 (2020)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gupta, P., et al.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: CVPR, vol. 1, p. 3 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Jiang, Q., Wu, X., Kittler, J.: Insight on attention modules for skeleton-based action recognition. In: Ma, H., et al. (eds.) PRCV 2021. LNCS, vol. 13019, pp. 242–255. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88004-0_20
Chapter Google Scholar
Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Modiri Assari, S., Roshan Zamir, A., Shah, M.: Video classification using semantic concept co-occurrences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2529–2536 (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904 (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, S., Thompson, L., Iyyer, M.: Phrase-BERT: improved phrase embeddings from BERT with an application to corpus exploration. arXiv preprint arXiv:2109.06304 (2021)
Wang, Z., She, Q., Smolic, A.: Action-Net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223 (2021)
Google Scholar
Wen, S., et al.: Multilabel image classification via feature/label co-projection. IEEE Trans. Syst. Man Cybernet. Syst. 51(11), 7250–7259 (2020)
Article Google Scholar
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Google Scholar
Wu, W., He, D., Lin, T., Li, F., Gan, C., Ding, E.: MvfNet: multi-view fusion network for efficient video recognition. In: Proceedings of the AAAI (2021)
Google Scholar
Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., Weijer, J.v.d.: Orderless recurrent models for multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13440–13449 (2020)
Google Scholar
Yu, Z., et al.: Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Trans. Image Process. 30, 5626–5640 (2021)
Article Google Scholar
Zhang, C., Zou, Y., Chen, G., Gan, L.: PAN: persistent appearance network with an efficient motion cue for fast action recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 500–509 (2019)
Google Scholar
Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20(5), 1038–1050 (2018)
Article Google Scholar
Zhu, C., Chen, F., Ahmed, U., Shen, Z., Savvides, M.: Semantic relation reasoning for shot-stable few-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8782–8791 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, No. 1 Dongxiang Road, Xi’an, 710129, China
Yizhe Wang, Congqi Cao & Yanning Zhang

Authors

Yizhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Congqi Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yanning Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Congqi Cao .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Cao, C., Zhang, Y. (2022). Beyond Vision: A Semantic Reasoning Enhanced Model for Gesture Recognition with Improved Spatiotemporal Capacity. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_33
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics