Utilizing Attention for Continuous Human Action Recognition Based on Multimodal Fusion of Visual and Inertial

Hua, Liang; Huang, Yong; Liu, Chao; Zhu, Tao

doi:10.1007/978-3-031-57919-6_5

Liang Hua¹⁸,
Yong Huang¹⁹,
Chao Liu¹⁹ &
…
Tao Zhu¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 704))

Included in the following conference series:

International Conference on Intelligent Information Processing

240 Accesses

Abstract

Both Visual and inertial are important modals of human action recognition and have a wide range of applications in virtual reality, human-computer interaction, action perception, and other fields. Currently, most of the work has achieved significant results by utilizing both visual and inertial sensor data, as well as deep learning methods. This method of integrating multimodal information makes the system more robust and adaptable to different environments and action scenarios. However, these works still have the drawbacks of data fusion and high demand for computing resources. In this article, a method for continuous human action recognition based on visual and inertial sensors using attention is proposed. Specifically, a deep visual inertial attention network(VIANet) architecture was designed to integrate spatial, channel and temporal attention into visual 3D CNN, integrate temporal attention mechanism into inertial 2D CNN, and perform decision level fusion on it. Experimental verification was conducted on the C-MHAD public dataset. The experiment shows that the proposed VIANet outperforms previous baseline in multi-modal human action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Two-Stream Adaptive Weight Convolutional Neural Network Based on Spatial Attention for Human Action Recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Article 12 February 2019

Multimodal vision-based human action recognition using deep learning: a review

Article Open access 19 June 2024

References

Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3200–3225 (2022)
Google Scholar
Dawar, N., Kehtarnavaz, N.: Real-time continuous detection and recognition of subject-specific smart tv gestures via fusion of depth and inertial sensing. IEEE Access 6, 7019–7028 (2018)
Article Google Scholar
Majumder, S., Kehtarnavaz, N.: Vision and inertial sensing fusion for human action recognition: a review. IEEE Sens. J. 21(3), 2454–2467 (2020)
Article Google Scholar
Li, T., Yu, H.: Visual - inertial fusion based human pose estimation: a review. IEEE Trans. Instrum. Meas. 72, 1–16 (2023)
Google Scholar
Diete, A., Sztyler, T., Stuckenschmidt, H.: Vision and acceleration modalities: Partners for recognizing complex activities. In: 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 101–106. IEEE (2019)
Google Scholar
Wei, H., Chopada, P., Kehtarnavaz, N.: C-MHAD: continuous multimodal human action dataset of simultaneous video and inertial sensing. Sensors 20(10), 2905 (2020)
Article Google Scholar
Wei, H., Jafari, R., Kehtarnavaz, N.: Fusion of video and inertial sensing for deep learning-based human action recognition. Sensors 19(17), 3680 (2019)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wei, H., Kehtarnavaz, N.: Simultaneous utilization of inertial and video sensing for action detection and recognition in continuous action streams. IEEE Sens. J. 20(11), 6055–6063 (2020)
Article Google Scholar
Wang, Y., et al.: A multi-dimensional parallel convolutional connected network based on multi-source and multi-modal sensor data for human activity recognition. IEEE Internet Things J. 10, 14873–14885 (2023)
Article Google Scholar
Chen, C., Jafari, R., Kehtarnavaz, N.: Utd-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 168–172. IEEE (2015)
Google Scholar
Bock, M., Kuehne, H., Van Laerhoven, K., Moeller, M.: Wear: An outdoor sports dataset for wearable and egocentric activity recognition (2023)
Google Scholar
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE (2013)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chao, X., Hou, Z., Mo, Y.: CZU-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors. IEEE Sens. J. 22(7), 7034–7042 (2022)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of South China, Hengyang, China
Liang Hua & Tao Zhu
Hunan Wanxiang Science and Technology Company, Hangzhou, China
Yong Huang & Chao Liu

Authors

Liang Hua
View author publications
You can also search for this author in PubMed Google Scholar
Yong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Zhu .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
University of Oslo, Oslo, Norway
Jim Torresen
De Montfort University, Leicester, UK
Shengxiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hua, L., Huang, Y., Liu, C., Zhu, T. (2024). Utilizing Attention for Continuous Human Action Recognition Based on Multimodal Fusion of Visual and Inertial. In: Shi, Z., Torresen, J., Yang, S. (eds) Intelligent Information Processing XII. IIP 2024. IFIP Advances in Information and Communication Technology, vol 704. Springer, Cham. https://doi.org/10.1007/978-3-031-57919-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-57919-6_5
Published: 06 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57918-9
Online ISBN: 978-3-031-57919-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Utilizing Attention for Continuous Human Action Recognition Based on Multimodal Fusion of Visual and Inertial