skip to main content
10.1145/3686397.3686410acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicisdmConference Proceedingsconference-collections
research-article

Keypoints-based multimodal network for robust human action recognition

Published: 25 November 2024 Publication History

Abstract

Skeleton-based action recognition has garnered widespread attention. However, due to the inherent limitations of skeleton sequences, existing works often confuse actions with inter-class similarities and struggle to meet the requirement for viewpoint invariance. As a solution, multimodal action recognition leverages the complementarity of information between modalities to significantly enhance the performance of unimodal models. However, effectively integrating these modalities remains an open problem. In this work, we first propose a keypoints-based multimodal data fusion method to construct images that adequately represent the crucial spatiotemporal characteristics and their variations of actions. Building upon this, we introduce the keypoints-based multimodal fusion network (KBMN), which comprehensively learns action features from skeleton, RGB, and depth data. Extensive experiments on two large-scale datasets demonstrate that our KBMN exhibits robust performance in both unimodal and multimodal action recognition tasks. As an auxiliary model for skeleton-based methods, KBMN effectively assists various baseline methods in improving their recognition accuracy.

References

[1]
Kong, Y., & Fu, Y. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5), 1366-1401.
[2]
Yadav, S. K., Tiwari, K., Pandey, H. M., & Akbar, S. A. 2021. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems, 223, 106970.
[3]
Xin, W., Liu, R., Liu, Y., Chen, Y., Yu, W., & Miao, Q. 2023. Transformer for Skeleton-based action recognition: A review of recent advances. Neurocomputing.
[4]
Hu, Z., Xiao, J., Li, L., Liu, C., & Ji, G. 2024. Human-centric multimodal fusion network for robust action recognition. Expert Systems with Applications, 239, 122314.
[5]
Bruce, X. B., Liu, Y., Zhang, X., Zhong, S. H., & Chan, K. C. 2022. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3522-3538.
[6]
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., & Liu, J. 2022. Human action recognition from various data modalities: A review. IEEE transactions on pattern analysis and machine intelligence, 45(3), 3200-3225.
[7]
Wu, H., Ma, X., & Li, Y. 2021. Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1250-1261.
[8]
Islam, M. M., Nooruddin, S., Karray, F., & Muhammad, G. 2023. Multi-level feature fusion for multimodal human activity recognition in Internet of Healthcare Things. Information Fusion, 94, 17-31.
[9]
Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. 2022. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2969-2978).
[10]
Bruce, X. B., Liu, Y., & Chan, K. C. 2021. Multimodal fusion via teacher-student network for indoor action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 4, pp. 3199-3207).
[11]
Yan, S., Xiong, Y., & Lin, D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).
[12]
Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 143-152).
[13]
He, K., Zhang, X., Ren, S., & Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[14]
Zhou, B., Wang, P., Wan, J., Liang, Y., Wang, F., Zhang, D., ... & Jin, R. 2022. Decoupling and recoupling spatiotemporal representation for RGB-D-based motion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20154-20163).
[15]
Cai, J., Jiang, N., Han, X., Jia, K., & Lu, J. 2021. JOLO-GCN: mining joint-centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2735-2744).
[16]
Das, S., Sharma, S., Dai, R., Bremond, F., & Thonnat, M. 2020. Vpn: Learning video-pose embedding for activities of daily living. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16 (pp. 72-90). Springer International Publishing.
[17]
Joze, H. R. V., Shaban, A., Iuzzolino, M. L., & Koishida, K. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13289-13299).
[18]
Ahn, D., Kim, S., Hong, H., & Ko, B. C. 2023. STAR-Transformer: a spatio-temporal cross attention transformer for human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3330-3339).
[19]
Das, S., Dai, R., Yang, D., & Bremond, F. 2021. Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9703-9717.
[20]
Ahn, D., Kim, S., & Ko, B. C. 2023. STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition. Applied Intelligence, 53(23), 28446-28459.
[21]
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. 2016. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010-1019).
[22]
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. 2019. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684-2701.
[23]
Shi, L., Zhang, Y., Cheng, J., & Lu, H. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12026-12035).
[24]
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., & Hu, W. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13359-13368).
[25]
Simonyan, K., & Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.
[26]
Zhang, H., Liu, D., & Xiong, Z. 2019. Two-stream action recognition-oriented video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8799-8808).
[27]
Feichtenhofer, C., Fan, H., Malik, J., & He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
[28]
Varol, G., Laptev, I., Schmid, C., & Zisserman, A. 2021. Synthetic humans for action recognition from unseen viewpoints. International Journal of Computer Vision, 129(7), 2264-2287.
[29]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
[30]
Li, B., Xiong, P., Han, C., & Guo, T. 2022. Shrinking temporal attention in transformers for video action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 2, pp. 1263-1271).
[31]
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., & Yu, D. 2022. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14063-14073).
[32]
Sanchez-Caballero, A., de López-Diz, S., Fuentes-Jimenez, D., Losada-Gutiérrez, C., Marrón-Romera, M., Casillas-Perez, D., & Sarker, M. I. 2022. 3dfcnn: Real-time action recognition using 3d deep neural networks with raw depth information. Multimedia Tools and Applications, 81(17), 24119-24143.
[33]
Xiao, Y., Chen, J., Wang, Y., Cao, Z., Zhou, J. T., & Bai, X. 2019. Action recognition for depth video using multi-view dynamic images. Information Sciences, 480, 287-304.
[34]
Tan, M., & Le, Q. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
[35]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).

Index Terms

  1. Keypoints-based multimodal network for robust human action recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICISDM '24: Proceedings of the 2024 8th International Conference on Information System and Data Mining
    June 2024
    157 pages
    ISBN:9798400717345
    DOI:10.1145/3686397
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 November 2024

    Check for updates

    Author Tags

    1. Human action recognition
    2. Keypoints-based data fusion
    3. Multimodal fusion network

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICISDM 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 28
      Total Downloads
    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media