Multi-action Prediction Using an Iterative Masking Approach with Class Activation Mapping

Wu, Chia-Ying; Tsay, Yu-Wei; Shih, Arthur Chun-Chieh

doi:10.1007/978-981-97-1711-8_22

Chia-Ying Wu⁸,
Yu-Wei Tsay⁸ &
Arthur Chun-Chieh Shih⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2074))

Included in the following conference series:

International Conference on Technologies and Applications of Artificial Intelligence

51 Accesses

Abstract

While prediction techniques for multiple objects in images have become increasingly sophisticated, predicting multiple actions in videos remains challenging. Since most video training datasets only labeled a single action per clip, the trained three-dimensional convolutional neural network (3D CNN) model was limited to predicting a single action. To overcome this limitation, we propose an iterative method that combines a 3D CNN model with class activation mapping (CAM), which can achieve multi-object and multi-action prediction in videos. In each iteration, the action class with the highest score is output first. Then, the selected CAM method is applied to detect the primary action region. After masking this region in the input video, the masked video is re-input to the CNN model to predict actions occurring in other regions. In the experimental section, we used a video dataset of a single mouse with a single action label to train a 3D CNN model and tested the prediction performance using another set of composite videos of multiple mice with the same or different actions. The results demonstrate that the proposed method combined with Grad-CAM can correctly predict the individual actions of multiple mice in the videos. Moreover, we also analyzed a few of human action videos to illustrate the feasibility of this approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) NIPS 2014, vol. 27 (2014)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213
Lin, T.Y. et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. ECCV 2014. LNCS, vol. 8693, pp. 1–13. Springer, Cham (2014)
Google Scholar
Bojarski, M., et al.: End to End Learning for Self-Driving Cars. arXiv e-prints (2016)
Google Scholar
Taigman, Y., et al.: DeepFace: closing the gap to human-level performance in face verification. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1701–1708 (2014). https://doi.org/10.1109/CVPR.2014.220
Badrinarayanan, V., et al.: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). https://doi.org/10.1109/TPAMI.2016.2644615
Article Google Scholar
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Soomro, K., Roshan Zamir, A., Shah, M.: UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv e-prints, arXiv:1212.0402 (2012). https://doi.org/10.48550/arXiv.1212.0402
Wu, C.-Y., et al.: Refined prediction of mouse and human actions based on a data-selective multiple-stage approach of 3D convolutional neural networks. In: Proceedings of the 2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 242–247 (2020). https://doi.org/10.1109/TAAI51410.2020.00052
Selvaraju, R.R., et al.: Grad-CAM: visual explanations from deep networks via gradient-based localization. ICCV 2017, 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74
Article Google Scholar
Chattopadhay, A., et al.: Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. WACV 2018, 839–847 (2018). https://doi.org/10.1109/WACV.2018.00097
Article Google Scholar
Wang, H., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: CVPRW 2020. IEEE/CVF (2020). https://doi.org/10.1109/CVPRW50498.2020.00020
Vinogradova, K., Dibrov, A., Myers, G.: Towards Interpretable Semantic Segmentation via Gradient-weighted Class Activation Mapping. arXiv e-prints (2020). https://doi.org/10.48550/arXiv.2002.11434
Wu, C.-Y., Tsay, Y.-W., Shih, A. C.-C.: Open action recognition by a 3d convolutional neural network combining with an open fuzzy min-max neural network. In: Proceedings of the 2022 International Conference on Advanced Robotics and Intelligent Systems (ARIS), pp. 1–6 (2022). https://doi.org/10.1109/ARIS56205.2022.9910444
Zhou, B., et al.: Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016). https://doi.org/10.1109/CVPR.2016.319
Jhuang, H., et al.: Automated home-cage behavioural phenotyping of mice. Nat. Commun. 1(1), 68 (2010). https://doi.org/10.1038/ncomms1064
Article Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
Abu-El-Haija, S., et al.: YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv e-prints (2016). https://doi.org/10.48550/arXiv.1609.08675

Download references

Acknowledgment

The work was partially supported by National Science and Technology Council (MOST 111-2221-E-001-018-MY2), Taiwan, and Institute of Information Science, Academia Sinica, Taiwan.

Author information

Authors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
Chia-Ying Wu, Yu-Wei Tsay & Arthur Chun-Chieh Shih

Authors

Chia-Ying Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wei Tsay
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Chun-Chieh Shih
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Chun-Chieh Shih .

Editor information

Editors and Affiliations

National Yunlin University of Science and Technology, Douliou, Taiwan
Chao-Yang Lee
National Yunlin University of Science and Technology, Douliou, Taiwan
Chun-Li Lin
National Yunlin University of Science and Technology, Douliou, Taiwan
Hsuan-Ting Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, CY., Tsay, YW., Shih, A.CC. (2024). Multi-action Prediction Using an Iterative Masking Approach with Class Activation Mapping. In: Lee, CY., Lin, CL., Chang, HT. (eds) Technologies and Applications of Artificial Intelligence. TAAI 2023. Communications in Computer and Information Science, vol 2074. Springer, Singapore. https://doi.org/10.1007/978-981-97-1711-8_22

Download citation

DOI: https://doi.org/10.1007/978-981-97-1711-8_22
Published: 28 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-1710-1
Online ISBN: 978-981-97-1711-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics