Abstract
With the development of deep convolutional neural networks, 2D CNN is widely used in action detection task. Although 2D CNN extracts rich features from video frames, these features also contain redundant information. In response to this problem, we propose Residual Channel-Spatial Attention module (RCSA) to guide the network what (object patterns) and where (spatially) need to be focused. Meanwhile, in order to effectively utilize the rich spatial and semantic features extracted by different layers of deep networks, we combine RCSA and deep aggregation network to propose Deep Attention Aggregation Network. Experiment resultes on two datasets J-HMDB and UCF-101 show that the proposed network achieves state-of-the-art performances on action detection.
Y. He and X. Liu—Contributing author(s).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, Z., Li, J., Wang, S., Wang, J., Ma, L.: Flexible gait transition for six wheel-legged robot with unstructured terrains. Robot. Auton. Syst. 150, 103989 (2022)
Chen, Z., et al.: Control strategy of stable walking for a hexapod wheel-legged robot. ISA Trans. 108, 367–380 (2021)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. PP(99) (2017)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (2014)
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 4415–4423 (2017). https://doi.org/10.1109/ICCV.2017.472
Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified CNN architecture for real-time spatiotemporal action localization. CoRR abs/1911.06644 (2019)
Li, J., Wang, J., Peng, H., Hu, Y., Su, H.: Fuzzy-torque approximation-enhanced sliding mode control for lateral stability of mobile robot. IEEE Trans. Syst. Man Cybern. Syst. 52(4), 2491–2500 (2022). https://doi.org/10.1109/TSMC.2021.3050616
Li, J., Wang, J., Peng, H., Zhang, L., Hu, Y., Su, H.: Neural fuzzy approximation enhanced autonomous tracking control of the wheel-legged robot under uncertain physical interaction. Neurocomputing 410, 342–353 (2020)
Li, J., Wang, J., Wang, S., Yang, C.: Human-robot skill transmission for mobile robot via learning by demonstration. Neural Computing and Applications pp. 1–11 (2021). https://doi.org/10.1007/s00521-021-06449-x
Li, J., Qin, H., Wang, J., Li, J.: Openstreetmap-based autonomous navigation for the four wheel-legged robot via 3D-lidar and CCD camera. IEEE Trans. Industr. Electron. 69(3), 2708–2717 (2022). https://doi.org/10.1109/TIE.2021.3070508
Li, J., Zhang, X., Li, J., Liu, Y., Wang, J.: Building and optimization of 3d semantic map based on lidar and camera fusion. Neurocomputing 409, 394–407 (2020)
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 68–84. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_5
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016 (2016)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. Comput. Ence (2012)
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813
Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D.: P3d-CTN: pseudo-3D convolutional tube network for spatio-temporal action detection in videos. In: 2019 IEEE International Conference on Image Processing (ICIP) (2019)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Yang, X., Yang, X., Liu, M., Xiao, F., Davis, L.S., Kautz, J.: STEP: spatio-temporal progressive learning for video action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. pp. 264–272 (2019). https://doi.org/10.1109/CVPR.2019.00035
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Zhao, J., Snoek, C.G.M.: Dance with flow: two-in-one stream action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Zheng, H., Fu, J., Zha, Z., Luo, J.: Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 5012–5021 (2019). https://doi.org/10.1109/CVPR.2019.00515
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. CoRR abs/1904.07850 (2019)
Acknowledgments
This work is supposed by the National Key R &D Program of China under Grant 2020YFB1708500
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Declarations
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
He, Y., Gan, MG., Liu, X. (2022). What and Where to See: Deep Attention Aggregation Network for Action Detection. In: Liu, H., et al. Intelligent Robotics and Applications. ICIRA 2022. Lecture Notes in Computer Science(), vol 13455. Springer, Cham. https://doi.org/10.1007/978-3-031-13844-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-13844-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13843-0
Online ISBN: 978-3-031-13844-7
eBook Packages: Computer ScienceComputer Science (R0)