What and Where to See: Deep Attention Aggregation Network for Action Detection

He, Yuxuan; Gan, Ming-Gang; Liu, Xiaozhou

doi:10.1007/978-3-031-13844-7_18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13455))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

2477 Accesses

Abstract

With the development of deep convolutional neural networks, 2D CNN is widely used in action detection task. Although 2D CNN extracts rich features from video frames, these features also contain redundant information. In response to this problem, we propose Residual Channel-Spatial Attention module (RCSA) to guide the network what (object patterns) and where (spatially) need to be focused. Meanwhile, in order to effectively utilize the rich spatial and semantic features extracted by different layers of deep networks, we combine RCSA and deep aggregation network to propose Deep Attention Aggregation Network. Experiment resultes on two datasets J-HMDB and UCF-101 show that the proposed network achieves state-of-the-art performances on action detection.

Y. He and X. Liu—Contributing author(s).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, Z., Li, J., Wang, S., Wang, J., Ma, L.: Flexible gait transition for six wheel-legged robot with unstructured terrains. Robot. Auton. Syst. 150, 103989 (2022)
Article Google Scholar
Chen, Z., et al.: Control strategy of stable walking for a hexapod wheel-legged robot. ISA Trans. 108, 367–380 (2021)
Article Google Scholar
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. PP(99) (2017)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (2014)
Google Scholar
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 4415–4423 (2017). https://doi.org/10.1109/ICCV.2017.472
Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified CNN architecture for real-time spatiotemporal action localization. CoRR abs/1911.06644 (2019)
Google Scholar
Li, J., Wang, J., Peng, H., Hu, Y., Su, H.: Fuzzy-torque approximation-enhanced sliding mode control for lateral stability of mobile robot. IEEE Trans. Syst. Man Cybern. Syst. 52(4), 2491–2500 (2022). https://doi.org/10.1109/TSMC.2021.3050616
Article Google Scholar
Li, J., Wang, J., Peng, H., Zhang, L., Hu, Y., Su, H.: Neural fuzzy approximation enhanced autonomous tracking control of the wheel-legged robot under uncertain physical interaction. Neurocomputing 410, 342–353 (2020)
Article Google Scholar
Li, J., Wang, J., Wang, S., Yang, C.: Human-robot skill transmission for mobile robot via learning by demonstration. Neural Computing and Applications pp. 1–11 (2021). https://doi.org/10.1007/s00521-021-06449-x
Li, J., Qin, H., Wang, J., Li, J.: Openstreetmap-based autonomous navigation for the four wheel-legged robot via 3D-lidar and CCD camera. IEEE Trans. Industr. Electron. 69(3), 2708–2717 (2022). https://doi.org/10.1109/TIE.2021.3070508
Article Google Scholar
Li, J., Zhang, X., Li, J., Liu, Y., Wang, J.: Building and optimization of 3d semantic map based on lidar and camera fusion. Neurocomputing 409, 394–407 (2020)
Article Google Scholar
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 68–84. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_5
Chapter Google Scholar
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Chapter Google Scholar
Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016 (2016)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. Comput. Ence (2012)
Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813
Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D.: P3d-CTN: pseudo-3D convolutional tube network for spatio-temporal action detection in videos. In: 2019 IEEE International Conference on Image Processing (ICIP) (2019)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Yang, X., Yang, X., Liu, M., Xiao, F., Davis, L.S., Kautz, J.: STEP: spatio-temporal progressive learning for video action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. pp. 264–272 (2019). https://doi.org/10.1109/CVPR.2019.00035
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhao, J., Snoek, C.G.M.: Dance with flow: two-in-one stream action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Zheng, H., Fu, J., Zha, Z., Luo, J.: Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 5012–5021 (2019). https://doi.org/10.1109/CVPR.2019.00515
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. CoRR abs/1904.07850 (2019)
Google Scholar

Download references

Acknowledgments

This work is supposed by the National Key R &D Program of China under Grant 2020YFB1708500

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, People’s Republic of China
Yuxuan He, Ming-Gang Gan & Xiaozhou Liu

Authors

Yuxuan He
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Gang Gan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhou Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming-Gang Gan .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Shenzhen, China
Honghai Liu
Huazhong University of Science and Technology, Wuhan, China
Zhouping Yin
Shenyang Institute of Automation, Shenyang, Liaoning, China
Lianqing Liu
Harbin Institute of Technology, Harbin, China
Li Jiang
Shanghai Jiao Tong University, Shanghai, China
Guoying Gu
Shenzhen Institute of Advanced Technology, Shenzhen, China
Xinyu Wu
Harbin Institute of Technology, Shenzhen, China
Weihong Ren

Ethics declarations

Declarations

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Y., Gan, MG., Liu, X. (2022). What and Where to See: Deep Attention Aggregation Network for Action Detection. In: Liu, H., et al. Intelligent Robotics and Applications. ICIRA 2022. Lecture Notes in Computer Science(), vol 13455. Springer, Cham. https://doi.org/10.1007/978-3-031-13844-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-13844-7_18
Published: 04 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13843-0
Online ISBN: 978-3-031-13844-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

What and Where to See: Deep Attention Aggregation Network for Action Detection