research-article

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition

Authors:

Shaobo HeiAuthors Info & Claims

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

Pages 701 - 706

https://doi.org/10.1145/3573942.3574082

Published: 16 May 2023 Publication History

Abstract

For action recognition, two-stream networks consisting of RGB and optical flow has been widely used, showing high recognition accuracy. However, optical flow computation is time-consuming and requires a large amount of storage space, and the recognition efficiency is very low. To alleviate this problem, we propose an Adaptive Multi-Scale Residual (AMSR) module and a Long Short Term Motion Squeeze (LSMS) module, which are inserted into the 2D convolutional neural network to improve the accuracy of action recognition and achieve a balance of accuracy and speed. The AMSR module adaptively fuses multi-scale feature maps to fully utilize the semantic information provided by deep feature maps and the detailed information provided by shallow feature maps. The LSMS module is a learnable lightweight motion feature extractor for learning long-term motion features of adjacent and non-adjacent frames, thus replacing the traditional optical flow and improving the accuracy of action recognition. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that the method proposed in this paper achieves competitive performance compared to state-of-the-art methods with only a small increase in parameters and computational cost.

References

[1]

Heng Wang, Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. IEEE, Sydney, NSW, Australia, 3551-3558. https://doi.org/10.1109/ICCV.2013.441

Digital Library

[2]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. IEEE, Santiago, Chile, 4489-4497. https://doi.org/10.1109/ICCV.2015.510

Digital Library

[3]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 5533-5541. https://doi.org/10.1109/ICCV.2017.590

[4]

Hanbo Wu, Xin Ma, and Yibin Li. 2019. Convolutional networks with channel and STIPs attention model for action recognition in videos. IEEE Transactions on Multimedia. 22, 9 (November 2019), 2293-2306. https://doi.org/10.1109/TMM.2019.2953814

[5]

Zhengping Hu, Ruixue Zhang, Yue Qiu, Mengyao Zhao, and Zhe Sun. 2021. 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimedia Tools and Applications. 80, 24 (October 2021), 33179-33192. https://doi.org/10.1007/s11042-021-11403-z

Digital Library

[6]

Karen Simonyan, and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 568–576. https://arxiv.org/abs/1406.2199

[7]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, Springer, Cham, 20-36. https://doi.org/10.1007/978-3-319-46484-8_2

[8]

Joao Carreira, and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, HI, USA, 6299-6308. https://doi.org/10.1109/CVPR.2017.502

[9]

Berthold KP Horn, and Brian G. Schunck. 1981. Determining optical flow. Artificial intelligence. 17, 1-3 (August 1981), 185-203. https://doi.org/10.1016/0004-3702(81)90024-2

Digital Library

[10]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South), 7083-7093. https://doi.org/10.1109/ICCV.2019.00718

[11]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, IEEE, Seoul, Korea (South), 6202-6211. https://doi.org/10.1109/ICCV.2019.00630

[12]

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. 2020. Motionsqueeze: Neural motion feature learning for video understanding. In European Conference on Computer Vision, Springer, Cham, 345-362. https://doi.org/10.1007/978-3-030-58517-4_21

Digital Library

[13]

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA, 909-918. https://doi.org/10.1109/CVPR42600.2020.00099

[14]

Haisheng Su, Jinyuan Feng, Dongliang Wang, Weihao Gan, Wei Wu, and Yu Qiao. 2021. TSI: Temporal Saliency Integration for Video Action Recognition. arXiv:2106.01088. Retrieved from https://doi.org/10.48550/arXiv.2106.01088

[15]

Yanze Wang, and Junyong Ye. 2021. TMF: Temporal Motion and Fusion for action recognition. Computer Vision and Image Understanding. 213 (December 2021), 103304. https://doi.org/10.1016/j.cviu.2021.103304

Digital Library

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, Las Vegas, NV, USA, 770-778. https://doi.org/10.1109/CVPR.2016.90

[17]

Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, and Jan Kautz. 2018. Improving landmark localization with semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, 1546-1555. https://doi.org/10.1109/CVPR.2018.00167

[18]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision. 2, 11 (November 2012). https://doi.org/10.48550/arXiv.1212.0402

[19]

Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, IEEE, Barcelona, Spain, 2556-2563. https://doi.org/10.1109/ICCV.2011.6126543

Digital Library

[20]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, Honolulu, HI, USA, 2462-2470. https://doi.org/10.1109/CVPR.2017.179

[21]

Ce Li, Baochang Zhang, Chen Chen, Qixiang Ye, Jungong Han, Guodong Guo, and Rongrong Ji. 2019. Deep manifold structure transfer for action recognition. IEEE transactions on image processing. 28, 9 (April 2019), 4646-4658. https://doi.org/10.1109/TIP.2019.2912357

[22]

Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. Stnet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI conference on artificial intelligence. 33, 1 (July 2019), 8401-8408. https://doi.org/10.1609/aaai.v33i01.33018401

Digital Library

Index Terms

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Learning discriminative motion feature for enhancing multi-modal action recognition
Abstract
Video action recognition is an important topic in computer vision tasks. Most of the existing methods use CNN-based models, and multiple modalities of image features are captured from the videos, such as static frames, dynamic images, and optical ...
Highlights
- A new network is proposed to learn discriminative dynamic motion features.
- The dynamic motion feature is complementary to other modal of features.
- The proposed method improves the performance of action recognition.
Multidimension Joint Networks for Action Recognition
Biometric Recognition
Abstract
Motion types, spatial and temporal features are two crucial elements of information for video action recognition. 3D CNNs boast good recognition performance but are computationally expensive and less competitive on temporal feature extraction. 2D ...
Learning motion and content-dependent features with convolutions for action recognition

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 2022

1221 pages

ISBN:9781450396899

DOI:10.1145/3573942

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AIPR 2022

AIPR 2022: 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 23 - 25, 2022

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
20
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten