research-article

A Dual-Task Deep Neural Network for Scene and Action Recognition Based on 3D SENet and 3D SEResNet

Authors:

Yuelei XiaoAuthors Info & Claims

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

Pages 664 - 671

https://doi.org/10.1145/3573942.3574077

Published: 16 May 2023 Publication History

Abstract

Aiming at the problem that scene information will become noise and cause interference in the feature extraction stage of action recognition, a dual-task deep neural network model for scene and action recognition is proposed. The model first uses a convolutional layer and max pooling layer as shared layers to extract low-dimensional features, then uses 3D SEResNet for action recognition and 3D SENet for scene recognition, and finally outputs their respective results. In addition, to solve the problem that the existing public dataset is not associated with the scene, a scene and action dataset (SAAD) for recognition is built by ourselves. Experimental results show that our method performs better than other methods on SAAD dataset.

References

[1]

Zhu, Jia gang and Zou, Wei and Zhu, Zheng and Xu, Liang and Huang, Guan.2019. Action Machine: Toward Person-Centric Action Recognition in Videos. IEEE Signal Processing Letters. (Nov. 2019), 1633-1637. https//doi.org/10.1109/LSP.2019.2942739.

[2]

Khurram Soomro, Amir Roshan Zamir and Mubarak Shah.2012. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. Retrieved Match 28 2022 from https://www.crcv.ucf.edu/data/UCF101.php.

[3]

Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks.IEEE Trans. Pattern Anal. Mach. Intell. 42, 8 (Aug. 2020), 2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

Digital Library

[4]

K. Hara, H. Kataoka and Y. Satoh.2018.Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? 2018. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA(June.2018), 6546-6555. https://doi.org/10.1109/CVPR.2018.00685

[5]

H. Wang and C. Schmid.2013.Action Recognition with Improved Trajectories 2013 IEEE International Conference on Computer Vision,Sydney, NSW, Australia (Dce.2013), 3551-3558. https://doi.org/10.1109/ICCV.2013.441

Digital Library

[6]

Wang, H., Kläser, A., Schmid, C.2013. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int J Comput Vis 103. (2013), 60-79. https://doi.org/10.1007/s11263-012-0594-8

[7]

Wenjing Ma, Liangliang Cao, Lei Yu, Guoping Long, and Yucheng Li. 2016. GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring. In Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalICMR '16). Association for Computing Machinery, New York, NY, USA, 39–46. https://doi.org/10.1145/2911996.2911997

Digital Library

[8]

Platt, John. 1998. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Tech. Rep., Microsoft Research, Technical Report msr-tr-98-14.

[9]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. SURF: speeded up robust features. In Proceedings of the 9th European conference on Computer Vision - Volume Part I(ECCV'06). Graz, Austria.(May 2006), 404–417. https://doi.org/10.1007/11744023_32

Digital Library

[10]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Computer Vision – ECCV 2016.(Oct.2016), Amsterdam, The Netherlands.20-36. https://doi.org/10.1007/978-3-319-46484-8_2

[11]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal Relational Reasoning in Videos. Computer Vision – ECCV 2018, (Sep.2018), Munich Germany. 831-846. https://doi.org/10.1007/978-3-030-01246-5_49

Digital Library

[12]

C. Feichtenhofer, H. Fan, J. Malik and K. He.2019.SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). (Oct.2019), Seoul, Korea (South), 2019 6201-6210. https//doi.org10.1109/ICCV.2019.00630

[13]

.Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2018. Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs. IEEE Transactions on Image Processing 27.(May 2018), 2326-2339. https://doi.org/10.1109/tip.2018.2791180

Digital Library

[14]

Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile (Dec. 2015), 4489-4497.https://doi.org/10.1109/iccv.2015.510

Digital Library

[15]

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals . 2014. Recurrent Neural Network Regularization. Neural and Evolutionary Computing. (Sep. 2014). https://doi.org/10.48550/arXiv.1409.2329

[16]

Aude Oliva. 2005. Gist of the Scene. Neurobiology of Attention, 251-256.https://doi.org/10.1016/b978-012375731-9/50045-8

[17]

Jianxin Wu and J M Rehg. 2011. CENTRIST: A Visual Descriptor for Scene Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33,(Dec. 2010), 1489-1501. https://doi.org/10.1109/tpami.2010.224

Digital Library

[18]

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba.2018Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.(June 2018), 1452-1464. https//doi.org/ 10.1109/TPAMI.2017.2723009.

[19]

Carlos Herranz-Perdiguero, Carolina Redondo-Cabrera, and Roberto J. Lopez-Sastre. 2018. In pixels we trust: From Pixel Labeling to Object Localization and Scene Categorization. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain.355-361. https://doi.org/10.1109/iros.2018.8593736

Digital Library

[20]

Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition. IEEE Access 8, 82066-82077. https://doi.org/10.1109/access.2020.2989863

[21]

Rohit Girdhar and Mannat Singh and Nikhila Ravi and Laurens van der Maaten and Armand Joulin and Ishan Misra. 2022.Omnivore: A Single Model for Many Visual Modalities. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, Louisiana.(June 2022), https://arxiv.org/abs/2201.08377

[22]

Lin, Min, Qiang Chen and Shuicheng Yan. 2014.Network In Network. 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014.https://doi.org/10.48550/arXiv.1312.4400

[23]

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on International Conference on Machine Learning Omnipress, Madison, WI, USA.(June 2010), 807–814. https://dl.acm.org/doi/10.5555/3104322.3104425

[24]

Jun Han, Claudio Moraga.1995. The influence of the sigmoid function parameters on the speed of backpropagation learning. Lecture Notes in Computer Science. 195-201. https://doi.org/10.1007/3-540-59497-3_175

[25]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. (January 2014), 1929–1958. https://dl.acm.org/doi/abs/10.5555/2627435.2670313

[26]

Léon Bottou. 2010. Large-Scale Machine Learning with Stochastic Gradient Descent. Proceedings of COMPSTAT.(2010)177-186. https://doi.org/10.1007/978-3-7908-2604-3_16

[27]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, Hawaii.(July 2017), 2261-2269.https://doi.org/10.1109/cvpr.2017.243

[28]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah (June 2018), 6450-6459.https://doi.org/10.1109/cvpr.2018.00675

[29]

Kay, Will and Carreira, Joao and Simonyan, Karen and Zhang The Kinetics Human Action Video Dataset. Retrieved Match 28 2022 from https://www.deepmind.com/open-source/kinetics

[30]

Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2012. HMDB51: A Large Video Database for Human MotionRecognition. High Performance Computing in Science and Engineering ‘12, 571-582. https://doi.org/10.1007/978-3-642-33374-3_41

Cited By

Li HZhang XLi BLiu YXiao P(2024)GDR-Net: Gene Content Prediction Network Based on Distribution RegressionProceedings of the 2024 9th International Conference on Biomedical Imaging, Signal Processing10.1145/3707172.3707190(117-122)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3707172.3707190

Index Terms

A Dual-Task Deep Neural Network for Scene and Action Recognition Based on 3D SENet and 3D SEResNet
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Scene text recognition using residual convolutional recurrent neural network

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...
3D-based Deep Convolutional Neural Network for action recognition with depth sequences

Traditional algorithms to design hand-crafted features for action recognition have been a hot research area in the last decade. Compared to RGB video, depth sequence is more insensitive to lighting changes and more discriminative due to its capability ...
Deep 3D semantic scene extrapolation

Scene extrapolation is a challenging variant of the scene completion problem, which pertains to predicting the missing part(s) of a scene. While the 3D scene completion algorithms in the literature try to fill the occluded part of a scene such as a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 2022

1221 pages

ISBN:9781450396899

DOI:10.1145/3573942

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

AIPR 2022

AIPR 2022: 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 23 - 25, 2022

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
21
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li HZhang XLi BLiu YXiao P(2024)GDR-Net: Gene Content Prediction Network Based on Distribution RegressionProceedings of the 2024 9th International Conference on Biomedical Imaging, Signal Processing10.1145/3707172.3707190(117-122)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3707172.3707190

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten