skip to main content
10.1145/3573942.3574077acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

A Dual-Task Deep Neural Network for Scene and Action Recognition Based on 3D SENet and 3D SEResNet

Published: 16 May 2023 Publication History

Abstract

Aiming at the problem that scene information will become noise and cause interference in the feature extraction stage of action recognition, a dual-task deep neural network model for scene and action recognition is proposed. The model first uses a convolutional layer and max pooling layer as shared layers to extract low-dimensional features, then uses 3D SEResNet for action recognition and 3D SENet for scene recognition, and finally outputs their respective results. In addition, to solve the problem that the existing public dataset is not associated with the scene, a scene and action dataset (SAAD) for recognition is built by ourselves. Experimental results show that our method performs better than other methods on SAAD dataset.

References

[1]
Zhu, Jia gang and Zou, Wei and Zhu, Zheng and Xu, Liang and Huang, Guan.2019. Action Machine: Toward Person-Centric Action Recognition in Videos. IEEE Signal Processing Letters. (Nov. 2019), 1633-1637. https//doi.org/10.1109/LSP.2019.2942739.
[2]
Khurram Soomro, Amir Roshan Zamir and Mubarak Shah.2012. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. Retrieved Match 28 2022 from https://www.crcv.ucf.edu/data/UCF101.php.
[3]
Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks.IEEE Trans. Pattern Anal. Mach. Intell. 42, 8 (Aug. 2020), 2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
[4]
K. Hara, H. Kataoka and Y. Satoh.2018.Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? 2018. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA(June.2018), 6546-6555. https://doi.org/10.1109/CVPR.2018.00685
[5]
H. Wang and C. Schmid.2013.Action Recognition with Improved Trajectories 2013 IEEE International Conference on Computer Vision,Sydney, NSW, Australia (Dce.2013), 3551-3558. https://doi.org/10.1109/ICCV.2013.441
[6]
Wang, H., Kläser, A., Schmid, C.2013. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int J Comput Vis 103. (2013), 60-79. https://doi.org/10.1007/s11263-012-0594-8
[7]
Wenjing Ma, Liangliang Cao, Lei Yu, Guoping Long, and Yucheng Li. 2016. GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring. In Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalICMR '16). Association for Computing Machinery, New York, NY, USA, 39–46. https://doi.org/10.1145/2911996.2911997
[8]
Platt, John. 1998. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Tech. Rep., Microsoft Research, Technical Report msr-tr-98-14.
[9]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. SURF: speeded up robust features. In Proceedings of the 9th European conference on Computer Vision - Volume Part I(ECCV'06). Graz, Austria.(May 2006), 404–417. https://doi.org/10.1007/11744023_32
[10]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Computer Vision – ECCV 2016.(Oct.2016), Amsterdam, The Netherlands.20-36. https://doi.org/10.1007/978-3-319-46484-8_2
[11]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal Relational Reasoning in Videos. Computer Vision – ECCV 2018, (Sep.2018), Munich Germany. 831-846. https://doi.org/10.1007/978-3-030-01246-5_49
[12]
C. Feichtenhofer, H. Fan, J. Malik and K. He.2019.SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). (Oct.2019), Seoul, Korea (South), 2019 6201-6210. https//doi.org10.1109/ICCV.2019.00630
[13]
.Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2018. Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs. IEEE Transactions on Image Processing 27.(May 2018), 2326-2339. https://doi.org/10.1109/tip.2018.2791180
[14]
Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile (Dec. 2015), 4489-4497.https://doi.org/10.1109/iccv.2015.510
[15]
Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals . 2014. Recurrent Neural Network Regularization. Neural and Evolutionary Computing. (Sep. 2014). https://doi.org/10.48550/arXiv.1409.2329
[16]
Aude Oliva. 2005. Gist of the Scene. Neurobiology of Attention, 251-256.https://doi.org/10.1016/b978-012375731-9/50045-8
[17]
Jianxin Wu and J M Rehg. 2011. CENTRIST: A Visual Descriptor for Scene Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33,(Dec. 2010), 1489-1501. https://doi.org/10.1109/tpami.2010.224
[18]
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba.2018Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.(June 2018), 1452-1464. https//doi.org/ 10.1109/TPAMI.2017.2723009.
[19]
Carlos Herranz-Perdiguero, Carolina Redondo-Cabrera, and Roberto J. Lopez-Sastre. 2018. In pixels we trust: From Pixel Labeling to Object Localization and Scene Categorization. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain.355-361. https://doi.org/10.1109/iros.2018.8593736
[20]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition. IEEE Access 8, 82066-82077. https://doi.org/10.1109/access.2020.2989863
[21]
Rohit Girdhar and Mannat Singh and Nikhila Ravi and Laurens van der Maaten and Armand Joulin and Ishan Misra. 2022.Omnivore: A Single Model for Many Visual Modalities. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, Louisiana.(June 2022), https://arxiv.org/abs/2201.08377
[22]
Lin, Min, Qiang Chen and Shuicheng Yan. 2014.Network In Network. 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014.https://doi.org/10.48550/arXiv.1312.4400
[23]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on International Conference on Machine Learning Omnipress, Madison, WI, USA.(June 2010), 807–814. https://dl.acm.org/doi/10.5555/3104322.3104425
[24]
Jun Han, Claudio Moraga.1995. The influence of the sigmoid function parameters on the speed of backpropagation learning. Lecture Notes in Computer Science. 195-201. https://doi.org/10.1007/3-540-59497-3_175
[25]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. (January 2014), 1929–1958. https://dl.acm.org/doi/abs/10.5555/2627435.2670313
[26]
Léon Bottou. 2010. Large-Scale Machine Learning with Stochastic Gradient Descent. Proceedings of COMPSTAT.(2010)177-186. https://doi.org/10.1007/978-3-7908-2604-3_16
[27]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, Hawaii.(July 2017), 2261-2269.https://doi.org/10.1109/cvpr.2017.243
[28]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.  Salt Lake City, Utah (June 2018), 6450-6459.https://doi.org/10.1109/cvpr.2018.00675
[29]
Kay, Will and Carreira, Joao and Simonyan, Karen and Zhang The Kinetics Human Action Video Dataset. Retrieved Match 28 2022 from https://www.deepmind.com/open-source/kinetics
[30]
Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2012. HMDB51: A Large Video Database for Human MotionRecognition. High Performance Computing in Science and Engineering ‘12, 571-582. https://doi.org/10.1007/978-3-642-33374-3_41

Cited By

View all
  • (2024)GDR-Net: Gene Content Prediction Network Based on Distribution RegressionProceedings of the 2024 9th International Conference on Biomedical Imaging, Signal Processing10.1145/3707172.3707190(117-122)Online publication date: 18-Oct-2024

Index Terms

  1. A Dual-Task Deep Neural Network for Scene and Action Recognition Based on 3D SENet and 3D SEResNet
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
        September 2022
        1221 pages
        ISBN:9781450396899
        DOI:10.1145/3573942
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 May 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        AIPR 2022

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 01 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)GDR-Net: Gene Content Prediction Network Based on Distribution RegressionProceedings of the 2024 9th International Conference on Biomedical Imaging, Signal Processing10.1145/3707172.3707190(117-122)Online publication date: 18-Oct-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media