Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Liu, Zhenbing; Li, Zeya; Wang, Ruili; Zong, Ming; Ji, Wanting

doi:10.1007/s00521-020-05144-7

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

S.I. : Deep Learning Approaches for RealTime Image Super Resolution (DLRSR)
Published: 07 July 2020

Volume 32, pages 14593–14602, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zhenbing Liu¹,
Zeya Li¹,
Ruili Wang^2,3,
Ming Zong ORCID: orcid.org/0000-0001-5425-6667^2,3 &
…
Wanting Ji^2,3

1440 Accesses
27 Citations
3 Altmetric
Explore all metrics

Abstract

Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

References

Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282
Article Google Scholar
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, pp 25–36
Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) Mars: motion-augmented rgb stream for action recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Dai C, Liu X, Lai J (2020) Human action recognition using two-stream attention based LSTM networks. Appl Soft Comput 86:105820
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van GL (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv:1711.08200
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Conference on neural information processing systems, pp 3468–3476
Gong W, Qi L, Xu Y (2018) Privacy-aware multidimensional mobile service quality prediction and recommendation in distributed fog environment. Wirel Commun Mobile Comput. https://doi.org/10.1155/2018/3075849
Article Google Scholar
Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recognit Lett 107(2018):83–90
Article Google Scholar
Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3154–3160
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ji S, Wei X, Yang M, Kai Y (2012) 3d Convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Jing L, Ye Y, Yang X, Tian Y (2017) 3d Convolutional neural network with multi-model framework for action recognition. In: 2017 IEEE international conference on image processing (ICIP), pp 1837–1841
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3376–3385
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563
Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6742–6751
Leordeanu M, Sukthankar R, Sminchisescu C (2012) Efficient closed-form solution to generalized boundary detection. In: European conference on computer vision. Springer, pp 516–529
Li J, Liu X, Zhang M, Wang D (2020) Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognit 98(2020):107037
Article Google Scholar
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CGM (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Article Google Scholar
Liu J, Wang Z, Liu H (2020) Hds-sp: a novel descriptor for skeleton-based human action recognition. Neurocomputing 385:22–32
Article Google Scholar
Liu X, Yang X (2018) Multi-stream with deep convolutional neural networks for human action recognition in videos. In: International conference on neural information processing. Springer, pp 251–262
Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv:1508.04025
Majd M, Safabakhsh R (2020) Correlational convolutional LSTM for human action recognition. Neurocomputing 396:224–229
Article Google Scholar
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Qi L, Dai P, Jiguo Y, Zhou Z, Yanwei X (2017) “Time-location-frequency”—aware internet of things service selection based on historical records. Int J Distrib Sens Netw 13(1):1550147716688696
Article Google Scholar
Qi L, Dou W, Chen J (2016) Weighted principal component analysis-based service selection method for multimedia services in cloud. Computing 98(1–2):195–214
Article MathSciNet Google Scholar
Qi L, Wang R, Chunhua H, Li S, He Q, Xiaolong X (2019) Time-aware distributed service recommendation with privacy-preservation. Inf Sci 480:354–364
Article Google Scholar
Qi L, Xu X, Dou W, Yu J, Zhou Z, Zhang X (2016) Time-aware IoE service recommendation on sparse data. Mobile Inf Syst 2016:12. https://doi.org/10.1155/2016/4397061
Article Google Scholar
Qi L, Yu J, Zhou Z (2017) An invocation cost optimization method for web services in cloud environment. Sci Program 2017:9. https://doi.org/10.1155/2017/4358536
Article Google Scholar
Qi L, Zhang X, Dou W, Chunhua H, Yang C, Chen J (2018) A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment. Future Gener Comput Syst 88(2018):636–643
Article Google Scholar
Qi L, Zhang X, Dou W, Ni Q (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624
Article Google Scholar
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Shamsolmoali P, Zareapoor M, Zhou H, Yang J (2020) Amil: adversarial multi instance learning for human pose estimation. ACM Trans Multimedia Comput Commun Appl (TOMM) 16(1s):1–23
Article Google Scholar
Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang S-F, Yan Z (2019) Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1268–1277
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5693–5703
Sun L, Jia K, Chen K, Yeung D-Y, Shi BE, Savarese S (2017) Lattice long short-term memory for human action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2147–2156
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605
Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6:1155–1166
Article Google Scholar
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van GL (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3402
Wang X, Gao L, Song J, Shen H (2016) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514
Article Google Scholar
Wang Y, Huang M, Zhao L et al (2016) Attention-based lstm for aspect-level sentiment classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 606–615
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1529–1538
Xie S, Sun C, Huang J, Tu Z, urphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Yang H, Yuan C, Li B, Yang D, Xing J, Weiming H, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85(2019):1–12
Google Scholar
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhenbing L, Zeya L, Ming Z, Wanting J, Ruili W, Yan T (2019) Spatiotemporal saliency based multi-stream networks for action recognition. In: Asian conference on pattern recognition, Springer, Singapore, pp 74–84

Download references

Acknowledgements

This study is supported by the National Natural Science Foundation of China (Grant Nos. 61562013, 61906050), the Natural Science Foundation of Guangxi Province (CN) (2017GXNSFDA198025), the Study Abroad Program for Graduate Student of Guilin University of Electronic Technology (GDYX2018006), the National Natural Science Foundation of China (Grant 61602407), the Natural Science Foundation of Zhejiang Province (Grant LY18F020008), the China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program.

Author information

Authors and Affiliations

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
Zhenbing Liu & Zeya Li
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
Ruili Wang, Ming Zong & Wanting Ji
School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
Ruili Wang, Ming Zong & Wanting Ji

Authors

Zhenbing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zeya Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zong
View author publications
You can also search for this author in PubMed Google Scholar
Wanting Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Zong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Li, Z., Wang, R. et al. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput & Applic 32, 14593–14602 (2020). https://doi.org/10.1007/s00521-020-05144-7

Download citation

Received: 06 November 2019
Accepted: 17 June 2020
Published: 07 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00521-020-05144-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation