Skip to main content
Log in

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

  • S.I. : Deep Learning Approaches for RealTime Image Super Resolution (DLRSR)
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282

    Article  Google Scholar 

  2. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39

  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  4. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, pp 25–36

  5. Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) Mars: motion-augmented rgb stream for action recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  6. Dai C, Liu X, Lai J (2020) Human action recognition using two-stream attention based LSTM networks. Appl Soft Comput 86:105820

    Article  Google Scholar 

  7. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255

  8. Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van GL (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv:1711.08200

  9. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  10. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Conference on neural information processing systems, pp 3468–3476

  11. Gong W, Qi L, Xu Y (2018) Privacy-aware multidimensional mobile service quality prediction and recommendation in distributed fog environment. Wirel Commun Mobile Comput. https://doi.org/10.1155/2018/3075849

    Article  Google Scholar 

  12. Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recognit Lett 107(2018):83–90

    Article  Google Scholar 

  13. Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3154–3160

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  15. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  16. Ji S, Wei X, Yang M, Kai Y (2012) 3d Convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  17. Jing L, Ye Y, Yang X, Tian Y (2017) 3d Convolutional neural network with multi-model framework for action recognition. In: 2017 IEEE international conference on image processing (ICIP), pp 1837–1841

  18. Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3376–3385

  19. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563

  20. Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6742–6751

  21. Leordeanu M, Sukthankar R, Sminchisescu C (2012) Efficient closed-form solution to generalized boundary detection. In: European conference on computer vision. Springer, pp 516–529

  22. Li J, Liu X, Zhang M, Wang D (2020) Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognit 98(2020):107037

    Article  Google Scholar 

  23. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CGM (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50

    Article  Google Scholar 

  24. Liu J, Wang Z, Liu H (2020) Hds-sp: a novel descriptor for skeleton-based human action recognition. Neurocomputing 385:22–32

    Article  Google Scholar 

  25. Liu X, Yang X (2018) Multi-stream with deep convolutional neural networks for human action recognition in videos. In: International conference on neural information processing. Springer, pp 251–262

  26. Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv:1508.04025

  27. Majd M, Safabakhsh R (2020) Correlational convolutional LSTM for human action recognition. Neurocomputing 396:224–229

    Article  Google Scholar 

  28. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  29. Poppe R (2010) A survey on vision-based human action recognition.  Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  30. Qi L, Dai P, Jiguo Y, Zhou Z, Yanwei X (2017) “Time-location-frequency”—aware internet of things service selection based on historical records. Int J Distrib Sens Netw 13(1):1550147716688696

    Article  Google Scholar 

  31. Qi L, Dou W, Chen J (2016) Weighted principal component analysis-based service selection method for multimedia services in cloud. Computing 98(1–2):195–214

    Article  MathSciNet  Google Scholar 

  32. Qi L, Wang R, Chunhua H, Li S, He Q, Xiaolong X (2019) Time-aware distributed service recommendation with privacy-preservation. Inf Sci 480:354–364

    Article  Google Scholar 

  33. Qi L, Xu X, Dou W, Yu J, Zhou Z, Zhang X (2016) Time-aware IoE service recommendation on sparse data. Mobile Inf Syst 2016:12. https://doi.org/10.1155/2016/4397061

    Article  Google Scholar 

  34. Qi L, Yu J, Zhou Z (2017) An invocation cost optimization method for web services in cloud environment. Sci Program 2017:9. https://doi.org/10.1155/2017/4358536

    Article  Google Scholar 

  35. Qi L, Zhang X, Dou W, Chunhua H, Yang C, Chen J (2018) A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment.  Future Gener Comput Syst 88(2018):636–643

    Article  Google Scholar 

  36. Qi L, Zhang X, Dou W, Ni Q (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624

    Article  Google Scholar 

  37. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

  38. Shamsolmoali P, Zareapoor M, Zhou H, Yang J (2020) Amil: adversarial multi instance learning for human pose estimation. ACM Trans Multimedia Comput Commun Appl (TOMM) 16(1s):1–23

    Article  Google Scholar 

  39. Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang S-F, Yan Z (2019) Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1268–1277

  40. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236

  41. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  42. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  43. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5693–5703

  44. Sun L, Jia K, Chen K, Yeung D-Y, Shi BE, Savarese S (2017) Lattice long short-term memory for human action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2147–2156

  45. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605

  46. Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399

  47. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  48. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6:1155–1166

    Article  Google Scholar 

  49. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  50. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van GL (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755

  51. Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3402

  52. Wang X, Gao L, Song J, Shen H (2016) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition.  IEEE Signal Process Lett 24(4):510–514

    Article  Google Scholar 

  53. Wang Y, Huang M, Zhao L et al (2016) Attention-based lstm for aspect-level sentiment classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 606–615

  54. Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1529–1538

  55. Xie S, Sun C, Huang J, Tu Z, urphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  56. Yang H, Yuan C, Li B, Yang D, Xing J, Weiming H, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85(2019):1–12

    Google Scholar 

  57. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  58. Zhenbing L, Zeya L, Ming Z, Wanting J, Ruili W, Yan T (2019) Spatiotemporal saliency based multi-stream networks for action recognition. In: Asian conference on pattern recognition, Springer, Singapore, pp 74–84

Download references

Acknowledgements

This study is supported by the National Natural Science Foundation of China (Grant Nos. 61562013, 61906050), the Natural Science Foundation of Guangxi Province (CN) (2017GXNSFDA198025), the Study Abroad Program for Graduate Student of Guilin University of Electronic Technology (GDYX2018006), the National Natural Science Foundation of China (Grant 61602407), the Natural Science Foundation of Zhejiang Province (Grant LY18F020008), the China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Zong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Li, Z., Wang, R. et al. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput & Applic 32, 14593–14602 (2020). https://doi.org/10.1007/s00521-020-05144-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05144-7

Keywords

Navigation