Spatiotemporal Saliency Based Multi-stream Networks for Action Recognition

Liu, Zhenbing; Li, Zeya; Zong, Ming; Ji, Wanting; Wang, Ruili; Tian, Yan

doi:10.1007/978-981-15-3651-9_8

Zhenbing Liu^10,11,12,
Zeya Li^10,11,12,
Ming Zong^10,11,12,
Wanting Ji^10,11,12,
Ruili Wang^10,11,12 &
…
Yan Tian^10,11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1180))

Included in the following conference series:

Asian Conference on Pattern Recognition

651 Accesses
3 Citations

Abstract

Human action recognition is a challenging research topic since videos often contain clutter backgrounds, which impairs the performance of human action recognition. In this paper, we propose a novel spatiotemporal saliency based multi-stream ResNet for human action recognition, which combines three different streams: a spatial stream with RGB frames as input, a temporal stream with optical flow frames as input, and a spatiotemporal saliency stream with spatiotemporal saliency maps as input. The spatiotemporal saliency stream is responsible for capturing the spatiotemporal object foreground information from spatiotemporal saliency maps which are generated by a geodesic distance based video segmentation method. Such architecture can reduce the background interference in videos and provide the spatiotemporal object foreground information for human action recognition. Experimental results on UCF101 and HMDB51 datasets demonstrate that the complementary spatiotemporal information can further improve the performance of action recognition, and our proposed method obtains the competitive performance compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS) (2014)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2740–2755 (2018)
Article Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Wang, Y., et al.: Spatiotemporal pyramid network for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Ji, S., et al.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2014)
Google Scholar
Diba, A., et al.: Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Kar, A., et al.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Sun, S., et al.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Xie, S., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: European Conference on Computer Vision (ECCV) (2017)
Google Scholar
Jing, L., Ye, Y., Yang, X., Tian, Y.: 3D convolutional neural network with multi-model framework for action recognition. In IEEE International Conference on Image Processing (ICIP) (2017)
Google Scholar
Liu, X., Yang, X.: Multi-stream with deep convolutional neural networks for human action recognition in videos. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11301, pp. 251–262. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04167-0_23
Chapter Google Scholar
Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Achanta, R., et al.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2281 (2012)
Article Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3
Chapter Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision (ICCV) (2014)
Google Scholar
Sun, L., et al.: Lattice long short-term memory for human action recognition. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Leordeanu, M., Sukthankar, R., Sminchisescu, C.: Efficient closed-form solution to generalized boundary detection. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 516–529. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_37
Chapter Google Scholar
Kuehne, H., et al.: HMDB51: a large video database for human motion recognition. In: IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Ng, Y.H., et al.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)
Article Google Scholar
Huang, G., et al.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Tian, C., et al.: Image denoising using deep CNN with batch renormalization. Neural Netw. 121, 461–473 (2020)
Article Google Scholar

Download references

Acknowledgment

This study is supported by the National Natural Science Foundation of China (Grant No. 61562013), the Natural Science Foundation of Guangxi Province (CN) (2017GXNSFDA198025), the Study Abroad Program for Graduate Student of Guilin University of Electronic Technology (GDYX2018006), the Marsden Fund of New Zealand, the National Natural Science Foundation of China (Grant 61602407), Natural Science Foundation of Zhejiang Province (Grant LY18F020008), the China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program.

Author information

Authors and Affiliations

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
Zhenbing Liu, Zeya Li, Ming Zong, Wanting Ji, Ruili Wang & Yan Tian
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
Zhenbing Liu, Zeya Li, Ming Zong, Wanting Ji, Ruili Wang & Yan Tian
School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
Zhenbing Liu, Zeya Li, Ming Zong, Wanting Ji, Ruili Wang & Yan Tian

Authors

Zhenbing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zeya Li
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zong
View author publications
You can also search for this author in PubMed Google Scholar
Wanting Ji
View author publications
You can also search for this author in PubMed Google Scholar
Ruili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Zong .

Editor information

Editors and Affiliations

University of Waikato, Hamilton, New Zealand
Michael Cree
National Ilan University, Yilan, Taiwan
Fay Huang
State University of New York at Buffalo, Buffalo, NY, USA
Junsong Yuan
Auckland University of Technology, Auckland, New Zealand
Wei Qi Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Li, Z., Zong, M., Ji, W., Wang, R., Tian, Y. (2020). Spatiotemporal Saliency Based Multi-stream Networks for Action Recognition. In: Cree, M., Huang, F., Yuan, J., Yan, W. (eds) Pattern Recognition. ACPR 2019. Communications in Computer and Information Science, vol 1180. Springer, Singapore. https://doi.org/10.1007/978-981-15-3651-9_8

Download citation

DOI: https://doi.org/10.1007/978-981-15-3651-9_8
Published: 07 March 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3650-2
Online ISBN: 978-981-15-3651-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics