Spatiotemporal Fusion Networks for Video Action Recognition

Liu, Zheng; Hu, Haifeng; Zhang, Junxuan

doi:10.1007/s11063-018-09972-6

Spatiotemporal Fusion Networks for Video Action Recognition

Published: 03 January 2019

Volume 50, pages 1877–1890, (2019)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

661 Accesses
4 Citations
Explore all metrics

Abstract

Learning spatiotemporal information is a fundamental part in action recognition. In this work, we attempt to extract efficient spatiotemporal information for video representation through a novel architecture, termed as SpatioTemporal Fusion Networks (STFN). STFN extract spatiotemporal information by introducing connections between the spatial and temporal streams in two-stream networks with fusion blocks, called as Compactly Fuse Spatial and Temporal information (CFST) block, whose goal is to integrate spatial and temporal information with little computational cost. CFST is built upon Compact Bilinear Pooling which can capture multiplicative interactions at corresponding locations. For better integration of two streams, we make an exploration of fusion configuration about where to insert fusion block and a combination of CFST block and additive interaction. We evaluate our proposed architecture on UCF-101 and HMDB-51, and obtain a comparable performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-Temporal Fusion Networks for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Spatiotemporal feature enhancement network for action recognition

Article 15 December 2023

References

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105
Google Scholar
Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Ji S, Xu W, Yang M et al (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, pp 3468–3476
Gao Y, Beijbom O, Zhang N et al (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326
Soomro K, Zamir A R, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Stiefelhagen R et al (2013) Hmdb51: A large video database for human motion recognition. In: High Performance Computing in Science and Engineering 12. Springer, Berlin, pp 571–582
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European conference on computer vision. Springer, Berlin, pp 650–663
Wang H, Kläser A, Schmid C et al (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Laptev I, Marszalek M, Schmid C et al (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7445–7454
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5534–5542
Diba A, Fayyaz M, Sharma V et al (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4733
Wang L, Li W, Li W, et al (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439
Sun S, Kuang Z, Sheng L, et al (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Zhu W, Hu J, Sun G et al (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999
Hong C, Yu J, Wan J et al (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Article MathSciNet Google Scholar
Hong C, Chen X, Wang X et al (2016) Hypergraph regularized autoencoder for image-based 3D human pose recovery. Signal Process 124:132–140
Article Google Scholar
Hong C, Yu J, Tao D et al (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751
Google Scholar
Yang M, Liu Y, You Z (2017) The Euclidean embedding learning based on convolutional neural network for stereo matching. Neurocomputing 267:195–200
Article Google Scholar
Qian S, Liu H, Liu C et al (2018) Adaptive activation functions in convolutional neural networks. Neurocomputing 272:204–212
Article Google Scholar
Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1839–1848
Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 99:1–13
Google Scholar
Kim JH, On KW, Lim W et al (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Simon M, Gao Y, Darrell T et al (2017) Generalized orderless pooling performs implicit salient matching. In: Proceedings of the IEEE international conference on computer vision, pp 4960–4969

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016B010109002, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
Zheng Liu, Haifeng Hu & Junxuan Zhang

Authors

Zheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar
Junxuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Hu, H. & Zhang, J. Spatiotemporal Fusion Networks for Video Action Recognition. Neural Process Lett 50, 1877–1890 (2019). https://doi.org/10.1007/s11063-018-09972-6

Download citation

Published: 03 January 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11063-018-09972-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal Fusion Networks for Video Action Recognition

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Fusion Networks for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Spatiotemporal feature enhancement network for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatiotemporal Fusion Networks for Video Action Recognition

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Fusion Networks for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Spatiotemporal feature enhancement network for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation