Exploring hybrid spatio-temporal convolutional networks for human action recognition

Wang, Hao; Yang, Yanhua; Yang, Erkun; Deng, Cheng

doi:10.1007/s11042-017-4514-3

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Published: 08 March 2017

Volume 76, pages 15065–15081, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hao Wang¹,
Yanhua Yang¹,
Erkun Yang¹ &
…
Cheng Deng^1,2

893 Accesses
18 Citations
Explore all metrics

Abstract

Convolutional neural networks have achieved great success in many computer vision tasks. However, it is still challenging for action recognition in videos due to the intrinsically complicated space-time correlation and computational difficult of videos. Existing methods usually neglect the fusion of long term spatio-temporal information. In this paper, we propose a novel hybrid spatio-temporal convolutional network for action recognition. Specifically, we integrate three different type of streams into the network: (1) the image stream utilizes still images to learn the appearance information; (2) the optical stream captures the motion information from optical flow frames; (3) the dynamic image stream explores the appearance information and motion information simultaneously from generated dynamic images. Finally, a weighted fusion strategy at the softmax layer is utilized to make the class decision. With the help of these three streams, we can take full advantage of the spatio-temporal information of the videos. Extensive experiments on two popular human action recognition datasets demonstrate the superiority of our proposed method when compared with several state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

References

Alfaro A, Mery D, Soto A (2016) Action recognition in video using sparse coding and relative features Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2688–2697
Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 3034–3042
Google Scholar
Cai Z, Wang L M, Peng X, Qiao Y (2014) Multi-view super vector for action recognition Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 596–603
Google Scholar
Diba A, Pazandeh A, Gool LV (2016) Efficient two-stream motion and appearance 3d CNNs for video classfication. arXiv:1608.08851
Deng J, Dong W, Socher R, Li L J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database Conference on computer vision and pattern recognition (CVPR), 2009, I.E. IEEE, pp 248–255
Google Scholar
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features VS-PETS 2005
Google Scholar
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 2625–2634
Google Scholar
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream networks fusion for video action recognition. arXiv:1604.06573
Fernando B, Anderson P, Hutter M, Gounld S (2016) Discriminative hierarchical rank pooling for activity recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1924–1932
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift International conference on machine learning (ICML), 2015, pp 448–456
Google Scholar
Ji SW, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(1):221–231
Article Google Scholar
Jia Y Q, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li Fei-Fei (2014) Large-scale video classification with convolutional neural networks Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 1725–1732
Google Scholar
Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients 2008-19th British machine vision conference (BMVC), British machine vision association
Google Scholar
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgbd action recognition Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 1054–1062
Google Scholar
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition International conference on computer vision (ICCV), 2011, I.E. IEEE, pp 2556–2563
Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis (IJCV) 64 (2–3):107–123
Article Google Scholar
Li Z Y, Gavves E, Jain M, Snoek CGM (2016) VideoLSTM convolves, attends and flows for action recognition. arXiv:1607.01794
Peng X, Wang L M, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:14054506
Sadanand S, Corso J J (2012) Action bank: a high-level representation of activity in video Conference on computer vision and pattern recognition (CVPR), 2012, I.E. IEEE, pp 1234–1341
Google Scholar
Scovanner P, Ali S, Mubarak Shah (2007) A 3-dimensional SIFT descriptor and its application to action recognition ACM international conference on multimedia (ACM MM), pp 357–360
Google Scholar
Shahroudy A, Ng T T, Yang Q, Wang G (2016) Multimodal multipart learning for action recognition in depth videos. IEEE Trans Pattern Anal Mach Intell (PAMI) 10:2123–2129
Article Google Scholar
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shen Y, Lin W Y, Yan J C, Xu M L, Wu J X, Wang J D (2015) Person re-identification with correspondence structure learning International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 3200–3208
Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos Annual conference on neural information processing systems (NIPS), pp 568–576
Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition International conference on learning representations (ICLR), pp 1–14
Google Scholar
Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2016) Rethinking the inception architecture for computer vision Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2818–2826
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Manohar Paluri (2015) Learning spatiotemporal features with 3d convolutional networks International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 4489–4497
Google Scholar
Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04994
Wang H, Schmid C (2013) Action recognition with improved trajectories International conference on computer vision (ICCV), 2013, I.E. IEEE, pp 3551–3558
Google Scholar
Wang L M, Qiao Y, XO T (2015) Action recognition with trajectory-pooled deep-convolutional descriptors Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 4305–4314
Google Scholar
Wang L M, Qiao Y, Tang X O (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis (IJCV) 119(3):254–271
Article MathSciNet Google Scholar
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang XO, Gool LV (2016) Temproal segment networks: towards good practices for deep action recognition. arXiv:1608.00859
Wang X L, Farhadi A, Gupta A (2016) Action ∼ transformation Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2658–2667
Google Scholar
Wang Y L, Wang S H, Tang J L, O’Hare N, Chang Y, Li BX (2016) Hierarchical attention network for action recognition in videos. arXiv:1607.0641
Willems G, Tuytelaars T, Gool L V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector Proceedings of the european conference on computer vision (ECCV), pp 650–663
Google Scholar
Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM international conference on multimedia (ACM MM), pp 461–470
Google Scholar
Xu Z, Hu C P, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications (MTAP) 75 (19):12155–12172
Article Google Scholar
Xu Z, Liu Y H, Mei L, Hu C P, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225
Article Google Scholar
Xu Z, Mei L, Hu C P, Liu Y H (2016) The big data analytics and applications of the surveillance system using video structured description technology. Clust Comput 19(3):1283–1292
Article Google Scholar
Xu Z, Mei L, Liu Y H, Hu C P, Chen L (2016) Semantic enhanced cloud environment for surveillance data management using video structural description. Computing 98(1–2):35–54
Article MathSciNet MATH Google Scholar
Yang YH, Deng C, Gao SQ, Liu W, Tao DP, Gao XB (2016) Discriminative multi-instance multi-task learning for 3d action recognition. IEEE Trans Multimedia (TMM). doi:10.1109/TMM.2016.2626959
Yang Y H, Deng C, Tao D P, Zhang S T, Liu W, Gao X B (2016) Latent max-margin multitask learning with skelets for 3d action recognition. IEEE Transactions on Cybernetics (TCYB) 99:1–10
Google Scholar
Yang Y H, Liu R S, Deng C, Gao X B (2016) Multi-task human action recognition via exploring super-category. Signal Process (SP) 124:36–44
Article Google Scholar
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-l1 optical flow 29th DAGM symposium on pattern recognition, pp 214–223
Google Scholar
Zhang B W, Wang L M, Wang Z, Qiao Y, Wang H L (2016) Real-time action recognition with enhanced motion vector CNNs Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2718–2726
Google Scholar
Zhu J, Wang B Y, Yang X K, Zhang W J, Tu Z W (2013) Action recognition with actons International conference oncomputer vision (ICCV), 2013, I.E. IEEE, pp 3559–3566
Google Scholar
Zhu W J, Hu J, Sun G, Cao X D, Qiao Y (2016) A key volume mining deep framework for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1991–1999
Google Scholar

Download references

Acknowledgements

The authors would like to thank the Editor-in-Chief, the handling associate editor and all anonymous reviewers for their considerations and suggestions. This work was supported by the National Natural Science Foundation of China (61572388).

Author information

Authors and Affiliations

Department of Electronic and Engineering, Xidian University, Xi’an, 710071, China
Hao Wang, Yanhua Yang, Erkun Yang & Cheng Deng
The State Key Laboratory of Integrated Services Networks (ISN), Xidian University, Xi’an, 710071, China
Cheng Deng

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Erkun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Yang, Y., Yang, E. et al. Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed Tools Appl 76, 15065–15081 (2017). https://doi.org/10.1007/s11042-017-4514-3

Download citation

Received: 27 November 2016
Revised: 31 January 2017
Accepted: 14 February 2017
Published: 08 March 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s11042-017-4514-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

A survey of the recent architectures of deep convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation