Skip to main content
Log in

Exploring hybrid spatio-temporal convolutional networks for human action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Convolutional neural networks have achieved great success in many computer vision tasks. However, it is still challenging for action recognition in videos due to the intrinsically complicated space-time correlation and computational difficult of videos. Existing methods usually neglect the fusion of long term spatio-temporal information. In this paper, we propose a novel hybrid spatio-temporal convolutional network for action recognition. Specifically, we integrate three different type of streams into the network: (1) the image stream utilizes still images to learn the appearance information; (2) the optical stream captures the motion information from optical flow frames; (3) the dynamic image stream explores the appearance information and motion information simultaneously from generated dynamic images. Finally, a weighted fusion strategy at the softmax layer is utilized to make the class decision. With the help of these three streams, we can take full advantage of the spatio-temporal information of the videos. Extensive experiments on two popular human action recognition datasets demonstrate the superiority of our proposed method when compared with several state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Alfaro A, Mery D, Soto A (2016) Action recognition in video using sparse coding and relative features Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2688–2697

    Google Scholar 

  2. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 3034–3042

    Google Scholar 

  3. Cai Z, Wang L M, Peng X, Qiao Y (2014) Multi-view super vector for action recognition Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 596–603

    Google Scholar 

  4. Diba A, Pazandeh A, Gool LV (2016) Efficient two-stream motion and appearance 3d CNNs for video classfication. arXiv:1608.08851

  5. Deng J, Dong W, Socher R, Li L J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database Conference on computer vision and pattern recognition (CVPR), 2009, I.E. IEEE, pp 248–255

    Google Scholar 

  6. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features VS-PETS 2005

    Google Scholar 

  7. Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 2625–2634

    Google Scholar 

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream networks fusion for video action recognition. arXiv:1604.06573

  9. Fernando B, Anderson P, Hutter M, Gounld S (2016) Discriminative hierarchical rank pooling for activity recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1924–1932

    Google Scholar 

  10. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift International conference on machine learning (ICML), 2015, pp 448–456

    Google Scholar 

  11. Ji SW, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(1):221–231

    Article  Google Scholar 

  12. Jia Y Q, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093

  13. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li Fei-Fei (2014) Large-scale video classification with convolutional neural networks Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 1725–1732

    Google Scholar 

  14. Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients 2008-19th British machine vision conference (BMVC), British machine vision association

    Google Scholar 

  15. Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgbd action recognition Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 1054–1062

    Google Scholar 

  16. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition International conference on computer vision (ICCV), 2011, I.E. IEEE, pp 2556–2563

    Google Scholar 

  17. Laptev I (2005) On space-time interest points. Int J Comput Vis (IJCV) 64 (2–3):107–123

    Article  Google Scholar 

  18. Li Z Y, Gavves E, Jain M, Snoek CGM (2016) VideoLSTM convolves, attends and flows for action recognition. arXiv:1607.01794

  19. Peng X, Wang L M, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:14054506

  20. Sadanand S, Corso J J (2012) Action bank: a high-level representation of activity in video Conference on computer vision and pattern recognition (CVPR), 2012, I.E. IEEE, pp 1234–1341

    Google Scholar 

  21. Scovanner P, Ali S, Mubarak Shah (2007) A 3-dimensional SIFT descriptor and its application to action recognition ACM international conference on multimedia (ACM MM), pp 357–360

    Google Scholar 

  22. Shahroudy A, Ng T T, Yang Q, Wang G (2016) Multimodal multipart learning for action recognition in depth videos. IEEE Trans Pattern Anal Mach Intell (PAMI) 10:2123–2129

    Article  Google Scholar 

  23. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  24. Shen Y, Lin W Y, Yan J C, Xu M L, Wu J X, Wang J D (2015) Person re-identification with correspondence structure learning International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 3200–3208

    Google Scholar 

  25. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos Annual conference on neural information processing systems (NIPS), pp 568–576

    Google Scholar 

  26. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition International conference on learning representations (ICLR), pp 1–14

    Google Scholar 

  27. Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  28. Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2016) Rethinking the inception architecture for computer vision Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2818–2826

    Google Scholar 

  29. Tran D, Bourdev L, Fergus R, Torresani L, Manohar Paluri (2015) Learning spatiotemporal features with 3d convolutional networks International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 4489–4497

    Google Scholar 

  30. Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04994

  31. Wang H, Schmid C (2013) Action recognition with improved trajectories International conference on computer vision (ICCV), 2013, I.E. IEEE, pp 3551–3558

    Google Scholar 

  32. Wang L M, Qiao Y, XO T (2015) Action recognition with trajectory-pooled deep-convolutional descriptors Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 4305–4314

    Google Scholar 

  33. Wang L M, Qiao Y, Tang X O (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis (IJCV) 119(3):254–271

    Article  MathSciNet  Google Scholar 

  34. Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang XO, Gool LV (2016) Temproal segment networks: towards good practices for deep action recognition. arXiv:1608.00859

  35. Wang X L, Farhadi A, Gupta A (2016) Action ∼ transformation Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2658–2667

    Google Scholar 

  36. Wang Y L, Wang S H, Tang J L, O’Hare N, Chang Y, Li BX (2016) Hierarchical attention network for action recognition in videos. arXiv:1607.0641

  37. Willems G, Tuytelaars T, Gool L V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector Proceedings of the european conference on computer vision (ECCV), pp 650–663

    Google Scholar 

  38. Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM international conference on multimedia (ACM MM), pp 461–470

    Google Scholar 

  39. Xu Z, Hu C P, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications (MTAP) 75 (19):12155–12172

    Article  Google Scholar 

  40. Xu Z, Liu Y H, Mei L, Hu C P, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225

    Article  Google Scholar 

  41. Xu Z, Mei L, Hu C P, Liu Y H (2016) The big data analytics and applications of the surveillance system using video structured description technology. Clust Comput 19(3):1283–1292

    Article  Google Scholar 

  42. Xu Z, Mei L, Liu Y H, Hu C P, Chen L (2016) Semantic enhanced cloud environment for surveillance data management using video structural description. Computing 98(1–2):35–54

    Article  MathSciNet  MATH  Google Scholar 

  43. Yang YH, Deng C, Gao SQ, Liu W, Tao DP, Gao XB (2016) Discriminative multi-instance multi-task learning for 3d action recognition. IEEE Trans Multimedia (TMM). doi:10.1109/TMM.2016.2626959

  44. Yang Y H, Deng C, Tao D P, Zhang S T, Liu W, Gao X B (2016) Latent max-margin multitask learning with skelets for 3d action recognition. IEEE Transactions on Cybernetics (TCYB) 99:1–10

    Google Scholar 

  45. Yang Y H, Liu R S, Deng C, Gao X B (2016) Multi-task human action recognition via exploring super-category. Signal Process (SP) 124:36–44

    Article  Google Scholar 

  46. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-l1 optical flow 29th DAGM symposium on pattern recognition, pp 214–223

    Google Scholar 

  47. Zhang B W, Wang L M, Wang Z, Qiao Y, Wang H L (2016) Real-time action recognition with enhanced motion vector CNNs Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2718–2726

    Google Scholar 

  48. Zhu J, Wang B Y, Yang X K, Zhang W J, Tu Z W (2013) Action recognition with actons International conference oncomputer vision (ICCV), 2013, I.E. IEEE, pp 3559–3566

    Google Scholar 

  49. Zhu W J, Hu J, Sun G, Cao X D, Qiao Y (2016) A key volume mining deep framework for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1991–1999

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Editor-in-Chief, the handling associate editor and all anonymous reviewers for their considerations and suggestions. This work was supported by the National Natural Science Foundation of China (61572388).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Deng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Yang, Y., Yang, E. et al. Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed Tools Appl 76, 15065–15081 (2017). https://doi.org/10.1007/s11042-017-4514-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4514-3

Keywords

Navigation