Skip to main content

Advertisement

Log in

More efficient and effective tricks for deep action recognition

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Deep convolutional network has achieved great success in visual recognition of static images, while it is not so advantageous as traditional methods in action recognition in videos. As two-stream-style convolutional network gaining best performance in human action recognition, there exist obstacles such as selecting different pre-train models and hyper-parameters, and high computation consumption. In this paper, we propose two efficient and effective methods for action recognition, based on two-stream convolutional network. (1) Reducing computational cost of temporal stream while achieving the same accuracy, and (2) providing techniques such as selection of optical flow algorithm, the pre-train dataset/architectures and the hyper-parameters for assembly in action recognition task. Experimental results show that we are able to obtain performance on a par with the state-of-the-art ones on the datasets of HMDB51 (70.9%) and UCF101 (95.4%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

  2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR, pp. 1–14 (2015)

  3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)

  4. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

  5. Tran, D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)

  6. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In NIPS (2014)

  7. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR (2015)

  8. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

  9. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)

  10. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. In: CoRR (2012)

  11. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV (2011)

  12. Ioe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  14. Yue-Hei, J., Jonghyun, C., Jan, N., Larry, D.: ActionFlowNet: learning motion representation for action recognition. In: CoRR (2016). arXiv:1612.03052

  15. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: ECCV (2004)

  16. Jean-Yves, B.: Pyramidal Implementation of the Affine Lucas Kanade Feature Tracker Description of the Algorithm, vol. 5, no. 1–10, p. 4. Intel Corporation (2001)

  17. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-L1 optical flow. In: Proceedings of the 29th DAGM symposium on pattern recognition, pp. 214–223 (2007)

  18. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

  19. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database 27. In: NIPS (2014)

  20. Jiang, Y., Wu, Z., Wang, J., Xue, X., Chang, S.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. In: CoRR (2015). arXiv:1502.07209

  21. Szegedy, C., Vanhoucke, V., Ioffe, S., Jonathon, S.: Rethinking the inception architecture for computer vision. In: CoRR (2015). arXiv:1512.00567

  22. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: CoRR (2016). arXiv:1602.07261

  23. Xie, S., Girshick, R., Doll’ar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CoRR (2016). arXiv:1611.05431

  24. Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR (2015)

  25. Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: CVPR, pp. 1991–1999 (2016)

  26. Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks (2016). arXiv Preprint arXiv:1611.06678

  27. Zhenzhong, L., Yi, Z., Alexander, G.: Deep local video feature for action recognition. In: CoRR (2017). arXiv:1701.07368

  28. Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Suk-thankar, R.: THUMOS challenge: action recognition with a large number of classes (2013)

  29. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th international joint conference on artificial intelligence (IJCAI), pp. 674–679 (1981)

Download references

Acknowledgements

The authors of this paper are members of Shanghai Engineering Research Center of Intelligent Video Surveillance. Dr. Lei Song is also a visiting researcher with Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China. Our research was sponsored by following projects: the National Natural Science Foundation of China (61402116, 61403084); Program of Science and Technology Commission of Shanghai Municipality (Nos. 15530701300, 15XD15202000); 2012 IoT Program of Ministry of Industry and Information Technology of China; Key Project of the Ministry of Public Security (No. 2014JSYJA007); the Project of the Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University (ESSCKF 2015-03); Shanghai Rising-Star Program (17QB1401000); the Special Fund for Basic R&D Expenses of Central Level Public Welfare Scientific Research Institutions (C17384).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Zhang, X., Song, L. et al. More efficient and effective tricks for deep action recognition. Cluster Comput 22 (Suppl 1), 819–826 (2019). https://doi.org/10.1007/s10586-017-1309-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1309-2

Keywords

Navigation