Stratified pooling based deep convolutional neural networks for human action recognition

Yu, Sheng; Cheng, Yun; Su, Songzhi; Cai, Guorong; Li, Shaozi

doi:10.1007/s11042-016-3768-5

Stratified pooling based deep convolutional neural networks for human action recognition

Published: 15 July 2016

Volume 76, pages 13367–13382, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sheng Yu^1,2,3,
Yun Cheng²,
Songzhi Su^1,3,
Guorong Cai⁴ &
…
Shaozi Li^1,3

1293 Accesses
32 Citations
3 Altmetric
Explore all metrics

Abstract

Video based human action recognition is an active and challenging topic in computer vision. Over the last few years, deep convolutional neural networks (CNN) has become the most popular method and achieved the state-of-the-art performance on several datasets, such as HMDB-51 and UCF-101. Since each video has a various number of frame-level features, how to combine these features to acquire good video-level feature becomes a challenging task. Therefore, this paper proposed a novel action recognition method named stratified pooling, which is based on deep convolutional neural networks (SP-CNN). The process is mainly composed of five parts: (i) fine-tuning a pre-trained CNN on the target dataset, (ii) frame-level features extraction; (iii) the principal component analysis (PCA) method for feature dimensionality reduction; (iv) stratified pooling frame-level features to get video-level feature; and (v) SVM for multiclass classification. Finally, the experimental results conducted on HMDB-51 and UCF-101 datasets show that the proposed method outperforms the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action Recognition Using Multiple Pooling Strategies of CNN Features

Article 03 October 2018

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

Article 01 August 2020

Multi-scale residual network model combined with Global Average Pooling for action recognition

Article 01 October 2021

References

Aarts E, Korst J (1988) Simulated annealing and boltzmann machines
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Computer vision–ECCV 2006. Springer, pp 404–417
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531
Chen QQ, Zhang YJ (2015) Cluster trees of improved trajectories for action recognition. Neurocomputing
Coates A, Ng AY (2011) The importance of encoding versus training with sparse coding and vector quantization. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 921–928
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625– 2634
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531
Gehring J, Miao Y, Metze F, Waibel A (2013) Extracting deep bottleneck features using stacked auto-encoders. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013. IEEE, pp 3377–3381
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), 2014. IEEE, pp 580–587
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1080–1088
Iosifidis A, Tefas A, Pitas I (2014) Class-specific reference discriminant analysis with application in human behavior analysis
Google Scholar
Jain M, Jégou H., Bouthemy P (2013) Better exploiting motion for better action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), 2013. IEEE, pp 2555–2562
Jégou H., Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716
Article Google Scholar
Jhuang H, Garrote H, Poggio E, Serre T, Hmdb T (2011) A large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678
Jian M, Lam KM (2014) Face-image retrieval based on singular values and potential-field representation. Signal Process 100:9–15
Article Google Scholar
Jian M, Lam KM (2015) Simultaneous hallucination and recognition of low-resolution faces based on singular value decomposition. IEEE Trans Circuits Syst Video Technol 25(11):1761–1772
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), 2014. IEEE, pp 1725–1732
Klaser A, Marszałek M., Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19Th british machine vision conference. British Machine Vision Association, pp 275–271
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123
Article Google Scholar
Laptev I, Marszałek M., Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), 2011. IEEE, pp 3361–3368
Le Roux N, Bengio Y (2008) Representational power of restricted boltzmann machines and deep belief networks. Neural Comput 20(6):1631–1649
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 609–616
Leng B, Zhang X, Yao M, Xiong Z (2015) A 3d model recognition mechanism based on deep boltzmann machines. Neurocomputing 151:593–602
Article Google Scholar
Liu L, Shen C, Hengel AVD (2014) The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification. arXiv:1411.7466
Liu R, Chen Y, Zhu X, Hou K (2015) Image classification using label constrained sparse coding. Multimedia Tools and Applications:1–15
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Luo J, Wang W, Qi H (2014) Spatio-temporal feature extraction and representation for rgb-d human action recognition. Pattern Recogn Lett 50:139–148
Article Google Scholar
Mnih V, Heess N, Graves A, et al. (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212
Peng X, Qiao Y, Peng Q, Qi X (2013) Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition. In: British machine vision conference (BMVC)
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:1405.4506
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: Computer vision–ECCV 2014. Springer, pp 581–595
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: IEEE conference on computer vision and pattern recognition, 2007. CVPR’07. IEEE, pp 1–8
Perronnin F, Sánchez J., Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer vision–ECCV 2010. Springer, pp 143–156
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Wang H, Kläser A., Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE international conference on computer vision (ICCV), 2013. IEEE, pp 3551–3558
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. arXiv:1505.04868
Wang P, Cao Y, Shen C, Liu L, Shen HT (2015) Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv:1503.01224
Xu H, Tian Q, Wang Z, Wu J (2015) A survey on aggregating methods for action recognition with dense trajectories. Multimedia Tools and Applications:1–17
Xu K, Ba J, Kiros R, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhou Y, Ni B, Hong R, Wang M, Tian Q (2015) Interaction part mining: a mid-level approach for fine-grained action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3323–3331

Download references

Acknowledgments

This work is supported by the Nature Science Foundation of China (No.61202143, No. 61572409, No.61571188), the Natural Science Foundation of Fujian Province (No.2013J05100), the Research Foundation of Education Bureau of Hunan Province(No.15C0726).

Author information

Authors and Affiliations

Cognitive Science Department, Xiamen University, Xiamen, 361005, China
Sheng Yu, Songzhi Su & Shaozi Li
School of Information, Hunan University of Humanities, Science and Technology, Loudi, 417000, China
Sheng Yu & Yun Cheng
Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen, 361005, China
Sheng Yu, Songzhi Su & Shaozi Li
Computer Engineering College, Jimei University, Xiamen, 361005, China
Guorong Cai

Authors

Sheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Songzhi Su
View author publications
You can also search for this author in PubMed Google Scholar
Guorong Cai
View author publications
You can also search for this author in PubMed Google Scholar
Shaozi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaozi Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, S., Cheng, Y., Su, S. et al. Stratified pooling based deep convolutional neural networks for human action recognition. Multimed Tools Appl 76, 13367–13382 (2017). https://doi.org/10.1007/s11042-016-3768-5

Download citation

Received: 09 December 2015
Revised: 03 July 2016
Accepted: 06 July 2016
Published: 15 July 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11042-016-3768-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified pooling based deep convolutional neural networks for human action recognition

Abstract

Access this article

Similar content being viewed by others

Action Recognition Using Multiple Pooling Strategies of CNN Features

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

Multi-scale residual network model combined with Global Average Pooling for action recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stratified pooling based deep convolutional neural networks for human action recognition

Abstract

Access this article

Similar content being viewed by others

Action Recognition Using Multiple Pooling Strategies of CNN Features

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

Multi-scale residual network model combined with Global Average Pooling for action recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation