Abstract
We present a simple yet effective approach for human action recognition. Most of the existing solutions based on multi-class action classification aim to assign a class label for the input video. However, the variety and complexity of real-life videos make it very challenging to achieve high classification accuracy. To address this problem, we propose to partition the input video into small clips and formulate action recognition as a joint decision-making task. First, we partition all videos into two equal segments that are processed in the same manner. We repeat this procedure to obtain three layers of video subsegments, which are then organized in a binary tree structure. We train separate classifiers for each layer. By applying the corresponding classifiers to video subsegments, we obtain a decision value matrix (DVM). Then, we construct an aggregated representation for the original full-length video by integrating the elements of the DVM. Finally, we train a new action recognition classifier based on the DVM representation. Our extensive experimental evaluations demonstrate that the proposed method achieves significant performance improvement against several compared methods on two benchmark datasets.
Similar content being viewed by others
References
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):16
Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69(2):253–275
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE conference on computer vision and pattern recognition, pp 3034–3042
Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circuits Syst Video Technol 23(11):1993–2008
Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: IEEE conference on computer vision and pattern recognition, pp 2559–2566
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE conference on computer vision and pattern recognition, pp 961–970
Cao Y, Barrett D, Barbu A, Narayanaswamy S, Yu H, Michaux A, Lin Y, Dickinson S, Siskind JM, Wang S (2013) Recognize human activities from partially observed videos. In: IEEE conference on computer vision and pattern recognition, pp 2658–2665
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. arXiv:1604.06573
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Jain A, Gupta A, Rodriguez M, Davis LS (2013) Representing videos using mid-level discriminative patches. In: IEEE conference on computer vision and pattern recognition, pp 2571–2578
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: IEEE conference on computer vision and pattern recognition, pp 2593–2600
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition, pp 1725–1732
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference, vol 275, pp 1–10
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision, pp 596–611
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 2046–2053
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: IEEE international conference on computer vision, pp 2556–2563
Lan T, Zhu Y, Roshan Zamir A, Savarese S (2015) Action recognition by hierarchical mid-level action elements. In: IEEE international conference on computer vision, pp 4552–4560
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, pp 1–8
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition, pp 3361–3368
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: IEEE conference on computer vision and pattern recognition, pp 1996–2003
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: IEEE conference on computer vision and pattern recognition, pp 3337–3344
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: IEEE conference on computer vision and pattern recognition, pp 1242–1249
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Ryoo M (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE international conference on computer vision, pp 1036–1043
Ryoo MS, Aggarwal JK (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)
Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: IEEE conference on computer vision and pattern recognition, pp 1234–1241
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: IEEE international conference on pattern recognition, vol 3, pp 32–36
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM international conference on multimedia, pp 357–360
Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimed Tools Appl 74 (2):523–542
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE international conference on computer vision, pp 4597–4605
Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: IEEE conference on computer vision and pattern recognition, pp 3681–3688
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision, pp 4489–4497
Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human activity recognition methods. Front Robot AI 2:28
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp 3551–3558
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British machine vision conference, vol 124, pp 1–11
Wang L, Qiao Y, Tang X (2013) Mining motion atoms and phrases for complex action recognition. In: IEEE international conference on computer vision, pp 2680–2687
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: IEEE international conference on computer vision, pp 3119–3127
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115 (2):224–241
Xu H, Tian Q, Wang Z, Wu J (2016) A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75(10):5701–5717
Xu Z, Qing L, Miao J (2015) Activity auto-completion: predicting human activities from partial videos. In: IEEE international conference on computer vision, pp 3191–3199
Xu Z, Yang Y, Hauptmann AG (2015) A discriminative cnn video representation for event detection. In: IEEE conference on computer vision and pattern recognition, pp 1798–1807
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang S, Yao H, Sun X, Wang K, Zhang J, Lu X, Zhang Y (2014) Action recognition based on overcomplete independent components analysis. Inf Sci 281:635–647
Zhang S, Zhou H, Yao H, Zhang Y, Wang K, Zhang J (2015) Adaptive normalhedge for robust visual tracking. Signal Process 110:132–142
Zhang S, Lan X, Yao H, Zhou H, Tao D, Li X (2016) A biologically inspired appearance model for robust visual tracking. In: IEEE transactions on neural networks and learning systems
Zhang W, Zhu M, Derpanis KG (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In: IEEE international conference on computer vision, pp 2248–2255
Zhou Y, Ni B, Hong R, Wang M, Tian Q (2015) Interaction part mining: a mid-level approach for fine-grained action recognition. In: IEEE conference on computer vision and pattern recognition, pp 3323–3331
Zhu J, Wang B, Yang X, Zhang W, Tu Z (2013) Action recognition with actons. In: IEEE international conference on computer vision, pp 3559–3566
Acknowledgments
The work was supported in part by the National Science Foundation of China (No. 61472103) and Australian Research Council (ARC) grant (DP150104645). We especially would like to thank the China Scholarship Council (CSC) for funding the first author to conduct the partially of this project at Australian National University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, Y., Yao, H., Sun, X. et al. Breaking video into pieces for action recognition. Multimed Tools Appl 76, 22195–22212 (2017). https://doi.org/10.1007/s11042-017-5038-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5038-6