Abstract
Action recognition is one of the most popular fields of computer vision, and lots of efforts have been made to improve recognition accuracy. While multiple descriptors are extracted to represent action, the spatio-temporal information is lost. In order to incorporate spatio-temporal information, we propose a novel method called augmented descriptor by adding the information to the original descriptor. As descriptors represent different video features, such as static appearance and motion information, previous methods just concatenate various descriptors. However, we propose a fusion method to boost the recognition accuracy of action recognition. The Multiple Kernel Learning is utilized to fuse different descriptors to get better representation in our fusion method. We also evaluate the contribution of normalization method to recognition accuracy. Our proposed methods are tested on the benchmark datasets, Olympic Sports dataset and HMDB51 dataset. The experimental results show that our approaches outperform the baseline method of improved trajectories and are effective in recognizing various actions.
Similar content being viewed by others
References
Arandjelovic R, Zisserman A (2013) All about VLAD. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 1578–1585
Bishop CM (2006) Pattern recognition and machine learning. springer
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, IEEE, pp 1395–1402
Brendel W, Todorovic S (2011) Learning spatiotemporal graphs of human activities. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 778–785
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 596–603
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Cherian A, Mairal J, Alahari K, Schmid C (2014) Mixing body-part sequences for human pose estimation. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 2353–2360
Chéron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. Proceedings of the IEEE International Conference on Computer Vision, In, pp. 3218–3226
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006. Springer, pp 428–441
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, IEEE, pp 65–72
Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 1347–1355
Gaidon A, Harchaoui Z, Schmid C (2012) Recognizing activities with cluster-trees of tracklets. In: BMVC 2012-British Machine Vision Conference, BMVA Press, pp 30.31–30.13
Girshick R, Iandola F, Darrell T, Malik J (2015) Deformable part models are convolutional neural networks. IEEE Conference on Computer Vision & Pattern Recogn, In, pp. 437–446
Hoai M, Zisserman A (2015) Improving human action recognition using score distribution and ranking. In: Computer Vision--ACCV 2014. Springer, pp 3–20
Jain A, Vishwanathan SVN, Varma M (2012) SPG-GMKL: generalized multiple kernel learning with a million kernels. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, ACM, pp 750–758
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 2555–2562
Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1169–1176
Jiang Y-G, Dai Q, Xue X, Liu W, Ngo C-W (2012) Trajectory-Based modeling of human actions with motion reference points. In: Proceedings of the 12th European conference on Computer Vision-Volume Part V, Springer-Verlag, pp 425–438
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, IEEE, pp 1725–1732
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association, pp 275: 271–210
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, In, pp. 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 2556–2563
Lan Z, Hauptmann AG (2015) Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition. arXiv preprint arXiv:151004565
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 I.E. Computer Society Conference on, IEEE, pp 2169–2178
Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 362–370
Lloyd SP (1982) Least squares quantization in PCM. Information Theory, IEEE Transactions on 28(2):129–137
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 2929–2936
Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Computer Vision–ECCV 2010. Springer, pp 392–405
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with Fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 1817–1824
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint arXiv:14054506
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010. Springer, pp 143–156
Pfister T, Simonyan K, Charles J, Zisserman A (2015) Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos. In: Asian Conference on Computer Vision
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
Sadanand S, Corso JJ (2012) Action bank: A high-level representation of activity in video. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1234–1241
Sánchez J, Perronnin F, De Campos T (2012) Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn Lett 33(16):2216–2223
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Pattern Recogn, 2004. ICPR 2004. Proceedings of the 17th International Conference on, IEEE, pp 32–36
Shuiwang J, Ming Y, Kai Y (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231
Sivic J, Zisserman A (2003) Video Google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, IEEE, pp 1470–1477
Sun C, Nevatia R (2013) Active: Activity concept transitions in video event classification. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 913–920
Tang K, Fei-Fei L, Koller D (2012) Learning latent temporal structure for complex event detection. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1250–1257
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Computer Vision–ECCV 2010. Springer, pp 140–153
Toshev A, Szegedy C (2014) DeepPose: Human Pose Estimation via Deep Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, pp 1653–1660
Van Gemert JC, Veenman CJ, Smeulders AW, Geusebroek J-M (2010) Visual word ambiguity. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(7):1271–1283
Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1 Wiley New York
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 1065–1072
Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. Advances in neural information processing systems, In, pp. 2361–2369
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 3551–3558
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on,. IEEE, pp 3360–3367
Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, IEEE, pp 3169–3176
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Xie S, Yang T, Wang X, Lin Y (2015) Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1794–1801
Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 833–841
Author information
Authors and Affiliations
Corresponding author
Additional information
An erratum to this article is available at http://dx.doi.org/10.1007/s11042-016-3889-x.
Rights and permissions
About this article
Cite this article
Li, L., Dai, S. Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tools Appl 76, 13953–13969 (2017). https://doi.org/10.1007/s11042-016-3789-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3789-0