Action recognition with spatio-temporal augmented descriptor and fusion method

Li, Lijun; Dai, Shuling

doi:10.1007/s11042-016-3789-0

Action recognition with spatio-temporal augmented descriptor and fusion method

Published: 29 July 2016

Volume 76, pages 13953–13969, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

489 Accesses
4 Citations
Explore all metrics

An Erratum to this article was published on 07 September 2016

Abstract

Action recognition is one of the most popular fields of computer vision, and lots of efforts have been made to improve recognition accuracy. While multiple descriptors are extracted to represent action, the spatio-temporal information is lost. In order to incorporate spatio-temporal information, we propose a novel method called augmented descriptor by adding the information to the original descriptor. As descriptors represent different video features, such as static appearance and motion information, previous methods just concatenate various descriptors. However, we propose a fusion method to boost the recognition accuracy of action recognition. The Multiple Kernel Learning is utilized to fuse different descriptors to get better representation in our fusion method. We also evaluate the contribution of normalization method to recognition accuracy. Our proposed methods are tested on the benchmark datasets, Olympic Sports dataset and HMDB51 dataset. The experimental results show that our approaches outperform the baseline method of improved trajectories and are effective in recognizing various actions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arandjelovic R, Zisserman A (2013) All about VLAD. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 1578–1585
Bishop CM (2006) Pattern recognition and machine learning. springer
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, IEEE, pp 1395–1402
Brendel W, Todorovic S (2011) Learning spatiotemporal graphs of human activities. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 778–785
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 596–603
Google Scholar
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Google Scholar
Cherian A, Mairal J, Alahari K, Schmid C (2014) Mixing body-part sequences for human pose estimation. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 2353–2360
Google Scholar
Chéron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. Proceedings of the IEEE International Conference on Computer Vision, In, pp. 3218–3226
Google Scholar
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006. Springer, pp 428–441
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, IEEE, pp 65–72
Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 1347–1355
Gaidon A, Harchaoui Z, Schmid C (2012) Recognizing activities with cluster-trees of tracklets. In: BMVC 2012-British Machine Vision Conference, BMVA Press, pp 30.31–30.13
Girshick R, Iandola F, Darrell T, Malik J (2015) Deformable part models are convolutional neural networks. IEEE Conference on Computer Vision & Pattern Recogn, In, pp. 437–446
Google Scholar
Hoai M, Zisserman A (2015) Improving human action recognition using score distribution and ranking. In: Computer Vision--ACCV 2014. Springer, pp 3–20
Jain A, Vishwanathan SVN, Varma M (2012) SPG-GMKL: generalized multiple kernel learning with a million kernels. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, ACM, pp 750–758
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 2555–2562
Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1169–1176
Jiang Y-G, Dai Q, Xue X, Liu W, Ngo C-W (2012) Trajectory-Based modeling of human actions with motion reference points. In: Proceedings of the 12th European conference on Computer Vision-Volume Part V, Springer-Verlag, pp 425–438
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, IEEE, pp 1725–1732
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association, pp 275: 271–210
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, In, pp. 1097–1105
Google Scholar
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 2556–2563
Lan Z, Hauptmann AG (2015) Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition. arXiv preprint arXiv:151004565
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 I.E. Computer Society Conference on, IEEE, pp 2169–2178
Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 362–370
Lloyd SP (1982) Least squares quantization in PCM. Information Theory, IEEE Transactions on 28(2):129–137
Article MathSciNet MATH Google Scholar
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 2929–2936
Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Computer Vision–ECCV 2010. Springer, pp 392–405
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with Fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 1817–1824
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint arXiv:14054506
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010. Springer, pp 143–156
Pfister T, Simonyan K, Charles J, Zisserman A (2015) Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos. In: Asian Conference on Computer Vision
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
Sadanand S, Corso JJ (2012) Action bank: A high-level representation of activity in video. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1234–1241
Sánchez J, Perronnin F, De Campos T (2012) Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn Lett 33(16):2216–2223
Article Google Scholar
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Pattern Recogn, 2004. ICPR 2004. Proceedings of the 17th International Conference on, IEEE, pp 32–36
Shuiwang J, Ming Y, Kai Y (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231
Article Google Scholar
Sivic J, Zisserman A (2003) Video Google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, IEEE, pp 1470–1477
Sun C, Nevatia R (2013) Active: Activity concept transitions in video event classification. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 913–920
Tang K, Fei-Fei L, Koller D (2012) Learning latent temporal structure for complex event detection. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1250–1257
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Computer Vision–ECCV 2010. Springer, pp 140–153
Toshev A, Szegedy C (2014) DeepPose: Human Pose Estimation via Deep Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, pp 1653–1660
Van Gemert JC, Veenman CJ, Smeulders AW, Geusebroek J-M (2010) Visual word ambiguity. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(7):1271–1283
Article Google Scholar
Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1 Wiley New York
MATH Google Scholar
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 1065–1072
Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. Advances in neural information processing systems, In, pp. 2361–2369
Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 3551–3558
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on,. IEEE, pp 3360–3367
Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, IEEE, pp 3169–3176
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Xie S, Yang T, Wang X, Lin Y (2015) Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1794–1801
Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 833–841

Download references

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Lijun Li & Shuling Dai

Authors

Lijun Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuling Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lijun Li.

Additional information

An erratum to this article is available at http://dx.doi.org/10.1007/s11042-016-3889-x.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Dai, S. Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tools Appl 76, 13953–13969 (2017). https://doi.org/10.1007/s11042-016-3789-0

Download citation

Received: 27 November 2015
Revised: 13 July 2016
Accepted: 18 July 2016
Published: 29 July 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11042-016-3789-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action recognition with spatio-temporal augmented descriptor and fusion method

Abstract

Access this article

Similar content being viewed by others

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action recognition with spatio-temporal augmented descriptor and fusion method

Abstract

Access this article

Similar content being viewed by others

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation