Skip to main content
Log in

Action recognition with spatio-temporal augmented descriptor and fusion method

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

An Erratum to this article was published on 07 September 2016

Abstract

Action recognition is one of the most popular fields of computer vision, and lots of efforts have been made to improve recognition accuracy. While multiple descriptors are extracted to represent action, the spatio-temporal information is lost. In order to incorporate spatio-temporal information, we propose a novel method called augmented descriptor by adding the information to the original descriptor. As descriptors represent different video features, such as static appearance and motion information, previous methods just concatenate various descriptors. However, we propose a fusion method to boost the recognition accuracy of action recognition. The Multiple Kernel Learning is utilized to fuse different descriptors to get better representation in our fusion method. We also evaluate the contribution of normalization method to recognition accuracy. Our proposed methods are tested on the benchmark datasets, Olympic Sports dataset and HMDB51 dataset. The experimental results show that our approaches outperform the baseline method of improved trajectories and are effective in recognizing various actions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Arandjelovic R, Zisserman A (2013) All about VLAD. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 1578–1585

  2. Bishop CM (2006) Pattern recognition and machine learning. springer

  3. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, IEEE, pp 1395–1402

  4. Brendel W, Todorovic S (2011) Learning spatiotemporal graphs of human activities. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 778–785

  5. Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 596–603

    Google Scholar 

  6. Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  7. Cherian A, Mairal J, Alahari K, Schmid C (2014) Mixing body-part sequences for human pose estimation. Proc IEEE Conf Comput Vis Pattern Recognit, In, pp. 2353–2360

    Google Scholar 

  8. Chéron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. Proceedings of the IEEE International Conference on Computer Vision, In, pp. 3218–3226

    Google Scholar 

  9. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006. Springer, pp 428–441

  10. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, IEEE, pp 65–72

  11. Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 1347–1355

  12. Gaidon A, Harchaoui Z, Schmid C (2012) Recognizing activities with cluster-trees of tracklets. In: BMVC 2012-British Machine Vision Conference, BMVA Press, pp 30.31–30.13

  13. Girshick R, Iandola F, Darrell T, Malik J (2015) Deformable part models are convolutional neural networks. IEEE Conference on Computer Vision & Pattern Recogn, In, pp. 437–446

    Google Scholar 

  14. Hoai M, Zisserman A (2015) Improving human action recognition using score distribution and ranking. In: Computer Vision--ACCV 2014. Springer, pp 3–20

  15. Jain A, Vishwanathan SVN, Varma M (2012) SPG-GMKL: generalized multiple kernel learning with a million kernels. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, ACM, pp 750–758

  16. Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on, IEEE, pp 2555–2562

  17. Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1169–1176

  18. Jiang Y-G, Dai Q, Xue X, Liu W, Ngo C-W (2012) Trajectory-Based modeling of human actions with motion reference points. In: Proceedings of the 12th European conference on Computer Vision-Volume Part V, Springer-Verlag, pp 425–438

  19. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, IEEE, pp 1725–1732

  20. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association, pp 275: 271–210

  21. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, In, pp. 1097–1105

    Google Scholar 

  22. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Computer Vision (ICCV), 2011 I.E. International Conference on, IEEE, pp 2556–2563

  23. Lan Z, Hauptmann AG (2015) Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition. arXiv preprint arXiv:151004565

  24. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  25. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8

  26. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 I.E. Computer Society Conference on, IEEE, pp 2169–2178

  27. Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 362–370

  28. Lloyd SP (1982) Least squares quantization in PCM. Information Theory, IEEE Transactions on 28(2):129–137

    Article  MathSciNet  MATH  Google Scholar 

  29. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 2929–2936

  30. Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Computer Vision–ECCV 2010. Springer, pp 392–405

  31. Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with Fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 1817–1824

  32. Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint arXiv:14054506

  33. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010. Springer, pp 143–156

  34. Pfister T, Simonyan K, Charles J, Zisserman A (2015) Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos. In: Asian Conference on Computer Vision

  35. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8

  36. Sadanand S, Corso JJ (2012) Action bank: A high-level representation of activity in video. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1234–1241

  37. Sánchez J, Perronnin F, De Campos T (2012) Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn Lett 33(16):2216–2223

    Article  Google Scholar 

  38. Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Pattern Recogn, 2004. ICPR 2004. Proceedings of the 17th International Conference on, IEEE, pp 32–36

  39. Shuiwang J, Ming Y, Kai Y (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231

    Article  Google Scholar 

  40. Sivic J, Zisserman A (2003) Video Google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, IEEE, pp 1470–1477

  41. Sun C, Nevatia R (2013) Active: Activity concept transitions in video event classification. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 913–920

  42. Tang K, Fei-Fei L, Koller D (2012) Learning latent temporal structure for complex event detection. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, IEEE, pp 1250–1257

  43. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Computer Vision–ECCV 2010. Springer, pp 140–153

  44. Toshev A, Szegedy C (2014) DeepPose: Human Pose Estimation via Deep Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on, pp 1653–1660

  45. Van Gemert JC, Veenman CJ, Smeulders AW, Geusebroek J-M (2010) Visual word ambiguity. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(7):1271–1283

    Article  Google Scholar 

  46. Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1 Wiley New York

    MATH  Google Scholar 

  47. Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 1065–1072

  48. Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO algorithm. Advances in neural information processing systems, In, pp. 2361–2369

    Google Scholar 

  49. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer Vision (ICCV), 2013 I.E. International Conference on, IEEE, pp 3551–3558

  50. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on,. IEEE, pp 3360–3367

  51. Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, IEEE, pp 3169–3176

  52. Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  53. Xie S, Yang T, Wang X, Lin Y (2015) Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on

  54. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 1794–1801

  55. Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pp 833–841

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lijun Li.

Additional information

An erratum to this article is available at http://dx.doi.org/10.1007/s11042-016-3889-x.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Dai, S. Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tools Appl 76, 13953–13969 (2017). https://doi.org/10.1007/s11042-016-3789-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-3789-0

Keywords

Navigation