Skip to main content
Log in

Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Feature extraction and encoding represent two of the most crucial steps in an action recognition system. For building a powerful action recognition pipeline it is important that both steps are efficient and in the same time provide reliable performance. This work proposes a new approach for feature extraction and encoding that allows us to obtain real-time frame rate processing for an action recognition system. The motion information represents an important source of information within the video. The common approach to extract the motion information is to compute the optical flow. However, the estimation of optical flow is very demanding in terms of computational cost, in many cases being the most significant processing step within the overall pipeline of the target video analysis application. In this work we propose an efficient approach to capture the motion information within the video. Our proposed descriptor, Histograms of Motion Gradients (HMG), is based on a simple temporal and spatial derivation, which captures the changes between two consecutive frames. For the encoding step a widely adopted method is the Vector of Locally Aggregated Descriptors (VLAD), which is an efficient encoding method, however, it considers only the difference between local descriptors and their centroids. In this work we propose Shape Difference VLAD (SD-VLAD), an encoding method which brings complementary information by using the shape information within the encoding process. We validated our proposed pipeline for action recognition on three challenging datasets UCF50, UCF101 and HMDB51, and we propose also a real-time framework for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://newsroom.cisco.com/press-release-content?articleId=1644203.

  2. https://iduta.github.io/software.html.

References

  1. Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: CVPR

  2. Arandjelovic R, Zisserman A (2013) All about VLAD. In: CVPR

  3. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR

  4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  5. Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI 33(3):500–513

    Article  Google Scholar 

  6. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: ECCV

  7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR

  8. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV

  9. de Souza CR, Gaidon A, Vig E, López AM (2016) Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In: ECCV

  10. Duta IC, Nguyen TA, Aizawa K, Ionescu B, Sebe N (2016) Boosting VLAD with double assignment using deep features for action recognition in videos. In: ICPR

  11. Duta IC, Uijlings JRR, Nguyen TA, Aizawa K, Hauptmann AG, Ionescu B, Sebe N (2016) Histograms of motion gradients for real-time video classification. In: CBMI

  12. Duta IC, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal VLAD encoding for human action recognition in videos. In: MMM

  13. Farnebäck G. (2003) Two-frame motion estimation based on polynomial expansion. In: Image analysis, pp 363–370

  14. Horn B K, Schunck BG (1981) Determining optical flow. In: 1981 Technical symposium east, pp 319–331. International Society for Optics and Photonics

  15. Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: CVPR

  16. Jégou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. TPAMI 34(9):1704–1716

    Article  Google Scholar 

  17. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: ICCV

  18. Kantorov V, Laptev I (2014) Efficient feature extraction, encoding and classification for action recognition. In: CVPR

  19. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR

  20. Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion interchange patterns for action recognition in unconstrained videos. In: ECCV

  21. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: ICCV

  22. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCV

  23. Laptev I (2005) On space-time interest points. IJCV 64(2-3):107–123

    Article  Google Scholar 

  24. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR

  25. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR

  26. Lowe D G (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110

    Article  Google Scholar 

  27. Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI

  28. Mironică I, Duţă I C, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl

  29. Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: ICCV

  30. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In: WACV

  31. Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:1405.4506

  32. Peng X, Wang L, Qiao Y, Peng Q (2014) Boosting vlad with supervised dictionary learning and high-order statistics. In: ECCV

  33. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: ECCV

  34. Poularakis S, Avgerinakis K, Briassouli A, Kompatsiaris I (2015) Computationally efficient recognition of activities of daily living. In: ICIP

  35. Reddy K K, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981

    Article  Google Scholar 

  36. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS

  37. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  38. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Ninth international conference on computer vision, 2003 Proceedings, pp 1470–1477

  39. Solmaz B, Assari S M, Shah M (2013) Classifying web videos using a global video descriptor. Mach Vis Appl

  40. Soomro K, Zamir A R, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  41. Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: CVPR

  42. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV

  43. Uijlings J R, Smeulders A W, Scha R J (2010) Real-time visual concept classification. TMR 12(7):665– 681

    Google Scholar 

  44. Uijlings JRR, Duta IC, Rostamzadeh N, Sebe N (2014) Realtime video classification using dense hof/hog. In: ICMR

  45. Uijlings J R R, Duta I C, Sangineto E, Sebe N (2015) Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off. Int J Multimed Inf Retriev

  46. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM Multimedia

  47. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV

  48. Wang H, Schmid C (2013) Lear-inria submission for the thumos workshop. In: ICCV Workshop

  49. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC

  50. Wang H, Kläser A, Schmid C, Liu C -L (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1):60–79

    Article  MathSciNet  Google Scholar 

  51. Wang H, Oneata D, Verbeek J, Schmid C (2015) A robust and efficient video representation for action recognition. IJCV

  52. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV

  53. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR

  54. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Pattern recognition

  55. Zhou X, Yu K, Zhang T, Huang T S (2010) Image classification using super-vector coding of local image descriptors. In: Computer Vision–ECCV 2010, pp 141–154

Download references

Acknowledgments

Part of this work was funded under research grant PN-III-P2-2.1-PED-2016-1065, agreement 30PED/2017, project SPOTTER. This material is based in part on work supported by the National Science Foundation (NSF) under grant number IIS-1251187.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ionut C. Duta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duta, I.C., R. Uijlings, J.R., Ionescu, B. et al. Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information. Multimed Tools Appl 76, 22445–22472 (2017). https://doi.org/10.1007/s11042-017-4795-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4795-6

Keywords

Navigation