Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information

Duta, Ionut C.; R. Uijlings, Jasper R.; Ionescu, Bogdan; Aizawa, Kiyoharu; G. Hauptmann, Alexander; Sebe, Nicu

doi:10.1007/s11042-017-4795-6

Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information

Published: 13 May 2017

Volume 76, pages 22445–22472, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ionut C. Duta ORCID: orcid.org/0000-0003-4009-8922¹,
Jasper R. R. Uijlings²,
Bogdan Ionescu³,
Kiyoharu Aizawa⁴,
Alexander G. Hauptmann⁵ &
…
Nicu Sebe¹

649 Accesses
23 Citations
Explore all metrics

Abstract

Feature extraction and encoding represent two of the most crucial steps in an action recognition system. For building a powerful action recognition pipeline it is important that both steps are efficient and in the same time provide reliable performance. This work proposes a new approach for feature extraction and encoding that allows us to obtain real-time frame rate processing for an action recognition system. The motion information represents an important source of information within the video. The common approach to extract the motion information is to compute the optical flow. However, the estimation of optical flow is very demanding in terms of computational cost, in many cases being the most significant processing step within the overall pipeline of the target video analysis application. In this work we propose an efficient approach to capture the motion information within the video. Our proposed descriptor, Histograms of Motion Gradients (HMG), is based on a simple temporal and spatial derivation, which captures the changes between two consecutive frames. For the encoding step a widely adopted method is the Vector of Locally Aggregated Descriptors (VLAD), which is an efficient encoding method, however, it considers only the difference between local descriptors and their centroids. In this work we propose Shape Difference VLAD (SD-VLAD), an encoding method which brings complementary information by using the shape information within the encoding process. We validated our proposed pipeline for action recognition on three challenging datasets UCF50, UCF101 and HMDB51, and we propose also a real-time framework for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated video analysis for action recognition using descriptors derived from optical acceleration

Article 31 January 2019

A Robust and Efficient Video Representation for Action Recognition

Article 17 July 2015

Action recognition in depth videos using hierarchical gaussian descriptor

Article 13 January 2018

Notes

References

Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: CVPR
Arandjelovic R, Zisserman A (2013) All about VLAD. In: CVPR
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI 33(3):500–513
Article Google Scholar
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: ECCV
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV
de Souza CR, Gaidon A, Vig E, López AM (2016) Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In: ECCV
Duta IC, Nguyen TA, Aizawa K, Ionescu B, Sebe N (2016) Boosting VLAD with double assignment using deep features for action recognition in videos. In: ICPR
Duta IC, Uijlings JRR, Nguyen TA, Aizawa K, Hauptmann AG, Ionescu B, Sebe N (2016) Histograms of motion gradients for real-time video classification. In: CBMI
Duta IC, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal VLAD encoding for human action recognition in videos. In: MMM
Farnebäck G. (2003) Two-frame motion estimation based on polynomial expansion. In: Image analysis, pp 363–370
Horn B K, Schunck BG (1981) Determining optical flow. In: 1981 Technical symposium east, pp 319–331. International Society for Optics and Photonics
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: CVPR
Jégou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. TPAMI 34(9):1704–1716
Article Google Scholar
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: ICCV
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding and classification for action recognition. In: CVPR
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion interchange patterns for action recognition in unconstrained videos. In: ECCV
Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: ICCV
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCV
Laptev I (2005) On space-time interest points. IJCV 64(2-3):107–123
Article Google Scholar
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR
Lowe D G (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110
Article Google Scholar
Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI
Mironică I, Duţă I C, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: ICCV
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In: WACV
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:1405.4506
Peng X, Wang L, Qiao Y, Peng Q (2014) Boosting vlad with supervised dictionary learning and high-order statistics. In: ECCV
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: ECCV
Poularakis S, Avgerinakis K, Briassouli A, Kompatsiaris I (2015) Computationally efficient recognition of activities of daily living. In: ICIP
Reddy K K, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Ninth international conference on computer vision, 2003 Proceedings, pp 1470–1477
Solmaz B, Assari S M, Shah M (2013) Classifying web videos using a global video descriptor. Mach Vis Appl
Soomro K, Zamir A R, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: CVPR
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV
Uijlings J R, Smeulders A W, Scha R J (2010) Real-time visual concept classification. TMR 12(7):665– 681
Google Scholar
Uijlings JRR, Duta IC, Rostamzadeh N, Sebe N (2014) Realtime video classification using dense hof/hog. In: ICMR
Uijlings J R R, Duta I C, Sangineto E, Sebe N (2015) Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off. Int J Multimed Inf Retriev
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM Multimedia
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV
Wang H, Schmid C (2013) Lear-inria submission for the thumos workshop. In: ICCV Workshop
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC
Wang H, Kläser A, Schmid C, Liu C -L (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1):60–79
Article MathSciNet Google Scholar
Wang H, Oneata D, Verbeek J, Schmid C (2015) A robust and efficient video representation for action recognition. IJCV
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Pattern recognition
Zhou X, Yu K, Zhang T, Huang T S (2010) Image classification using super-vector coding of local image descriptors. In: Computer Vision–ECCV 2010, pp 141–154

Download references

Acknowledgments

Part of this work was funded under research grant PN-III-P2-2.1-PED-2016-1065, agreement 30PED/2017, project SPOTTER. This material is based in part on work supported by the National Science Foundation (NSF) under grant number IIS-1251187.

Author information

Authors and Affiliations

University of Trento, Trento, Italy
Ionut C. Duta & Nicu Sebe
University of Edinburgh, Edinburgh, UK
Jasper R. R. Uijlings
University Politehnica of Bucharest, Bucharest, Romania
Bogdan Ionescu
University of Tokyo, Tokyo, Japan
Kiyoharu Aizawa
Carnegie Mellon University, Pittsburgh, PA, USA
Alexander G. Hauptmann

Authors

Ionut C. Duta
View author publications
You can also search for this author in PubMed Google Scholar
Jasper R. R. Uijlings
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Ionescu
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoharu Aizawa
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Hauptmann
View author publications
You can also search for this author in PubMed Google Scholar
Nicu Sebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ionut C. Duta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duta, I.C., R. Uijlings, J.R., Ionescu, B. et al. Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information. Multimed Tools Appl 76, 22445–22472 (2017). https://doi.org/10.1007/s11042-017-4795-6

Download citation

Received: 30 September 2016
Revised: 25 February 2017
Accepted: 02 May 2017
Published: 13 May 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11042-017-4795-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information

Abstract

Access this article

Similar content being viewed by others

Automated video analysis for action recognition using descriptors derived from optical acceleration

A Robust and Efficient Video Representation for Action Recognition

Action recognition in depth videos using hierarchical gaussian descriptor

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information

Abstract

Access this article

Similar content being viewed by others

Automated video analysis for action recognition using descriptors derived from optical acceleration

A Robust and Efficient Video Representation for Action Recognition

Action recognition in depth videos using hierarchical gaussian descriptor

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation