A modified vector of locally aggregated descriptors approach for fast video classification

Mironică, Ionuţ; Duţă, Ionuţ Cosmin; Ionescu, Bogdan; Sebe, Nicu

doi:10.1007/s11042-015-2819-7

A modified vector of locally aggregated descriptors approach for fast video classification

Published: 21 August 2015

Volume 75, pages 9045–9072, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ionuţ Mironică¹,
Ionuţ Cosmin Duţă²,
Bogdan Ionescu¹ &
…
Nicu Sebe²

672 Accesses
25 Citations
Explore all metrics

Abstract

In order to reduce the computational complexity, most of the video classification approaches represent video data at frame level. In this paper we investigate a novel perspective that combines frame features to create a global descriptor. The main contributions are: (i) a fast algorithm to densely extract global frame features which are easier and faster to compute than spatio-temporal local features; (ii) replacing the traditional k-means visual vocabulary from Bag-of-Words with a Random Forest approach allowing a significant speedup; (iii) the use of a modified Vector of Locally Aggregated Descriptor(VLAD) combined with a Fisher kernel approach that replace the classic Bag-of-Words approach, allowing us to achieve high accuracy. By doing so, the proposed approach combines the frame-based features effectively capturing video content variation in time. We show that our framework is highly general and is not dependent on a particular type of descriptors. Experiments performed on four different scenarios: movie genre classification, human action recognition, daily activity recognition and violence scene classification, show the superiority of the proposed approach compared to the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Article 28 September 2014

Violent scene detection algorithm based on kernel extreme learning machine and three-dimensional histograms of gradient orientation

Article Open access 07 December 2018

Violence Detection Based on Spatio-Temporal Feature and Fisher Vector

Notes

References

Almeida J, Pedronette DC, Penatti OA (2014) Unsupervised Manifold Learning for Video Genre Retrieval. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer International Publishing, pp 604–612
Bilinski P, Corvee E, Bak S, Bremond F (2013) Relative dense tracklets for human action recognition. In: IEEE International Conference of Automatic Face and Gesture Recognition (FG)
Bouckaert RR, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scuse D (2013) WEKA Manual for Version 3–7–8
Brezeale D, Cook DJ (2008) Automatic video classification: A survey of the literature, in Systems, Man, and Cybernetics, Part C: Applications and Reviews. IEEE Trans 38(3):416–430
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MathSciNet MATH Google Scholar
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE International Conference on Computer Vision (ICCV)
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual Categorization with Bags of Keypoints, European Conference on Computer Vision (ECCV):1–2
Ciresan DC, Meier U, Masci J, Maria Gambardella L, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. Proc-Int Joint Conf Artif Intell (IJCAI) 22(1):1238–1242
Google Scholar
Chakraborty B, Holte MB, Moeslund TB, Gonzlez J (2012) Selective spatio-temporal interest points. Computer Vision and Image Understanding 116(3):396–410
Article Google Scholar
Demarty C-H, Penet C, Soleymani M, Gravier G (2013) VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Media Tools and Applications
Demarty C-H, Penet C, Schedl M, Ionescu B, Quang VL, Jiang Y-G (2013) The MediaEval 2013 Affect Task: Violent Scenes Detection. Working Notes Proceedings [3]
Demarty C-H, Ionescu B, Jiang Y-G, Quang VL, Schedl M, Penet C (2014) Banchmarking Violent Scenes Detection in Movies. IEEE International Workshop on Content-Based Multimedia Indexing - CBMI
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
García Seco De Herrera A, Kalpathy-Cramer J, Demner Fushman D, Antani S, Müller H (2013) Overview of the ImageCLEF 2013 medical tasks, Working Notes of CLEF 2013. Cross Language Evaluation Forum, Valencia, Spain
Google Scholar
Goto S, Aoki T (2013) TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multimodal Features by MKL. Working Notes Proceedings [3]
Gold K, Petrosino A (2010) Using information gain to build meaningful decision forests for multilabel classification. In: Development and Learning (ICDL), 2010 IEEE 9th International Conference on. IEEE, pp 58–63
Ionescu B, Mironică I, Seyerlehner K, Knees P, Schluter J, Schedl M, Cucu H, Buzo A, Lambert P (2012) ARF @ mediaeval 2012: Multimodal video classification. In: MediaEval workshop
Imre C, Korner J (2011) Information theory: coding theorems for discrete memoryless systems. Cambridge University Press
Ikizler-Cinbis N, Sclaroff S (2011) Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the European Conference on Computer vision (ECCV), pp 494–507
Jiang Y-G, Liu J, Roshan Zamir A, Laptev I, Piccardi M, Shah M, Sukthankar R (2013) THUMOS challenge: Action recognition with a large number of classes, ICCV Workshop on Action Recognition with a Large Number of Classes, http://crcv.ucf.edu/ICCV13-Action-Workshop
Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: Computer Vision and Pattern Recognition (CVPR)
Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
Karaman S, Seidenari L, Bagdanov AD, Bagdanov A (2013) L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video. ICCV workshop on action recognition with a large number of classes
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fe L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Khurram S, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR. arXiv:1212.0402
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp 1097–1105
Ludwig O, Delgado D, Goncalves V, Nunes U (2009) Trainable classifier-fusion schemes: an application to pedestrian detection. IEEE Int Conf Intell Trans Syst 1:432–437
Google Scholar
Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging Understanding Workshop
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp 1–8
Liu J, Luo J, Shah M. (2009) Recognizing realistic actions from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Computer Vision and Pattern Recognition
MediaEval 2013 Workshop Larson M, Anguera X, Reuter T, Jones GJF, Ionescu B, Schedl M, Piatrik T, Hauff C, Soleymani M (eds) (2013) co-located with ACM Multimedia, Barcelona, Spain, October 18-19, CEUR-WS.org, ISSN 1613-0073, Vol. 1043, http://ceur-ws.org/Vol-1043/
Mironica I, Uijlings J, Rostamzadeh N, Ionescu B, Sebe N (2013) Time matters!: capturing variation in time in video using fisher kernels. In: Proceedings of the 21st ACM International Conference on Multimedia, pp 701–704
Marin J, Vzquez D, Lpez AM, Amores J, Leibe B (2013) Random Forests of Local Experts for Pedestrian Detection. In: IEEE International Conference on Computer Vision (ICCV), pp 2592–2599
Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: ICCV
Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software. Proceedings of the 11th ISMIR conference, pp 441–446
Murthy OV, Goecke R (2013) Ordered Trajectories for Large Scale Human Action Recognition. IEEE International Conference on Computer Vision
Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802
Article Google Scholar
Nakayama H (2012) Aggregating Descriptors with Local Gaussian Metrics. In: proceedings of NIPS 2012 Workshop on Large Scale Visual Recognition and Retrieval
Nowozin S (2012) Improved information gain estimates for decision tree induction, arXiv preprint arXiv:1206.4620
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quéenot G (2013) TRECVID 2013 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics, Proceedings of TRECVID 2013, http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf, NIST. USA
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR , abs/1405.4506
Perronnin F, Sanchez J, Mensink T. (2010) Improving the fisher kernel for large-scale image classification. In: European Conference of Computer Vision (ECCV), pp 143–156
Penet C, Demarty C-H, Gravier G, Gros P (2013) Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task. Working Notes Proceedings [3]
Picard D, Gosselin P-H. (2011) Improving image similarity with vectors of locally aggregated tensors. IEEE Image Processing (ICIP), 2011 18th IEEE International Conference on
Quoc V (2013) Building high-level features using large scale unsupervised learning. In: IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP)
Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: International Conference of Computer Vision and Pattern Recognition. CVPR
Rostamzadeh N, Zen G, Mironic I, Uijlings J, Sebe N (2013) Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation. In: IEEE International Conference on Image Analysis and Processing. ICIAP
Raptis M, Soatto S (2011) Tracklet descriptors for action modeling and video analysis. In: European Conference of Computer Vision (ECCV), pp 577–590
Simonyan K, Vedaldi A, Zisserman A (2013) Deep Fisher networks for large-scale image classification. In: NIPS
Schmiedeke S, Xu P, Ferrané I, Eskevich M, Kofler C, Larson M, Estève Y, Lamel L, Jones G, Sikora T (2013) Blip10000: A Social Video Dataset Containing SPUG Content for Tagging and Retrieval, vol 1. ACM Multimedia Systems Conference, Oslo, Norway
Book Google Scholar
Schmiedeke S, Kofler C, Ferrané I Overview of the MediaEval 2012 Tagging Task, Working Notes Proceedings of the MediaEval 2012 Workshop, Pisa, Italy, October 4-5, 2012, CEUR-WS.org, ISSN 1613-0073, http://ceur-ws.org/Vol-927/mediaeval2012_submission_2.pdf
Semela T, Tapaswi M, Ekenel H, Stiefelhagen R (2012) Kit at mediaeval 2012 - content-based genre classification with visual cues. In: MediaEval workshop
Schmiedeke S, Kelm P, Sikora T (2012) TUB @ MediaEval 2012 tagging task: Feature selection methods for bag-of- (visual)-words approaches. In: MediaEval Workshop
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Conference of Computer Vision and Patern Recognition
Solmaz B, Assari SM, Mubarak S (2013) Classifying web videos using a global video descriptor. Mach Vi Appl (MVAP) 24(7):1473–1485
Article Google Scholar
Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies. In: MediaEval 2014 Workshop, Barcelona
Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimed 12(7):665–681
Article Google Scholar
Uijlings JRR, Duta IC, Sangineto E, Sebe N (2014) Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off. In: International Journal of Multimedia Information Retrieval, pp 1–12
Van deWeijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18(7):1512–1523
Article MathSciNet Google Scholar
Wang H, Klaser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, Proceedings, pp 3551–3558
Wang J, Chen Z, Wu Y (2011) Action recognition with multiscale spatio-temporal contexts. In: CVPR
Wang H, Schmid C (2013) LEAR-INRIA submission for the THUMOS workshop. ICCV Workshop on Action Recognition with a Large Number of Classes
Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890
Article Google Scholar

Download references

Acknowledgments

The work has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132395.

Author information

Authors and Affiliations

LAPI, University Politehnica of Bucharest, Bucharest, 061071, Romania
Ionuţ Mironică & Bogdan Ionescu
DISI, University of Trento, Trento, Italy
Ionuţ Cosmin Duţă & Nicu Sebe

Authors

Ionuţ Mironică
View author publications
You can also search for this author in PubMed Google Scholar
Ionuţ Cosmin Duţă
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Ionescu
View author publications
You can also search for this author in PubMed Google Scholar
Nicu Sebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ionuţ Mironică.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mironică, I., Duţă, I.C., Ionescu, B. et al. A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl 75, 9045–9072 (2016). https://doi.org/10.1007/s11042-015-2819-7

Download citation

Received: 03 December 2014
Revised: 13 June 2015
Accepted: 09 July 2015
Published: 21 August 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11042-015-2819-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A modified vector of locally aggregated descriptors approach for fast video classification

Abstract

Access this article

Similar content being viewed by others

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Violent scene detection algorithm based on kernel extreme learning machine and three-dimensional histograms of gradient orientation

Violence Detection Based on Spatio-Temporal Feature and Fisher Vector

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A modified vector of locally aggregated descriptors approach for fast video classification

Abstract

Access this article

Similar content being viewed by others

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Violent scene detection algorithm based on kernel extreme learning machine and three-dimensional histograms of gradient orientation

Violence Detection Based on Spatio-Temporal Feature and Fisher Vector

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation