Skip to main content

Advertisement

Log in

A modified vector of locally aggregated descriptors approach for fast video classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In order to reduce the computational complexity, most of the video classification approaches represent video data at frame level. In this paper we investigate a novel perspective that combines frame features to create a global descriptor. The main contributions are: (i) a fast algorithm to densely extract global frame features which are easier and faster to compute than spatio-temporal local features; (ii) replacing the traditional k-means visual vocabulary from Bag-of-Words with a Random Forest approach allowing a significant speedup; (iii) the use of a modified Vector of Locally Aggregated Descriptor(VLAD) combined with a Fisher kernel approach that replace the classic Bag-of-Words approach, allowing us to achieve high accuracy. By doing so, the proposed approach combines the frame-based features effectively capturing video content variation in time. We show that our framework is highly general and is not dependent on a particular type of descriptors. Experiments performed on four different scenarios: movie genre classification, human action recognition, daily activity recognition and violence scene classification, show the superiority of the proposed approach compared to the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://blip.tv/

  2. http://www.youtube.com/

References

  1. Almeida J, Pedronette DC, Penatti OA (2014) Unsupervised Manifold Learning for Video Genre Retrieval. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer International Publishing, pp 604–612

  2. Bilinski P, Corvee E, Bak S, Bremond F (2013) Relative dense tracklets for human action recognition. In: IEEE International Conference of Automatic Face and Gesture Recognition (FG)

  3. Bouckaert RR, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scuse D (2013) WEKA Manual for Version 3–7–8

  4. Brezeale D, Cook DJ (2008) Automatic video classification: A survey of the literature, in Systems, Man, and Cybernetics, Part C: Applications and Reviews. IEEE Trans 38(3):416–430

    Google Scholar 

  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  6. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE International Conference on Computer Vision (ICCV)

  7. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual Categorization with Bags of Keypoints, European Conference on Computer Vision (ECCV):1–2

  8. Ciresan DC, Meier U, Masci J, Maria Gambardella L, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. Proc-Int Joint Conf Artif Intell (IJCAI) 22(1):1238–1242

    Google Scholar 

  9. Chakraborty B, Holte MB, Moeslund TB, Gonzlez J (2012) Selective spatio-temporal interest points. Computer Vision and Image Understanding 116(3):396–410

    Article  Google Scholar 

  10. Demarty C-H, Penet C, Soleymani M, Gravier G (2013) VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Media Tools and Applications

  11. Demarty C-H, Penet C, Schedl M, Ionescu B, Quang VL, Jiang Y-G (2013) The MediaEval 2013 Affect Task: Violent Scenes Detection. Working Notes Proceedings [3]

  12. Demarty C-H, Ionescu B, Jiang Y-G, Quang VL, Schedl M, Penet C (2014) Banchmarking Violent Scenes Detection in Movies. IEEE International Workshop on Content-Based Multimedia Indexing - CBMI

  13. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  14. García Seco De Herrera A, Kalpathy-Cramer J, Demner Fushman D, Antani S, Müller H (2013) Overview of the ImageCLEF 2013 medical tasks, Working Notes of CLEF 2013. Cross Language Evaluation Forum, Valencia, Spain

    Google Scholar 

  15. Goto S, Aoki T (2013) TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multimodal Features by MKL. Working Notes Proceedings [3]

  16. Gold K, Petrosino A (2010) Using information gain to build meaningful decision forests for multilabel classification. In: Development and Learning (ICDL), 2010 IEEE 9th International Conference on. IEEE, pp 58–63

  17. Ionescu B, Mironică I, Seyerlehner K, Knees P, Schluter J, Schedl M, Cucu H, Buzo A, Lambert P (2012) ARF @ mediaeval 2012: Multimodal video classification. In: MediaEval workshop

  18. Imre C, Korner J (2011) Information theory: coding theorems for discrete memoryless systems. Cambridge University Press

  19. Ikizler-Cinbis N, Sclaroff S (2011) Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the European Conference on Computer vision (ECCV), pp 494–507

  20. Jiang Y-G, Liu J, Roshan Zamir A, Laptev I, Piccardi M, Shah M, Sukthankar R (2013) THUMOS challenge: Action recognition with a large number of classes, ICCV Workshop on Action Recognition with a Large Number of Classes, http://crcv.ucf.edu/ICCV13-Action-Workshop

  21. Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  22. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: Computer Vision and Pattern Recognition (CVPR)

  23. Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)

  24. Karaman S, Seidenari L, Bagdanov AD, Bagdanov A (2013) L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video. ICCV workshop on action recognition with a large number of classes

  25. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fe L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  26. Khurram S, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR. arXiv:1212.0402

  27. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp 1097–1105

  28. Ludwig O, Delgado D, Goncalves V, Nunes U (2009) Trainable classifier-fusion schemes: an application to pedestrian detection. IEEE Int Conf Intell Trans Syst 1:432–437

    Google Scholar 

  29. Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging Understanding Workshop

  30. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp 1–8

  31. Liu J, Luo J, Shah M. (2009) Recognizing realistic actions from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003

  32. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Computer Vision and Pattern Recognition

  33. MediaEval 2013 Workshop Larson M, Anguera X, Reuter T, Jones GJF, Ionescu B, Schedl M, Piatrik T, Hauff C, Soleymani M (eds) (2013) co-located with ACM Multimedia, Barcelona, Spain, October 18-19, CEUR-WS.org, ISSN 1613-0073, Vol. 1043, http://ceur-ws.org/Vol-1043/

  34. Mironica I, Uijlings J, Rostamzadeh N, Ionescu B, Sebe N (2013) Time matters!: capturing variation in time in video using fisher kernels. In: Proceedings of the 21st ACM International Conference on Multimedia, pp 701–704

  35. Marin J, Vzquez D, Lpez AM, Amores J, Leibe B (2013) Random Forests of Local Experts for Pedestrian Detection. In: IEEE International Conference on Computer Vision (ICCV), pp 2592–2599

  36. Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: ICCV

  37. Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software. Proceedings of the 11th ISMIR conference, pp 441–446

  38. Murthy OV, Goecke R (2013) Ordered Trajectories for Large Scale Human Action Recognition. IEEE International Conference on Computer Vision

  39. Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802

    Article  Google Scholar 

  40. Nakayama H (2012) Aggregating Descriptors with Local Gaussian Metrics. In: proceedings of NIPS 2012 Workshop on Large Scale Visual Recognition and Retrieval

  41. Nowozin S (2012) Improved information gain estimates for decision tree induction, arXiv preprint arXiv:1206.4620

  42. Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quéenot G (2013) TRECVID 2013 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics, Proceedings of TRECVID 2013, http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf, NIST. USA

  43. Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR , abs/1405.4506

  44. Perronnin F, Sanchez J, Mensink T. (2010) Improving the fisher kernel for large-scale image classification. In: European Conference of Computer Vision (ECCV), pp 143–156

  45. Penet C, Demarty C-H, Gravier G, Gros P (2013) Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task. Working Notes Proceedings [3]

  46. Picard D, Gosselin P-H. (2011) Improving image similarity with vectors of locally aggregated tensors. IEEE Image Processing (ICIP), 2011 18th IEEE International Conference on

  47. Quoc V (2013) Building high-level features using large scale unsupervised learning. In: IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP)

  48. Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: International Conference of Computer Vision and Pattern Recognition. CVPR

  49. Rostamzadeh N, Zen G, Mironic I, Uijlings J, Sebe N (2013) Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation. In: IEEE International Conference on Image Analysis and Processing. ICIAP

  50. Raptis M, Soatto S (2011) Tracklet descriptors for action modeling and video analysis. In: European Conference of Computer Vision (ECCV), pp 577–590

  51. Simonyan K, Vedaldi A, Zisserman A (2013) Deep Fisher networks for large-scale image classification. In: NIPS

  52. Schmiedeke S, Xu P, Ferrané I, Eskevich M, Kofler C, Larson M, Estève Y, Lamel L, Jones G, Sikora T (2013) Blip10000: A Social Video Dataset Containing SPUG Content for Tagging and Retrieval, vol 1. ACM Multimedia Systems Conference, Oslo, Norway

    Book  Google Scholar 

  53. Schmiedeke S, Kofler C, Ferrané I Overview of the MediaEval 2012 Tagging Task, Working Notes Proceedings of the MediaEval 2012 Workshop, Pisa, Italy, October 4-5, 2012, CEUR-WS.org, ISSN 1613-0073, http://ceur-ws.org/Vol-927/mediaeval2012_submission_2.pdf

  54. Semela T, Tapaswi M, Ekenel H, Stiefelhagen R (2012) Kit at mediaeval 2012 - content-based genre classification with visual cues. In: MediaEval workshop

  55. Schmiedeke S, Kelm P, Sikora T (2012) TUB @ MediaEval 2012 tagging task: Feature selection methods for bag-of- (visual)-words approaches. In: MediaEval Workshop

  56. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Conference of Computer Vision and Patern Recognition

  57. Solmaz B, Assari SM, Mubarak S (2013) Classifying web videos using a global video descriptor. Mach Vi Appl (MVAP) 24(7):1473–1485

    Article  Google Scholar 

  58. Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies. In: MediaEval 2014 Workshop, Barcelona

  59. Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimed 12(7):665–681

    Article  Google Scholar 

  60. Uijlings JRR, Duta IC, Sangineto E, Sebe N (2014) Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off. In: International Journal of Multimedia Information Retrieval, pp 1–12

  61. Van deWeijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18(7):1512–1523

    Article  MathSciNet  Google Scholar 

  62. Wang H, Klaser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  63. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, Proceedings, pp 3551–3558

  64. Wang J, Chen Z, Wu Y (2011) Action recognition with multiscale spatio-temporal contexts. In: CVPR

  65. Wang H, Schmid C (2013) LEAR-INRIA submission for the THUMOS workshop. ICCV Workshop on Action Recognition with a Large Number of Classes

  66. Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890

    Article  Google Scholar 

Download references

Acknowledgments

The work has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132395.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ionuţ Mironică.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mironică, I., Duţă, I.C., Ionescu, B. et al. A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl 75, 9045–9072 (2016). https://doi.org/10.1007/s11042-015-2819-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2819-7

Keywords

Navigation