Skip to main content
Log in

Event detection and recognition for semantic annotation of video

Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Research on methods for detection and recognition of events and actions in videos is receiving an increasing attention from the scientific community, because of its relevance for many applications, from semantic video indexing to intelligent video surveillance systems and advanced human-computer interaction interfaces. Event detection and recognition requires to consider the temporal aspect of video, either at the low-level with appropriate features, or at a higher-level with models and classifiers than can represent time. In this paper we survey the field of event recognition, from interest point detectors and descriptors, to event modelling techniques and knowledge management technologies. We provide an overview of the methods, categorising them according to video production methods and video domains, and according to types of events and actions that are typical of these domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. http://www.irisa.fr/vista/actions/

  2. http://www.nada.kth.se/cvap/actions/

References

  1. Akdemir U, Turaga P, Chellappa R (2008) An ontology based approach for activity recognition from video. In: Proc. of ACM multimedia (MM)

  2. Arndt R, Troncy R, Staab S, Hardman L, Vacura M (2007) Comm: designing a well-founded multimedia ontology for the web. In: Proc. of int’l semantic web conference

  3. Artikis A, Sergot M, Paliouras G (2010) A logic programming approach to activity recognition. In: Proc. of ACM int’l workshop on events in multimedia

  4. Assfalg J, Bertini M, Del Bimbo A, Nunziati W, Pala P (2002) Soccer highlights detection and recognition using HMMs. In: Proc. of int’l conference on multimedia & expo (ICME)

  5. Assfalg J, Bertini M, Colombo C, Del Bimbo A, Nunziati W (2003) Semantic annotation of soccer videos: automatic highlights identification. Comput Vis Image Underst 92(2–3):285–305

    Article  Google Scholar 

  6. Bai L, Lao S, Jones G, Smeaton AF (2007) Video semantic content analysis based on ontology. In: Proc. of int’l machine vision and image processing conference

  7. Bai L, Lao S, Zhang W, Jones G, Smeaton A (2007) A semantic event detection approach for soccer video based on perception concepts and finite state machines. In: Proc. intl’l workshop on image analysis for multimedia interactive services (WIAMIS)

  8. Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2009) Recognizing human actions by fusing spatio-temporal appearance and motion descriptors. In: Proc. of int’l conference on image processing (ICIP). Cairo, Egypt

  9. Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Semantic annotation of soccer videos by visual instance clustering and spatial/temporal reasoning in ontologies. Multimed Tools Appl 48(2):313–337

    Article  Google Scholar 

  10. Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Video event classification using string kernels. Multimed Tools Appl 48(1):69–87

    Article  Google Scholar 

  11. Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Video annotation and retrieval using ontologies and rule learning. IEEE Multimed doi:10.1109/MMUL.2004.4

    Google Scholar 

  12. Basharat A, Zhai Y, Shah M (2008) Content based video matching using spatiotemporal volumes. Comput Vis Image Underst 110(3):360–377

    Article  Google Scholar 

  13. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput Vis Image Underst 110(3):346–359

    Article  Google Scholar 

  14. Bertini M, Del Bimbo A, Nunziati W (2005) Common visual cues for sports highlights modeling. Multimed Tools Appl 27(2):215–218

    Article  Google Scholar 

  15. Bertini M, Del Bimbo A, Torniai C, Cucchiara R, Grana C (2007) Dynamic pictorial ontologies for video digital libraries annotation. In: Proc. of ACM int’l workshop on many faces of multimedia semantics (MS)

  16. Bertini M, Del Bimbo A, Serra G (2008) Learning ontology rules for semantic video annotation. In: Proc. of ACM int’l workshop on many faces of multimedia semantics (MS)

  17. Bloehdorn S, Petridis K, Saathoff C, Simou N, Tzouvaras V, Avrithis Y, Handschuh S, Kompatsiaris I, Staab S, Strintzis M (2005) Semantic annotation of images and videos for multimedia analysis. In: Proc. of European semantic web conference

  18. Brand M, Kettnaker V (2000) Discovery and segmentation of activities in video. IEEE Trans Pattern Anal Mach Intell 22(8):844–851

    Article  Google Scholar 

  19. Brezeale D, Cook D (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern 38(3):416–430

    Article  Google Scholar 

  20. Chao C, Shih HC, Huang CL (2005) Semantics-based highlight extraction of soccer program using DBN. In: Proc. of int’l conference on acoustics, speech, and signal processing (ICASSP)

  21. Chen D, Yang J, Wactlar HD (2004) Towards automatic analysis of social interaction patterns in a nursing home environment from video. In: Proc. of int’l workshop on multimedia information retrieval (MIR)

  22. Chen M, Hauptmann A, Li H (2009) Informedia @ TRECVID2009: analyzing video motions. In: Proc. of the TRECVID workshop

  23. Dasiopoulou S, Mezaris V, Kompatsiaris I, Papastathis VK, Strintzis MG (2005) Knowledge-assisted semantic video object detection. IEEE Trans Circuits Syst Video Technol 15(10):1210–1224

    Article  Google Scholar 

  24. Dasiopoulou S, Saathoff C, Mylonas P, Avrithis Y, Kompatsiaris Y, Staab S, Strintzis M (2008) Semantic multimedia and ontologies theory and applications, chapter introducing context and reasoning in visual content analysis: an ontology-based framework. Springer, pp 99–122

  25. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proc. of int’l workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS)

  26. Dousson C, Le Maigat P (2007) Chronicle recognition improvement using temporal focusing and hierarchization. In: Proc. of int’l joint conference on artificial intelligence

  27. Dublin Core Metadata Initiative. http://dublincore.org/. Accessed 11 October 2010

  28. Ebadollahi S, Xie L, Chang SF, Smith J (2006) Visual event detection using multi-dimensional concept dynamics. In: Proc. of int’l conference on multimedia & expo (ICME)

  29. Fathi A, Mori G (2008) Action recognition by learning mid-level motion features. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  30. Fellbaum C (1998) WordNet: an electronic lexical database, chap 3. A semantic network of English verbs. MIT, Cambridge

    Google Scholar 

  31. Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  32. Fihl P, Holte M, Moeslund T (2007) Motion primitives for action recognition. In: Proc. of int’l workshop on gesture in human-computer interaction and simulation

  33. Francois A, Nevatia R, Hobbs J, Bolles R, Smith J (2005) VERL: an ontology framework for representing and annotating video events. IEEE Multimed 12(4):76–86

    Article  Google Scholar 

  34. Garcia R, Celma O (2005) Semantic integration and retrieval of multimedia metadata. In: Proc. of the knowledge markup and semantic annotation workshop

  35. Georis B, Mazière M, Brémond F, Thonnat M (2004) A video interpretation platform applied to bank agency monitoring. In: Proc. of intelligent distributed surveillance systems workshop

  36. Gruber T (1995) Principles for the design of ontologies used for knowledge sharing. Int J Human-comput Stud 43(5–6):907–928

    Article  Google Scholar 

  37. Hakeem A, Shah M (2004) Ontology and taxonomy collaborated framework for meeting classification. In: Proc. of int’l conference on pattern recognition (ICPR)

  38. Harte N, Lennon D, Kokaram A (2009) On parsing visual sequences with the hidden Markov model. EURASIP JIVP 2009:1–13

    Google Scholar 

  39. Haubold A, Naphade M (2007) Classification of video events using 4-dimensional time-compressed motion features. In: Proc. of ACM international conference on image and video retrieval (CIVR), pp 178–185

  40. Hollink L, Little S, Hunter J (2005) Evaluating the application of semantic inferencing rules to image annotation. In: Proc. of int’l conference on knowledge capture

  41. Jhuang H, Garrote E, Yu X, Khilnani V, Poggio T, Steele A, Serre T (2010) Automated home-cage behavioral phenotyping of mice. Nature communications doi:10.1038/ncomms.1064

    Google Scholar 

  42. Kadir T, Brady M (2001) Saliency, scale and image description. Int J Comput Vis 45(2):83–105

    Article  MATH  Google Scholar 

  43. Kale A, Sundaresan A, Rajagopalan AN, Cuntoor NP, Roy-Chowdhury AK, Kruger V, Chellappa R (2004) Identification of humans using gait. IEEE Trans Knowl Data Eng 13(9):1163–1173

    Google Scholar 

  44. Kennedy L (2006) Revision of LSCOM event/activity annotations, DTO challenge workshop on large scale concept ontology for multimedia. Advent technical report #221-2006-7, Columbia University

  45. Kienzle W, Scholkopf B, Wichmann F, Franz MO (2007) How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: Proc. of 29th annual symposium of the german association for pattern recognition. Springer

  46. Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-Gradients. In: Proc. of British machine vision conference (BMVC)

  47. Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 37th IEEE applied imagery pattern recognition workshop, pp 1–8

  48. Kompatsiaris Y, Hobson P (2008) Semantic multimedia and ontologies: theory and applications. Springer

  49. Kowalski R, Sergot M (1986) A logic-based calculus of events. New Gener Comput 4(1):67–95

    Article  Google Scholar 

  50. Kuettel D, Breitenstein MD, Van Gool L, Ferrari V (2010) What’s going on? discovering spatio-temporal dependencies in dynamic scenes. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  51. Laptev I, Lindeberg T (2003) Space-time interest points. In: Proc. of int’l conference on computer vision (ICCV)

  52. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  53. Laptev I, Perez P (2007) Retrieving actions in movies. In: Proc. of int’l conference on computer vision (ICCV)

  54. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  55. Lavee G, Borzin A, Rivlin E, Rudzsky M (2007) Building Petri nets from video event ontologies. In: Proc. of international symposium on visual computing (ISVC). LNCS, vol 4841. Springer Verlag, pp 442–451

  56. Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans Syst Man Cybern 39(5):489–504

    Article  Google Scholar 

  57. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  58. Leslie L, Chua TS, Ramesh J (2007) Annotation of paintings with high-level semantic concepts using transductive inference and ontology-based concept disambiguation. In: Proc. of ACM multimedia (MM)

  59. Liu J, Shah M (2008) Learning human actions via information maximization. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  60. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  61. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  62. Luo M, Ma YF, Zhang HJ (2003) Pyramidwise structuring for soccer highlight extraction. In: Proc. of ICICS-PCM

  63. Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  64. Maillot N, Thonnat M (2008) Ontology based complex object recognition. Image Vis Comput 26(1):102–113

    Article  Google Scholar 

  65. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  66. Mehran R, Moore B, Shah M (2010) A streakline representation of flow in crowded scenes. In: Proc. of European conference on computer vision (ECCV)

  67. Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615–1630

    Article  Google Scholar 

  68. Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Van Gool L (2005) A comparison of affine region detectors. Int J Comput Vis 65(1/2):43–72

    Article  Google Scholar 

  69. Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  70. Miller JA, Baramidze G (2005) Simulation and the semantic web. In: Proc. of the winter simulation conference (WSC)

  71. Naphade M, Smith J, Tesic J, Chang SF, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91

    Article  Google Scholar 

  72. Neumann B, Moeller R (2006) On scene interpretation with description logics. In: Cognitive vision systems: sampling the spectrum of approaches. Lecture notes in computer science, vol 3948. Springer, pp 247–278

  73. Nevatia R, Hobbs J, Bolles B (2004) An ontology for video event representation. In: Proc. of the conference on computer vision and pattern recognition workshop (CVPRW)

  74. Niebles J, Fei-Fei L (2007) A hierarchical model of shape and appearance for human action classification. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  75. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  76. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proc. of European conference on computer vision (ECCV)

  77. Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern 36:719

    Google Scholar 

  78. Over P, Awad G, Fiscus J, Michel M, Smeaton AF, Kraaij W (2009) TRECVid 2009–goals, tasks, data, evaluation mechanisms and metrics. In: Proc. of the TRECVID workshop. Gaithersburg, USA

  79. Paschke A, Bichler M (2008) Knowledge representation concepts for automated SLA management. Decis Support Syst 46(1):187–205

    Article  Google Scholar 

  80. Pattanasri N, Jatowt A, Tanaka K (2006) Enhancing comprehension of events in video through explanation-on-demand hypervideo. In: Advances in multimedia modeling. Lecture notes in computer science, vol 4351. Springer, pp 535–544

  81. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  82. Sadlier D, O’Connor N (2005) Event detection in field sports video using audio–visual features and a support vector machine. IEEE Trans Circuits Syst Video Technol 15(10):1225–1233

    Article  Google Scholar 

  83. SanMiguel J, Martinez J, Garcia A (2009) An ontology for event detection and its application in surveillance video. In: Proc. of int’l conference on advanced video and signal-based surveillance (AVSS)

  84. Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  85. Savarese S, Del Pozo A, Niebles JC, Fei-Fei L (2008) Spatial-temporal correlatons for unsupervised action classification. In: Proc. of workshop on motion and video computing

  86. Scherp A, Franz T, Saathoff C, Staab S (2009) F–a model of events based on the foundational ontology DOLCE+DnS ultralight. In: Proc. of int’l conference on knowledge capture (K-CAP)

  87. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proc. of int’l conference on pattern recognition (ICPR)

  88. Scovanner P, Ali S, Shah M (2007) A 3-Dimensional SIFT descriptor and its application to action recognition. In: Proc. of ACM multimedia (MM)

  89. Seidenari L, Bertini M (2010) Non-parametric anomaly detection exploiting space-time features. In: Proc. of ACM multimedia (MM)

  90. Shet V, Harwood D, Davis L (2005) Vidmap: video monitoring of activity with prolog. In: Proc. of IEEE int’l conference on advanced video and signal-based surveillance (AVSS)

  91. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proc. of int’l conference on computer vision (ICCV)

  92. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proc. of int’l workshop on multimedia information retrieval (MIR)

  93. Snidaro L, Belluz M, Foresti G (2007) Domain knowledge for surveillance applications. In: Proc. of int’l conference on information fusion

  94. Snoek C, Worring M (2005) Multimodal video indexing: A review of the state-of-the-art. Multimed Tools Appl 25(1):5–35

    Article  Google Scholar 

  95. Tran SD, Davis LS (2008) Event modeling and recognition using Markov logic networks. In: Proc. of European conference on computer vision (ECCV)

  96. Tsinaraki C, Polydoros P, Kazasis F, Christodoulakis S (2005) Ontology-based semantic indexing for MPEG-7 and TV-Anytime audiovisual content. Multimed Tools Appl 26(3):299–325

    Article  Google Scholar 

  97. TV Anytime Forum. http://www.tv-anytime.org/. Accessed 11 October 2010

  98. Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimed Tools Appl 50(2):359–380. http://www.openvisor.org

    Article  Google Scholar 

  99. Viola PA, Jones MJ (2001) Rapid object detection using a boosted cascade of simple features. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  100. Wang Xj, Mamadgi S, Thekdi A, Kelliher A, Sundaram H (2007) Eventory—an event based media repository. In: Proc of the int’l conference on semantic computing (ICSC)

  101. Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. In: Proc. of ACM multimedia (MM)

  102. Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proc. of European conference on computer vision (ECCV)

  103. Winder SAJ, Hua G, Brown M (2009) Picking the best DAISY. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  104. Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: Proc. of int’l conference on computer vision (ICCV)

  105. Wong SF, Kim TK, Cipolla R (2007) Learning motion categories using both semantic and structural information. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)

  106. Xu D, Chang SF (2008) Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985–1997

    Article  Google Scholar 

  107. Xu P, Xie L, Chang SF, Divakaran A, Vetro A, Sun H (2001) Algorithms and system for segmentation and structure analysis in soccer video. In: Proc. of int’l conference on multimedia & expo (ICME)

  108. Xu G, Ma YF, Zhang HJ, Yang S (2003) A HMM based semantic analysis framework for sports game event detection. In: Proc. of IEEE int’l conference on image processing (ICIP). Barcelona, Spain

  109. Yang J, Hauptmann AG (2006) Exploring temporal consistency for video analysis and retrieval. In: Proc. of int’l workshop on multimedia information retrieval (MIR)

  110. Yang J, Jiang YG, Hauptmann AG, Ngo CW (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proc. of int’l workshop on multimedia information retrieval (MIR)

  111. Zhan B, Monekosso D, Remagnino P, Velastin S, Xu LQ (2008) Crowd analysis: a survey. Mach Vis Appl 19:345–357

    Article  MATH  Google Scholar 

  112. Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238

    Article  Google Scholar 

  113. Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang T (2008) SIFT-bag kernel for video event analysis. In: Proc. of ACM multimedia (MM), pp 229–238

Download references

Acknowledgement

This work is partially supported by the EU IST IM3I Project (Contract FP7-222267).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Bertini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ballan, L., Bertini, M., Del Bimbo, A. et al. Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51, 279–302 (2011). https://doi.org/10.1007/s11042-010-0643-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0643-7

Keywords

Navigation