Abstract
Research on methods for detection and recognition of events and actions in videos is receiving an increasing attention from the scientific community, because of its relevance for many applications, from semantic video indexing to intelligent video surveillance systems and advanced human-computer interaction interfaces. Event detection and recognition requires to consider the temporal aspect of video, either at the low-level with appropriate features, or at a higher-level with models and classifiers than can represent time. In this paper we survey the field of event recognition, from interest point detectors and descriptors, to event modelling techniques and knowledge management technologies. We provide an overview of the methods, categorising them according to video production methods and video domains, and according to types of events and actions that are typical of these domains.
Similar content being viewed by others
References
Akdemir U, Turaga P, Chellappa R (2008) An ontology based approach for activity recognition from video. In: Proc. of ACM multimedia (MM)
Arndt R, Troncy R, Staab S, Hardman L, Vacura M (2007) Comm: designing a well-founded multimedia ontology for the web. In: Proc. of int’l semantic web conference
Artikis A, Sergot M, Paliouras G (2010) A logic programming approach to activity recognition. In: Proc. of ACM int’l workshop on events in multimedia
Assfalg J, Bertini M, Del Bimbo A, Nunziati W, Pala P (2002) Soccer highlights detection and recognition using HMMs. In: Proc. of int’l conference on multimedia & expo (ICME)
Assfalg J, Bertini M, Colombo C, Del Bimbo A, Nunziati W (2003) Semantic annotation of soccer videos: automatic highlights identification. Comput Vis Image Underst 92(2–3):285–305
Bai L, Lao S, Jones G, Smeaton AF (2007) Video semantic content analysis based on ontology. In: Proc. of int’l machine vision and image processing conference
Bai L, Lao S, Zhang W, Jones G, Smeaton A (2007) A semantic event detection approach for soccer video based on perception concepts and finite state machines. In: Proc. intl’l workshop on image analysis for multimedia interactive services (WIAMIS)
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2009) Recognizing human actions by fusing spatio-temporal appearance and motion descriptors. In: Proc. of int’l conference on image processing (ICIP). Cairo, Egypt
Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Semantic annotation of soccer videos by visual instance clustering and spatial/temporal reasoning in ontologies. Multimed Tools Appl 48(2):313–337
Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Video event classification using string kernels. Multimed Tools Appl 48(1):69–87
Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Video annotation and retrieval using ontologies and rule learning. IEEE Multimed doi:10.1109/MMUL.2004.4
Basharat A, Zhai Y, Shah M (2008) Content based video matching using spatiotemporal volumes. Comput Vis Image Underst 110(3):360–377
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput Vis Image Underst 110(3):346–359
Bertini M, Del Bimbo A, Nunziati W (2005) Common visual cues for sports highlights modeling. Multimed Tools Appl 27(2):215–218
Bertini M, Del Bimbo A, Torniai C, Cucchiara R, Grana C (2007) Dynamic pictorial ontologies for video digital libraries annotation. In: Proc. of ACM int’l workshop on many faces of multimedia semantics (MS)
Bertini M, Del Bimbo A, Serra G (2008) Learning ontology rules for semantic video annotation. In: Proc. of ACM int’l workshop on many faces of multimedia semantics (MS)
Bloehdorn S, Petridis K, Saathoff C, Simou N, Tzouvaras V, Avrithis Y, Handschuh S, Kompatsiaris I, Staab S, Strintzis M (2005) Semantic annotation of images and videos for multimedia analysis. In: Proc. of European semantic web conference
Brand M, Kettnaker V (2000) Discovery and segmentation of activities in video. IEEE Trans Pattern Anal Mach Intell 22(8):844–851
Brezeale D, Cook D (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern 38(3):416–430
Chao C, Shih HC, Huang CL (2005) Semantics-based highlight extraction of soccer program using DBN. In: Proc. of int’l conference on acoustics, speech, and signal processing (ICASSP)
Chen D, Yang J, Wactlar HD (2004) Towards automatic analysis of social interaction patterns in a nursing home environment from video. In: Proc. of int’l workshop on multimedia information retrieval (MIR)
Chen M, Hauptmann A, Li H (2009) Informedia @ TRECVID2009: analyzing video motions. In: Proc. of the TRECVID workshop
Dasiopoulou S, Mezaris V, Kompatsiaris I, Papastathis VK, Strintzis MG (2005) Knowledge-assisted semantic video object detection. IEEE Trans Circuits Syst Video Technol 15(10):1210–1224
Dasiopoulou S, Saathoff C, Mylonas P, Avrithis Y, Kompatsiaris Y, Staab S, Strintzis M (2008) Semantic multimedia and ontologies theory and applications, chapter introducing context and reasoning in visual content analysis: an ontology-based framework. Springer, pp 99–122
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proc. of int’l workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS)
Dousson C, Le Maigat P (2007) Chronicle recognition improvement using temporal focusing and hierarchization. In: Proc. of int’l joint conference on artificial intelligence
Dublin Core Metadata Initiative. http://dublincore.org/. Accessed 11 October 2010
Ebadollahi S, Xie L, Chang SF, Smith J (2006) Visual event detection using multi-dimensional concept dynamics. In: Proc. of int’l conference on multimedia & expo (ICME)
Fathi A, Mori G (2008) Action recognition by learning mid-level motion features. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Fellbaum C (1998) WordNet: an electronic lexical database, chap 3. A semantic network of English verbs. MIT, Cambridge
Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Fihl P, Holte M, Moeslund T (2007) Motion primitives for action recognition. In: Proc. of int’l workshop on gesture in human-computer interaction and simulation
Francois A, Nevatia R, Hobbs J, Bolles R, Smith J (2005) VERL: an ontology framework for representing and annotating video events. IEEE Multimed 12(4):76–86
Garcia R, Celma O (2005) Semantic integration and retrieval of multimedia metadata. In: Proc. of the knowledge markup and semantic annotation workshop
Georis B, Mazière M, Brémond F, Thonnat M (2004) A video interpretation platform applied to bank agency monitoring. In: Proc. of intelligent distributed surveillance systems workshop
Gruber T (1995) Principles for the design of ontologies used for knowledge sharing. Int J Human-comput Stud 43(5–6):907–928
Hakeem A, Shah M (2004) Ontology and taxonomy collaborated framework for meeting classification. In: Proc. of int’l conference on pattern recognition (ICPR)
Harte N, Lennon D, Kokaram A (2009) On parsing visual sequences with the hidden Markov model. EURASIP JIVP 2009:1–13
Haubold A, Naphade M (2007) Classification of video events using 4-dimensional time-compressed motion features. In: Proc. of ACM international conference on image and video retrieval (CIVR), pp 178–185
Hollink L, Little S, Hunter J (2005) Evaluating the application of semantic inferencing rules to image annotation. In: Proc. of int’l conference on knowledge capture
Jhuang H, Garrote E, Yu X, Khilnani V, Poggio T, Steele A, Serre T (2010) Automated home-cage behavioral phenotyping of mice. Nature communications doi:10.1038/ncomms.1064
Kadir T, Brady M (2001) Saliency, scale and image description. Int J Comput Vis 45(2):83–105
Kale A, Sundaresan A, Rajagopalan AN, Cuntoor NP, Roy-Chowdhury AK, Kruger V, Chellappa R (2004) Identification of humans using gait. IEEE Trans Knowl Data Eng 13(9):1163–1173
Kennedy L (2006) Revision of LSCOM event/activity annotations, DTO challenge workshop on large scale concept ontology for multimedia. Advent technical report #221-2006-7, Columbia University
Kienzle W, Scholkopf B, Wichmann F, Franz MO (2007) How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: Proc. of 29th annual symposium of the german association for pattern recognition. Springer
Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-Gradients. In: Proc. of British machine vision conference (BMVC)
Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 37th IEEE applied imagery pattern recognition workshop, pp 1–8
Kompatsiaris Y, Hobson P (2008) Semantic multimedia and ontologies: theory and applications. Springer
Kowalski R, Sergot M (1986) A logic-based calculus of events. New Gener Comput 4(1):67–95
Kuettel D, Breitenstein MD, Van Gool L, Ferrari V (2010) What’s going on? discovering spatio-temporal dependencies in dynamic scenes. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Laptev I, Lindeberg T (2003) Space-time interest points. In: Proc. of int’l conference on computer vision (ICCV)
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Laptev I, Perez P (2007) Retrieving actions in movies. In: Proc. of int’l conference on computer vision (ICCV)
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Lavee G, Borzin A, Rivlin E, Rudzsky M (2007) Building Petri nets from video event ontologies. In: Proc. of international symposium on visual computing (ISVC). LNCS, vol 4841. Springer Verlag, pp 442–451
Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans Syst Man Cybern 39(5):489–504
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Leslie L, Chua TS, Ramesh J (2007) Annotation of paintings with high-level semantic concepts using transductive inference and ontology-based concept disambiguation. In: Proc. of ACM multimedia (MM)
Liu J, Shah M (2008) Learning human actions via information maximization. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Luo M, Ma YF, Zhang HJ (2003) Pyramidwise structuring for soccer highlight extraction. In: Proc. of ICICS-PCM
Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Maillot N, Thonnat M (2008) Ontology based complex object recognition. Image Vis Comput 26(1):102–113
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Mehran R, Moore B, Shah M (2010) A streakline representation of flow in crowded scenes. In: Proc. of European conference on computer vision (ECCV)
Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615–1630
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Van Gool L (2005) A comparison of affine region detectors. Int J Comput Vis 65(1/2):43–72
Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Miller JA, Baramidze G (2005) Simulation and the semantic web. In: Proc. of the winter simulation conference (WSC)
Naphade M, Smith J, Tesic J, Chang SF, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91
Neumann B, Moeller R (2006) On scene interpretation with description logics. In: Cognitive vision systems: sampling the spectrum of approaches. Lecture notes in computer science, vol 3948. Springer, pp 247–278
Nevatia R, Hobbs J, Bolles B (2004) An ontology for video event representation. In: Proc. of the conference on computer vision and pattern recognition workshop (CVPRW)
Niebles J, Fei-Fei L (2007) A hierarchical model of shape and appearance for human action classification. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proc. of European conference on computer vision (ECCV)
Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern 36:719
Over P, Awad G, Fiscus J, Michel M, Smeaton AF, Kraaij W (2009) TRECVid 2009–goals, tasks, data, evaluation mechanisms and metrics. In: Proc. of the TRECVID workshop. Gaithersburg, USA
Paschke A, Bichler M (2008) Knowledge representation concepts for automated SLA management. Decis Support Syst 46(1):187–205
Pattanasri N, Jatowt A, Tanaka K (2006) Enhancing comprehension of events in video through explanation-on-demand hypervideo. In: Advances in multimedia modeling. Lecture notes in computer science, vol 4351. Springer, pp 535–544
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Sadlier D, O’Connor N (2005) Event detection in field sports video using audio–visual features and a support vector machine. IEEE Trans Circuits Syst Video Technol 15(10):1225–1233
SanMiguel J, Martinez J, Garcia A (2009) An ontology for event detection and its application in surveillance video. In: Proc. of int’l conference on advanced video and signal-based surveillance (AVSS)
Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Savarese S, Del Pozo A, Niebles JC, Fei-Fei L (2008) Spatial-temporal correlatons for unsupervised action classification. In: Proc. of workshop on motion and video computing
Scherp A, Franz T, Saathoff C, Staab S (2009) F–a model of events based on the foundational ontology DOLCE+DnS ultralight. In: Proc. of int’l conference on knowledge capture (K-CAP)
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proc. of int’l conference on pattern recognition (ICPR)
Scovanner P, Ali S, Shah M (2007) A 3-Dimensional SIFT descriptor and its application to action recognition. In: Proc. of ACM multimedia (MM)
Seidenari L, Bertini M (2010) Non-parametric anomaly detection exploiting space-time features. In: Proc. of ACM multimedia (MM)
Shet V, Harwood D, Davis L (2005) Vidmap: video monitoring of activity with prolog. In: Proc. of IEEE int’l conference on advanced video and signal-based surveillance (AVSS)
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proc. of int’l conference on computer vision (ICCV)
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proc. of int’l workshop on multimedia information retrieval (MIR)
Snidaro L, Belluz M, Foresti G (2007) Domain knowledge for surveillance applications. In: Proc. of int’l conference on information fusion
Snoek C, Worring M (2005) Multimodal video indexing: A review of the state-of-the-art. Multimed Tools Appl 25(1):5–35
Tran SD, Davis LS (2008) Event modeling and recognition using Markov logic networks. In: Proc. of European conference on computer vision (ECCV)
Tsinaraki C, Polydoros P, Kazasis F, Christodoulakis S (2005) Ontology-based semantic indexing for MPEG-7 and TV-Anytime audiovisual content. Multimed Tools Appl 26(3):299–325
TV Anytime Forum. http://www.tv-anytime.org/. Accessed 11 October 2010
Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimed Tools Appl 50(2):359–380. http://www.openvisor.org
Viola PA, Jones MJ (2001) Rapid object detection using a boosted cascade of simple features. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Wang Xj, Mamadgi S, Thekdi A, Kelliher A, Sundaram H (2007) Eventory—an event based media repository. In: Proc of the int’l conference on semantic computing (ICSC)
Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. In: Proc. of ACM multimedia (MM)
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proc. of European conference on computer vision (ECCV)
Winder SAJ, Hua G, Brown M (2009) Picking the best DAISY. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: Proc. of int’l conference on computer vision (ICCV)
Wong SF, Kim TK, Cipolla R (2007) Learning motion categories using both semantic and structural information. In: Proc. of int’l conference on computer vision and pattern recognition (CVPR)
Xu D, Chang SF (2008) Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985–1997
Xu P, Xie L, Chang SF, Divakaran A, Vetro A, Sun H (2001) Algorithms and system for segmentation and structure analysis in soccer video. In: Proc. of int’l conference on multimedia & expo (ICME)
Xu G, Ma YF, Zhang HJ, Yang S (2003) A HMM based semantic analysis framework for sports game event detection. In: Proc. of IEEE int’l conference on image processing (ICIP). Barcelona, Spain
Yang J, Hauptmann AG (2006) Exploring temporal consistency for video analysis and retrieval. In: Proc. of int’l workshop on multimedia information retrieval (MIR)
Yang J, Jiang YG, Hauptmann AG, Ngo CW (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proc. of int’l workshop on multimedia information retrieval (MIR)
Zhan B, Monekosso D, Remagnino P, Velastin S, Xu LQ (2008) Crowd analysis: a survey. Mach Vis Appl 19:345–357
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang T (2008) SIFT-bag kernel for video event analysis. In: Proc. of ACM multimedia (MM), pp 229–238
Acknowledgement
This work is partially supported by the EU IST IM3I Project (Contract FP7-222267).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ballan, L., Bertini, M., Del Bimbo, A. et al. Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51, 279–302 (2011). https://doi.org/10.1007/s11042-010-0643-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-010-0643-7