Elsevier

Neurocomputing

Volume 119, 7 November 2013, Pages 82-93
Neurocomputing

Video event description in scene context

https://doi.org/10.1016/j.neucom.2012.03.037Get rights and content

Abstract

Video event description is an important research topic in video analysis with a vast amount of applications, such as visual surveillance, video retrieval, video annotation, video database indexing, and interactive system. In this paper, we present a framework for automated video event description, which features fused with the context knowledge to provide accurate and reliable event description. The processing framework is designed to describe the event and recognize objects activities composed of four components: object detection, classification, tracking, and semantic event description. Our contribution is to integrate the contextual cues into these components to facilitate the semantic video event description. Furthermore, in the tracking part, a novel adaptive shape kernel based mean shift tracking algorithm is proposed to improve object tracking performance under object deformation and background clutter. In the experiments, we show attractive experimental results, highlighting the system efficiency and tracking capability by using our video event description system on a real-world video for video event understanding application.

Introduction

Video event understanding is an important research topic in video analysis with a vast amount of applications, such as visual surveillance, video retrieval, video annotation, video database indexing, and interactive system. A large number of approaches have focused on automatic video understanding. These works can be divided into four groups: text-based approaches, audio-based approaches, visual-based approaches, and those that use some combination of text, audio, and visual features. The visual-based approaches utilize visual features from video sequences to translate low-level content into high-level semantic event descriptions in order to understand the behaviors of physical objects in a scene and give an event description.

The visual-based approaches of video event understanding belong to high-level computer vision, which depends on the results of low-level computer vision, such as object detection, object classification, and object tracking. Each stage has specific problems to be handled in order to provide accurate and reliable data to the next stage. In this article, we aim to achieve an automatic human motion related event description system using video understanding techniques. It involves detection, classification, tracking, and behavior recognition of human and objects. We briefly survey the related works in the following subsection.

Visual-based video event understanding, i.e., the translation of low-level content in video sequences into high-level semantic event description, is a research topic that has received much interest in recent years. Many techniques have been developed for the applications of video event understanding, such as visual surveillance, video annotation, video retrieval, video database indexing, and others.

Automated video surveillance addresses real-time observation of people and vehicles in some environment, including the description of their actions and interactions. A great number of researches have been spawned on object detection and tracking, motion analysis. Haritaoglu et al. [34] design a real time visual surveillance system W4 using monocular gray-scale video imagery to monitor multiple people in an outdoor environment, which can track multiple people with occlusion and describe the action of people's carrying objects. Hu et al. [35] present an overview of recent developments in visual surveillance within a general processing framework for visual surveillance systems. Jun-wei et al. [36] presents an automatic traffic surveillance system using only one camera to detect lane-dividing lines, normalize vehicle size, eliminate shadow, resolve occlusion, and design a linearity feature to classify vehicles. Another application of video event understanding is video annotation which can avoid the intensive labor cost of manual annotation and facilitate video retrieval. Tang et al. [32] propose a graph-based semi-supervised learning method named kernel linear neighborhood propagation (KLNP) for video annotation to handle complex situation by the advantages of LNP and kernel methods. Tang et al. [33] present correlative linear neighborhood propagation method by incorporation of the semantic concept correlation into graph-based semi-supervised learning to improve the performance of automated video annotation. Wang et al. [37] propose an optimized multi-graph-based semi-supervised learning method to tackle the insufficiency of training data and the curse of dimensionality. Video indexing and retrieval is another important application of video event understanding, which can describe, store, and organize video information and assist people to find the video. Naphide and Huang [38] propose a probabilistic framework for semantic video indexing, using probabilistic multimedia objects to map low-level features to high-level semantics. Chen et al. [39] use shrinkage optimized directed information assessment (SODA) as similarity measure to construct a framework for multimodal video indexing and retrieval. Hu et al. [40] present an overview of visual content-based video indexing and retrieval, including shot boundary detection, key frame extraction and scene segmentation, extraction of features.

As one major component of the automatic video event description system, the accuracy of object tracking results largely affects the accuracy of the overall surveillance system. In the real world, some difficulties are involved in tracking objects due to non-rigid object structures, object occlusion, multiple connected objects, low contrast to the background, object scale variation, and complex object motion.

Object tracking methods can be divided into three types, as suggested in [1], which are point tracking, silhouette tracking, and kernel tracking. In the point tracking approach [2], tracking is implemented by the correspondence of detected object points across frames. This approach is suitable for tracking small objects represented by a single point, but for the larger object used by multiple points it suffers from misdetection and occlusion. In contrast to point tracking, a silhouette-based method provides an object shape description for tracking [3]. This approach is flexible and able to handle a variety of object shapes. The kernel tracking method is performed by estimating object motion, which uses a model region to represent the object. This is a robust approach for tracking and performs well under occlusion.

In the kernel tracking approach, the mean shift algorithm is an efficient and nonparametric method for seeking the nearest mode of a point sample distribution based on kernel density estimation [4], [5]. The algorithm is well known and widely applied in the object tracking area. It is effective and popular due to its merits of low computation, easy implementation, real time response, and robust tracking performance. The traditional mean shift algorithm still encounters difficulties when the object has the shape deformation itself. In order to overcome this problem, the arbitrarily shaped kernels methods are proposed in [6], [7], [8], [9], [10]. Yilmaz [7] presents an asymmetric kernel mean shift algorithm to estimate object location, orientation, and scale. It is achieved by introducing an implicit level set of functions to reduce the estimation bias and improve the density estimation process. However, it relies on the model of the object shape in the first frame. In addition, it cannot obtain good performance in the case when non-rigid object is handled due to the constancy of the shape. Yi et al. [9] use the detected object mask to construct a kernel, which depends heavily on the detected object. Quast et al. [6] apply a GMM-SAMT algorithm to achieve an asymmetric shape adapted kernel. It is similar to Yi's method [9] in that the shape of the kernel relies on the object segmentation result. Leichter et al. [10] propose an asymmetric kernel-based visual tracker, which takes the target's color PDF into account to enhance the tracker's robustness.

The goal of human activity recognition is to automatically analyze ongoing activities from an unknown video. In recent years, a large amount of works have focused on human activity recognition. These methodologies could be classified into two categories: single-layered approaches and hierarchical approaches [11].

Single-layered approaches attempt to represent and recognize human activities directly based on sequences of images, which are suitable for the recognition of gestures and actions. According to how to model human activities, single-layered approaches can be further divided into two types: space–time approaches and sequential approaches. Space–time approaches model a human activity as a particular 3-D volume in a space–time dimension and compare them to measure their similarities and recognize the human activity. An activity generally can be represented as space–time volumes [12], trajectories [13], and local features [14]. Sequential approaches analyze sequences of features to recognize human activities. It can be further classified into two categories: exemplar-based recognition approaches and state model-based recognition approaches. Exemplar-based sequential approaches recognize the human activity by comparing the sequence of feature vectors extracted from the video with the template sequence [15]. State model-based approaches represent a human activity as a model composed of a set of states. One statistical model is constructed for each activity. For each model, the probability of the model is calculated to measure the likelihood between the action model and the input image sequence [16].

Hierarchical approaches apply the recognition results of the simpler activities to recognize high-level activities. They can be categorized into three groups: statistical approaches, syntactic approaches, and description-based approaches. Statistical approaches use multiple layers of statistical state-based models such as HMMs and DBNs to recognize activities with sequential structures [17]. Syntactic approaches model human activities as a string of symbols, where each symbol corresponds to an atomic-level action, and represent human activities as a set of production rules generating a string of atomic actions, and adopt parsing techniques from the field of programming language to recognize human activities [18]. Description-based approaches represent a high-level human activity in terms of simpler activities composing the activity, describing their temporal, spatial, and logical relationships, and recognizing the activity by searching the sub-events satisfying the relations specified in its representation [19].

Context is critical for understanding human actions. It is defined as the union of interrelated conditions in which visual entities (e.g. objects, scenes) exist or occur, including the larger environmental knowledge such as the laws of biology and physics and common sense.

In computer vision, context has been used for interpretation of statistic scenes [21], [22], [23], [24]. Jiang et al. [21] model scene and object contexts with few examples to improve the performance of human action retrieval. Rabinovich et al. [22] use semantic context as post-processing to an off-the-shelf discriminative model for object categorization. Russell et al. [23] apply the object information from a labeled image database to detect objects in scenes. Torralba [24] models the relationship between context and object properties based on the correlation between the statistics of low-level features across the entire scene and the objects that it contains.

Extensive research has been devoted to the context of human action recognition. Marszalek et al. [20] use automatically extracted context information based on video scripts and improve action recognition. Gupta and Davis [25] present a graphical Bayesian model for modeling human–object interactions. Li and Li [26] propose an integrative model to learn to classify static images into complicated social events by interpreting the semantic components of the image. Moore et al. [27] exploit human motion and object context for action recognition and object classification by measuring object-based and action-based information from video. Ryoo and Aggarwal [28] integrate object recognition, motion estimation, and the semantic-level recognition of high-level human–object interactions for the hierarchical recognition of human activity interacting with objects.

The work in this paper presents a system of video event understanding, which spans a broad range from low level vision algorithms to high level techniques used for semantic video understanding. The general framework of the automated visual event description system applies the semantic analysis to recognize object activities and describe the event by integrating the scene context as shown in Fig. 1, which includes detection of moving objects, classification, tracking, action recognition, and semantic event description. Each step is integrated with the scene context to provide accurate and reliable data for the next step. The proposed system works with the assumption that the video sequence to be processed is captured by a static camera. Our goal is to recognize object activities and describe the event using any information available from the video to define a context when the visual entities exit or occur.

In this paper, the first contribution is the proposal of the video event description system to describe the event and recognize the objects activities which are composed of four components: object detection, classification, tracking, and semantic event description. This scheme provides an efficient strategy using semantic method to describe the video event from a monocular video. It well integrates contextual cues into every component to facilitate the semantic video event description. The second contribution is to propose an adaptive shape kernel based mean shift tracking algorithm. In contrast with the symmetric constant kernel used in the traditional tracker, this tracker works on a combination of the traditional symmetric shape and the foreground shape, and correspondingly construct arbitrarily shaped kernels integrated with color and gradient features to describe the object's appearance. It can better adapt to the object shape change to reduce the estimation error and improve the density estimation process. Experiments demonstrate that this tracker outperforms the traditional tracker significantly especially when target shape deformation, target occlusion and background clutter occur.

The reminder of this paper is organized as follows. Context is discussed in Section 2. Section 3 introduces the process of joint action recognition and scene context. Section 4 presents the semantic analysis to describe the event. In Section 5, the experimental studies are presented. Finally, Section 6 summarizes the main contributions of the paper together with discussions on some opening issues and future research directions.

Section snippets

Context

Context is critical for video event understanding. In our case, both background scene- and object-interaction contexts are explored for understanding human activities. Generally, the contextual information can be obtained by manual labeling and automatic labeling based on concept learning. The automatic labeling method is performed by the concept learned by concept module obtained from training dataset. It is suitable for sufficient training data or intensive unlabeled data. As our system works

Object preprocessing

In this section, we illustrate the detailed preprocessing of the objects and humans, including detection, classification, and tracking. The accuracy result of this part largely affects the accuracy of the overall video event description system. In the real world, some difficulties are involved in the low-level process due to complex object structures, object occlusion, complex object motion, and so on. The low-level process in our system is implemented by using the previous three steps as shown

Event semantic description

The event semantic description is the focus of video understanding, which associates visual features with natural language verbs and symbols to build the event semantics. According to our previous work, we use the semantic level representation to recognize the object activity and describe the event.

In our previous work, we use a bounding box to track the object. So some physical state parameters of objects can be estimated, position (x, y), size (h, w), velocity (vx, vy). The position and size

Experiments

In order to illustrate the performance of the proposed system, we apply our method on an i-Lids surveillance video database [41], which is taken at a subway station under complex conditions, including occlusion, scale changes of tracking object, and in a complex environment. We take partial video sequences from AVSS_AB_Easy_Divx video to illustrate our framework. The proposed system converted the video into sequences of image frames with 576×720 pixel resolution, obtained at a rate of 25 frames

Conclusion

We have presented a framework for automated video event description, which includes detection of moving objects, classification, tracking, action recognition, and semantic event description. The presented system fuses the context knowledge and casts them into every stage of the system to provide accurate and reliable data for the system and finally deduce the event description. The whole process spans a broad range from low level vision algorithms to high level techniques used for semantic

Acknowledgment

This work is supported by the National Natural Science Foundation of China (Nos. 61003102 & 61272223).

Chunmei Liu works in the Department of Computer Science and Technology, Tongji University, Shanghai, China. She received her Ph.D. and her M.S. from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2006. She worked as a visiting scholar at the Computer & Vision Research Center in the University of Texas at Austin during October, 2010 and October, 2011. Her interests include human activity recognition, object detection, surveillance, semantic recognition and document

References (41)

  • A. Yilmaz et al.

    Object tracking: a survey

    ACM Comput. Surv.

    (2006)
  • C.J. Veenman et al.

    Resolving motion correspondence for densely moving points

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • A. Yilmaz et al.

    Contour based object tracking with occlusion handling in video acquired using mobile cameras

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2004)
  • D. Comaniciu et al.

    Kernel-based object tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean shift, in: Proceedings of the IEEE...
  • K. Quast, A. Kaup, Shape adaptive mean shift object tracking using Gaussian mixture models, in: Proceedings of the 11th...
  • A. Yilmaz, Object tracking by asymmetric Kernel mean shift with automatic scale and orientation selection, in:...
  • A. Yilmaz

    Kernel based object tracking using asymmetric Kernels with adaptive scale and orientation selection

    Mach. Vision Appl. J.

    (2011)
  • K.M. Yi, H.S. Ahn, J.Y. Choi, Orientation and scale invariant Mean Shift using object mask-based Kernel, in:...
  • I. Leichter et al.

    Tracking by Affine Kernel transformations using color and boundary cues

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • J.K. Aggarwal et al.

    Human activity analysis: a review

    ACM Comput. Surv.

    (2011)
  • Y. Ke, R. Sukthankar, M. Hebert, Spatio-temporal Shapeand Flow Correlation for Action Recognition, in: Proceedings of...
  • Y. Sheikh, M. Shah, Exploring the space of an action, in: Proceedings of the Conference on Computer Vision, 2005, pp....
  • S.F. Wong, T.K. Kim, R. Cipolla, Learning motion categories using both semantic and structural information, in:...
  • A. Veeraraghavan, R. Chellappa, A.K. Roy-Chowdhury, The function space of an activity, in: Proceedings of the IEEE...
  • A.F. Bobick et al.

    A state-based approach to the representation and recognition of gesture

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • N. Oliver, E. Horvitz, A. Garg, Layered representations for human activity recognition, in: Proceedings of the 4th IEEE...
  • Y.A. Ivanov et al.

    Recognition of visual activities and interactions by stochastic parsing

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • M.S. Ryoo et al.

    Semantic representation and recognition of continued and recursive human activities

    Int. J. Comput. Vision

    (2009)
  • M. Marszalek, I. Laptev, C. Schmid, Actions in context, in: in: Proceedings of the International Conference on Computer...
  • Cited by (6)

    • A semantic similarity computation method for virtual resources in cloud manufacturing environment based on information content

      2021, Journal of Manufacturing Systems
      Citation Excerpt :

      Based on the VSD, events caused by related entities and processes can be described, which will be the unit of video environmental semantic similarity computation for more interaction information. With events and specific demands, we can construct EC [34]. As depicted in Fig. 4, the construction of EC consists of four parts: events, results, event flow, and environments.

    • Multiresolution semantic activity characterisation and abnormality discovery in videos

      2014, Applied Soft Computing Journal
      Citation Excerpt :

      Khan et al. [25] have predefined sets of actions, emotions, locations and objects; when these are recognised a template is employed to build grammatically correct sentences. Similarly Liu et al. [26], describe how to recognise predefined events in the subway, expressed in the form of an operation triplet given as subject–verb–object. Defining the object/subject corresponds to the task of detecting and classifying mobiles in the scene into human, baggage, train, train door.

    • Deep learning based image description generation

      2017, Proceedings of the International Joint Conference on Neural Networks
    • Evaluating the use of robots to enlarge AAL services

      2015, Journal of Ambient Intelligence and Smart Environments

    Chunmei Liu works in the Department of Computer Science and Technology, Tongji University, Shanghai, China. She received her Ph.D. and her M.S. from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2006. She worked as a visiting scholar at the Computer & Vision Research Center in the University of Texas at Austin during October, 2010 and October, 2011. Her interests include human activity recognition, object detection, surveillance, semantic recognition and document processing.

    Changbo Hu is a visiting researcher at the Computer & Vision Research Center, The University of Texas at Austin, working on face and facial expression recognition, human body tracking, human activity analysis and video surveillance. Changbo Hu obtained his Ph.D. from National Lab of Pattern Recognition, Chinese Academy of Sciences in 2001. Before joining CVRC, he has been a project scientist in Robotics Institute, Carnegie Mellon University, and a post-doc researcher in FourEyes Lab, University of California at Santa Barbara. He is a senior member of IEEE.

    Qingshan Liu is a professor at Nanjing University of Information Science & Technology. He received his Ph. D. from National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2003, and his M.S. from South East University in 2000. Before joining Nanjing University of Information Science & Technology, he worked as an Assistant research professor at Rutgers University, and an Associate professor at the National Laboratory of Pattern Recognition, Chinese Academic of Science. He worked as an Associate researcher at the Multimedia Laboratory in Chinese University of Hong Kong during June, 2004 and April, 2005. He received the president scholarship of Chinese Academy of Sciences in 2003. His research interests include Image and Vision Analysis including Face Image Analysis, Graph & Hyper-graph based Image and Video understanding, Medical Image Analysis, Event-based Video Analysis, etc. He has published more than 80 papers in journals and conferences including IEEE Transactions on PAMI, ICCV, and CVPR. He is an editorial board member of NeuroComputing and the Journal of Advance in Multimedia, and he is a guest editor of IEEE Transaction on Multimedia, Computer Vision & Image Understanding, and Pattern Recognition Letters. He is a senior member of the IEEE.

    J.K. Aggarwal is on the faculty of The University of Texas at Austin College of Engineering and is currently a Cullen Professor of Electrical and Computer Engineering and Director of the Computer and Vision Research Center. His research interests include computer vision, pattern recognition and image processing focusing on human motion. A Fellow of IEEE (1976), IAPR (1998) and AAAS (2005), he received the Senior Research Award of the American Society of Engineering Education in 1992, the 1996 Technical Achievement Award of the IEEE Computer Society and the graduate teaching award at The University of Texas at Austin in 1992. More recently, he is the recipient of the 2004 K. S. Fu prize of the International Association for Pattern Recognition, the 2005 Kirchmayer Graduate Teaching Award of the IEEE and the 2007 Okawa Prize of the Okawa Foundation of Japan.. He is a Life Fellow of IEEE and Golden Core member of IEEE Computer Society. He has authored and edited a number of books, chapters, proceedings of conferences, and papers.

    View full text