iSurveillance: Intelligent framework for multiple events detection in surveillance videos

https://doi.org/10.1016/j.eswa.2014.02.003Get rights and content

Highlights

  • It is essential to automate the video surveillance systems due to human limitation.

  • Based on the principle of compositionality, this paper proposed the iSurveillance.

  • Our work shows good results in detecting multiple events under a unified framework.

  • Our work is opposed to prior methods that are tailor-made and designed to work in specific area.

  • Future work focus on new domain knowledge, dataset and improve variables complexity.

Abstract

Research in the video surveillance is gaining more popularity due to its widespread applications as well as social impact. In this paper, we present an intelligent framework for detection of multiple events in surveillance videos. Based on the principle of compositionality, we modularize the surveillance problems into a set of variables comprising regions-of-interest, classes (i.e. human, vehicle), attributes (i.e. speed, locality) and a set of notions (i.e. rules) associated to each of the attributes to construct a knowledge-based understanding of the environment. The final output from the reasoning process, which combines the definition domains of the various variables, allows a broader and integrated understanding of complex pattern of activities in the scene. This is in contrast to the state-of-the-art solutions that are only able to perform only a singular task, at a time. Experimental results on both the public and real-time datasets have demonstrated the effectiveness and robustness of the proposed framework in detecting multiple events in surveillance videos.

Introduction

Nowadays, video surveillance systems are rapidly being deployed in public spaces to strengthen public safety. This is motivated by the availability of powerful computing hardware at lower costs, the demand for better public safety in the society, as well as the advancements of technology and tools. While manual surveillance by human operators is an ideal solution, there is a dire need to automate some aspects of real-time surveillance systems due to the large number of channels that the human operators have to closely observe, and not to mention human fatigue (Bruckner et al., 2012, Celik and Kusetogullari, 2010, Chacon-Murguia and Gonzalez-Duarte, 2012, Fookes et al., 2010).

There have been considerable efforts in the industry as well as academia, which are focused on developing various algorithms and models for surveillance systems (Albusac et al., 2011a, Albusac et al., 2014, Albusac et al., 2011b, Chan and Liu, 2009, Chen et al., 2011, Draganjac et al., 2008, Liu et al., 2010). These systems are commonly designed for specific video surveillance applications, which arise in favor of social welfare and public safety. Amongst the applications include traffic monitoring, loitering and intrusion detection. A research under the DARPA Video Surveillance and Monitoring (VSAM) (Collins et al., 2000) project developed an automated video understanding technology that enables a single human operator to monitor activities over a complex area using a distributed network of active video sensors. Kettnaker (2003) proposed a Hidden Markov Model that incorporates time information to detect unauthorized intruder into a personnel room. In (Castro, Delgado, Medina, & Ruiz-Lozano, 2011), Castro et al. integrated different information obtained from multiple sensors to detect intruders. A vision-based method (Bird, Masoud, Papanikolopoulos, & Isaacs, 2005) was developed to automatically detect individuals loitering at inner-city bus stops. According to Bird et al. (2005), the act of loitering in public areas drew interest in the surveillance domain, as it is often related to drug-dealing activity. In their later work (Bird et al., 2006), they proposed an abandoned object detection algorithm to discriminate between abandoned object and stationary human. Lv, Song, Wu, Singh, and Nevatia (2006) presented a Bayesian framework to infer left-luggage event. They evaluated their framework using PETS 2006 dataset and showed satisfactory results. In another related work, Liu, Lee, and Lin (2010a) applied a k-NN classifier to identify fall posture from normal standing posture for fall event detection. Their method employed a statistical scheme to reduce the noise from the upper limb activities. Meanwhile, Olivieri, Gómez Conde, and Vila Sobrino (2012) proposed the Motion Vector Flow Instance (MVFI) to detect fall events. While there are numerous solutions towards proactive video surveillance, most of these solutions are designed to solve a particular problem in a specific scenario. These systems often act separately to detect multiple events in different scenarios. For example, systems that perform loitering detection or/and abnormal trajectory in a given scene is based on two separate modules that work independently. Thus, they are usually not flexible or general enough to allow detections of different events at one time. In summary, most of the aforementioned solutions are tailor-made and designed to work well in a specific condition.

In contrast to existing solutions, this paper aims to propose a novel framework that is able to detect multiple events, on different regions-of-interests (ROI) of a scene, at a particular time. The advancement from detecting singular event to multiple events provides a broader degree of scene understanding in automated video surveillance. This is very critical in the real-world scenarios where different (multiple) events may take place in a scene at the same time. For example, it is very likely that a loitering event happens at the same time as an abandoned object or luggage in a given scene.

Work that are closely related to ours includes, (Khoudour et al., 1997, Velastin et al., 2005, Fuentes and Velastin, 2004, Schwerdt et al., 2005, Black et al., 2005, Fernández-Caballero et al., 2012). CROMATICA (Khoudour et al., 1997) – Crowd Monitoring with Telematic and Communication Assistance combined video-analysis based technologies and wireless data transfer to improve surveillance in public transport systems. Their method deals with multiple events such as intrusion and unattended object but is limited to an indoor environment. PRISMATICA (Velastin et al., 2005) – Pro-active Integrated systems for Security Management by Technological Institutional and Communication Assistance is a distributed system with automated event detection to improve the safety in public transport. The system components were tested in a real world environment and achieved satisfactory results. Although their method deals with a certain degree of crowded scene, it is also limited to an indoor environment. Fuentes and Velastin (2004) proposed a framework that utilizes low-level descriptions such as the centroid position of blobs to infer events. An extension of this work to include not only low-level but also the high-level descriptions was discussed in detail, in Schwerdt et al. (2005). This project, which is also known as the EAGLE project, shows satisfactory evaluation results using the Challenge of Real-time Event Detection Solutions (CREDS) dataset. Similarly, Black et al. (2005) evaluated their proposed real-time surveillance system for metropolitan railways in the United Kingdom and Italy using the CREDS dataset. A more recent work by Fernández-Caballero et al. (2012), atomised or divided low-level human actions into smaller components and used these components as grammars to infer an event. Similarly, their method is limited to indoor environment although they cope well with crowded scenes. In summary, there is still an open challenge for a solution that deals with multiple events and provide flexibility in handling different environments (both indoor and outdoor) in the public surveillance.

In this paper, we propose a novel framework that is able to detect multiple events in different ROI of a scene. The main contribution is that the proposed compositional-based framework provides flexibility to deal with the different environments, including both the indoor and outdoor for a broader degree of scene understanding with minimal fine-tuning requirement. Furthermore, it alleviates the need to perform redundant low-level processing tasks across different events. This is made possible by utilizing the theory of compositionality in the domain of knowledge-based understanding. By adopting the principle of compositionality into the video surveillance domain, the detection of multiple events in multiple scenes are simplified and optimized. The key idea is to conceptually decompose information obtained from a given scene into several intermediate degrees of abstractions. These low-level descriptions are then integrated and combined using a basic set of rule-packages, which discriminate between different abnormal events to build a complete knowledge of the given scene. In order to represent the contextual information of the scene using the proposed framework, this work investigates two main research questions: (i) how to decompose and represent the modularized entities of the knowledge-based system in the video surveillance domain, and (ii) how to apply the basic set of rule-packages to perform different abnormally events detection in a given scene.

The rest of this paper is organized as the following. Section 2 describes in detail the framework to construct the proposed knowledge-based system, while Section 3 presents the concept of compositionality from the perspective of video surveillance applications. This is followed by Section 4 which discusses a model application which is aimed at detecting multiple abnormal events using the proposed framework. Finally, the experimental results are presented and discussed in Section 5. We evaluate the capability of the proposed compositional-based model against state-of-the-art solutions. The final section in this manuscript concludes this study and provides insights as well as the future work.

Section snippets

Proposed compositional-based framework

This paper proposes an intelligent video surveillance framework to detect multiple events, in different ROI of a scene. The framework adapts the knowledge-based architecture which is common for traditional artificial intelligence (AI) systems to video surveillance domain for a broader and integrated analysis of real-world surveillance scenarios. Thus far, constructing the knowledge-based for video understanding is still an open issue due to the large variety and complexity of real-world

Proposed framework in terms of the principle of compositionality

The general architecture of the proposed framework for multiple events detection is illustrated in Fig. 3. In general, our proposed method categorizes the process of detecting multiple events into 3 broad levels comprising the (i): Sensory Level (SL) which is the data acquisition process, (ii) Analysis and Reasoning Level (ARL) in which the knowledge of the environment is firstly constructed, followed by an analysis and reasoning using the principle of compositionality (the details are

Model application

We discuss the model application by applying a public surveillance scenario for better understanding of the proposed compositional-based model. In this, we proposed a model to detect abnormal events taking place in different regions of a given scene, where the abnormal events comprise of multiple events such as loitering and intrusion. Since the proposed model categorizes an input environment into multiple ROI, it is best suited for wide-area surveillance spaces which include the airport and

Experimental results

The main goal of this experiment is to evaluate the efficiency of the proposed model in detecting different scenarios of abnormal events. Each of the five recommended abnormal events as discussed in the model application (Section 4) are tested on 20 datasets (with one or two events in each dataset), respectively. Each dataset comprises a combination of video sequences obtained from standard dataset such as the PETS dataset,2 CANTATA dataset,

Conclusion

This paper presents a framework for multiple event detections in surveillance videos. Based on the principle of compositionality, we modularize the surveillance problems into a set of sub-problems to allow flexibility and ease of fine-tuning for later extension of this framework to include other real-time events. In order to demonstrate the functionality of the knowledge constructed based on the proposed concept of compositionality, we perform comprehensive experiments on 100 video sequences

Acknowledgement

This work was supported by the University of Malaya HIR under Grant UM.C/625/1/HIR/MOHE/FCSIT/08, B000008; and Mei Kuan Lim is sponsored by the Yayasan Khazanah Malaysia.

References (37)

  • S.M. Yoon et al.

    Human action recognition based on skeleton splitting

    Expert Systems with Applications

    (2013)
  • J. Albusac et al.

    A scalable approach based on normality components for intelligent surveillance

  • N. Bird et al.

    Real time, online detection of abandoned objects in public areas

  • N.D. Bird et al.

    Detection of loitering individuals in public transportation areas

    IEEE Transactions on Intelligent Transportation Systems

    (2005)
  • J. Black et al.

    A real time surveillance system for metropolitan railways

  • D. Bruckner et al.

    Hierarchical semantic processing architecture for smart sensors in surveillance networks

    IEEE Transactions on Industrial Informatics

    (2012)
  • T. Celik et al.

    Solar-powered automated road surveillance system for speed violation detection

    IEEE Transactions on Industrial Electronics

    (2010)
  • M.I. Chacon-Murguia et al.

    An adaptive neural-fuzzy approach for object detection in dynamic backgrounds for surveillance systems

    IEEE Transactions on Industrial Electronics

    (2012)
  • Cited by (0)

    1

    Mei Kuan Lim and Szeling Tang contributed equally to this paper.

    View full text