Elsevier

Pattern Recognition Letters

Volume 29, Issue 8, 1 June 2008, Pages 1117-1135
Pattern Recognition Letters

On the correspondence between objects and events for the diagnosis of situations in visual surveillance tasks

https://doi.org/10.1016/j.patrec.2007.10.020Get rights and content

Abstract

A key problem in visual surveillance systems (VSS) is to find an effective procedure for linking the geometric descriptions of a scene at the object level with the corresponding descriptions of the agents intervening in this scene at the activity level. In this work, we explore a constructivist approach based on using the usual Artificial Intelligence (AI) techniques and methods to establish correspondences between the entities and relations of the ontologies in these two levels. The proposal is exemplified using a real interior scenario that uses images from just one fixed camera and where the purpose of the surveillance is to do a preventive diagnosis of the activity of abandoning a potentially dangerous object in a sensitive area. The work stresses: (1) anchoring the object-level labels in the result of analytical processes on blobs, (2) specifying contextual knowledge that has to be injected to link the activities, as described by a human surveillance expert, with the objects, as they are labelled by the same expert from geometric descriptions. The work is set within the context of the 50th anniversary of AI and the leading theories on human visual perception.

Introduction

We have just celebrated the 50th anniversary of Artificial Intelligence and we still have not solved some of the fundamental problems. These unsolved problems are always related to the enormous semantic gap that exists between neurophysiologic signals and cognition. In other words, between semantic low-level physical and geometric descriptions and the emergence of entities and relations characteristic of the psychic world of perception, reasoning and natural language. An example of this semantic gap is the difficulty in passing from image pattern recognition to scene interpretation. Inadequate and inappropriate solutions to this problem affect determinately the stagnation of the processes to solve other apparent greater granularity problems, like the diagnosis of situations in surveillance tasks to detect potential terrorist attacks in especially sensitive scenarios (airports, trains, underground, etc.).

In these scenarios solutions to analyse images at blob and object level give appropriate support to the monitoring and dynamic selective attention task. However, the difficulty arises when we want to pass from simple monitoring to situation diagnosis because here the static and dynamic roles of the different inferences into which a specific method breaks down the task (hypotheses, claims, etc.) are entities and relations of an ontology at activity level. For diagnosis, the appropriate techniques and knowledge are required to interpret image sequences, and object-level information is not sufficient. It has to be complemented with the knowledge information that a human security guard has on each specific scene, including hypotheses about the purposes and intentions of different actors participating in the scene.

In this work, after reviewing the state-of-the-art, we establish correspondences between descriptions at object level and activity level in a simple, real, interior scene, but significant for the surveillance task, where the aim of surveillance is to detect potentially dangerous abandoned objects (rucksacks, baggage, etc.) in areas of special interest for preventing attacks. The work stresses using the same ontological structure at both levels (object and activity) and describing explicitly and declaratively the contextual knowledge that has to be injected at object level to obtain the entities and relations of the activity level necessary for the diagnosis task in this scenario.

The rest of the work is structured as follows. In the second section we review the historical development of the semantic gap problem in image understanding from the foundation stage of neurocybernetics to date. We start with the works of Pitts and McCulloch, 1947, Lettvin et al., 1959 which are the forerunners of the two alternative leading theories in neuroscience (Marr, 1982, Gibson, 1979). We resort to the reviews of Huang, 1992, Chellappa and Kashyap, 1992 to link the neurocybernetic stage with the 1980s. We complete the rest of the review in the last 20 years in terms of the scene type, representation techniques and use of knowledge, the aim of surveillance and the diagnosis task. In Section 3 we introduce a general ontology of the surveillance task. The rest of the work focuses on interior spaces and the early detection of potentially dangerous situations. We also describe in this section the scenario used to illustrate the possibilities and limitations associated with establishing correspondences between the ontology of the object and activity levels. The Section 4 presents an ontology for the object level and Section 5 another ontology for the activity level. The linking of the object-level entities and relations with the roles and inferences of the diagnosis task is made explicit. We also make explicit the additional knowledge that had to be injected because it was not in the objects. We conclude with some reflections on the difficulty of solving the semantic-gap problem in complex scenarios and on possible solutions.

Section snippets

Historical view

The problem of linking the physical and geometric stage of human vision with the final cognitive interpretation of its meaning is old, both in neuroscience and AI, and we still have not found a reasonably complete and satisfactory solution. It is clear that we understand what we see beyond the 2D-geometric image that is projected onto the retina and also beyond the retinotopic projection of the transformations of this image on the primary visual cortex. However, we do not know how to integrate

Surveillance task

Surveillance is a multidisciplinary task affecting an increasing number of scenarios, services and customers. Accordingly, we can differentiate between two types of surveillance: forensic surveillance, where the aim is to detect an anomalous situation and analyse the reasons for this, and predictive surveillance, where a pre-alarm signal is detected and the possible consequences of this are analysed.

Both instances imply observing vulnerable areas considered to be of economic, social or

Object level

Following the generic ontology described in the previous section, in this section the object-level ontology is developed. Within the Task–Method hierarchy, at this level, identification, tracking, calculation of 3D-coordinates and situation recognition tasks are distinguished (Fig. 2). In turn, each of these tasks can be done with different methods. Within these tasks, we focus on the task for recognising human-related situations because of their importance for defining states and events used

Activity level

In this level, the events used for describing the scene in the appropriate abstraction level for the task are specified, i.e., the activity done by the objects of interest. The object level labels are not sufficient, activities/situations are spatial–temporal compositions, with injected domain knowledge (context, scenario, task). Different authors distinguish between the emerging process of event composition and an active search process, which requires selective visual attention (Tostsos et

Conclusions

A fundamental problem when designing visual surveillance systems (VSS) is the enormous semantic gap that exists between the ontology of physical signals (geometric scene description) and that of meanings (interpretation of activities of the different actors intervening in this scene). In other words, in the link between the object level and the activity level.

This problem is old and was detected, both in neuroscience and AI, from the foundation times of neurocybernetics. The first works of the

Acknowledgement

The authors are grateful to the CiCYT for financial aid on project TIN-2004-07661-C0201.

References (60)

  • Arens, M., Gerber, R., Nagel, H.H., 2006. Conceptual representations between video signals and natural language...
  • Bobick, A., 1997. Movement, activity, and action: The role of knowledge in the perception of motion. Royal Society...
  • F. Brémond et al.

    Scenario recognition in airborne video imagery

  • Brémond, F., Maillot, N., Thonnat, M., Vu, V., 2004. Ontologies for video events. Rapport de recherché l’INRIA...
  • H. Buxton et al.

    Advanced visual surveillance using bayesian networks

  • Carmona, E.J., Martínez-Cantos, J., Mira. J., 2007. A new video segmentation method of moving objects based on...
  • Castel, C., Chaudron, L., Tessier, C. 1996. What is going on? A high level interpretation of sequences of images. In:...
  • L. Chaudron et al.

    A purely symbolic model for dynamic scene interpretation

    Internat. J. Artificial Intell. Tools

    (1997)
  • R. Chellappa et al.

    Image understanding

  • Christensen, H.I., Matas, J., Kittler, J., 1996. Using grammars for scene interpretation. In: IEEE Internat. Conf. on...
  • Chleq, N., Thonnat, M., 1996. Realtime image sequence interpretation for video-surveillance applications. In: IEEE...
  • W.J. Clancey

    Situated Cognition: On Human Knowledge and Computer Representations

    (1997)
  • Cohen, I., Medioni, G., 1999. Detecting and tracking moving objects for video surveillance. In: IEEE Conf. on Computer...
  • Dousson, C., Gaborit, P., Ghalab, M., 1993. Situation recognition: Representation and algorithms. In: Proc. of the 13th...
  • J. Fernyhough et al.

    Building Qualitative Event Models Automatically from visual input

  • Folgado, E., Rincón, M., Carmona, E.J., Bachiller, M., 2007. A block-based model for monitoring of human activity....
  • D. Forsth et al.

    Computer Vision: A Modern Approach

    (2001)
  • Alexandre R.J. Francois et al.

    VERL: An ontology framework for representing and annotating video events

    IEEE Multimedia Magazine

    (2005)
  • J.J. Gibson

    The Ecological Approach to Visual Perception

    (1979)
  • Herzog, G., 1992. Utilizing interval-based event representations for incremental high-level scene analysis. In:...
  • Cited by (0)

    View full text