On the correspondence between objects and events for the diagnosis of situations in visual surveillance tasks
Introduction
We have just celebrated the 50th anniversary of Artificial Intelligence and we still have not solved some of the fundamental problems. These unsolved problems are always related to the enormous semantic gap that exists between neurophysiologic signals and cognition. In other words, between semantic low-level physical and geometric descriptions and the emergence of entities and relations characteristic of the psychic world of perception, reasoning and natural language. An example of this semantic gap is the difficulty in passing from image pattern recognition to scene interpretation. Inadequate and inappropriate solutions to this problem affect determinately the stagnation of the processes to solve other apparent greater granularity problems, like the diagnosis of situations in surveillance tasks to detect potential terrorist attacks in especially sensitive scenarios (airports, trains, underground, etc.).
In these scenarios solutions to analyse images at blob and object level give appropriate support to the monitoring and dynamic selective attention task. However, the difficulty arises when we want to pass from simple monitoring to situation diagnosis because here the static and dynamic roles of the different inferences into which a specific method breaks down the task (hypotheses, claims, etc.) are entities and relations of an ontology at activity level. For diagnosis, the appropriate techniques and knowledge are required to interpret image sequences, and object-level information is not sufficient. It has to be complemented with the knowledge information that a human security guard has on each specific scene, including hypotheses about the purposes and intentions of different actors participating in the scene.
In this work, after reviewing the state-of-the-art, we establish correspondences between descriptions at object level and activity level in a simple, real, interior scene, but significant for the surveillance task, where the aim of surveillance is to detect potentially dangerous abandoned objects (rucksacks, baggage, etc.) in areas of special interest for preventing attacks. The work stresses using the same ontological structure at both levels (object and activity) and describing explicitly and declaratively the contextual knowledge that has to be injected at object level to obtain the entities and relations of the activity level necessary for the diagnosis task in this scenario.
The rest of the work is structured as follows. In the second section we review the historical development of the semantic gap problem in image understanding from the foundation stage of neurocybernetics to date. We start with the works of Pitts and McCulloch, 1947, Lettvin et al., 1959 which are the forerunners of the two alternative leading theories in neuroscience (Marr, 1982, Gibson, 1979). We resort to the reviews of Huang, 1992, Chellappa and Kashyap, 1992 to link the neurocybernetic stage with the 1980s. We complete the rest of the review in the last 20 years in terms of the scene type, representation techniques and use of knowledge, the aim of surveillance and the diagnosis task. In Section 3 we introduce a general ontology of the surveillance task. The rest of the work focuses on interior spaces and the early detection of potentially dangerous situations. We also describe in this section the scenario used to illustrate the possibilities and limitations associated with establishing correspondences between the ontology of the object and activity levels. The Section 4 presents an ontology for the object level and Section 5 another ontology for the activity level. The linking of the object-level entities and relations with the roles and inferences of the diagnosis task is made explicit. We also make explicit the additional knowledge that had to be injected because it was not in the objects. We conclude with some reflections on the difficulty of solving the semantic-gap problem in complex scenarios and on possible solutions.
Section snippets
Historical view
The problem of linking the physical and geometric stage of human vision with the final cognitive interpretation of its meaning is old, both in neuroscience and AI, and we still have not found a reasonably complete and satisfactory solution. It is clear that we understand what we see beyond the 2D-geometric image that is projected onto the retina and also beyond the retinotopic projection of the transformations of this image on the primary visual cortex. However, we do not know how to integrate
Surveillance task
Surveillance is a multidisciplinary task affecting an increasing number of scenarios, services and customers. Accordingly, we can differentiate between two types of surveillance: forensic surveillance, where the aim is to detect an anomalous situation and analyse the reasons for this, and predictive surveillance, where a pre-alarm signal is detected and the possible consequences of this are analysed.
Both instances imply observing vulnerable areas considered to be of economic, social or
Object level
Following the generic ontology described in the previous section, in this section the object-level ontology is developed. Within the Task–Method hierarchy, at this level, identification, tracking, calculation of 3D-coordinates and situation recognition tasks are distinguished (Fig. 2). In turn, each of these tasks can be done with different methods. Within these tasks, we focus on the task for recognising human-related situations because of their importance for defining states and events used
Activity level
In this level, the events used for describing the scene in the appropriate abstraction level for the task are specified, i.e., the activity done by the objects of interest. The object level labels are not sufficient, activities/situations are spatial–temporal compositions, with injected domain knowledge (context, scenario, task). Different authors distinguish between the emerging process of event composition and an active search process, which requires selective visual attention (Tostsos et
Conclusions
A fundamental problem when designing visual surveillance systems (VSS) is the enormous semantic gap that exists between the ontology of physical signals (geometric scene description) and that of meanings (interpretation of activities of the different actors intervening in this scene). In other words, in the link between the object level and the activity level.
This problem is old and was detected, both in neuroscience and AI, from the foundation times of neurocybernetics. The first works of the
Acknowledgement
The authors are grateful to the CiCYT for financial aid on project TIN-2004-07661-C0201.
References (60)
Towards a general theory of action and time
Artificial Intell.
(1984)Intelligence without representation
Artificial Intell. J.
(1991)- et al.
Incremental recognition of traffic situations from video images sequences
Image Vision Comput.
(2000) - et al.
Video-based event recognition: Activity representation and probabilistic recognition methods
Computer Vision and Image Understanding
(2004) Interpreting a dynamic and uncertain world – task-based control
Artificial Intell.
(1998)- et al.
Conceptual descriptions from monitoring and watching image sequences
Image Vision Comput.
(2000) From imagine sequences towards conceptual descriptions
Image. Vision Comput.
(1988)Reconstructing force-dynamic models from video sequences
Artificial Intell.
(2003)Schema theory: From Kant to McCulloch and beyond
- et al.
On the simultaneous interpretation of real world image sequences and their natural language description: The system soccer