Conceptual descriptions from monitoring and watching image sequences

doi:10.1016/S0262-8856(99)00025-6

Image and Vision Computing

Volume 18, Issue 2, January 2000, Pages 105-135

https://doi.org/10.1016/S0262-8856(99)00025-6 Get rights and content

Abstract

This paper contrasts two ways of forming conceptual descriptions from images. The first, called “monitoring”, just follows the flow of data from images to interpretation, having little need for top-level control. The second, called “watching”, emphasizes the use of top-level control and actively selects evidence for task-based descriptions of the dynamic scenes. Here we look at the effect this has on forming conceptual descriptions. First, we look at how motion verbs and the perception of events contribute to an effective representational scheme. Then we go on to discuss illustrated examples of computing conceptual descriptions from images in our implementations of the monitoring and watching systems. Finally, we discuss future plans and related work.

Introduction

This paper concerns the flow of information in the extraction of conceptual descriptions from image sequences. The questions we want to ask here are: is it just a bottom–up process? Or is there some top–down control? Do you start with small fragments of data (say, events or activity fragments) and use these to construct larger forms (say, episodes)? Or do you start with the larger form and try to find the data that fits what you want/expect to find?

We consider two contrasting approaches in some detail here: (1) passively going from images to conceptual descriptions which we call “monitoring”; and (2) a more active, task oriented version called “watching”. These instances are illustrated by the two computer programs hivis-monitor and hivis-watcher.

The passive systems used in “monitoring” require very little top-level control making them simple to implement. Basically the control level is data independent and unchanging having the single task of detecting events from the image data, such as, when an object changes motion by starting or stopping. Once the events are detected a monitoring system, such as hivis-monitor, then composes related, temporally ordered events to produce appropriate episodes which might, depending on the application environment and events observed, be closing a door or entering a roundabout. This type of system is useful for off-line analysis and collecting statistics or learning about different behaviours. However, it is not suitable for on-line use, as each episode needs to be completed before it can be used for higher level interpretation. For on-line analysis and evaluation we need immediate feedback concerning the likely emerging behaviour in the scene which entails a more active control of the processing focussed on the current task. We propose that attentional control in “watching” should depend on a set of preattentive operators that we know are related to the task. In hivis-watcher these operators dynamically allocate markers to objects of interest. For example, in this paper we show how preattentive selection using mutual-proximity can identify vehicles that may be involved in overtaking. Attentional processing then establishes indexed (deictic) state changes for the events that confirm overtaking behaviour. hivis-watcher‘s performance is completely dependent upon the given data and surveillance task it is asked to perform. Making it a much more flexible, real-time option.

As shown in Fig. 1 this implementation uses preprocessed visual data in the form of 3D poseboxes from a model-based tracker (details are given by Sullivan [76], Tan et al. [77] and Worrall et al. [82], [83]). This means that the preattentive stage is simulated in the hivis-programs—something that is being corrected as described in Section 5.

Our surveillance problem has the following simplifications that make visual understanding more tractable: we use a fixed camera that observes the activity of rigid objects in a structured domain. Examples include: a road traffic scene where the main interest is the road vehicles, and airport holding areas where we are interested in the activities of the various special vehicles that unload and service the passenger aeroplanes. We call this single viewpoint of the fixed camera the “official-observer”. From this camera input we wish to obtain a description of the activity, taking place in the dynamic wide-area scene, and then an understanding of the dynamic and improvised interactions of the scene objects. There are constraints on the official-observer's interpretation of the objects in the scene: we only see the objects that are in the camera's field-of-view; we do not know each participant's goal (typically something like “go to place X”); and what we see is mostly reactive behaviour (rather than deeply planned).

To provide a background to the work described here we first consider the nature of conceptual descriptions, before describing hivis-monitor and hivis-watcher. Two examples are used in this paper: a road-traffic one that presents scenarios such as overtaking; and an office example that is used to demonstrate how these two programs could be used. This second application area is the subject of future work, but is presented to show the range of applications being considered.

Section snippets

Ontological considerations

Before describing the way that we can passively or actively compute conceptual descriptions, we first consider the relationship between images, motion verbs, and the perception and representation of events.

hivis-monitor

In the passive, off-line approach used in hivis-monitor descriptions of what is happening now are not as important as building a history to which queries can be addressed. In this situation, data-driven, bottom–up control can be used to identify key primitive changes in the data, which can later be combined into larger structures. The lower level changes are typically events with the larger structures being episodes that are initiated and concluded by related events.

hivis-watcher

In hivis-watcher the surveillance task is specified first to say what the observer is to look for in the image sequence. This top–down selection provides an expectation of what is going to happen, biasing interpretation, with things that do not comply with the observer's task being ignored. By limiting the observer's behaviour to just looking for task relevant actor behaviour we gain a more active, purposive framework. The cost is that we no longer interpret the unfolding happenings in the

Future work

In the “prototype” implementation of hivis-watcher, described above, we do not have access to the original image sequence only the stream of 3D poseboxes from the model-based tracker. Fig. 20 provides an illustration of the model-matcher results from which the 3D poseboxes are derived.

This means that both the preattentive cues and attentional aspects are derived from these model-matcher results. However, these preattentive cues could be calculated without using the results from the

Related work

The vitra (visual translator) project (an overview is given by Herzog and Wazinski [34]) uses a bottom–up technique like hivis-monitor. The vision component, either actions or xtrack (Nagel [58] gives some background details), provides object trajectories and a static 3D model which, after a number of pipelined stages, form incremental language descriptions. An example of this is the soccer application described by Blocher, Schirra and Stopp [11], [12], [74]. In soccer, the relationship to the

Conclusion

In this paper we have looked at two different ways of identifying events. The first approach (hivis-monitor) involves explicitly looking for all of them, and then composing episodes, suitable for off-line analysis. In contrast, the second approach (hivis-watcher) has events generated almost as a side-effect, denoting the final achievement of those small routine processes used by the observer to work out what is going on in the scene. We argue that this kind of active, selective processing is

Acknowledgements

This work was funded, at various stages, by the EPSRC grant GR/K08772, the ESPRIT II project P2152 VIEWS, and a SERC CASE award with GEC-Marconi Research Centre. The work described in this paper is based on Howarth [37] for which Hilary Buxton was the main supervisor. Richard Howarth also thanks his other supervisors Mike Clarke and Dave Saunders. Fig. 14, Fig. 17, Fig. 20, Fig. 21, Fig. 23 and Table 4, Table 6 are from Howarth and Buxton [44], are copyright 1998 by IEEE, and are

References (85)

J.F. Allen
Towards a general theory of action and time
Artificial Intelligence
(1984)
F. Brémond et al.
Issues of representing context illustrated by video-surveillance applications
International Journal of Human–Computer Studies, Special issue on Using Context in Applications
(1998)
H. Buxton et al.
Visual surveillance in a dynamic and uncertain world
Artificial Intelligence
(1995)
M.M. Fleck
The topology of boundaries
Artificial Intellligence
(1996)
R.J. Howarth
Interpreting a dynamic and uncertain world: task-based control
Artificial Intelligence
(1998)
R.J. Howarth et al.
An analogical representation of space and time
Image and Vision Computing
(1992)
M. Leyton
Inferring causal-history from shape
Cognitive Science
(1989)
M. Mohnhaupt et al.
Understanding object motion: recognition, learning, spatiotemporal reasoning
Journal of Robotics and Autonomous Systems
(1991)
H.-H. Nagel
From image sequences towards conceptual descriptions
Image and Vision Computing
(1988)
A.E. Nicholson et al.
Sensor validation using dynamic belief networks
Uncertainty in Artificial Intelligence
(1992)

R. Thibadeau

Artificial perception of actions

Cognitive Science

(1986)

P.E. Agre, The dynamic structure of everyday life, PhD thesis, MIT AI Lab., AI-TR 1085, October...

P.E. Agre et al.

N.I. Badler, Temporal scene analysis: conceptual descriptions of object movements. PhD thesis, Department of Computer...

D.H. Ballard et al.

Computer Vision

(1982)

D.H. Ballard et al.

Deictic codes for the embodiment of cognition

Behaviour and Brain Sciences

(1997)

A. Baumberg et al.

Learning flexible models from image sequences

R. Bird et al.

Introduction to Functional Programming

(1988)

L. Birnbaum et al.

S.S. Blackman, Multiple-Target Tracking with Radar Applications, Artech House, Inc.,...

A. Blocher et al.

A. Blocker et al.

Time-dependent generation of minimal sets of spatial descriptions

A.F. Bobick

Movement, activity and action: the role of knowledge in the perception of motion

A.F. Bobick et al.

R.D. Boyle et al.

Computer Vision: A First Course

(1988)

K. Bühler

The deictic field of language and deictic words

H. Buxton et al.

D. Chapman

Vision, Instruction and Action

(1991)

A.N. Clark

Pattern recognition of noisy sequences of behavioural events using functional combinations

The Computer Journal

(1994)

D.R. Corrall et al.

Visual surveillance

GEC Review

(1992)

J.H. Fernyhough et al.

Generation of semantic regions from image sequences

M.A. Fischler et al.

Readings in Computer Vision: Issues, Problems, Principles, and Paradigms

(1987)

J. Forbes et al.

H. Garfinkel

Studies in Ethnomethodology

(1967)

J.J. Gibson

The Ecological Approach to Visual Perception

(1979)

S.G. Gong et al.

I.E. Gordon

Theories of Visual Perception

(1989)

W.F. Hanks

Referential Practice: Language and Lived Space among the Maya

(1990)

J. Heritage

Garfinkel and Ethnomethodology

(1984)

A. Herskovits

Language and Spatial Cognition: an interdisciplinary study of the prepositions in English

(1986)

G. Herzog et al.

VIsual TRAnslator: linking perceptions and natural language descriptions

Artificial Intelligence Review

(1994)

C.A.R. Hoare

Communicating sequential processes

Communications of the ACM

(1978)

Cited by (55)

Hierarchical and incremental event learning approach based on concept formation models
2013, Neurocomputing
Citation Excerpt :
Finally, Section 5 presents the experiments performed on simulated and real data-sets. Most of video event learning approaches for abnormal behaviour recognition are supervised, requesting annotated videos representative of the events to be learnt in a training phase [7–9]. As well described in [10], these approaches normally use general techniques as hidden Markov models (HMM) [11].
We propose an event learning approach for video, based on concept formation models. This approach incrementally learns on-line a hierarchy of states and event by aggregating the attribute values of tracked objects in the scene. The model can aggregate both numerical and symbolic values.
The utilisation of symbolic attributes gives high flexibility to the approach. The approach also proposes the integration of attributes as a doublet value-reliability, for considering the effect in the event learning process of the uncertainty inherited from previous phases of the video analysis process.
Simultaneously, the approach recognises the states and events of the tracked objects, giving a multi-level description of the object situation.
The approach has been evaluated for an elderly care application and a rat behaviour analysis application. The results show that the approach is capable of learning and recognising meaningful events occurring in the scene, and to build a rich model of the objects behaviour. The results also show that the approach can give a description of the activities of a person (e.g. approaching to a table, crouching), and to detect abnormal events based on the frequency of occurrence.
TED: A texture-edge descriptor for pedestrian detection in video sequences
2012, Pattern Recognition
Citation Excerpt :
Another example is analyzing the pedestrians' activities and determining whether these activities are normal or abnormal. The output of pedestrian detection can be the input of higher level processes such as pedestrian tracking [3,4] and behavior recognition systems [5,6]. This large number of potential applications resulted in a myriad of pedestrian detection techniques in the literature.
This paper presents a novel descriptor, TED, for pedestrian detection in video sequences. TED describes texture and edge information simultaneously. TED is a local descriptor because it is defined over a neighborhood. The size of the TED, independent of the neighborhood size defined over it, is 8 bits. TED is based on intensity difference, and so it is robust against illumination changes. We demonstrate TED performance in a block-based framework for pedestrian detection. Experimental results show the effectiveness of the proposed descriptor when applied in different outdoor and indoor environments.
Surveillance and human-computer interaction applications of self-growing models
2011, Applied Soft Computing Journal
Citation Excerpt :
One of the important issues is that monitoring is done globally, i.e., the cameras are reported to indicate what object are following and when you are entering the field of vision of another to continue monitoring. Recognition of actions has been extensively investigated [16,35]. The analysis of the trajectory is also one of the basic problems in understanding the actions [37].
The aim of the work is to build self-growing based architectures to support visual surveillance and human–computer interaction systems. The objectives include: identifying and tracking persons or objects in the scene or the interpretation of user gestures for interaction with services, devices and systems implemented in the digital home. The system must address multiple vision tasks of various levels such as segmentation, representation or characterization, analysis and monitoring of the movement to allow the construction of a robust representation of their environment and interpret the elements of the scene.
It is also necessary to integrate the vision module into a global system that operates in a complex environment by receiving images from acquisition devices at video frequency and offering results to higher level systems, monitors and take decisions in real time, and must accomplish a set of requirements such as: time constraints, high availability, robustness, high processing speed and re-configurability.
Based on our previous work with neural models to represent objects, in particular the Growing Neural Gas (GNG) model and the study of the topology preservation as a function of the parameters election, it is proposed to extend the capabilities of this self-growing model to track objects and represent their motion in image sequences under temporal restrictions.
These neural models have various interesting features such as: their ability to readjust to new input patterns without restarting the learning process, adaptability to represent deformable objects and even objects that are divided in different parts or the intrinsic resolution of the problem of matching features for the sequence analysis and monitoring of the movement. It is proposed to build an architecture based on the GNG that has been called GNG-Seq to represent and analyze the motion in image sequences. Several experiments are presented that demonstrate the validity of the architecture to solve problems of target tracking, motion analysis or human–computer interaction.
Detecting motion patterns via direction maps with application to surveillance
2009, Computer Vision and Image Understanding
Detection of motion patterns in video data can be significantly simplified by abstracting away from pixel intensity values towards representations that explicitly and compactly capture movement across space and time. A novel representation that captures the spatiotemporal distributions of motion across regions of interest, called the “Direction Map,” abstracts video data by assigning a two-dimensional vector, representative of local direction of motion, to quantized regions in space-time. Methods are presented for recovering direction maps from video, constructing direction map templates (defining target motion patterns of interest) and comparing templates to newly acquired video (for pattern detection and localization). These methods have been successfully implemented and tested (with real-time considerations) on over 6300 frames across seven surveillance/traffic videos, detecting potential targets of interest as they traverse the scene in specific ways. Results show an overall recognition rate of approximately 91% hits vs 8% false positives.
On the effect of feedback in multilevel representation spaces for visual surveillance tasks
2009, Neurocomputing
In this work we propose a general top–down feedback scheme between adjacent description levels to interpret video sequences. This scheme distinguishes two types of feedback: repair-oriented feedback and focus-oriented feedback. With the first it is possible to improve the system's performance and produce more reliable and consistent information, and with the second it is possible to adjust the computational load to match the aims. Finally, the general feedback scheme is used in different examples for a visual surveillance application which improved the final result of each description level by using the information in the higher adjacent level.
An EM based multiple instance learning method for image classification
2008, Expert Systems with Applications
In this paper, we propose an EM based learning algorithm to provide a comprehensive procedure for maximizing the measurement of diverse density on given multiple Instances. Furthermore, the new EM based learning framework converts an MI problem into a single-instance treatment by using EM to maximize the instance responsibility for the corresponding label of each bag. To learn a desired image class, a user may select a set of exemplar images and label them to be conceptual related (positive) or conceptual unrelated (negative) images. A positive image consists of at least one object that the user may be interested, and a negative image should not contain any object that the user may be interested. By using the proposed EM based learning algorithm, an image retrieval prototype system is implemented. Experimental results show that for only a few times of relearning cycles, the prototype system can retrieve user’s favor images from WWW over Internet.

View all citing articles on Scopus

View full text

Conceptual descriptions from monitoring and watching image sequences

Abstract

Introduction

Section snippets

Ontological considerations

hivis-monitor

hivis-watcher

Future work

Related work

Conclusion

Acknowledgements

Artificial Intelligence

International Journal of Human–Computer Studies, Special issue on Using Context in Applications

Artificial Intelligence

Artificial Intellligence

Artificial Intelligence

Image and Vision Computing

Cognitive Science

Journal of Robotics and Autonomous Systems

Image and Vision Computing

Uncertainty in Artificial Intelligence

Cognitive Science

Computer Vision

Deictic codes for the embodiment of cognition

Behaviour and Brain Sciences

Learning flexible models from image sequences

Introduction to Functional Programming

Time-dependent generation of minimal sets of spatial descriptions

Movement, activity and action: the role of knowledge in the perception of motion

Computer Vision: A First Course

The deictic field of language and deictic words

Vision, Instruction and Action

Pattern recognition of noisy sequences of behavioural events using functional combinations

The Computer Journal

Visual surveillance

GEC Review

Generation of semantic regions from image sequences

Readings in Computer Vision: Issues, Problems, Principles, and Paradigms

Studies in Ethnomethodology

The Ecological Approach to Visual Perception

Theories of Visual Perception

Referential Practice: Language and Lived Space among the Maya

Garfinkel and Ethnomethodology

Language and Spatial Cognition: an interdisciplinary study of the prepositions in English

VIsual TRAnslator: linking perceptions and natural language descriptions

Artificial Intelligence Review

Communicating sequential processes

Communications of the ACM