Conceptual descriptions from monitoring and watching image sequences

https://doi.org/10.1016/S0262-8856(99)00025-6Get rights and content

Abstract

This paper contrasts two ways of forming conceptual descriptions from images. The first, called “monitoring”, just follows the flow of data from images to interpretation, having little need for top-level control. The second, called “watching”, emphasizes the use of top-level control and actively selects evidence for task-based descriptions of the dynamic scenes. Here we look at the effect this has on forming conceptual descriptions. First, we look at how motion verbs and the perception of events contribute to an effective representational scheme. Then we go on to discuss illustrated examples of computing conceptual descriptions from images in our implementations of the monitoring and watching systems. Finally, we discuss future plans and related work.

Introduction

This paper concerns the flow of information in the extraction of conceptual descriptions from image sequences. The questions we want to ask here are: is it just a bottom–up process? Or is there some top–down control? Do you start with small fragments of data (say, events or activity fragments) and use these to construct larger forms (say, episodes)? Or do you start with the larger form and try to find the data that fits what you want/expect to find?

We consider two contrasting approaches in some detail here: (1) passively going from images to conceptual descriptions which we call “monitoring”; and (2) a more active, task oriented version called “watching”. These instances are illustrated by the two computer programs hivis-monitor and hivis-watcher.

The passive systems used in “monitoring” require very little top-level control making them simple to implement. Basically the control level is data independent and unchanging having the single task of detecting events from the image data, such as, when an object changes motion by starting or stopping. Once the events are detected a monitoring system, such as hivis-monitor, then composes related, temporally ordered events to produce appropriate episodes which might, depending on the application environment and events observed, be closing a door or entering a roundabout. This type of system is useful for off-line analysis and collecting statistics or learning about different behaviours. However, it is not suitable for on-line use, as each episode needs to be completed before it can be used for higher level interpretation. For on-line analysis and evaluation we need immediate feedback concerning the likely emerging behaviour in the scene which entails a more active control of the processing focussed on the current task. We propose that attentional control in “watching” should depend on a set of preattentive operators that we know are related to the task. In hivis-watcher these operators dynamically allocate markers to objects of interest. For example, in this paper we show how preattentive selection using mutual-proximity can identify vehicles that may be involved in overtaking. Attentional processing then establishes indexed (deictic) state changes for the events that confirm overtaking behaviour. hivis-watcher‘s performance is completely dependent upon the given data and surveillance task it is asked to perform. Making it a much more flexible, real-time option.

As shown in Fig. 1 this implementation uses preprocessed visual data in the form of 3D poseboxes from a model-based tracker (details are given by Sullivan [76], Tan et al. [77] and Worrall et al. [82], [83]). This means that the preattentive stage is simulated in the hivis-programs—something that is being corrected as described in Section 5.

Our surveillance problem has the following simplifications that make visual understanding more tractable: we use a fixed camera that observes the activity of rigid objects in a structured domain. Examples include: a road traffic scene where the main interest is the road vehicles, and airport holding areas where we are interested in the activities of the various special vehicles that unload and service the passenger aeroplanes. We call this single viewpoint of the fixed camera the “official-observer”. From this camera input we wish to obtain a description of the activity, taking place in the dynamic wide-area scene, and then an understanding of the dynamic and improvised interactions of the scene objects. There are constraints on the official-observer's interpretation of the objects in the scene: we only see the objects that are in the camera's field-of-view; we do not know each participant's goal (typically something like “go to place X”); and what we see is mostly reactive behaviour (rather than deeply planned).

To provide a background to the work described here we first consider the nature of conceptual descriptions, before describing hivis-monitor and hivis-watcher. Two examples are used in this paper: a road-traffic one that presents scenarios such as overtaking; and an office example that is used to demonstrate how these two programs could be used. This second application area is the subject of future work, but is presented to show the range of applications being considered.

Section snippets

Ontological considerations

Before describing the way that we can passively or actively compute conceptual descriptions, we first consider the relationship between images, motion verbs, and the perception and representation of events.

hivis-monitor

In the passive, off-line approach used in hivis-monitor descriptions of what is happening now are not as important as building a history to which queries can be addressed. In this situation, data-driven, bottom–up control can be used to identify key primitive changes in the data, which can later be combined into larger structures. The lower level changes are typically events with the larger structures being episodes that are initiated and concluded by related events.

hivis-watcher

In hivis-watcher the surveillance task is specified first to say what the observer is to look for in the image sequence. This top–down selection provides an expectation of what is going to happen, biasing interpretation, with things that do not comply with the observer's task being ignored. By limiting the observer's behaviour to just looking for task relevant actor behaviour we gain a more active, purposive framework. The cost is that we no longer interpret the unfolding happenings in the

Future work

In the “prototype” implementation of hivis-watcher, described above, we do not have access to the original image sequence only the stream of 3D poseboxes from the model-based tracker. Fig. 20 provides an illustration of the model-matcher results from which the 3D poseboxes are derived.

This means that both the preattentive cues and attentional aspects are derived from these model-matcher results. However, these preattentive cues could be calculated without using the results from the

Related work

The vitra (visual translator) project (an overview is given by Herzog and Wazinski [34]) uses a bottom–up technique like hivis-monitor. The vision component, either actions or xtrack (Nagel [58] gives some background details), provides object trajectories and a static 3D model which, after a number of pipelined stages, form incremental language descriptions. An example of this is the soccer application described by Blocher, Schirra and Stopp [11], [12], [74]. In soccer, the relationship to the

Conclusion

In this paper we have looked at two different ways of identifying events. The first approach (hivis-monitor) involves explicitly looking for all of them, and then composing episodes, suitable for off-line analysis. In contrast, the second approach (hivis-watcher) has events generated almost as a side-effect, denoting the final achievement of those small routine processes used by the observer to work out what is going on in the scene. We argue that this kind of active, selective processing is

Acknowledgements

This work was funded, at various stages, by the EPSRC grant GR/K08772, the ESPRIT II project P2152 VIEWS, and a SERC CASE award with GEC-Marconi Research Centre. The work described in this paper is based on Howarth [37] for which Hilary Buxton was the main supervisor. Richard Howarth also thanks his other supervisors Mike Clarke and Dave Saunders. Fig. 14, Fig. 17, Fig. 20, Fig. 21, Fig. 23 and Table 4, Table 6 are from Howarth and Buxton [44], are copyright 1998 by IEEE, and are

References (85)

  • R. Thibadeau

    Artificial perception of actions

    Cognitive Science

    (1986)
  • P.E. Agre, The dynamic structure of everyday life, PhD thesis, MIT AI Lab., AI-TR 1085, October...
  • P.E. Agre et al.
  • N.I. Badler, Temporal scene analysis: conceptual descriptions of object movements. PhD thesis, Department of Computer...
  • D.H. Ballard et al.

    Computer Vision

    (1982)
  • D.H. Ballard et al.

    Deictic codes for the embodiment of cognition

    Behaviour and Brain Sciences

    (1997)
  • A. Baumberg et al.

    Learning flexible models from image sequences

  • R. Bird et al.

    Introduction to Functional Programming

    (1988)
  • L. Birnbaum et al.
  • S.S. Blackman, Multiple-Target Tracking with Radar Applications, Artech House, Inc.,...
  • A. Blocher et al.
  • A. Blocker et al.

    Time-dependent generation of minimal sets of spatial descriptions

  • A.F. Bobick

    Movement, activity and action: the role of knowledge in the perception of motion

  • A.F. Bobick et al.
  • R.D. Boyle et al.

    Computer Vision: A First Course

    (1988)
  • K. Bühler

    The deictic field of language and deictic words

  • H. Buxton et al.
  • D. Chapman

    Vision, Instruction and Action

    (1991)
  • A.N. Clark

    Pattern recognition of noisy sequences of behavioural events using functional combinations

    The Computer Journal

    (1994)
  • D.R. Corrall et al.

    Visual surveillance

    GEC Review

    (1992)
  • J.H. Fernyhough et al.

    Generation of semantic regions from image sequences

  • M.A. Fischler et al.

    Readings in Computer Vision: Issues, Problems, Principles, and Paradigms

    (1987)
  • J. Forbes et al.
  • H. Garfinkel

    Studies in Ethnomethodology

    (1967)
  • J.J. Gibson

    The Ecological Approach to Visual Perception

    (1979)
  • S.G. Gong et al.
  • I.E. Gordon

    Theories of Visual Perception

    (1989)
  • W.F. Hanks

    Referential Practice: Language and Lived Space among the Maya

    (1990)
  • J. Heritage

    Garfinkel and Ethnomethodology

    (1984)
  • A. Herskovits

    Language and Spatial Cognition: an interdisciplinary study of the prepositions in English

    (1986)
  • G. Herzog et al.

    VIsual TRAnslator: linking perceptions and natural language descriptions

    Artificial Intelligence Review

    (1994)
  • C.A.R. Hoare

    Communicating sequential processes

    Communications of the ACM

    (1978)
  • Cited by (55)

    • Hierarchical and incremental event learning approach based on concept formation models

      2013, Neurocomputing
      Citation Excerpt :

      Finally, Section 5 presents the experiments performed on simulated and real data-sets. Most of video event learning approaches for abnormal behaviour recognition are supervised, requesting annotated videos representative of the events to be learnt in a training phase [7–9]. As well described in [10], these approaches normally use general techniques as hidden Markov models (HMM) [11].

    • TED: A texture-edge descriptor for pedestrian detection in video sequences

      2012, Pattern Recognition
      Citation Excerpt :

      Another example is analyzing the pedestrians' activities and determining whether these activities are normal or abnormal. The output of pedestrian detection can be the input of higher level processes such as pedestrian tracking [3,4] and behavior recognition systems [5,6]. This large number of potential applications resulted in a myriad of pedestrian detection techniques in the literature.

    • Surveillance and human-computer interaction applications of self-growing models

      2011, Applied Soft Computing Journal
      Citation Excerpt :

      One of the important issues is that monitoring is done globally, i.e., the cameras are reported to indicate what object are following and when you are entering the field of vision of another to continue monitoring. Recognition of actions has been extensively investigated [16,35]. The analysis of the trajectory is also one of the basic problems in understanding the actions [37].

    View all citing articles on Scopus
    View full text