Optimising dynamic graphical models for video content analysis

https://doi.org/10.1016/j.cviu.2008.05.011Get rights and content

Abstract

A key problem in video content analysis using dynamic graphical models is to learn a suitable model structure given observed visual data. We propose a completed likelihood AIC (CL-AIC) scoring function for solving the problem. CL-AIC differs from existing scoring functions in that it aims to optimise explicitly both the explanation and prediction capabilities of a model simultaneously. CL-AIC is derived as a general scoring function suitable for both static and dynamic graphical models with hidden variables. In particular, we formulate CL-AIC for determining the number of hidden states for a hidden Markov model (HMM) and the topology of a dynamically multi-linked HMM (DML-HMM). The effectiveness of CL-AIC on learning the optimal structure of a dynamic graphical model especially given sparse and noisy visual date is shown through comparative experiments against existing scoring functions including Bayesian information criterion (BIC), Akaike’s information criterion (AIC), integrated completed likelihood (ICL), and variational Bayesian (VB). We demonstrate that CL-AIC is superior to the other scoring functions in building dynamic graphical models for solving two challenging problems in video content analysis: (1) content based surveillance video segmentation and (2) discovering causal/temporal relationships among visual events for group activity modelling.

Introduction

Dynamic graphical models or dynamic Bayesian networks (DBNs), especially hidden Markov models (HMMs) and their variants, have become increasingly popular for modelling and analysing dynamic video content [12], [21], [8], [11], [26], [19], [15], [34]. By using a DBN for video content analysis, we assume that dynamic visual content is generated sequentially by some hidden states of the dynamic scene which evolve over time. These hidden states often have physical meanings. For instance, they could correspond to certain stages/phases of an activity [11], [26], [21], the occurrence of different classes of visual events [19], or different types of transition segments between video shots [8]. The hidden states, as suggested by the name, cannot be observed directly. They can only be inferred from the observed visual data given a learned DBN. Learning a DBN involves estimating both its structure and parameters from data. The structure of a DBN refers primarily to (1) the number of hidden states of each hidden variables of a model and (2) the conditional independence structure of a model, i.e., factorisation of the state space for determining the topology of a graph. There have been extensive studies in the machine learning community on efficient parameter learning when the structure of the model is known a priori (i.e., assumed) [18]. However, much less effort has been made to tackle the more challenging problem of learning the optimal structure of an unknown DBN [5], [17], [13], [36]. Most previous DBNs-based video content modelling approaches avoid the structure learning problem by setting the structure manually [21], [26], [8], [15]. However, it has been shown that a learned structure can be advantageous over manually set ones given sparse and noisy visual data [19]. In this paper, we address the problem of how to accurately and robustly learn the optimal structure of a DBN for video content analysis in a realistic situation where only sparse and noisy visual data are available.

Most previous structure learning techniques have adopted a search-and-score paradigm [17].1 These techniques first define a scoring function/model selection criterion consisting of a maximum likelihood term and a penalty term to penalise complex models. The model structure space is then searched to find the optimal model structure with the highest score. The most commonly used scoring functions include Bayesian information criterion (BIC) [29] , minimum description length (MDL) [28], BDe [20], Akaike’s information criterion (AIC) [1], integrated completed likelihood (ICL) [6], and variational Bayesian (VB) [5], [4]. The selected models are ‘optimal’ in the sense that they can either best explain the existing data (BIC, MDL), or best predict unseen data (AIC). It has been demonstrated both theoretically and experimentally in the case of static models that explanation oriented scoring functions suffer from model under-fitting while prediction oriented ones suffer from model over-fitting [23], [30], [6], [35].

To address the problems associated with existing scoring functions, we argue that a better scoring function should select a model structure that is capable of both explaining the observed data and predicting unseen data optimally at the same time. To this end, we derive completed likelihood AIC (CL-AIC) for learning the structure of a DBN. CL-AIC was first introduced in our previous work [35] for Gaussian mixture models (GMMs) which can be represented as a static graphical model (see Fig. 1(a)). In this paper, CL-AIC is derived as a general scoring function suitable for both static and dynamic graphical models, with GMMs and DBNs as special cases. In particular, CL-AIC is formulated for determining the number of hidden states of a HMM and for learning the topology of a dynamically multi-linked HMM (DML-HMM) (see Fig. 1(b) and (c)).

The effectiveness of CL-AIC on DBNs structure learning is demonstrated through comparative experiments against BIC, AIC, ICL, and VB. Experiments on synthetic data were carried out to examine and quantify the effect of sample size on the performance of different score functions. The results, for the first time, reveal a key difference in structure learning of static and dynamic graphical models in terms of the definition of data sparseness. We further considered two video content analysis problems using real data: (1) content based surveillance video segmentation and (2) discovering causal/temporal relationships among visual events for group activity modelling. Our experimental results demonstrate that CL-AIC is superior to alternative scoring functions in building dynamic graphical models for video content analysis especially given sparse and noisy data.

The rest of the paper is structured as follows: in Section 2, we derive CL-AIC as a general scoring function for graphical models with hidden variables. We also formulate CL-AIC for two special cases of DBN, namely a HMM and a DML-HMM, and present synthetic experiments to compare CL-AIC to existing scoring functions including BIC, AIC, ICL, and VB. In Section 3, we address the problem of learning the optimal number of video segments for surveillance video segmentation. Comparative experiments are conducted using over 10 h of challenging outdoor surveillance video footages. We then compare CL-AIC with other competing scoring functions in learning the topology of a DML-HMM for group activity modelling in Section 4. The paper concludes in Section 5.

Section snippets

Completed likelihood AIC for graphical models with hidden variables

We derive CL-AIC for graphical models with hidden variables with GMMs and DBNs as special cases. Let us first consider the nature of computation in estimating and using a graphical model. Consider an observed data set Y modelled by a graphical model MK with hidden variables. MK can be used to perform three tasks: (1) estimating the unknown distribution that most likely generates Y, (2) inferring the values of hidden variable in MK from Y, and (3) predicting unseen data. Computing (1) and (2)

Surveillance video segmentation

HMMs have been widely used for automatic segmentation of sequential/time-series data such as speech [14], DNA sequences [10] and video [8], [11], [34]. Here we propose to use HMM for content based surveillance video segmentation, i.e. to segment a continuous surveillance video based on activities captured in the video. Note that since there is only one video shot in a continuous surveillance video, the conventional shot-change detection based segmentation approach [2] cannot be adopted. We thus

Discovering causal relationships in group activity modelling

A group activity involves multiple objects co-existing and interacting in a shared common space. Examples of group activities include ‘people playing football’ and ‘shoppers checking out at a supermarket’. Group Activity modelling is concerned with not only modelling actions executed by different objects in isolation, but also the interactions and causal/temporal relationships among these actions. Adopting a DML-HMM based activity modelling approach [19], we consider that a group activity is

Discussion and conclusion

Our experimental results show that the performance of CL-AIC on learning the structure of a dynamic graphical model with hidden variables is superior to that of existing popular alternatives including BIC, AIC, ICL, and VB. This is especially true when the given data set is noisy and sparse. Similar results were reported in the case of static graphical models in [35]. However, it is interesting to note the difference in the definitions of ‘data sparseness’ in the context of DBNs and in that of

References (36)

  • F. Brugnara et al.

    Automatic segmentation and labeling of speech based on hidden markov models

    Speech Communication

    (1993)
  • C. Vogler et al.

    A framework for recognizing the simultaneous aspects of American sign language

    Computer Vision and Image Understanding

    (2001)
  • H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Proceedings of the 2nd...
  • N. Babaguchi et al.

    Event based indexing of broadcasting sports video by intermodal collaboration

    IEEE Transactions on Multimedia

    (2002)
  • L.E. Baum et al.

    Statistical inference for probabilistic functions of finite state Markov chains

    Annals of Mathematical Statistics

    (1966)
  • M.J. Beal et al.

    Variational bayesian learning of directed graphical models with hidden variables

    Bayesian Analysis

    (2006)
  • M. Beal et al.

    The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures

    Bayesian Statistics

    (2003)
  • C. Biernacki et al.

    Assessing a mixture model for clustering with the integrated completed likelihood

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • C. Biernacki et al.

    Using the classification likelihood to choose the number of clusters

    Computing Science and Statistics

    (1997)
  • J. Boreczky, L. Wilcox, A hidden markov model framework for video segmentation using audio and image features, in:...
  • C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller, Context-specific independence in bayesian networks, in:...
  • R. Boys et al.

    A bayesian approach to dna sequence segmentation

    Biometrics

    (2004)
  • M. Brand et al.

    Discovery and segmentation of activities in video

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • M. Brand, N. Oliver, A. Pentland. Coupled hidden markov models for complex action recognition, in: IEEE Conference on...
  • M. Brand

    Structure discovery in conditional probability models via an entropic prior and parameter extinction

    Neural Computation

    (1999)
  • T. Duong, H. Bui, D. Phung, and S. Venkatesh. Activity recognition and abnormality detection with the switching hidden...
  • G.D. Forney

    The Viterbi algorithm

    Proceedings of the IEEE

    (1973)
  • N. Friedman, K. Murphy, S. Russell. Learning the structure of dynamic probabilistic networks, in: Uncertainty in AI,...
  • Cited by (6)

    View full text