Optimising dynamic graphical models for video content analysis

doi:10.1016/j.cviu.2008.05.011

Computer Vision and Image Understanding

Volume 112, Issue 3, December 2008, Pages 310-323

https://doi.org/10.1016/j.cviu.2008.05.011 Get rights and content

Abstract

A key problem in video content analysis using dynamic graphical models is to learn a suitable model structure given observed visual data. We propose a completed likelihood AIC (CL-AIC) scoring function for solving the problem. CL-AIC differs from existing scoring functions in that it aims to optimise explicitly both the explanation and prediction capabilities of a model simultaneously. CL-AIC is derived as a general scoring function suitable for both static and dynamic graphical models with hidden variables. In particular, we formulate CL-AIC for determining the number of hidden states for a hidden Markov model (HMM) and the topology of a dynamically multi-linked HMM (DML-HMM). The effectiveness of CL-AIC on learning the optimal structure of a dynamic graphical model especially given sparse and noisy visual date is shown through comparative experiments against existing scoring functions including Bayesian information criterion (BIC), Akaike’s information criterion (AIC), integrated completed likelihood (ICL), and variational Bayesian (VB). We demonstrate that CL-AIC is superior to the other scoring functions in building dynamic graphical models for solving two challenging problems in video content analysis: (1) content based surveillance video segmentation and (2) discovering causal/temporal relationships among visual events for group activity modelling.

Introduction

Dynamic graphical models or dynamic Bayesian networks (DBNs), especially hidden Markov models (HMMs) and their variants, have become increasingly popular for modelling and analysing dynamic video content [12], [21], [8], [11], [26], [19], [15], [34]. By using a DBN for video content analysis, we assume that dynamic visual content is generated sequentially by some hidden states of the dynamic scene which evolve over time. These hidden states often have physical meanings. For instance, they could correspond to certain stages/phases of an activity [11], [26], [21], the occurrence of different classes of visual events [19], or different types of transition segments between video shots [8]. The hidden states, as suggested by the name, cannot be observed directly. They can only be inferred from the observed visual data given a learned DBN. Learning a DBN involves estimating both its structure and parameters from data. The structure of a DBN refers primarily to (1) the number of hidden states of each hidden variables of a model and (2) the conditional independence structure of a model, i.e., factorisation of the state space for determining the topology of a graph. There have been extensive studies in the machine learning community on efficient parameter learning when the structure of the model is known a priori (i.e., assumed) [18]. However, much less effort has been made to tackle the more challenging problem of learning the optimal structure of an unknown DBN [5], [17], [13], [36]. Most previous DBNs-based video content modelling approaches avoid the structure learning problem by setting the structure manually [21], [26], [8], [15]. However, it has been shown that a learned structure can be advantageous over manually set ones given sparse and noisy visual data [19]. In this paper, we address the problem of how to accurately and robustly learn the optimal structure of a DBN for video content analysis in a realistic situation where only sparse and noisy visual data are available.

Most previous structure learning techniques have adopted a search-and-score paradigm [17].¹ These techniques first define a scoring function/model selection criterion consisting of a maximum likelihood term and a penalty term to penalise complex models. The model structure space is then searched to find the optimal model structure with the highest score. The most commonly used scoring functions include Bayesian information criterion (BIC) [29] , minimum description length (MDL) [28], BDe [20], Akaike’s information criterion (AIC) [1], integrated completed likelihood (ICL) [6], and variational Bayesian (VB) [5], [4]. The selected models are ‘optimal’ in the sense that they can either best explain the existing data (BIC, MDL), or best predict unseen data (AIC). It has been demonstrated both theoretically and experimentally in the case of static models that explanation oriented scoring functions suffer from model under-fitting while prediction oriented ones suffer from model over-fitting [23], [30], [6], [35].

To address the problems associated with existing scoring functions, we argue that a better scoring function should select a model structure that is capable of both explaining the observed data and predicting unseen data optimally at the same time. To this end, we derive completed likelihood AIC (CL-AIC) for learning the structure of a DBN. CL-AIC was first introduced in our previous work [35] for Gaussian mixture models (GMMs) which can be represented as a static graphical model (see Fig. 1(a)). In this paper, CL-AIC is derived as a general scoring function suitable for both static and dynamic graphical models, with GMMs and DBNs as special cases. In particular, CL-AIC is formulated for determining the number of hidden states of a HMM and for learning the topology of a dynamically multi-linked HMM (DML-HMM) (see Fig. 1(b) and (c)).

The effectiveness of CL-AIC on DBNs structure learning is demonstrated through comparative experiments against BIC, AIC, ICL, and VB. Experiments on synthetic data were carried out to examine and quantify the effect of sample size on the performance of different score functions. The results, for the first time, reveal a key difference in structure learning of static and dynamic graphical models in terms of the definition of data sparseness. We further considered two video content analysis problems using real data: (1) content based surveillance video segmentation and (2) discovering causal/temporal relationships among visual events for group activity modelling. Our experimental results demonstrate that CL-AIC is superior to alternative scoring functions in building dynamic graphical models for video content analysis especially given sparse and noisy data.

The rest of the paper is structured as follows: in Section 2, we derive CL-AIC as a general scoring function for graphical models with hidden variables. We also formulate CL-AIC for two special cases of DBN, namely a HMM and a DML-HMM, and present synthetic experiments to compare CL-AIC to existing scoring functions including BIC, AIC, ICL, and VB. In Section 3, we address the problem of learning the optimal number of video segments for surveillance video segmentation. Comparative experiments are conducted using over 10 h of challenging outdoor surveillance video footages. We then compare CL-AIC with other competing scoring functions in learning the topology of a DML-HMM for group activity modelling in Section 4. The paper concludes in Section 5.

Section snippets

Completed likelihood AIC for graphical models with hidden variables

We derive CL-AIC for graphical models with hidden variables with GMMs and DBNs as special cases. Let us first consider the nature of computation in estimating and using a graphical model. Consider an observed data set $Y$ modelled by a graphical model $M_{K}$ with hidden variables. $M_{K}$ can be used to perform three tasks: (1) estimating the unknown distribution that most likely generates $Y$ , (2) inferring the values of hidden variable in $M_{K}$ from $Y$ , and (3) predicting unseen data. Computing (1) and (2)

Surveillance video segmentation

HMMs have been widely used for automatic segmentation of sequential/time-series data such as speech [14], DNA sequences [10] and video [8], [11], [34]. Here we propose to use HMM for content based surveillance video segmentation, i.e. to segment a continuous surveillance video based on activities captured in the video. Note that since there is only one video shot in a continuous surveillance video, the conventional shot-change detection based segmentation approach [2] cannot be adopted. We thus

Discovering causal relationships in group activity modelling

A group activity involves multiple objects co-existing and interacting in a shared common space. Examples of group activities include ‘people playing football’ and ‘shoppers checking out at a supermarket’. Group Activity modelling is concerned with not only modelling actions executed by different objects in isolation, but also the interactions and causal/temporal relationships among these actions. Adopting a DML-HMM based activity modelling approach [19], we consider that a group activity is

Discussion and conclusion

Our experimental results show that the performance of CL-AIC on learning the structure of a dynamic graphical model with hidden variables is superior to that of existing popular alternatives including BIC, AIC, ICL, and VB. This is especially true when the given data set is noisy and sparse. Similar results were reported in the case of static graphical models in [35]. However, it is interesting to note the difference in the definitions of ‘data sparseness’ in the context of DBNs and in that of

References (36)

F. Brugnara et al.
Automatic segmentation and labeling of speech based on hidden markov models
Speech Communication
(1993)
C. Vogler et al.
A framework for recognizing the simultaneous aspects of American sign language
Computer Vision and Image Understanding
(2001)
H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Proceedings of the 2nd...
N. Babaguchi et al.
Event based indexing of broadcasting sports video by intermodal collaboration
IEEE Transactions on Multimedia
(2002)
L.E. Baum et al.
Statistical inference for probabilistic functions of finite state Markov chains
Annals of Mathematical Statistics
(1966)
M.J. Beal et al.
Variational bayesian learning of directed graphical models with hidden variables
Bayesian Analysis
(2006)
M. Beal et al.
The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures
Bayesian Statistics
(2003)
C. Biernacki et al.
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2000)
C. Biernacki et al.
Using the classification likelihood to choose the number of clusters
Computing Science and Statistics
(1997)
J. Boreczky, L. Wilcox, A hidden markov model framework for video segmentation using audio and image features, in:...

C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller, Context-specific independence in bayesian networks, in:...

R. Boys et al.

A bayesian approach to dna sequence segmentation

Biometrics

(2004)

M. Brand et al.

Discovery and segmentation of activities in video

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2000)

M. Brand, N. Oliver, A. Pentland. Coupled hidden markov models for complex action recognition, in: IEEE Conference on...

M. Brand

Structure discovery in conditional probability models via an entropic prior and parameter extinction

Neural Computation

(1999)

T. Duong, H. Bui, D. Phung, and S. Venkatesh. Activity recognition and abnormality detection with the switching hidden...

G.D. Forney

The Viterbi algorithm

Proceedings of the IEEE

(1973)

N. Friedman, K. Murphy, S. Russell. Learning the structure of dynamic probabilistic networks, in: Uncertainty in AI,...

Cited by (6)

Bayesian filter based behavior recognition in workflows allowing for user feedback
2012, Computer Vision and Image Understanding
Citation Excerpt :
In that work a LHMM is used for event identification in meetings. In [26] structure learning in HMMs is addressed in order to obtain temporal dependencies between high-level events for video segmentation. An HMM models the simultaneous output of event-classifiers to filter the wrong detections.
In this paper, we propose a novel online framework for behavior understanding, in visual workflows, capable of achieving high recognition rates in real-time. To effect online recognition, we propose a methodology that employs a Bayesian filter supported by hidden Markov models. We also introduce a novel re-adjustment framework of behavior recognition and classification by incorporating the user’s feedback into the learning process through two proposed schemes: a plain non-linear one and a more sophisticated recursive one. The proposed approach aims at dynamically correcting erroneous classification results to enhance the behavior modeling and therefore the overall classification rates. The performance is thoroughly evaluated under real-life complex visual behavior understanding scenarios in an industrial plant. The obtained results are compared and discussed.
Proposal-Based Graph Attention Networks for Workflow Detection
2022, Neural Processing Letters
A top-down event-driven approach for concurrent activity recognition
2014, Multimedia Tools and Applications
A system for multicamera task recognition and summarization for structured environments
2013, IEEE Transactions on Industrial Informatics
Workflow monitoring based on 3D motion features
2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009
Modelling multi-object activity by Gaussian processes
2009, British Machine Vision Conference, BMVC 2009 - Proceedings

View full text

Optimising dynamic graphical models for video content analysis

Abstract

Introduction

Section snippets

Completed likelihood AIC for graphical models with hidden variables

Surveillance video segmentation

Discovering causal relationships in group activity modelling

Discussion and conclusion

Speech Communication

Computer Vision and Image Understanding

Event based indexing of broadcasting sports video by intermodal collaboration

IEEE Transactions on Multimedia

Statistical inference for probabilistic functions of finite state Markov chains

Annals of Mathematical Statistics

Variational bayesian learning of directed graphical models with hidden variables

Bayesian Analysis

The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures

Bayesian Statistics

Assessing a mixture model for clustering with the integrated completed likelihood

IEEE Transactions on Pattern Analysis and Machine Intelligence

Using the classification likelihood to choose the number of clusters

Computing Science and Statistics

A bayesian approach to dna sequence segmentation

Biometrics

Discovery and segmentation of activities in video

IEEE Transactions on Pattern Analysis and Machine Intelligence

Structure discovery in conditional probability models via an entropic prior and parameter extinction

Neural Computation

The Viterbi algorithm

Proceedings of the IEEE