Optimising dynamic graphical models for video content analysis
Introduction
Dynamic graphical models or dynamic Bayesian networks (DBNs), especially hidden Markov models (HMMs) and their variants, have become increasingly popular for modelling and analysing dynamic video content [12], [21], [8], [11], [26], [19], [15], [34]. By using a DBN for video content analysis, we assume that dynamic visual content is generated sequentially by some hidden states of the dynamic scene which evolve over time. These hidden states often have physical meanings. For instance, they could correspond to certain stages/phases of an activity [11], [26], [21], the occurrence of different classes of visual events [19], or different types of transition segments between video shots [8]. The hidden states, as suggested by the name, cannot be observed directly. They can only be inferred from the observed visual data given a learned DBN. Learning a DBN involves estimating both its structure and parameters from data. The structure of a DBN refers primarily to (1) the number of hidden states of each hidden variables of a model and (2) the conditional independence structure of a model, i.e., factorisation of the state space for determining the topology of a graph. There have been extensive studies in the machine learning community on efficient parameter learning when the structure of the model is known a priori (i.e., assumed) [18]. However, much less effort has been made to tackle the more challenging problem of learning the optimal structure of an unknown DBN [5], [17], [13], [36]. Most previous DBNs-based video content modelling approaches avoid the structure learning problem by setting the structure manually [21], [26], [8], [15]. However, it has been shown that a learned structure can be advantageous over manually set ones given sparse and noisy visual data [19]. In this paper, we address the problem of how to accurately and robustly learn the optimal structure of a DBN for video content analysis in a realistic situation where only sparse and noisy visual data are available.
Most previous structure learning techniques have adopted a search-and-score paradigm [17].1 These techniques first define a scoring function/model selection criterion consisting of a maximum likelihood term and a penalty term to penalise complex models. The model structure space is then searched to find the optimal model structure with the highest score. The most commonly used scoring functions include Bayesian information criterion (BIC) [29] , minimum description length (MDL) [28], BDe [20], Akaike’s information criterion (AIC) [1], integrated completed likelihood (ICL) [6], and variational Bayesian (VB) [5], [4]. The selected models are ‘optimal’ in the sense that they can either best explain the existing data (BIC, MDL), or best predict unseen data (AIC). It has been demonstrated both theoretically and experimentally in the case of static models that explanation oriented scoring functions suffer from model under-fitting while prediction oriented ones suffer from model over-fitting [23], [30], [6], [35].
To address the problems associated with existing scoring functions, we argue that a better scoring function should select a model structure that is capable of both explaining the observed data and predicting unseen data optimally at the same time. To this end, we derive completed likelihood AIC (CL-AIC) for learning the structure of a DBN. CL-AIC was first introduced in our previous work [35] for Gaussian mixture models (GMMs) which can be represented as a static graphical model (see Fig. 1(a)). In this paper, CL-AIC is derived as a general scoring function suitable for both static and dynamic graphical models, with GMMs and DBNs as special cases. In particular, CL-AIC is formulated for determining the number of hidden states of a HMM and for learning the topology of a dynamically multi-linked HMM (DML-HMM) (see Fig. 1(b) and (c)).
The effectiveness of CL-AIC on DBNs structure learning is demonstrated through comparative experiments against BIC, AIC, ICL, and VB. Experiments on synthetic data were carried out to examine and quantify the effect of sample size on the performance of different score functions. The results, for the first time, reveal a key difference in structure learning of static and dynamic graphical models in terms of the definition of data sparseness. We further considered two video content analysis problems using real data: (1) content based surveillance video segmentation and (2) discovering causal/temporal relationships among visual events for group activity modelling. Our experimental results demonstrate that CL-AIC is superior to alternative scoring functions in building dynamic graphical models for video content analysis especially given sparse and noisy data.
The rest of the paper is structured as follows: in Section 2, we derive CL-AIC as a general scoring function for graphical models with hidden variables. We also formulate CL-AIC for two special cases of DBN, namely a HMM and a DML-HMM, and present synthetic experiments to compare CL-AIC to existing scoring functions including BIC, AIC, ICL, and VB. In Section 3, we address the problem of learning the optimal number of video segments for surveillance video segmentation. Comparative experiments are conducted using over 10 h of challenging outdoor surveillance video footages. We then compare CL-AIC with other competing scoring functions in learning the topology of a DML-HMM for group activity modelling in Section 4. The paper concludes in Section 5.
Section snippets
Completed likelihood AIC for graphical models with hidden variables
We derive CL-AIC for graphical models with hidden variables with GMMs and DBNs as special cases. Let us first consider the nature of computation in estimating and using a graphical model. Consider an observed data set modelled by a graphical model with hidden variables. can be used to perform three tasks: (1) estimating the unknown distribution that most likely generates , (2) inferring the values of hidden variable in from , and (3) predicting unseen data. Computing (1) and (2)
Surveillance video segmentation
HMMs have been widely used for automatic segmentation of sequential/time-series data such as speech [14], DNA sequences [10] and video [8], [11], [34]. Here we propose to use HMM for content based surveillance video segmentation, i.e. to segment a continuous surveillance video based on activities captured in the video. Note that since there is only one video shot in a continuous surveillance video, the conventional shot-change detection based segmentation approach [2] cannot be adopted. We thus
Discovering causal relationships in group activity modelling
A group activity involves multiple objects co-existing and interacting in a shared common space. Examples of group activities include ‘people playing football’ and ‘shoppers checking out at a supermarket’. Group Activity modelling is concerned with not only modelling actions executed by different objects in isolation, but also the interactions and causal/temporal relationships among these actions. Adopting a DML-HMM based activity modelling approach [19], we consider that a group activity is
Discussion and conclusion
Our experimental results show that the performance of CL-AIC on learning the structure of a dynamic graphical model with hidden variables is superior to that of existing popular alternatives including BIC, AIC, ICL, and VB. This is especially true when the given data set is noisy and sparse. Similar results were reported in the case of static graphical models in [35]. However, it is interesting to note the difference in the definitions of ‘data sparseness’ in the context of DBNs and in that of
References (36)
- et al.
Automatic segmentation and labeling of speech based on hidden markov models
Speech Communication
(1993) - et al.
A framework for recognizing the simultaneous aspects of American sign language
Computer Vision and Image Understanding
(2001) - H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Proceedings of the 2nd...
- et al.
Event based indexing of broadcasting sports video by intermodal collaboration
IEEE Transactions on Multimedia
(2002) - et al.
Statistical inference for probabilistic functions of finite state Markov chains
Annals of Mathematical Statistics
(1966) - et al.
Variational bayesian learning of directed graphical models with hidden variables
Bayesian Analysis
(2006) - et al.
The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures
Bayesian Statistics
(2003) - et al.
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2000) - et al.
Using the classification likelihood to choose the number of clusters
Computing Science and Statistics
(1997) - J. Boreczky, L. Wilcox, A hidden markov model framework for video segmentation using audio and image features, in:...
A bayesian approach to dna sequence segmentation
Biometrics
Discovery and segmentation of activities in video
IEEE Transactions on Pattern Analysis and Machine Intelligence
Structure discovery in conditional probability models via an entropic prior and parameter extinction
Neural Computation
The Viterbi algorithm
Proceedings of the IEEE
Cited by (6)
Bayesian filter based behavior recognition in workflows allowing for user feedback
2012, Computer Vision and Image UnderstandingCitation Excerpt :In that work a LHMM is used for event identification in meetings. In [26] structure learning in HMMs is addressed in order to obtain temporal dependencies between high-level events for video segmentation. An HMM models the simultaneous output of event-classifiers to filter the wrong detections.
Proposal-Based Graph Attention Networks for Workflow Detection
2022, Neural Processing LettersA top-down event-driven approach for concurrent activity recognition
2014, Multimedia Tools and ApplicationsA system for multicamera task recognition and summarization for structured environments
2013, IEEE Transactions on Industrial InformaticsWorkflow monitoring based on 3D motion features
2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009Modelling multi-object activity by Gaussian processes
2009, British Machine Vision Conference, BMVC 2009 - Proceedings