Hierarchical Bayesian models for unsupervised scene understanding
Introduction
In many real-world applications involving the collection of visual data, obtaining ground truth from a human expert can be very costly or even infeasible. For example, remote autonomous agents operating in novel environments like extra-planetary rovers and autonomous underwater vehicles (AUVs) are very effective at collecting huge quantities of visual data. Sending all of this data back to human operators quickly is hard since communication is usually bandwidth limited. In these situations it may be desirable to have algorithms operating on these vehicles that can summarise the data in unsupervised but semantically meaningful ways.
Similarly, many scientific datasets may contain terabytes of visual data that require expert knowledge to label it in a manner which is suitable for scientific inference. Obtaining such knowledge for large datasets can be a large drain on research resources. Again, it would be desirable to have algorithms that can separate this data automatically and in semantically meaningful ways, so the attention of the domain experts can be focused on subsets of the visual data for further labelling. In Section 6 we present a large visual dataset collected by an AUV that exhibits exactly this problem.
Recently there has been much focus on the computer vision problem of scene understanding, whereby multiple sources of information and various contextual relationships are used to create holistic scene models. Typically the aim in scene understanding is to improve scene recognition tasks while taking advantage of scene labels or annotations [1], [2], [3], [4], accompanying caption or body text [5], or even contextual relationships between image labels and low-level visual features [6], [7]. Most of these approaches are weakly supervised, semi-supervised or supervised in nature, and not much attention has been given to fully unsupervised, visual-data only holistic scene understanding.
In this article we wish to explore how unsupervised, or visual-data only, techniques can be applied to the problem of scene understanding. To this end we experiment with well established unsupervised models for clustering, such as Bayesian mixture models [8] and latent Dirichlet allocation [9]. These models cluster coarse whole-image descriptors, or cluster individual parts of images (but not simultaneously). We also explore models that can cluster data on multiple levels simultaneously (e.g. image segments or parts, and images), which are similar to the models presented in [4], [10]. These models discover the relationships between objects in images, and then define scene types as distributions of these objects. Also, by knowing the scene type, contextual information is used to aid in finding objects within scenes. Finally we present a new model that can cluster multiple sources of visual information, such as segment and image descriptors. This model takes advantage of holistic image descriptors, which may encode spatial layout, as well as modelling scene types as distributions of objects.
All of these models are compared on standard computer vision datasets as well as a large AUV dataset for scene and object discovery. Emphasis is placed on scene category discovery, since we compare these unsupervised methods to state-of-the-art weakly-, semi- and supervised techniques for scene understanding. We also compare these models for object discovery in two of the experiments.
In the next section we review the most relevant literature to place this work in context. We then present the hierarchical Bayesian models we use for unsupervised scene understanding in Section 3 and in Section 4 we present variational Bayes algorithms for learning these models. In Section 5 we describe the image and image-segment descriptors we use, since these play a large part in the performance of these purely visual-data driven models. Then in Section 6 we empirically compare all of the aforementioned models, and summarise our results in Section 7.
Section snippets
Relevant literature
Visual context, such as the spatial structure of images, and position and co-occurrence of objects within scenes provide semantic information that aids object and scene recognition in our visual cortex [11], [12]. Similarly, semantic information about images can be derived from the volumes of textual data that accompanies these images in the form of tags, captions and paragraph text on the Internet. Consequently, there has recently been a lot of research focusing on holistic image
Bayesian models for unsupervised scene understanding
In this section we present and discuss the structure of a number of hierarchical Bayesian models of increasing complexity that we apply to unsupervised scene understanding tasks. We start with Bayesian Gaussian mixture models (BGMMs) [8], [32] and latent Dirichlet allocation [9], but with Gaussian clusters or topics (G-LDA), for scene or segment clustering. We then present two novel models for simultaneous image and segment clustering. The first is the simultaneous clustering model (SCM), which
Variational inference for the SCM and MCM
In this section variational Bayes inference algorithms are derived for learning the posterior latent variables of the SCM and MCM, i.e., posterior hyper-parameters, labels, and number of clusters. We do not present the variational Bayes algorithms for the BGMM, VDP or G-LDA since these can be found in [8], [32], [38], [39]. Also, many of the updates are similar between these models.
Typically, to learn the posterior latent variables of the types of Bayesian models presented in the previous
Image representation
The aforementioned algorithms rely on highly discriminative visual descriptors since they are driven by visual-data alone. We have chosen unsupervised feature learning algorithms for this task as they are easily implemented and have excellent performance in a number of scene recognition tasks, e.g. [41].
Experiments
In this section we compare the VDP, G-LDA, SCM and MCM in image and segment clustering tasks. We also compare to other unsupervised, weakly-supervised, semi-supervised and supervised algorithms in the literature for scene understanding/recognition. For this comparison we use three standard datasets (single album) and a large novel dataset consisting of twelve surveys (albums) from an autonomous underwater vehicle (AUV). We also explore whether a symmetric Dirichlet prior over group weights, ,
Conclusion
It has been long established that using discriminative visual features is essential for both unsupervised and supervised applications such as scene recognition, object detection, and scene understanding. In this work we have also shown that the choice of model structure has a large influence on results for scene understanding tasks in the absence of any semantic knowledge such as image tags, accompanying text, or image or object labels. We have also shown that with appropriate model structure
Acknowledgements
This work is funded by the Australian Research Council, the New South Wales State Government, and the Integrated Marine Observing System. The authors acknowledge the providers of the datasets and those who released their code that was used in the validation of this work.
References (48)
- et al.
Building the gist of a scene: the role of global image features in recognition
Progr. Brain Res.
(2006) - et al.
The role of context in object recognition
Trends Cognitive Sci.
(2007) Generalized Dirichlet distribution in Bayesian analysis
Appl. Math. Comput.
(1998)- et al.
Independent component analysis: algorithms and applications
Neural Networks
(2000) - et al.
Modeling annotated data
- et al.
Simultaneous image classification and annotation
- L.-J. Li, R. Socher, L. Fei-Fei, Towards total scene understanding: Classification, annotation and segmentation in an...
- L. Du, L. Ren, D. Dunson, L. Carin, A Bayesian model for simultaneous image clustering, annotation and object...
- R. Socher, L. Fei-Fei, Connecting modalities: semi-supervised segmentation and annotation of images using unaligned...
- L. Li, M. Zhou, G. Sapiro, L. Carin, On the integration of topic modeling and dictionary learning, in: International...
A Bayesian approach to multimodal visual dictionary learning
A variational Bayesian framework for graphical models
Adv. Neural Inf. Process. Syst.
Latent Dirichlet allocation
J. Mach. Learn. Res.
What, where and who? Telling the story of an image by activity classification, scene recognition and object categorization
Using the forest to see the trees: exploiting context for visual object detection and localization
Commun. ACM
Object-graphs for context-aware category discovery
Spatial latent Dirichlet allocation
Adv. Neural Inf. Process. Syst.
Hierarchical kernel stick-breaking process for multi-task image analysis
Cited by (14)
New advances in benthic monitoring technology and methodology
2018, World Seas: An Environmental Evaluation Volume III: Ecological Issues and Environmental ImpactsModeling spatial layout for scene image understanding via a novel multiscale sum-product network
2016, Expert Systems with ApplicationsCitation Excerpt :Additionally, Todorovic and Nechyba (2007) address the problem of object detection and recognition in complex scenes by generative dynamic tree-structure belief networks. In the recent work (Steinberg, Pizarro, & Williams, 2015), hierarchical Bayesian models for unsupervised scene understanding are also proposed. Recently, deep models have also been deployed in exploring the labeling problem by modeling image priors.
Non-parametric Bayesian models of response function in dynamic image sequences
2016, Computer Vision and Image UnderstandingCitation Excerpt :In this paper, we will study the probabilistic models of response functions using Bayesian methodology within the general blind source separation model [16]. The Bayesian approach was chosen for its inference flexibility and for its ability to incorporate prior information of models [17,18]. We will formulate the prior model for general blind source separation problem with deconvolution [15] where the hierarchical structure of the model allow us to study various versions of prior models of response functions.
Reflections on a decade of autonomous underwater vehicles operations for marine survey at the Australian Centre for Field Robotics
2016, Annual Reviews in ControlCitation Excerpt :More recent work on deep learning and convolution neural networks promises to increase the performance of these classifiers. We have also explored techniques suitable for classifying habitats when little or no a priori training informations is available (Steinberg, Friedman, Pizarro, & Williams, 2011; Steinberg et al., 2010; Steinberg, Pizarro, & Williams, 2015), including methods based on the Variational Dirichlet Process (VDP) that allow very large volumes of image data to be clustered in a fully automated manner. An example of the application of these techniques to data collected across multiple years at Scott Reef in Western Australia is shown in Fig. 4.
Integrating global and local visual features with semantic hierarchies for two-level image annotation
2016, NeurocomputingCitation Excerpt :Moreover, Jaimes and Chang [4] offer to structure image labels using 10 levels between visual features and abstract interpretation with the purpose of helping understanding the complex visual world. Multilevel image annotation [5–11] that uses hierarchies to partition the whole vocabulary into a few structured subsets can significantly reduce the complexity for classifier training and improve the discriminative power of the single label classifier. Although the hierarchical approach may provide some advantages, it may still suffer from the above sparseness problem and also lead to the problem of inter-level error propagation that the misclassification error at the parent nodes of semantics will transport to its child nodes.
DeepDPM: Deep Clustering With an Unknown Number of Clusters
2022, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition