Infinite Liouville mixture models with application to text and texture categorization
Highlights
► An infinite mixture model, based on Liouville family of distributions is proposed. ► A hierarchical nonparametric Bayesian approach is developed for the learning of the proposed mixture model. ► Two challenging applications involving text categorization and texture discrimination are investigated.
Introduction
With the progress in data capture technology, very large databases composed of textual documents, images and videos are created and updated every day and this trend is expected to grow in the future. Modeling and organizing the content of these databases is an important and challenging problem which can be approached using data clustering techniques. Data clustering methods decompose a collection of data into a given number of disjoint clusters which are optimal in terms of some predefined criteria functions. The main goal is to organize unlabeled feature vectors into clusters such that vectors within a cluster are more similar to each other than to vectors belonging to other different clusters (Everitt, 1993). The clustering problem arises in many disciplines and the existing literature related to it is abundant. A typical widely used approach is the consideration of finite mixture models which have been used to resolve a variety of clustering problems (McLachlan and Peel, 2000). A common problem when using finite mixtures is the difficulty to determine the accurate number of clusters. Indeed, when using mixtures we generally face the model selection dilemma where simple models cause underfitting and then large approximation errors, while complex models cause overfitting and then estimation errors (Allen and Greiner, 2000, Bouchard and Celleux, 2006). There have been extensive research efforts that strive to provide model selection capability to finite mixtures. A lot of deterministic approaches have been proposed and generally involve a trade-off between simplicity and goodness of fit (see Bouguila and Ziou, 2006, for instance, for discussions and details). Some parametric and nonparametric Bayesian approaches have been proposed, also (Robert, 2007). Successful studies of Bayesian approaches have been completed in a variety of domains including computer vision, image processing, data mining and pattern recognition. Indeed, for many problems it is possible to use Bayesian inference for models estimation and selection by using available prior information about the mixture’s parameters (Robert, 2007). Bayesian approaches are attractive for several reasons and automatically embody Occam’s razor (Blumer et al., 1987). In particular, there has been recently a good deal of interest in using nonparametric Bayesian approaches for machine learning and pattern recognition problems. Rooted in the early works of Ferguson, 1973, Antoniak, 1974, progress has been made in both theory (Ishwaran, 1998, Gopalan and Berry, 1998, Kim, 1999, Neal, 2000) and application (Rasmussen, 2000, Kivinen et al., 2007, Bouguila and Ziou, 2010). This renewed interest is justified by the fact that nonparametric Bayesian approaches allow the increasing of the number of mixture components to infinity, which removes the problems underlying the selection of the number of clusters which can increase or decrease as new data arrive (Ghosh and Ramamoorthi, 2003).
In approaches to machine learning and pattern recognition that use mixture models, success depends also on the ability to select efficiently the most accurate probability density functions (pdfs) to represent the mixture components. Exhaustive evaluation of all possible pdfs is infeasible, thus it is crucial to take into account the nature of the data when a given pdf is selected. Unlike, the majority of research works which have focused on Gaussian continuous data, we shall focus in this paper on proportional data clustering which naturally appear in many applications from different domains (Aitchison, 1986, Bouguila et al., 2004, Bouguila and Ziou, 2006, Bouguila and Ziou, 2007). Proportional data entertain two restrictions namely non-negativity and unit-sum constraint. The most relevant application where proportional data are naturally generated is perhaps text classification where text documents are represented as normalized histograms of keywords. Another application which has motivated our work is images categorization where modern approaches, to handle this problem, are based on the so-called bag of features, a technique inspired from text analysis, extracted from local image descriptors and presented as vector of proportions (Bouguila and Ziou, 2010). We propose then in this paper an autonomous unsupervised nonparametric Bayesian clustering method for proportional data that performs clustering without a priori information about the number of clusters. As mentioned above the choice of an appropriate family of distributions is one of the most challenging problems in statistical analysis in general and mixture models in particular (Cox, 1990). Indeed, the success of mixture-based learning techniques lies on the accurate choice of the appropriate probability density functions to describe the components. Many statistical learning analyzes begin with the assumption that the data clusters are generated from the Gaussian distribution which is usually only an approximation used mainly for convenience. Although a Gaussian may provide a reasonable approximation to many distributions, it is certainly not the best approximation in many real-world problems and in particular those involving proportional data as we have shown in our previous works (Bouguila and Ziou, 2010, Bouguila and Ziou, 2008) where we have investigated, in particular, the use of nonparametric Bayesian learning for Dirichlet (Bouguila and Ziou, 2008) and generalized Dirichlet (Bouguila and Ziou, 2010) mixture models. Both models have their own advantages and are not, however, exempt of drawbacks. The Dirichlet involves a small number of parameters (a D-variate Dirichlet is defined by D + 1 parameters), but has a very restrictive negative covariance matrix which makes it non applicable in several real problems (Bouguila and Ziou, 2006). On the other hand, the generalized Dirichlet has a more general covariance matrix which can be positive or negative, but involves clearly a larger number of parameters (a D-variate generalized Dirichlet is defined by 2D parameters) (Bouguila and Ziou, 2006). The present paper proposes another alternative called the Beta-Liouville distribution that we extract from the Liouville family of distributions. Like the generalized Dirichlet and in contrast to the Dirichlet, the Beta-Liouville has a general covariance structure which can be positive or negative, but it involves smaller number of parameters (a D-variate Beta-Liouville is defined by D + 2 parameters). It is noteworthy that Liouville distributions have been used only as priors to the multinomial in the past (Wong, 2009), but its potential of being an effective parent distribution to model directly the data in its own right has long been neglected. In this paper we adopt nonparametric Bayesian learning to fit infinite Beta-Liouville mixture models to proportional data which to the best of our knowledge has never been considered in the past. In particular, we establish that improved clustering and modeling performance could result from using this approach as compared to infinite Dirichlet and infinite generalized Dirichlet approaches.
This paper is organized as follows. Preliminaries and details about Liouville mixture model are given in Section 2. The principles of our infinite mixture model and its prior-posterior analysis are presented in Section 3 where a complete learning algorithm is also developed. Performance results, which involve two interesting applications namely text categorization and texture discrimination, are presented in Section 4 and show the merits of the proposed model. Section 5 contains the summary, conclusions and potential future works.
Section snippets
The Beta-Liouville distribution
In dimension D, the Liouville distribution, with positive parameters (α1, … , αD) and generating density f(·) with parameters ξ, is defined by Fang et al. (1990)where . The mean, the variance and the covariance of a Liouville distribution are given by Fang et al. (1990)where E(u) and E(
The Bayesian mixture model
Bayesian algorithms represent a class of learning techniques that have been intensively studied in the past. The importance of Bayesian approaches has been increasingly acknowledged in the last years and there is now considerable evidence that Bayesian algorithms are useful in several applications and domains (Robert, 2007). This is can be justified by advances in Markov chain Monte Carlo (MCMC) methods which have made the application of Bayesian approaches feasible and relatively
Design of experiments
The main goal of our experiments is to compare the performance of the infinite Beta-Liouville mixture (or IBLM, as we shall henceforth refer to it) to two previously proposed models, for proportional data, namely the infinite Dirichlet (IDM) and infinite generalized Dirichlet (IGDM) mixtures. We refer the reader to Bouguila and Ziou, 2008, Bouguila and Ziou, 2010 for details about the IDM and IGDM, respectively. In this section, we are chiefly concerned with applications which involve
Conclusion
In this paper, we have presented a hierarchical nonparametric Bayesian statistical framework based on infinite Beta-Liouville mixtures for proportional data modeling and classification that has been motivated by the importance of this kind of data in several applications. Infinite models have many advantages: they are general, consistent, powerful, extensible and flexible enough to be applied to a variety of learning problems. We estimate the posterior distributions of our model parameters
Acknowledgment
The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC).
References (50)
- et al.
Occam’s Razor
Inform. Process. Lett.
(1987) - et al.
Multivariate Liouville distributions
J. Multivar. Anal.
(1987) - et al.
Adaptive information filtering using evolutionary computation
Inform. Sci.
(2000) The Statistical Analysis of Compositional Data
(1986)- Allen, T.V., Greiner, R., 2000. Model selection criteria for learning belief nets: an empirical comparison. In: Proc....
Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems
Ann. Statist.
(1974)- et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
Selection of generative models in classification
IEEE Trans. Pattern Anal. Machine Intell.
(2006) - et al.
A Hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture
IEEE Trans. on Image Process.
(2006) - et al.
Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach
IEEE Trans. Knowledge Data Eng.
(2006)