Elsevier

Pattern Recognition Letters

Volume 33, Issue 2, 15 January 2012, Pages 103-110
Pattern Recognition Letters

Infinite Liouville mixture models with application to text and texture categorization

https://doi.org/10.1016/j.patrec.2011.09.037Get rights and content

Abstract

This paper addresses the problem of proportional data modeling and clustering using mixture models, a problem of great interest and of importance for many practical pattern recognition, image processing, data mining and computer vision applications. Finite mixture models are broadly applicable to clustering problems. But, they involve the challenging problem of the selection of the number of clusters which requires a certain trade-off. The number of clusters must be sufficient to provide the discriminating capability between clusters required for a given application. Indeed, if too many clusters are employed overfitting problems may occur and if few are used we have a problem of underfitting. Here we approach the problem of modeling and clustering proportional data using infinite mixtures which have been shown to be an efficient alternative to finite mixtures by overcoming the concern regarding the selection of the optimal number of mixture components. In particular, we propose and discuss the consideration of infinite Liouville mixture model whose parameter values are fitted to the data through a principled Bayesian algorithm that we have developed and which allows uncertainty in the number of mixture components. Our experimental evaluation involves two challenging applications namely text classification and texture discrimination, and suggests that the proposed approach can be an excellent choice for proportional data modeling.

Highlights

► An infinite mixture model, based on Liouville family of distributions is proposed. ► A hierarchical nonparametric Bayesian approach is developed for the learning of the proposed mixture model. ► Two challenging applications involving text categorization and texture discrimination are investigated.

Introduction

With the progress in data capture technology, very large databases composed of textual documents, images and videos are created and updated every day and this trend is expected to grow in the future. Modeling and organizing the content of these databases is an important and challenging problem which can be approached using data clustering techniques. Data clustering methods decompose a collection of data into a given number of disjoint clusters which are optimal in terms of some predefined criteria functions. The main goal is to organize unlabeled feature vectors into clusters such that vectors within a cluster are more similar to each other than to vectors belonging to other different clusters (Everitt, 1993). The clustering problem arises in many disciplines and the existing literature related to it is abundant. A typical widely used approach is the consideration of finite mixture models which have been used to resolve a variety of clustering problems (McLachlan and Peel, 2000). A common problem when using finite mixtures is the difficulty to determine the accurate number of clusters. Indeed, when using mixtures we generally face the model selection dilemma where simple models cause underfitting and then large approximation errors, while complex models cause overfitting and then estimation errors (Allen and Greiner, 2000, Bouchard and Celleux, 2006). There have been extensive research efforts that strive to provide model selection capability to finite mixtures. A lot of deterministic approaches have been proposed and generally involve a trade-off between simplicity and goodness of fit (see Bouguila and Ziou, 2006, for instance, for discussions and details). Some parametric and nonparametric Bayesian approaches have been proposed, also (Robert, 2007). Successful studies of Bayesian approaches have been completed in a variety of domains including computer vision, image processing, data mining and pattern recognition. Indeed, for many problems it is possible to use Bayesian inference for models estimation and selection by using available prior information about the mixture’s parameters (Robert, 2007). Bayesian approaches are attractive for several reasons and automatically embody Occam’s razor (Blumer et al., 1987). In particular, there has been recently a good deal of interest in using nonparametric Bayesian approaches for machine learning and pattern recognition problems. Rooted in the early works of Ferguson, 1973, Antoniak, 1974, progress has been made in both theory (Ishwaran, 1998, Gopalan and Berry, 1998, Kim, 1999, Neal, 2000) and application (Rasmussen, 2000, Kivinen et al., 2007, Bouguila and Ziou, 2010). This renewed interest is justified by the fact that nonparametric Bayesian approaches allow the increasing of the number of mixture components to infinity, which removes the problems underlying the selection of the number of clusters which can increase or decrease as new data arrive (Ghosh and Ramamoorthi, 2003).

In approaches to machine learning and pattern recognition that use mixture models, success depends also on the ability to select efficiently the most accurate probability density functions (pdfs) to represent the mixture components. Exhaustive evaluation of all possible pdfs is infeasible, thus it is crucial to take into account the nature of the data when a given pdf is selected. Unlike, the majority of research works which have focused on Gaussian continuous data, we shall focus in this paper on proportional data clustering which naturally appear in many applications from different domains (Aitchison, 1986, Bouguila et al., 2004, Bouguila and Ziou, 2006, Bouguila and Ziou, 2007). Proportional data entertain two restrictions namely non-negativity and unit-sum constraint. The most relevant application where proportional data are naturally generated is perhaps text classification where text documents are represented as normalized histograms of keywords. Another application which has motivated our work is images categorization where modern approaches, to handle this problem, are based on the so-called bag of features, a technique inspired from text analysis, extracted from local image descriptors and presented as vector of proportions (Bouguila and Ziou, 2010). We propose then in this paper an autonomous unsupervised nonparametric Bayesian clustering method for proportional data that performs clustering without a priori information about the number of clusters. As mentioned above the choice of an appropriate family of distributions is one of the most challenging problems in statistical analysis in general and mixture models in particular (Cox, 1990). Indeed, the success of mixture-based learning techniques lies on the accurate choice of the appropriate probability density functions to describe the components. Many statistical learning analyzes begin with the assumption that the data clusters are generated from the Gaussian distribution which is usually only an approximation used mainly for convenience. Although a Gaussian may provide a reasonable approximation to many distributions, it is certainly not the best approximation in many real-world problems and in particular those involving proportional data as we have shown in our previous works (Bouguila and Ziou, 2010, Bouguila and Ziou, 2008) where we have investigated, in particular, the use of nonparametric Bayesian learning for Dirichlet (Bouguila and Ziou, 2008) and generalized Dirichlet (Bouguila and Ziou, 2010) mixture models. Both models have their own advantages and are not, however, exempt of drawbacks. The Dirichlet involves a small number of parameters (a D-variate Dirichlet is defined by D + 1 parameters), but has a very restrictive negative covariance matrix which makes it non applicable in several real problems (Bouguila and Ziou, 2006). On the other hand, the generalized Dirichlet has a more general covariance matrix which can be positive or negative, but involves clearly a larger number of parameters (a D-variate generalized Dirichlet is defined by 2D parameters) (Bouguila and Ziou, 2006). The present paper proposes another alternative called the Beta-Liouville distribution that we extract from the Liouville family of distributions. Like the generalized Dirichlet and in contrast to the Dirichlet, the Beta-Liouville has a general covariance structure which can be positive or negative, but it involves smaller number of parameters (a D-variate Beta-Liouville is defined by D + 2 parameters). It is noteworthy that Liouville distributions have been used only as priors to the multinomial in the past (Wong, 2009), but its potential of being an effective parent distribution to model directly the data in its own right has long been neglected. In this paper we adopt nonparametric Bayesian learning to fit infinite Beta-Liouville mixture models to proportional data which to the best of our knowledge has never been considered in the past. In particular, we establish that improved clustering and modeling performance could result from using this approach as compared to infinite Dirichlet and infinite generalized Dirichlet approaches.

This paper is organized as follows. Preliminaries and details about Liouville mixture model are given in Section 2. The principles of our infinite mixture model and its prior-posterior analysis are presented in Section 3 where a complete learning algorithm is also developed. Performance results, which involve two interesting applications namely text categorization and texture discrimination, are presented in Section 4 and show the merits of the proposed model. Section 5 contains the summary, conclusions and potential future works.

Section snippets

The Beta-Liouville distribution

In dimension D, the Liouville distribution, with positive parameters (α1,  , αD) and generating density f(·) with parameters ξ, is defined by Fang et al. (1990)p(X|α1,,αD,ξ)=f(u|ξ)Γd=1Dαdud=1Dαd-1d=1DXdαd-1Γ(αd)where X=(X1,,XD),u=d=1DXd<1,Xd>0,d=1,,D. The mean, the variance and the covariance of a Liouville distribution are given by Fang et al. (1990)E(Xd)=E(u)αdd=1DαdVar(Xd)=E(u2)αd(αd+1)d=1Dαdd=1Dαd+1-E(Xd)2αd2d=1Dαd2Cov(Xl,Xk)=αlαkl=1dαlE(u2)d=1Dαd+1-E(u)2d=1Dαdwhere E(u) and E(

The Bayesian mixture model

Bayesian algorithms represent a class of learning techniques that have been intensively studied in the past. The importance of Bayesian approaches has been increasingly acknowledged in the last years and there is now considerable evidence that Bayesian algorithms are useful in several applications and domains (Robert, 2007). This is can be justified by advances in Markov chain Monte Carlo (MCMC) methods which have made the application of Bayesian approaches feasible and relatively

Design of experiments

The main goal of our experiments is to compare the performance of the infinite Beta-Liouville mixture (or IBLM, as we shall henceforth refer to it) to two previously proposed models, for proportional data, namely the infinite Dirichlet (IDM) and infinite generalized Dirichlet (IGDM) mixtures. We refer the reader to Bouguila and Ziou, 2008, Bouguila and Ziou, 2010 for details about the IDM and IGDM, respectively. In this section, we are chiefly concerned with applications which involve

Conclusion

In this paper, we have presented a hierarchical nonparametric Bayesian statistical framework based on infinite Beta-Liouville mixtures for proportional data modeling and classification that has been motivated by the importance of this kind of data in several applications. Infinite models have many advantages: they are general, consistent, powerful, extensible and flexible enough to be applied to a variety of learning problems. We estimate the posterior distributions of our model parameters

Acknowledgment

The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC).

References (50)

  • A. Blumer et al.

    Occam’s Razor

    Inform. Process. Lett.

    (1987)
  • R.D. Gupta et al.

    Multivariate Liouville distributions

    J. Multivar. Anal.

    (1987)
  • D.R. Tauritz et al.

    Adaptive information filtering using evolutionary computation

    Inform. Sci.

    (2000)
  • J. Aitchison

    The Statistical Analysis of Compositional Data

    (1986)
  • Allen, T.V., Greiner, R., 2000. Model selection criteria for learning belief nets: an empirical comparison. In: Proc....
  • C.E. Antoniak

    Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems

    Ann. Statist.

    (1974)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • G. Bouchard et al.

    Selection of generative models in classification

    IEEE Trans. Pattern Anal. Machine Intell.

    (2006)
  • N. Bouguila et al.

    A Hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture

    IEEE Trans. on Image Process.

    (2006)
  • N. Bouguila et al.

    Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach

    IEEE Trans. Knowledge Data Eng.

    (2006)
  • N. Bouguila et al.

    High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length

    IEEE Trans. Pattern Anal. Machine Intell.

    (2007)
  • Bouguila, N., Ziou, D., 2008. A Dirichlet process mixture of Dirichlet distributions for classification and prediction....
  • N. Bouguila et al.

    A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling

    IEEE Trans. Neural Networks

    (2010)
  • N. Bouguila et al.

    Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application

    IEEE Trans. Image Process.

    (2004)
  • N. Bouguila et al.

    Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications

    Statist. Comput.

    (2006)
  • Chai, K.M.A., Ng, H.T., Chieu, H.L., 2002. Bayesian online classifiers for text classification and filtering. In Proc....
  • M.K. Cowles et al.

    Markov chain Monte Carlo convergence diagnostics: a comparative review

    J. Amer. Statist. Associat.

    (1996)
  • D.R. Cox

    Role of models in statistical analysis

    Statist. Sci.

    (1990)
  • B. Everitt

    Cluster Analysis

    (1993)
  • K.T. Fang et al.

    Symmetric Multivariate and Related Distributions

    (1990)
  • U.M. Fayyad et al.

    From digitized images to online catalogs

    AI Mag.

    (1996)
  • T.S. Ferguson

    A Bayesian analysis of some nonparametric problems

    Ann. Statist.

    (1973)
  • S. Geman et al.

    Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

    IEEE Trans. Pattern Anal. Machine Intell.

    (1984)
  • J.K. Ghosh et al.

    Bayesian Nonparametrics

    (2003)
  • W.R. Gilks et al.

    Algorithm AS 287: adaptive rejection sampling from log-concave density functions

    Appl. Stat.

    (1993)
  • Cited by (0)

    View full text