Penalized factor mixture analysis for variable selection in clustered data
Introduction
Model-based clustering is recently receiving a wide interest in statistics. According to this approach if a sample of observations arises from some underlying populations of unknown proportions, its distributional form is specified in each of the underlying populations and the purpose is to decompose the sample into its mixture components (Mclachlan and Peel, 2000). For quantitative data each mixture component is usually modeled as a multivariate Gaussian distribution.
When model-based clustering is performed on a large number of observed variables, it is well known that Gaussian mixture models represent an over-parameterized solution as, besides the mixing weights, it is required to estimate the mean vector and the covariance matrix for each component (Mclachlan and Peel, 2000). In order to avoid over-parameterized solutions associated with very computationally intensive procedures, some strategies have been proposed in the statistical literature aimed at parameterizing the generic component-covariance matrix (see Banfield and Raftery (1993)) or at performing dimensional reduction in each component through latent variables (see for instance Ghahramani and Hilton (1997) and Mclachlan et al. (2003)).
Since clustering methods are strongly derailed by the presence of non-informative variables, there has recently also been an increasing interest in variable selection for model-based clustering, mostly within the Bayesian framework (see Liu et al. (2003) and Hoff (2005)), but with contributes also in the frequentist one (see, for instance, Pan and Shen (2007)). The main idea of these proposals is to parameterize the mean vector of the th cluster as , where is the global mean. If some components of are 0, then the corresponding attributes are non-informative to clustering, at least as far as the cluster location is concerned. A different approach, followed by Raftery and Dean (2006), considers a stepwise variable selection procedure based on the Bayesian Information Criterion (BIC).
In this paper we address both issues simultaneously by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures and by shrinking the factor loadings, resorting to a penalized likelihood method with an L1 penalty.
The paper is organized as follows. In Section 2 we first briefly review the standard model-based clustering and present the dimension reduction approach; afterwards we propose an implementation with an penalty resulting in soft thresholding on the estimated factor loadings and thus realizing automatic variable selection. In Section 3 the EM algorithm for obtaining the penalized model estimates and a modified BIC criterion for selecting the penalization parameter are illustrated. Section 4 shows the experimental results: the proposed model is first evaluated on a Monte Carlo simulation study and then on a real example on thyroid classification with added irrelevant variables.
Section snippets
Model-based clustering via penalized factor mixture analysis
Let be a -dimensional vector of continuous observed variables. According to the model-based approach to clustering the density of can be modeled by a mixture of a sufficiently large enough number of multivariate components each of which corresponds to a unit cluster. The most common choice for the distributional form of the components is represented by the multivariate Gaussian distribution: where the vector of unknown parameters consists of the mixing
Maximum penalized likelihood estimation
In order to derive the maximum penalized likelihood estimates for the proposed model parameters, the penalized likelihood function (6) has to be maximized. The maximum likelihood estimation problem can be solved using the EM-algorithm (Dempster et al., 1977), since the proposed model can be expressed in a simplified hierarchical form: where is the complete density and denotes the so-called allocation variable, which derives from modeling the
Experimental results
The EM algorithm for the maximization of the penalized likelihood has been implemented in R code (R Development Core Team, 2008), available from the authors upon request. The effectiveness of the proposed penalized model-based clustering has been evaluated in a Monte Carlo experiment and on a real data set with added noise variables. Results are reported in the following subsections.
Concluding remarks
In this paper a penalized factor mixture analysis has been proposed. The approach can be viewed as a particular model-based clustering approach, which contextually performs dimension reduction and variable selection by shrinking the factor loadings through a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure, via the EM algorithm, has been developed and illustrated. The proposed approach has been investigated through a Monte Carlo simulation study and
References (30)
- et al.
Gaussian parsimonious clustering models
Pattern Recognition
(1995) - et al.
The application of linear discriminant analysis in the diagnosis of thyroid diseases
Analytica. Chimica. Acta
(1978) - et al.
Modelling high-dimensional data by mixtures of factor analyzers
Computational Statistics and Data Analysis
(2003) - Baek, J., McLachlan, G.J., 2008. Mixtures of factor analyzers with common factor loadings for the clustering and...
- et al.
Model-based Gaussian and non-Gaussian Clustering
Biometrics
(1993) - et al.
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
Journal of the Royal Statistical Society B
(1977) - et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
Journal of the American Statistical Association
(2001) - et al.
How many clusters? Which clustering methods? Answers via model-based cluster analysis
The Computer Journal
(1998) - et al.
MCLUST: Software for model-based cluster analysis
Journal of Classification
(1999) - et al.
Model-based clustering, discriminant analysis, and density estimation
Journal of the American Statistical Association
(2002)
Enhanced Software for model-based clustering, discriminant analysis, and density estimation: MCLUST
Journal of Classification
Subset clustering of binary sequences, with an application to genomic abnormality data
Biometrics
Comparing partitions
Journal of the Classification
Cited by (22)
Mixtures of Gaussian copula factor analyzers for clustering high dimensional data
2019, Journal of the Korean Statistical SocietyCitation Excerpt :McLachlan, Bean, and Ben-Tovim Jones (2007) extended MFA to incorporate the multivariate t-distribution and proposed mixtures of t-factor analyzers (MtFA). Galimberti, Montanari, and Viroli (2009) also proposed another parsimonious factor mixture model to allow both dimension reduction and variable selection. Baek, McLachlan, and Flack (2010) extended the MFA by using common component factor loadings, called mixtures of common factor analyzers (MCFA), to reduce the number of parameters further more.
Prediction with a flexible finite mixture-of-regressions
2019, Computational Statistics and Data AnalysisModel-based clustering of high-dimensional data: A review
2014, Computational Statistics and Data AnalysisBayesian variable selection and model averaging in the arbitrage pricing theory model
2010, Computational Statistics and Data AnalysisCitation Excerpt :Much work has been done to formalize the search for latent factors using factor analysis and principal components. See, for example, Connor (1988), Connor and Korajczyk (1993), Bai and Ng (2002), Bai (2003) and Galimbeti et al. (2009). Optimization of information criteria is one class of tools successfully used in this literature.
Clustered Sparse Structural Equation Modeling for Heterogeneous Data
2023, Journal of Classification