Penalized factor mixture analysis for variable selection in clustered data

https://doi.org/10.1016/j.csda.2009.05.025Get rights and content

Abstract

A model-based clustering approach which contextually performs dimension reduction and variable selection is presented. Dimension reduction is achieved by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures. Variable selection is performed by shrinking the factor loadings though a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure via the EM algorithm is developed and a modified BIC criterion to select the penalization parameter is illustrated. The effectiveness of the proposed model is explored in a Monte Carlo simulation study and in a real example.

Introduction

Model-based clustering is recently receiving a wide interest in statistics. According to this approach if a sample of observations arises from some underlying populations of unknown proportions, its distributional form is specified in each of the underlying populations and the purpose is to decompose the sample into its mixture components (Mclachlan and Peel, 2000). For quantitative data each mixture component is usually modeled as a multivariate Gaussian distribution.

When model-based clustering is performed on a large number of observed variables, it is well known that Gaussian mixture models represent an over-parameterized solution as, besides the mixing weights, it is required to estimate the mean vector and the covariance matrix for each component (Mclachlan and Peel, 2000). In order to avoid over-parameterized solutions associated with very computationally intensive procedures, some strategies have been proposed in the statistical literature aimed at parameterizing the generic component-covariance matrix (see Banfield and Raftery (1993)) or at performing dimensional reduction in each component through latent variables (see for instance Ghahramani and Hilton (1997) and Mclachlan et al. (2003)).

Since clustering methods are strongly derailed by the presence of non-informative variables, there has recently also been an increasing interest in variable selection for model-based clustering, mostly within the Bayesian framework (see Liu et al. (2003) and Hoff (2005)), but with contributes also in the frequentist one (see, for instance, Pan and Shen (2007)). The main idea of these proposals is to parameterize the mean vector of the kth cluster as μk=μ+δk, where μ is the global mean. If some components of δk are 0, then the corresponding attributes are non-informative to clustering, at least as far as the cluster location is concerned. A different approach, followed by Raftery and Dean (2006), considers a stepwise variable selection procedure based on the Bayesian Information Criterion (BIC).

In this paper we address both issues simultaneously by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures and by shrinking the factor loadings, resorting to a penalized likelihood method with an L1 penalty.

The paper is organized as follows. In Section 2 we first briefly review the standard model-based clustering and present the dimension reduction approach; afterwards we propose an implementation with an L1 penalty resulting in soft thresholding on the estimated factor loadings and thus realizing automatic variable selection. In Section 3 the EM algorithm for obtaining the penalized model estimates and a modified BIC criterion for selecting the penalization parameter are illustrated. Section 4 shows the experimental results: the proposed model is first evaluated on a Monte Carlo simulation study and then on a real example on thyroid classification with added irrelevant variables.

Section snippets

Model-based clustering via penalized factor mixture analysis

Let y be a p-dimensional vector of continuous observed variables. According to the model-based approach to clustering the density of y can be modeled by a mixture of a sufficiently large enough number k of multivariate components each of which corresponds to a unit cluster. The most common choice for the distributional form of the components is represented by the multivariate Gaussian distribution: f(y;θ)=i=1kwiϕ(y;μ̃i,Σ̃i), where the vector θ̃ of unknown parameters consists of the mixing

Maximum penalized likelihood estimation

In order to derive the maximum penalized likelihood estimates for the proposed model parameters, the penalized likelihood function (6) has to be maximized. The maximum likelihood estimation problem can be solved using the EM-algorithm (Dempster et al., 1977), since the proposed model can be expressed in a simplified hierarchical form: f(y,z,s;θ)=f(y|z;θ)f(z|s;θ)f(s;θ), where f(y,z,s;θ) is the complete density and s denotes the so-called allocation variable, which derives from modeling the

Experimental results

The EM algorithm for the maximization of the penalized likelihood has been implemented in R code (R Development Core Team, 2008), available from the authors upon request. The effectiveness of the proposed penalized model-based clustering has been evaluated in a Monte Carlo experiment and on a real data set with added noise variables. Results are reported in the following subsections.

Concluding remarks

In this paper a penalized factor mixture analysis has been proposed. The approach can be viewed as a particular model-based clustering approach, which contextually performs dimension reduction and variable selection by shrinking the factor loadings through a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure, via the EM algorithm, has been developed and illustrated. The proposed approach has been investigated through a Monte Carlo simulation study and

References (30)

  • G. Celeux et al.

    Gaussian parsimonious clustering models

    Pattern Recognition

    (1995)
  • D. Coomans et al.

    The application of linear discriminant analysis in the diagnosis of thyroid diseases

    Analytica. Chimica. Acta

    (1978)
  • G.J. Mclachlan et al.

    Modelling high-dimensional data by mixtures of factor analyzers

    Computational Statistics and Data Analysis

    (2003)
  • Baek, J., McLachlan, G.J., 2008. Mixtures of factor analyzers with common factor loadings for the clustering and...
  • J.D. Banfield et al.

    Model-based Gaussian and non-Gaussian Clustering

    Biometrics

    (1993)
  • N.M. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm (with discussion)

    Journal of the Royal Statistical Society B

    (1977)
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    Journal of the American Statistical Association

    (2001)
  • C. Fraley et al.

    How many clusters? Which clustering methods? Answers via model-based cluster analysis

    The Computer Journal

    (1998)
  • C. Fraley et al.

    MCLUST: Software for model-based cluster analysis

    Journal of Classification

    (1999)
  • C. Fraley et al.

    Model-based clustering, discriminant analysis, and density estimation

    Journal of the American Statistical Association

    (2002)
  • Fraley, C., Raftery, A.E., 2002. MCLUST: Software for model-based clustering, discriminant analysis, and density...
  • C. Fraley et al.

    Enhanced Software for model-based clustering, discriminant analysis, and density estimation: MCLUST

    Journal of Classification

    (2003)
  • Ghahramani, Z., Hilton, G.E., 1997. The EM algorithm for mixture of factor analyzers.í, Technical Report CRG-TR-96-1,...
  • P.D. Hoff

    Subset clustering of binary sequences, with an application to genomic abnormality data

    Biometrics

    (2005)
  • L. Hubert et al.

    Comparing partitions

    Journal of the Classification

    (1985)
  • Cited by (22)

    • Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

      2019, Journal of the Korean Statistical Society
      Citation Excerpt :

      McLachlan, Bean, and Ben-Tovim Jones (2007) extended MFA to incorporate the multivariate t-distribution and proposed mixtures of t-factor analyzers (MtFA). Galimberti, Montanari, and Viroli (2009) also proposed another parsimonious factor mixture model to allow both dimension reduction and variable selection. Baek, McLachlan, and Flack (2010) extended the MFA by using common component factor loadings, called mixtures of common factor analyzers (MCFA), to reduce the number of parameters further more.

    • Prediction with a flexible finite mixture-of-regressions

      2019, Computational Statistics and Data Analysis
    • Model-based clustering of high-dimensional data: A review

      2014, Computational Statistics and Data Analysis
    • Bayesian variable selection and model averaging in the arbitrage pricing theory model

      2010, Computational Statistics and Data Analysis
      Citation Excerpt :

      Much work has been done to formalize the search for latent factors using factor analysis and principal components. See, for example, Connor (1988), Connor and Korajczyk (1993), Bai and Ng (2002), Bai (2003) and Galimbeti et al. (2009). Optimization of information criteria is one class of tools successfully used in this literature.

    View all citing articles on Scopus
    View full text