Penalized factor mixture analysis for variable selection in clustered data

doi:10.1016/j.csda.2009.05.025

Computational Statistics & Data Analysis

Volume 53, Issue 12, 1 October 2009, Pages 4301-4310

https://doi.org/10.1016/j.csda.2009.05.025 Get rights and content

Abstract

A model-based clustering approach which contextually performs dimension reduction and variable selection is presented. Dimension reduction is achieved by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures. Variable selection is performed by shrinking the factor loadings though a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure via the EM algorithm is developed and a modified BIC criterion to select the penalization parameter is illustrated. The effectiveness of the proposed model is explored in a Monte Carlo simulation study and in a real example.

Introduction

Model-based clustering is recently receiving a wide interest in statistics. According to this approach if a sample of observations arises from some underlying populations of unknown proportions, its distributional form is specified in each of the underlying populations and the purpose is to decompose the sample into its mixture components (Mclachlan and Peel, 2000). For quantitative data each mixture component is usually modeled as a multivariate Gaussian distribution.

When model-based clustering is performed on a large number of observed variables, it is well known that Gaussian mixture models represent an over-parameterized solution as, besides the mixing weights, it is required to estimate the mean vector and the covariance matrix for each component (Mclachlan and Peel, 2000). In order to avoid over-parameterized solutions associated with very computationally intensive procedures, some strategies have been proposed in the statistical literature aimed at parameterizing the generic component-covariance matrix (see Banfield and Raftery (1993)) or at performing dimensional reduction in each component through latent variables (see for instance Ghahramani and Hilton (1997) and Mclachlan et al. (2003)).

Since clustering methods are strongly derailed by the presence of non-informative variables, there has recently also been an increasing interest in variable selection for model-based clustering, mostly within the Bayesian framework (see Liu et al. (2003) and Hoff (2005)), but with contributes also in the frequentist one (see, for instance, Pan and Shen (2007)). The main idea of these proposals is to parameterize the mean vector of the $k$ th cluster as $μ_{k} = μ + δ_{k}$ , where $μ$ is the global mean. If some components of $δ_{k}$ are 0, then the corresponding attributes are non-informative to clustering, at least as far as the cluster location is concerned. A different approach, followed by Raftery and Dean (2006), considers a stepwise variable selection procedure based on the Bayesian Information Criterion (BIC).

In this paper we address both issues simultaneously by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures and by shrinking the factor loadings, resorting to a penalized likelihood method with an L1 penalty.

The paper is organized as follows. In Section 2 we first briefly review the standard model-based clustering and present the dimension reduction approach; afterwards we propose an implementation with an $L_{1}$ penalty resulting in soft thresholding on the estimated factor loadings and thus realizing automatic variable selection. In Section 3 the EM algorithm for obtaining the penalized model estimates and a modified BIC criterion for selecting the penalization parameter are illustrated. Section 4 shows the experimental results: the proposed model is first evaluated on a Monte Carlo simulation study and then on a real example on thyroid classification with added irrelevant variables.

Section snippets

Model-based clustering via penalized factor mixture analysis

Let $y$ be a $p$ -dimensional vector of continuous observed variables. According to the model-based approach to clustering the density of $y$ can be modeled by a mixture of a sufficiently large enough number $k$ of multivariate components each of which corresponds to a unit cluster. The most common choice for the distributional form of the components is represented by the multivariate Gaussian distribution: $f (y; θ) = \sum_{i = 1}^{k} w_{i} ϕ (y; {\tilde{μ}}_{i}, {\tilde{Σ}}_{i}),$ where the vector $\tilde{θ}$ of unknown parameters consists of the mixing

Maximum penalized likelihood estimation

In order to derive the maximum penalized likelihood estimates for the proposed model parameters, the penalized likelihood function (6) has to be maximized. The maximum likelihood estimation problem can be solved using the EM-algorithm (Dempster et al., 1977), since the proposed model can be expressed in a simplified hierarchical form: $f (y, z, s; θ) = f (y | z; θ) f (z | s; θ) f (s; θ),$ where $f (y, z, s; θ)$ is the complete density and $s$ denotes the so-called allocation variable, which derives from modeling the

Experimental results

The EM algorithm for the maximization of the penalized likelihood has been implemented in R code (R Development Core Team, 2008), available from the authors upon request. The effectiveness of the proposed penalized model-based clustering has been evaluated in a Monte Carlo experiment and on a real data set with added noise variables. Results are reported in the following subsections.

Concluding remarks

In this paper a penalized factor mixture analysis has been proposed. The approach can be viewed as a particular model-based clustering approach, which contextually performs dimension reduction and variable selection by shrinking the factor loadings through a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure, via the EM algorithm, has been developed and illustrated. The proposed approach has been investigated through a Monte Carlo simulation study and

References (30)

G. Celeux et al.
Gaussian parsimonious clustering models
Pattern Recognition
(1995)
D. Coomans et al.
The application of linear discriminant analysis in the diagnosis of thyroid diseases
Analytica. Chimica. Acta
(1978)
G.J. Mclachlan et al.
Modelling high-dimensional data by mixtures of factor analyzers
Computational Statistics and Data Analysis
(2003)
Baek, J., McLachlan, G.J., 2008. Mixtures of factor analyzers with common factor loadings for the clustering and...
J.D. Banfield et al.
Model-based Gaussian and non-Gaussian Clustering
Biometrics
(1993)
N.M. Dempster et al.
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
Journal of the Royal Statistical Society B
(1977)
J. Fan et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
Journal of the American Statistical Association
(2001)
C. Fraley et al.
How many clusters? Which clustering methods? Answers via model-based cluster analysis
The Computer Journal
(1998)
C. Fraley et al.
MCLUST: Software for model-based cluster analysis
Journal of Classification
(1999)
C. Fraley et al.
Model-based clustering, discriminant analysis, and density estimation
Journal of the American Statistical Association
(2002)

Fraley, C., Raftery, A.E., 2002. MCLUST: Software for model-based clustering, discriminant analysis, and density...

C. Fraley et al.

Enhanced Software for model-based clustering, discriminant analysis, and density estimation: MCLUST

Journal of Classification

(2003)

Ghahramani, Z., Hilton, G.E., 1997. The EM algorithm for mixture of factor analyzers.í, Technical Report CRG-TR-96-1,...

P.D. Hoff

Subset clustering of binary sequences, with an application to genomic abnormality data

Biometrics

(2005)

L. Hubert et al.

Comparing partitions

Journal of the Classification

(1985)

Cited by (22)

Mixtures of Gaussian copula factor analyzers for clustering high dimensional data
2019, Journal of the Korean Statistical Society
Citation Excerpt :
McLachlan, Bean, and Ben-Tovim Jones (2007) extended MFA to incorporate the multivariate t-distribution and proposed mixtures of t-factor analyzers (MtFA). Galimberti, Montanari, and Viroli (2009) also proposed another parsimonious factor mixture model to allow both dimension reduction and variable selection. Baek, McLachlan, and Flack (2010) extended the MFA by using common component factor loadings, called mixtures of common factor analyzers (MCFA), to reduce the number of parameters further more.
Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.
An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.
Prediction with a flexible finite mixture-of-regressions
2019, Computational Statistics and Data Analysis
Finite mixture regression (FMR) is widely used for modeling data that originate from heterogeneous populations. In these settings, FMR can offer increased predictive power compared to more traditional one-class models. However, existing FMR methods rely heavily on mixtures of linear models, where the linear predictor must be given as an input. A flexible FMR model is presented using a combination of the random forest learner and a penalized linear FMR. The performance of the new method is assessed by predictive log-likelihood in extensive simulation studies. The method is shown to achieve equal performance with the existing FMR methods when the true regression functions are in fact linear and superior performance in cases where at least one of the regression functions is nonlinear. The method can handle a large number of covariates, and its predictive ability is not greatly affected by surplus variables.
Model-based clustering of high-dimensional data: A review
2014, Computational Statistics and Data Analysis
Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical model-based clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that model-based clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for clustering and recent techniques exploit those characteristics. After having recalled the bases of model-based clustering, dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace clustering methods and clustering methods based on variable selection are reviewed. Existing softwares for model-based clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.
Bayesian variable selection and model averaging in the arbitrage pricing theory model
2010, Computational Statistics and Data Analysis
Citation Excerpt :
Much work has been done to formalize the search for latent factors using factor analysis and principal components. See, for example, Connor (1988), Connor and Korajczyk (1993), Bai and Ng (2002), Bai (2003) and Galimbeti et al. (2009). Optimization of information criteria is one class of tools successfully used in this literature.
Empirical tests of the arbitrage pricing theory using measured variables rely on the accuracy of standard inferential theory in approximating the distribution of the estimated risk premiums and factor betas. The techniques employed thus far perform factor selection and model inference sequentially. Recent advances in Bayesian variable selection are adapted to an approximate factor model to investigate the role of measured economic variables in the pricing of securities. In finite samples, exact statistical inference is carried out using posterior distributions of functions of risk premiums and factor betas. The role of the panel dimensions in posterior inference is investigated. New empirical evidence is found of time-varying risk premiums with higher and more volatile expected compensation for bearing systematic risk during contraction phases. In addition, investors are rewarded for exposure to “Economic” risk.
Clustered Sparse Structural Equation Modeling for Heterogeneous Data
2023, Journal of Classification
Gamma Mixture Variational Autoencoder For Clustering High-Dimensional Right-Skewed Data
2023, SSRN

View all citing articles on Scopus

View full text

Penalized factor mixture analysis for variable selection in clustered data

Abstract

Introduction

Section snippets

Model-based clustering via penalized factor mixture analysis

Maximum penalized likelihood estimation

Experimental results

Concluding remarks

Pattern Recognition

Analytica. Chimica. Acta

Computational Statistics and Data Analysis

Model-based Gaussian and non-Gaussian Clustering

Biometrics

Maximum likelihood from incomplete data via the EM algorithm (with discussion)

Journal of the Royal Statistical Society B

Variable selection via nonconcave penalized likelihood and its oracle properties

Journal of the American Statistical Association

How many clusters? Which clustering methods? Answers via model-based cluster analysis

The Computer Journal

MCLUST: Software for model-based cluster analysis

Journal of Classification

Model-based clustering, discriminant analysis, and density estimation

Journal of the American Statistical Association

Enhanced Software for model-based clustering, discriminant analysis, and density estimation: MCLUST

Journal of Classification

Subset clustering of binary sequences, with an application to genomic abnormality data

Biometrics

Comparing partitions

Journal of the Classification