Dimension reduction for model-based clustering via mixtures of multivariate $$t$$ -distributions

Morris, Katherine; McNicholas, Paul D.; Scrucca, Luca

doi:10.1007/s11634-013-0137-3

Dimension reduction for model-based clustering via mixtures of multivariate $t$-distributions

Regular Article
Published: 14 June 2013

Volume 7, pages 321–338, (2013)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Katherine Morris¹,
Paul D. McNicholas¹ &
Luca Scrucca²

414 Accesses
14 Citations
Explore all metrics

Abstract

We introduce a dimension reduction method for model-based clustering obtained from a finite mixture of $t$-distributions. This approach is based on existing work on reducing dimensionality in the case of finite Gaussian mixtures. The method relies on identifying a reduced subspace of the data by considering the extent to which group means and group covariances vary. This subspace contains linear combinations of the original data, which are ordered by importance via the associated eigenvalues. Observations can be projected onto the subspace and the resulting set of variables captures most of the clustering structure available in the data. The approach is illustrated using simulated and real data, where it outperforms its Gaussian analogue.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering

Article Open access 24 August 2018

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Article 06 October 2015

Finite mixtures, projection pursuit and tensor rank: a triangulation

Article 06 September 2018

References

Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate $t$-factor analyzers. Stat Comput 21(3):361–373
Article MathSciNet Google Scholar
Andrews JL, McNicholas PD (2011b) Mixtures of modified $t$-factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486
Article MathSciNet MATH Google Scholar
Andrews JL, McNicholas PD (2012a) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate $t$-distributions: the $t$EIGEN family. Stat Comput 22(5):1021–1029
Article MathSciNet MATH Google Scholar
Andrews JL, McNicholas PD (2012b) teigen: model-based clustering and classification with the multivariate t-distribution. R package version 1.0
Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate $t$-distributions. Comput Stat Data Anal 55(1):520–529
Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276
Article Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3): 803–821
Google Scholar
Boulesteix AL, Lambert-Lacroix S, Peyre J, Strimmer K (2011) plsgenomics: PLS analyses for genomics. R package version 1.2-6
Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324
Article MathSciNet Google Scholar
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust J Zoo l 22:417–425
Article Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793
Article Google Scholar
Dean N, Raftery AE (2009) clustvarsel: Variable selection for model-based clustering. R package version 1.3
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc 39(1):1–38
MathSciNet MATH Google Scholar
Forina M, Armanino C, Castino M, Ubigli M (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201
Google Scholar
Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306
Article MATH Google Scholar
Franczak B, Browne RP, McNicholas PD (2012) Mixtures of shifted asymmetric Laplace distributions. Arxiv, preprint arXiv:1207.1727v3
Greselin F, Ingrassia S (2010a) Constrained monotone EM algorithms for mixtures of multivariate $t$-distributions. Stat Comput 20(1):9–22
Google Scholar
Greselin F, Ingrassia S (2010b) Weakly homoscedastic constraints for mixtures of $t$-distributions. In: Fink A, Lausen B, Seidel W, Ultsch A (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Berlin
Hubert L, Arabie P (1985) Comparing partitions. J Classifi 2:193–218
Article Google Scholar
Hubert M, Rousseeuw PJ, Vanden Branden K (2005) ROBPCA: a new approach to robust principal components analysis. Technometrics 47:64–79
Google Scholar
Hurley C (2004) Clustering visualizations of multivariate data. J Comput Gr Stat 13(4):788–806
Article MathSciNet Google Scholar
Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19:73–83
Article MathSciNet Google Scholar
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679
Google Scholar
Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Arxiv, preprint arXiv:1211.3602v3
Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86: 316–342
Google Scholar
Li KC (2000) High dimensional data analysis via the SIR/PHD approach, unpublished manuscript. http://www.stat.ucla.edu/~kcli/sir-PHD.pdf
Lin TI (2010) Robust mixture modeling using multivariate skew $t$-distributions. Stat Comput 20:343–356
Article MathSciNet Google Scholar
Loader C (2012) locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-8
Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65:701–709
Article MathSciNet MATH Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
Book MATH Google Scholar
McLachlan GJ, Bean RW, Jones LT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate $t$-distribution. Comput Stat Data Anal 51(11):5327–5338
Article MATH Google Scholar
McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate t-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Switzerland
McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comput 18:285–296
Article MathSciNet Google Scholar
McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712
Article Google Scholar
McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plan Inference 142(5):1114–1127
Article MathSciNet MATH Google Scholar
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Article MathSciNet MATH Google Scholar
Peel D, McLachlan GJ (2000) Robust mixture modelling using the $t$-distribution. Stat Comput 10:339–348
Article Google Scholar
Qiu WL, Joe H (2006) Generation of random clusters with specified degree of separation. J Classifi 23(2):315–334
Article MathSciNet Google Scholar
R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473): 168–178
Google Scholar
Reaven GM, Miller RG (1979) An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16:17–24
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MATH Google Scholar
Scrucca L (2010) Dimension reduction for model-based clustering. Stat Comput 20(4):471–484
Article MathSciNet Google Scholar
Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523
Google Scholar
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 99(10):6567–6572
Article Google Scholar
Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47. http://www.jstatsoft.org/v32/i03/
Google Scholar
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4
Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew-mixture models. Stat Prob Lett 82(6):1169–1174
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors thank Dr. Jeffrey Andrews for running the MM$t$FA models and providing the results reported herein. The authors are grateful to a guest editor and two anonymous reviewers for their very helpful comments and suggestions.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Guelph, Ontario, Canada
Katherine Morris & Paul D. McNicholas
Dipartimento di Economia, Finanza e Statistica, Università degli Studi di Perugia, Perugia, Italy
Luca Scrucca

Authors

Katherine Morris
View author publications
You can also search for this author in PubMed Google Scholar
Paul D. McNicholas
View author publications
You can also search for this author in PubMed Google Scholar
Luca Scrucca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul D. McNicholas.

Additional information

This work was supported by a Queen Elizabeth II Scholarship in Science and Technology (Morris), as well as an Early Researcher Award from the government of Ontario and a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (McNicholas).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morris, K., McNicholas, P.D. & Scrucca, L. Dimension reduction for model-based clustering via mixtures of multivariate $t$-distributions. Adv Data Anal Classif 7, 321–338 (2013). https://doi.org/10.1007/s11634-013-0137-3

Download citation

Received: 28 November 2012
Revised: 05 May 2013
Accepted: 27 May 2013
Published: 14 June 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11634-013-0137-3

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions

Abstract

Access this article

Similar content being viewed by others

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Finite mixtures, projection pursuit and tensor rank: a triangulation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions

Abstract

Access this article

Similar content being viewed by others

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Finite mixtures, projection pursuit and tensor rank: a triangulation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation