Abstract
A major challenge when performing model-based clustering is a large increase in the number of free parameters as the data dimensionality increases. To combat this issue, parsimonious methods such allow component covariance matrices to share parameters by exploiting geometric redundancies. The present work considers an additional level of intracluster structure that also captures hybridisation of mean and covariance parameters between components for the multivariate normal distribution. We posit components with heterogeneous parameterisation; a subset are considered factor components and have explicit mean and covariance parameters, and the remainder are considered hybrid components that have means and covariances implied by a set of factor loadings that weight factor component parameters. An estimation procedure is provided using the Expectation-Maximization algorithm, and comparison to Gaussian mixture models with parsimonious covariances is made by evaluation on a collection of datasets.
Similar content being viewed by others
References
Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res
Airoldi EM, Blei D, Erosheva EA, Fienberg SE (2014) Handbook of mixed membership models and their applications. CRC Press
Anderson E (1936) The species problem in Iris. Ann Missouri Botanical Garden 23(3):457–509. https://doi.org/10.2307/23941641
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821. https://doi.org/10.2307/2532201
Battle A, Segal E, Koller D (2005) Probabilistic discovery of overlapping cellular processes and their regulation. J Comput Biol 12(7):909–927. https://doi.org/10.1089/cmb.2005.12.909 (pMID: 16201912)
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the em algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3):561–575. https://doi.org/10.1016/S0167-9473(02)00163-9
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388. https://doi.org/10.1007/BF01720593
Browne RP, McNicholas PD (2014) Estimating common principal components in high dimensions. Adv Data Anal Classif 8(2):217–226. https://doi.org/10.1007/s11634-013-0139-1
Celeux G, Govaert G (1993) Comparison of the mixture and the classification maximum likelihood in cluster analysis. J Stat Comput Simul 47(3–4):127–146. https://doi.org/10.1080/00949659308811525
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793. https://doi.org/10.1016/0031-3203(94)00125-6
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):338–347
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188
Forina M, Armanino C, Lanteri S, Tiscornia E (1983) Classification of olive oils from their fatty acid composition. In: Food research and data analysis: proceedings from the IUFoST Symposium, September 20-23, 1982, Oslo, Norway/edited by H. Martens and H. Russwurm, Jr, London: Applied Science Publishers, 1983
Fraley C (1998) Algorithms for model-based gaussian hierarchical clustering. SIAM J Sci Comput 20(1):270–281. https://doi.org/10.1137/s1064827596311451
Ghahramani Z, Hinton GE, et al. (1996) The EM algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto
Goldfarb D, Idnani A (1983) A numerically stable dual method for solving strictly convex quadratic programs. Math Program 27(1):1–33
Gormley IC, Murphy TB (2009) A grade of membership model for rank data. Bayesian Anal 4(2):265–295. https://doi.org/10.1214/09-BA410
Gruber PM (2007) Convex and discrete geometry. Springer
Grünbaum B (2003) Convex polytopes. Springer
Heller KA, Williamson S, Ghahramani Z (2008) Statistical models for partial membership. In: Proceedings of the 25th international conference on machine learning, association for computing machinery, New York, NY, USA, ICML ’08, p 392–399, https://doi.org/10.1145/1390156.1390206
Holzmann H, Munk A, Gneiting T (2006) Identifiability of finite mixtures of elliptical distributions. Scand J Stat 33(4):753–763. https://doi.org/10.1111/j.1467-9469.2006.00505.x
Horst AM, Hill AP, Gorman KB (2020) palmerpenguins: Palmer Archipelago (Antarctica) penguin data. https://allisonhorst.github.io/palmerpenguins/, r package version 0.1.0
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious gaussian mixture models. Comput Stat Data Anal 54(3):711–723. https://doi.org/10.1016/j.csda.2009.02.011
McNicholas PD, Murphy TB (2008) Parsimonious gaussian mixture models. Stat Comput 18(3):285–296. https://doi.org/10.1007/s11222-008-9056-0
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278. https://doi.org/10.1093/biomet/80.2.267
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136
Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using gaussian finite mixture models. The R J 8(1):289–317. https://doi.org/10.32614/RJ-2016-021
Symons MJ (1981) Clustering criteria and multivariate normal mixtures. Biometrics 37(1):35–43. https://doi.org/10.2307/2530520
Teicher H (1961) Maximum likelihood characterization of distributions. Ann Math Statist 32(4):1214–1222. https://doi.org/10.1214/aoms/1177704861
von Weinen MDzS (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201
Wolfe JH (1963) Object cluster analysis of social areas. PhD thesis, University of California
Yakowitz SJ, Spragins JD (1968) On the identifiability of finite mixtures. Ann Math Stat 39(1):209–214. https://doi.org/10.1214/aoms/1177698520
Zhang J (2013) Epistatic clustering: a model-based approach for identifying links between clusters. J Am Stat Assoc 108(504):1366–1384. https://doi.org/10.1080/01621459.2013.835661
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hou-Liu, J., Browne, R.P. Factor and hybrid components for model-based clustering. Adv Data Anal Classif 16, 373–398 (2022). https://doi.org/10.1007/s11634-021-00483-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00483-2
Keywords
- Model-based clustering
- Intracluster structure
- Factor loadings
- Multivariate normal
- Mixture model
- Expectation-maximization