Abstract
Cluster analysis is an important problem of unsupervised machine learning. Model-based clustering is one of the most popular clustering techniques based on finite mixture models. Upon fitting of a mixture model, one question naturally arises as to how many misclassifications there are in the partition. At the same time, rather limited literature is devoted to developing diagnostic tools for obtained clustering solution. In this paper, an algorithm is developed for efficiently estimating the misclassification probability. The confusion probability map and classification confidence region are proposed for predicting the confusion matrix, identifying which cluster causes the most confusion, and understand the distribution of misclassifications. Application to real-life datasets illustrates the developed technique with promising results.
References
Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5
Azzalini A, Bowman AW (1990) A look at some data on the old faithful geyser. J R Stat Soc C 39:357–365
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Cook D, Weisberg S (1994) An introduction to regression graphics. Wiley, New York
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econom 12(3):313–336
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Gillespie NA, Neale MC (2006) A finite mixture model for genotype and environment interactions: detecting latent population heterogeneity. Twin Res Hum Genet 9(3):412–23
Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295
Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New York
Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
Melnykov V (2013) Challenges in model-based clustering. WIREs: Comput Stat 5:135–148
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Melnykov Y, Melnykov V, Zhu X (2017) Studying contributions of variables to classification. Stat Probab Lett 129:318–325
Ripley B, Tierney L, Urbanek S (2011) Package ’parallel’. http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin
Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438
Wang SJ, Woodward WA, Gray HL, Wiechecki S, Satin SR (1997) A new test for outlier detection from a multivariate mixture distribution. J Comput Graph Stat 6:285–299
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422
Acknowledgements
The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, X. Probability of misclassification in model-based clustering. Comput Stat 34, 1427–1442 (2019). https://doi.org/10.1007/s00180-019-00868-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-019-00868-0