Probability of misclassification in model-based clustering

Zhu, Xuwen

doi:10.1007/s00180-019-00868-0

Probability of misclassification in model-based clustering

Short Note
Published: 24 January 2019

Volume 34, pages 1427–1442, (2019)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Xuwen Zhu ORCID: orcid.org/0000-0002-7644-2695¹

687 Accesses
Explore all metrics

Abstract

Cluster analysis is an important problem of unsupervised machine learning. Model-based clustering is one of the most popular clustering techniques based on finite mixture models. Upon fitting of a mixture model, one question naturally arises as to how many misclassifications there are in the partition. At the same time, rather limited literature is devoted to developing diagnostic tools for obtained clustering solution. In this paper, an algorithm is developed for efficiently estimating the misclassification probability. The confusion probability map and classification confidence region are proposed for predicting the confusion matrix, identifying which cluster causes the most confusion, and understand the distribution of misclassifications. Application to real-life datasets illustrates the developed technique with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5
Google Scholar
Azzalini A, Bowman AW (1990) A look at some data on the old faithful geyser. J R Stat Soc C 39:357–365
MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Article MathSciNet MATH Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Article MathSciNet MATH Google Scholar
Cook D, Weisberg S (1994) An introduction to regression graphics. Wiley, New York
Book MATH Google Scholar
Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econom 12(3):313–336
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
MATH Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188
Article Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Google Scholar
Gillespie NA, Neale MC (2006) A finite mixture model for genotype and environment interactions: detecting latent population heterogeneity. Twin Res Hum Genet 9(3):412–23
Article Google Scholar
Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295
Article Google Scholar
Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New York
Book Google Scholar
Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454
Article MathSciNet MATH Google Scholar
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376
Article MathSciNet Google Scholar
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
MATH Google Scholar
Melnykov V (2013) Challenges in model-based clustering. WIREs: Comput Stat 5:135–148
Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MathSciNet MATH Google Scholar
Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Article Google Scholar
Melnykov Y, Melnykov V, Zhu X (2017) Studying contributions of variables to classification. Stat Probab Lett 129:318–325
Article MathSciNet MATH Google Scholar
Ripley B, Tierney L, Urbanek S (2011) Package ’parallel’. http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin
MATH Google Scholar
Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438
Google Scholar
Wang SJ, Woodward WA, Gray HL, Wiechecki S, Satin SR (1997) A new test for outlier detection from a multivariate mixture distribution. J Comput Graph Stat 6:285–299
MathSciNet Google Scholar
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Article MathSciNet Google Scholar
Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.

Author information

Authors and Affiliations

Department of Mathematics, The University of Louisville, Louisville, KY, 40208, USA
Xuwen Zhu

Authors

Xuwen Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuwen Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X. Probability of misclassification in model-based clustering. Comput Stat 34, 1427–1442 (2019). https://doi.org/10.1007/s00180-019-00868-0

Download citation

Received: 12 August 2017
Accepted: 16 January 2019
Published: 24 January 2019
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s00180-019-00868-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probability of misclassification in model-based clustering

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation