Skip to main content
Log in

Probability of misclassification in model-based clustering

  • Short Note
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Cluster analysis is an important problem of unsupervised machine learning. Model-based clustering is one of the most popular clustering techniques based on finite mixture models. Upon fitting of a mixture model, one question naturally arises as to how many misclassifications there are in the partition. At the same time, rather limited literature is devoted to developing diagnostic tools for obtained clustering solution. In this paper, an algorithm is developed for efficiently estimating the misclassification probability. The confusion probability map and classification confidence region are proposed for predicting the confusion matrix, identifying which cluster causes the most confusion, and understand the distribution of misclassifications. Application to real-life datasets illustrates the developed technique with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  • Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5

    Google Scholar 

  • Azzalini A, Bowman AW (1990) A look at some data on the old faithful geyser. J R Stat Soc C 39:357–365

    MATH  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332

    Article  MathSciNet  MATH  Google Scholar 

  • Cook D, Weisberg S (1994) An introduction to regression graphics. Wiley, New York

    Book  MATH  Google Scholar 

  • Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econom 12(3):313–336

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    MATH  Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188

    Article  Google Scholar 

  • Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780

    Google Scholar 

  • Gillespie NA, Neale MC (2006) A finite mixture model for genotype and environment interactions: detecting latent population heterogeneity. Twin Res Hum Genet 9(3):412–23

    Article  Google Scholar 

  • Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295

    Article  Google Scholar 

  • Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New York

    Book  Google Scholar 

  • Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454

    Article  MathSciNet  MATH  Google Scholar 

  • Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376

    Article  MathSciNet  Google Scholar 

  • McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York

    MATH  Google Scholar 

  • Melnykov V (2013) Challenges in model-based clustering. WIREs: Comput Stat 5:135–148

    Google Scholar 

  • Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25

    Article  Google Scholar 

  • Melnykov Y, Melnykov V, Zhu X (2017) Studying contributions of variables to classification. Stat Probab Lett 129:318–325

    Article  MathSciNet  MATH  Google Scholar 

  • Ripley B, Tierney L, Urbanek S (2011) Package ’parallel’. http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

  • Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin

    MATH  Google Scholar 

  • Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438

    Google Scholar 

  • Wang SJ, Woodward WA, Gray HL, Wiechecki S, Satin SR (1997) A new test for outlier detection from a multivariate mixture distribution. J Comput Graph Stat 6:285–299

    MathSciNet  Google Scholar 

  • Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244

    Article  MathSciNet  Google Scholar 

  • Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuwen Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, X. Probability of misclassification in model-based clustering. Comput Stat 34, 1427–1442 (2019). https://doi.org/10.1007/s00180-019-00868-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-019-00868-0

Keywords

Navigation