Skip to main content
Log in

Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Within the scope of cluster analysis of variables, the selection of the appropriate number of clusters is of paramount interest. The strategy of determination of the appropriate number of clusters adopted herein is based on a hypothesis testing approach. It consists in testing whether the variation of a partition quality criterion between two consecutive partitions is far removed from the expected variation under the null-hypothesis stipulating a lack of structure. Three hypothesis testing strategies are detailed and compared in the scope of clustering of variables: Gap, Weighted Gap and a statistic associated with CLV methodology. Finally, an illustration is presented based on data from a preference study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bel Mufti G, Bertrand P, El Moubarki L (2005) Determining the number of groups from measures of cluster stability. In: Janssen J, Lenca P (eds) The XIth international symposium on applied stochastic models and data analysis. Brest France, pp 404–414

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22: 719–725

    Article  Google Scholar 

  • Bock HH (1985) On some significance tests in cluster analysis. J Classifi 2: 77–108

    Article  MATH  MathSciNet  Google Scholar 

  • Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27

    Article  MathSciNet  Google Scholar 

  • Carroll JD (1972) Individual differences and multidimensional scaling. In: Shepard RN, Romney AK, Nerlove SB (eds) Multidimensional scaling: theory and applications in the behavioral sciences, vol 1. Seminar Press, New York, pp 105–155

    Google Scholar 

  • Chae SS, DuBien JL, Wardec WD (2006) A method of predicting the number of clusters using Rand’s statistic. Comput Stat Data Anal 50: 3531–3546

    Article  MATH  Google Scholar 

  • Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Fayyad U, Piatesky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 153–180

    Google Scholar 

  • Dudoit S, Fridlyand J (2002) A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3: research0036.1–0036.21

    Google Scholar 

  • Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868

    Article  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? which clustering method? Answers via model-based cluster analysis. Comput J 41: 578–588

    Article  MATH  Google Scholar 

  • Greenhoff K, MacFie HJH (1994) Preference mapping in practice. In: MacFie HJH, Thomson DMH (eds) Measurement of food preferences. Blackie Academic and Professional, London, pp 137–166

    Google Scholar 

  • Hardy A (1996) On the number of clusters. Comput Stat Data Anal 23: 83–96

    Article  MATH  Google Scholar 

  • Hartigan JA (1985) Statistical theory in clustering. J Classifi 2: 63–76

    Article  MATH  MathSciNet  Google Scholar 

  • Jolliffe IT (1972) Discarding variables in a principal component analysis, I: Artificial data. Appl Stat 21: 160–173

    Article  MathSciNet  Google Scholar 

  • Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset. Biostatistics 8: 9–31

    Article  MATH  Google Scholar 

  • Kettenring JR (2006) The practice of cluster analysis. J Classifi 23: 3–30

    Article  MathSciNet  Google Scholar 

  • Kojadinovic I (2010) Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages. Comput Stat Data Anal 54: 90–108

    Article  Google Scholar 

  • Krzanowski WJ, Lai YT (1988) A criterion for determing the number of groups in a data set using sum-of-squares clustering. Biometrics 44: 23–34

    Article  MATH  MathSciNet  Google Scholar 

  • Lerman IC (1991) Foundations of the likelihood linkage analysis classification method. Appl Stoch Models Data Anal 7: 63–76

    Article  MATH  MathSciNet  Google Scholar 

  • McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. Wiley, New York

    MATH  Google Scholar 

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179

    Article  Google Scholar 

  • Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65

    Article  MATH  Google Scholar 

  • Sahmer K (2006) Propriétés et extensions de la classification de variables autour de composantes latentes. Application en évaluation sensorielle. PhD Thesis, Rennes, France

  • Sahmer K, Vigneau E, Qannari EM (2006) A cluster approach to analyze preference data: choice of the number of clusters. Food Qual Prefer 17: 257–265

    Article  Google Scholar 

  • SAS/STAT (1997) User’s guide, version 8. SAS Institute Inc, North Carolina

  • Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset using the gap statistic. J R Stat Soc B 63: 411–423

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14: 511–528

    Article  MathSciNet  Google Scholar 

  • Vigneau E, Qannari EM, Punter PH, Knoops S (2001) Segmentation of a panel of consumers using clustering of variables around latent directions of preference. Food Qual Prefer 12: 359–363

    Article  Google Scholar 

  • Vigneau E, Qannari EM (2003) Clustering of variables around latent components. Commun Stat Simul Comput 32: 1131–1150

    Article  MATH  MathSciNet  Google Scholar 

  • Vigneau E, Sahmer K, Qannari EM, Bertrand D (2005) Clustering of variables to analyze spectral data. J Chemom 19: 122–128

    Article  Google Scholar 

  • Vigneau E, Qannari EM, Sahmer K, Ladiray D (2006) Classification de variables autour de composantes latentes. Revue de Statistique Appliquée 54: 27–45

    Google Scholar 

  • Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63: 1031–1037

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Véronique Cariou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cariou, V., Verdun, S., Diaz, E. et al. Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables. Adv Data Anal Classif 3, 227–241 (2009). https://doi.org/10.1007/s11634-009-0047-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-009-0047-6

Keywords

Mathematics Subject Classification (2000)

Navigation