Abstract
Within the scope of cluster analysis of variables, the selection of the appropriate number of clusters is of paramount interest. The strategy of determination of the appropriate number of clusters adopted herein is based on a hypothesis testing approach. It consists in testing whether the variation of a partition quality criterion between two consecutive partitions is far removed from the expected variation under the null-hypothesis stipulating a lack of structure. Three hypothesis testing strategies are detailed and compared in the scope of clustering of variables: Gap, Weighted Gap and a statistic associated with CLV methodology. Finally, an illustration is presented based on data from a preference study.
Similar content being viewed by others
References
Bel Mufti G, Bertrand P, El Moubarki L (2005) Determining the number of groups from measures of cluster stability. In: Janssen J, Lenca P (eds) The XIth international symposium on applied stochastic models and data analysis. Brest France, pp 404–414
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22: 719–725
Bock HH (1985) On some significance tests in cluster analysis. J Classifi 2: 77–108
Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27
Carroll JD (1972) Individual differences and multidimensional scaling. In: Shepard RN, Romney AK, Nerlove SB (eds) Multidimensional scaling: theory and applications in the behavioral sciences, vol 1. Seminar Press, New York, pp 105–155
Chae SS, DuBien JL, Wardec WD (2006) A method of predicting the number of clusters using Rand’s statistic. Comput Stat Data Anal 50: 3531–3546
Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Fayyad U, Piatesky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 153–180
Dudoit S, Fridlyand J (2002) A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3: research0036.1–0036.21
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868
Fraley C, Raftery AE (1998) How many clusters? which clustering method? Answers via model-based cluster analysis. Comput J 41: 578–588
Greenhoff K, MacFie HJH (1994) Preference mapping in practice. In: MacFie HJH, Thomson DMH (eds) Measurement of food preferences. Blackie Academic and Professional, London, pp 137–166
Hardy A (1996) On the number of clusters. Comput Stat Data Anal 23: 83–96
Hartigan JA (1985) Statistical theory in clustering. J Classifi 2: 63–76
Jolliffe IT (1972) Discarding variables in a principal component analysis, I: Artificial data. Appl Stat 21: 160–173
Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset. Biostatistics 8: 9–31
Kettenring JR (2006) The practice of cluster analysis. J Classifi 23: 3–30
Kojadinovic I (2010) Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages. Comput Stat Data Anal 54: 90–108
Krzanowski WJ, Lai YT (1988) A criterion for determing the number of groups in a data set using sum-of-squares clustering. Biometrics 44: 23–34
Lerman IC (1991) Foundations of the likelihood linkage analysis classification method. Appl Stoch Models Data Anal 7: 63–76
McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. Wiley, New York
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65
Sahmer K (2006) Propriétés et extensions de la classification de variables autour de composantes latentes. Application en évaluation sensorielle. PhD Thesis, Rennes, France
Sahmer K, Vigneau E, Qannari EM (2006) A cluster approach to analyze preference data: choice of the number of clusters. Food Qual Prefer 17: 257–265
SAS/STAT (1997) User’s guide, version 8. SAS Institute Inc, North Carolina
Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset using the gap statistic. J R Stat Soc B 63: 411–423
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14: 511–528
Vigneau E, Qannari EM, Punter PH, Knoops S (2001) Segmentation of a panel of consumers using clustering of variables around latent directions of preference. Food Qual Prefer 12: 359–363
Vigneau E, Qannari EM (2003) Clustering of variables around latent components. Commun Stat Simul Comput 32: 1131–1150
Vigneau E, Sahmer K, Qannari EM, Bertrand D (2005) Clustering of variables to analyze spectral data. J Chemom 19: 122–128
Vigneau E, Qannari EM, Sahmer K, Ladiray D (2006) Classification de variables autour de composantes latentes. Revue de Statistique Appliquée 54: 27–45
Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63: 1031–1037
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cariou, V., Verdun, S., Diaz, E. et al. Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables. Adv Data Anal Classif 3, 227–241 (2009). https://doi.org/10.1007/s11634-009-0047-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-009-0047-6