Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables

Cariou, Véronique; Verdun, Stéphane; Diaz, Emmanuelle; Qannari, El Mostafa; Vigneau, Evelyne

doi:10.1007/s11634-009-0047-6

Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables

Regular Article
Published: 06 November 2009

Volume 3, pages 227–241, (2009)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Véronique Cariou¹,
Stéphane Verdun¹,
Emmanuelle Diaz²,
El Mostafa Qannari¹ &
…
Evelyne Vigneau¹

163 Accesses
Explore all metrics

Abstract

Within the scope of cluster analysis of variables, the selection of the appropriate number of clusters is of paramount interest. The strategy of determination of the appropriate number of clusters adopted herein is based on a hypothesis testing approach. It consists in testing whether the variation of a partition quality criterion between two consecutive partitions is far removed from the expected variation under the null-hypothesis stipulating a lack of structure. Three hypothesis testing strategies are detailed and compared in the scope of clustering of variables: Gap, Weighted Gap and a statistic associated with CLV methodology. Finally, an illustration is presented based on data from a preference study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable Selection in Cluster Analysis: An Approach Based on a New Index

Modelling the role of variables in model-based cluster analysis

Article 12 January 2017

Supervised clustering of variables

Article 15 November 2014

References

Bel Mufti G, Bertrand P, El Moubarki L (2005) Determining the number of groups from measures of cluster stability. In: Janssen J, Lenca P (eds) The XIth international symposium on applied stochastic models and data analysis. Brest France, pp 404–414
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22: 719–725
Article Google Scholar
Bock HH (1985) On some significance tests in cluster analysis. J Classifi 2: 77–108
Article MATH MathSciNet Google Scholar
Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27
Article MathSciNet Google Scholar
Carroll JD (1972) Individual differences and multidimensional scaling. In: Shepard RN, Romney AK, Nerlove SB (eds) Multidimensional scaling: theory and applications in the behavioral sciences, vol 1. Seminar Press, New York, pp 105–155
Google Scholar
Chae SS, DuBien JL, Wardec WD (2006) A method of predicting the number of clusters using Rand’s statistic. Comput Stat Data Anal 50: 3531–3546
Article MATH Google Scholar
Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Fayyad U, Piatesky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 153–180
Google Scholar
Dudoit S, Fridlyand J (2002) A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3: research0036.1–0036.21
Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868
Article Google Scholar
Fraley C, Raftery AE (1998) How many clusters? which clustering method? Answers via model-based cluster analysis. Comput J 41: 578–588
Article MATH Google Scholar
Greenhoff K, MacFie HJH (1994) Preference mapping in practice. In: MacFie HJH, Thomson DMH (eds) Measurement of food preferences. Blackie Academic and Professional, London, pp 137–166
Google Scholar
Hardy A (1996) On the number of clusters. Comput Stat Data Anal 23: 83–96
Article MATH Google Scholar
Hartigan JA (1985) Statistical theory in clustering. J Classifi 2: 63–76
Article MATH MathSciNet Google Scholar
Jolliffe IT (1972) Discarding variables in a principal component analysis, I: Artificial data. Appl Stat 21: 160–173
Article MathSciNet Google Scholar
Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset. Biostatistics 8: 9–31
Article MATH Google Scholar
Kettenring JR (2006) The practice of cluster analysis. J Classifi 23: 3–30
Article MathSciNet Google Scholar
Kojadinovic I (2010) Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages. Comput Stat Data Anal 54: 90–108
Article Google Scholar
Krzanowski WJ, Lai YT (1988) A criterion for determing the number of groups in a data set using sum-of-squares clustering. Biometrics 44: 23–34
Article MATH MathSciNet Google Scholar
Lerman IC (1991) Foundations of the likelihood linkage analysis classification method. Appl Stoch Models Data Anal 7: 63–76
Article MATH MathSciNet Google Scholar
McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. Wiley, New York
MATH Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Article Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65
Article MATH Google Scholar
Sahmer K (2006) Propriétés et extensions de la classification de variables autour de composantes latentes. Application en évaluation sensorielle. PhD Thesis, Rennes, France
Sahmer K, Vigneau E, Qannari EM (2006) A cluster approach to analyze preference data: choice of the number of clusters. Food Qual Prefer 17: 257–265
Article Google Scholar
SAS/STAT (1997) User’s guide, version 8. SAS Institute Inc, North Carolina
Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763
Article MATH MathSciNet Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset using the gap statistic. J R Stat Soc B 63: 411–423
Article MATH MathSciNet Google Scholar
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14: 511–528
Article MathSciNet Google Scholar
Vigneau E, Qannari EM, Punter PH, Knoops S (2001) Segmentation of a panel of consumers using clustering of variables around latent directions of preference. Food Qual Prefer 12: 359–363
Article Google Scholar
Vigneau E, Qannari EM (2003) Clustering of variables around latent components. Commun Stat Simul Comput 32: 1131–1150
Article MATH MathSciNet Google Scholar
Vigneau E, Sahmer K, Qannari EM, Bertrand D (2005) Clustering of variables to analyze spectral data. J Chemom 19: 122–128
Article Google Scholar
Vigneau E, Qannari EM, Sahmer K, Ladiray D (2006) Classification de variables autour de composantes latentes. Revue de Statistique Appliquée 54: 27–45
Google Scholar
Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63: 1031–1037
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

ENITIAA/INRA, Sensometrics and Chemometrics Laboratory, rue de la Géraudière, BP 82225, 44322, Nantes Cedex 03, France
Véronique Cariou, Stéphane Verdun, El Mostafa Qannari & Evelyne Vigneau
PSA Peugeot Citroën, Intégration Facteurs Humains, DTI/DRIA/DSTF/IFH, 2, route de Gisy, 78943, Vélizy Villacoublay, France
Emmanuelle Diaz

Authors

Véronique Cariou
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Verdun
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuelle Diaz
View author publications
You can also search for this author in PubMed Google Scholar
El Mostafa Qannari
View author publications
You can also search for this author in PubMed Google Scholar
Evelyne Vigneau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Véronique Cariou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cariou, V., Verdun, S., Diaz, E. et al. Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables. Adv Data Anal Classif 3, 227–241 (2009). https://doi.org/10.1007/s11634-009-0047-6

Download citation

Received: 30 April 2009
Accepted: 21 October 2009
Published: 06 November 2009
Issue Date: December 2009
DOI: https://doi.org/10.1007/s11634-009-0047-6

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables

Abstract

Access this article

Similar content being viewed by others

Variable Selection in Cluster Analysis: An Approach Based on a New Index

Modelling the role of variables in model-based cluster analysis

Supervised clustering of variables

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Comparison of three hypothesis testing approaches for the selection of the appropriate number of clusters of variables

Abstract

Access this article

Similar content being viewed by others

Variable Selection in Cluster Analysis: An Approach Based on a New Index

Modelling the role of variables in model-based cluster analysis

Supervised clustering of variables

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation