Abstract
This paper, arising from population studies, develops clustering algorithms for identifying patterns in data. Based on the concept of geometric variability, we have developed one polythetic-divisive and three agglomerative algorithms. The effectiveness of these procedures is shown by relating them to classical clustering algorithms. They are very general since they do not impose constraints on the type of data, so they are applicable to general (economics, ecological, genetics...) studies. Our major contributions include a rigorous formulation for novel clustering algorithms, and the discovery of new relationship between geometric variability and clustering. Finally, these novel procedures give a theoretical frame with an intuitive interpretation to some classical clustering methods to be applied with any type of data, including mixed data. These approaches are illustrated with real data on Drosophila chromosomal inversions.
Similar content being viewed by others
References
Anderson MJ, Robinson J (2003) Generalized discriminant analysis based on distances. Aust N Z J Stat 45: 301–318
Anderson MJ, Willis TJ (2003) Canonical analysis of principal coordinates: a useful method of constrained ordination for ecology. Ecology 84: 511–525
Arenas C, Cuadras CM (2002) Some recent statistical methods based on distances. Contrib Sci 2: 183–191
Balanyà J, Solé E, Oller JM, Sperlich D, Serra L (2004) Long-term changes in chromosomal inversion polymorphism of D. subobscura. II. European populations. J Zool Syst Evol Res 42: 191–201
Balanyà J, Oller JM, Huey RB, Gilchrist GW, Serra L (2006) Global genetic change tracks global climate warming in D. subobscura. Science 313: 1773–1775
Bhattacharyya A (1946) On a measure of divergence of two multinominal populations Sankhyā. Indian J Stat 7: 401–406
Calinski R, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27
Cuadras CM (1992) Probability distributions with given multivariate marginals and given dependence structure. J Multivar Anal 42: 51–66
Cuadras CM, Arenas C (1990) A distance based regression model for prediction with mixed data. Commun Stat Theory Methods 19: 2261–2279
Cuadras CM, Fortiana J (1995) A continuous metric scaling solution for a random variable. J Multivar Anal 32: 1–14
Cuadras CM, Fortiana J, Oliva F (1997) The proximity of an individual to a population with applications in discriminant analysis. J Classif 14: 117–136
Edwards AWF, Cavalli-Sforza LL (1965) A method for cluster analysis. Biometrics 21: 362–375
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338
Gower JC (1985) Measures of similarity, dissimilarity and distance. In: Kotz S, Johson NL, Read CB (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 307–316
Gower JC, Krzanowski WJ (1999) Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. J R Stat Soc Ser C Appl Stat 48: 505–519
Gower JC, Legendre P (1986) Metric and euclidean properties of dissimilarity coefficients. J Classif 3: 5–48
Irigoien I, Arenas C (2008) INCA: new statistic for estimating the number of clusters, and identifying atypical units. Stat Med 27: 2948–2973
Krimbas CB (1993) D. subobscura biology, genetics and inversion polymorphism. Verlag, Dr. Kovac, Hamburg
Krzanowski WJ (2004) Biplots for multifactorial analysis of distance. Biometrics 60: 517–524
Krzanowski WJ, Marriott FHC (1994) Multivariate analysis part 1: distributions, ordination and inference. Kendall’s Library of Statistics, Edward Arnold, London
Lance GN, Williams WT (1967) A general theory of classification sorting strategies: 1. Hierarchical systems. Comput J 9: 373–380
Legendre P, Anderson MJ (1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol Monogr 48: 505–519
Lingoes JC (1971) Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36: 195–203
Mestres F, Balanyà J, Pascual M, Arenas C, Gilchrist GW, Huey RB, Serra L (2009) Evolution of Chilean colonizing populations of D. subobscura: lethal genes and chromosomal arrangements. Genetica 136: 37–48
Prevosti A, Ribó G, Serra L, Aguadé M, Balanyà J, Monclús M, Mestres F (1988) Colonization of America by D. subobscura: experiment in natural populations that supports the adaptative role of chromosomal-inversion polymorphism. Proc Natl Acad Sci USA 85: 5597–5600
Rao CR (1982) Diversity: its measurement, decomposition, apportionment and analysis Sankhyā. Indian J Stat 44: 1–22
Solé E, Mestres F, Balanyà J, Arenas C, Serra L (2000) Colonization of America by D. subobscura: spatial and temporal lethal-gene allelism. Hereditas 133: 65–72
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58: 236–244
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Irigoien, I., Arenas, C., Fernández, E. et al. GEVA: geometric variability-based approaches for identifying patterns in data. Comput Stat 25, 241–255 (2010). https://doi.org/10.1007/s00180-009-0173-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-009-0173-9