Skip to main content
Log in

A Criterion Based on the Mahalanobis Distance for Cluster Analysis with Subsampling

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

A two-level data set consists of entities of a higher level (say populations), each one being composed of several units of the lower level (say individuals). Observations are made at the individual level, whereas population characteristics are aggregated from individual data. Cluster analysis with subsampling of populations is a cluster analysis based on individual data that aims at clustering populations rather than individuals. In this article, we extend existing optimality criteria for cluster analysis with subsampling of populations to deal with situations where population characteristics are not the mean of individual data. A new criterion that depends on the Mahalanobis distance is also defined. The criteria are compared using simulated examples and an ecological data set of tree species in a tropical rain forest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ABRAMOWITZ, M., and STEGUN, I.A. (1964), Handbook of Mathematical Functions (with Formulas, Graphs, and Mathematical Tables), Washington, DC: US Government Printing Office.

    MATH  Google Scholar 

  • BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.

    Article  MathSciNet  MATH  Google Scholar 

  • CADEZ, I.V., GAFFNEY, S., and SMYTH, P. (2000), “A General Probabilistic Framework for Clustering Individuals and Objects”, in International Conference on Knowledge Discovery and Data Mining. Proceedings of the Sixth ACM Special Interest Group on Knowledge Discovery in Data International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts: ACM, pp. 140–149.

    Google Scholar 

  • CALIŃSKI, R.B., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.

    Article  MathSciNet  Google Scholar 

  • CELEUX, G., and GOVAERT, G. (1992), “A Classification EM Algorithm for Clustering and Two Stochastic Versions”, Computational Statistics and Data Analysis, 14, 315–332.

    Article  MathSciNet  MATH  Google Scholar 

  • COSMIDES, L., and TOOBY, J. (2000), “Evolutionary Psychology and the Emotions”, in Handbook of Emotions, eds. M. Lewis and J.M. Haviland-Jones, New York, NY: Guilford, pp. 91–115.

    Google Scholar 

  • CUEVAS, A., and ROMO, J. (1995), “On the Estimation of the Influence Curve”, Canadian Journal of Statistics, 23, 1–9.

    Article  MathSciNet  MATH  Google Scholar 

  • DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39, 1–38.

    MathSciNet  MATH  Google Scholar 

  • DESARBO, W.S., and CRON, W.L. (1988), “A Maximum Likelihood Methodology for Clusterwise Linear Regression”, Journal of Classification, 5, 249–282.

    Article  MathSciNet  MATH  Google Scholar 

  • EVERITT, B.S. (1974), Cluster Analysis, London: Heinemann Educ. Books.

    Google Scholar 

  • EVERITT, B.S. (1998), “Cluster Analysis of Subjects, Nonhierarchical Methods”, in Encyclopedia of Biostatistics, eds. P. Armitage and T. Colton, Chichester: John Wiley & Sons, p. 7.

    Google Scholar 

  • FAVRICHON, V. (1994), “Classification des Espèces Arborées en Groupes Fonctionnels en Vue de laRéalisation d’unModèle de Dynamique de Peuplement en Forêt Guyanaise”, Revue d’Écologie (Terre et Vie), 49, 379–403.

    Google Scholar 

  • FRALEY, C., and RAFTERY, A. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97, 611–631.

    Article  MathSciNet  MATH  Google Scholar 

  • FRIEDMAN, H.P., and RUBIN, J. (1967), “On Some Invariant Criteria for Grouping Data”, Journal of the American Statistical Association, 62, 1159–1178.

    Article  MathSciNet  Google Scholar 

  • GARD, T.C. (1988), “Aggregation in Stochastic EcosystemModels”, EcologicalModelling, 44, 153–164.

    Google Scholar 

  • GOURLET-FLEURY, S., GUEHL, J.M., and LAROUSSINIE, O. (eds.) (2004), Ecology and Management of a Neotropical Rainforest. Lessons Drawn from Paracou, a Long-Term Experimental Research Site in French Guiana, Paris: Elsevier.

    Google Scholar 

  • GRÜN, B., and LEISCH, F. (2008), “Identifiability of FiniteMixtures of Multinomial Logit Models with Varying and Fixed Effects”, Journal of Classification, 25, 225–247.

    Article  MathSciNet  MATH  Google Scholar 

  • HAMPEL, F.R. (1974), “The Influence Curve and its Role in Robust Estimation”, Journal of the American Statistical Association, 69, 383–393.

    Article  MathSciNet  MATH  Google Scholar 

  • HARTIGAN, J.A., and WONG, M.A. (1979), “A K-means Clustering Algorithm”, Applied Statistics, 28, 100–108.

    Article  MATH  Google Scholar 

  • HETTMANSPERGER, T.P., and THOMAS, H. (2000), “Almost Nonparametric Inference for Repeated Measures in Mixture Models”, Journal of the Royal Statistical Society, Series B, 62, 811–825.

    Article  MathSciNet  MATH  Google Scholar 

  • HILDENBRAND, W. (2008), “Aggregation Theory”, in The New Palgrave Dictionary of Economics, eds. S.N.Durlauf and L.E. Blume, Basingstoke, UK: PalgraveMacmillan.

    Google Scholar 

  • HYNDMAN, R.J., and FAN, Y. (1996), “Sample Quantiles in Statistical Packages”, American Statistician, 50, 361–365.

    Article  Google Scholar 

  • IWASA, Y., ANDREASEN, V., and LEVIN, S.A. (1987), “Aggregation in Model Ecosystems: I. Perfect Aggregation”, Ecological Modelling, 37, 287–302.

    Article  Google Scholar 

  • KIRKPATRICK, S. (1984), “Optimization by Simulated Annealing: Quantitative Studies”, Journal of Statistical Physics, 34, 975–986.

    Article  MathSciNet  Google Scholar 

  • LEISCH, F. (2004), “FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R”, Journal of Statistical Software, 11, 1–18.

    Google Scholar 

  • MCLACHLAN, G., and PEEL, D. (2000), Finite Mixture Models, New York: John Wiley & Sons.

    Book  MATH  Google Scholar 

  • MCLACHLAN, G.J. (1982), “The Classification and Mixture Maximum Likelihood Approaches to Cluster Analysis”, in Handbook of Statistics (Vol. 2), eds. P.R. Krishnaiah and L.N. Kanal, Amsterdam, The Netherlands: Elsevier Science, pp. 199–208.

    Google Scholar 

  • MCLACHLAN, G.J., and BASFORD, K.E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker.

    MATH  Google Scholar 

  • NG, S.K., MCLACHLAN, G.J., WANG, K., BEN-TOVIM, L., and NG, S.W. (2006), “A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles”, Bioinformatics, 22, 1745–1752.

    Article  Google Scholar 

  • PENNY, D., and HENDY, M. (2003), “Phylogenetics: Parsimony and Distance Methods”, in Handbook of Statistical Genetics (Vol. 3), eds. D.J. Balding, M. Bishop, and C. Cannings, Chichester, England: John Wiley & Sons, pp. 348–388.

    Google Scholar 

  • PICARD, N., MORTIER, F., ROSSI, V., and GOURLET-FLEURY, S. (2010), “Clustering Species Using a Model of Population Dynamics and Aggregation Theory”, Ecological Modelling, 221, 152–160.

    Article  Google Scholar 

  • PRESS, W.H., TEUKOLSKY, S.A., VETTERLING, W.T., and FLANNERY, B.P. (1992), Numerical Recipes in C: The Art of Scientific Computing(2nd ed.), Cambridge: Cambridge University Press.

    Google Scholar 

  • RAO, C.R. (1952), Advanced Statistical Methods in Biomatrics Research, New York: Wiley.

    Google Scholar 

  • SCOTT, A.J., and SYMONS, M.J. (1971), “Clustering Methods Based on Likelihood Ratio Criteria”, Biometrics, 27, 387–397.

    Article  Google Scholar 

  • SILVERMAN, B.W. (1986), Density Estimation, London, UK: Chapman and Hall.

    MATH  Google Scholar 

  • UKOUMUNNE, O.C., and THOMPSON, S.G. (2001), “Analysis of Cluster Randomized Trials with Repeated Cross-Sectional Binary Measurements”, Statistics in Medicine, 20, 417–433.

    Article  Google Scholar 

  • WARD, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58, 236–244.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Nicolas Picard or Avner Bar-Hen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Picard, N., Bar-Hen, A. A Criterion Based on the Mahalanobis Distance for Cluster Analysis with Subsampling. J Classif 29, 23–49 (2012). https://doi.org/10.1007/s00357-012-9100-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-012-9100-9

Keywords

Navigation