Skip to main content
Log in

Detecting Clusters in the Data from Variance Decompositions of Its Projections

Journal of Classification Aims and scope Submit manuscript

Abstract

A new projection-pursuit index is used to identify clusters and other structures in multivariate data. It is obtained from the variance decompositions of the data’s one-dimensional projections, without assuming a model for the data or that the number of clusters is known. The index is affine invariant and successful with real and simulated data. A general result is obtained indicating that clusters’ separation increases with the data’s dimension. In simulations it is thus confirmed, as expected, that the performance of the index either improves or does not deteriorate when the data’s dimension increases, making it especially useful for “large dimension-small sample size” data. The efficiency of this index will increase with the continuously improved computer technology. Several applications are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  • ANDREWS, D.F., GNANADESIKAN, R., and WARNER, J.L. (1971), “Transformations of Multivariate Data”, Biometrics, 27, 825–840.

    Article  Google Scholar 

  • BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–822.

    Article  MathSciNet  MATH  Google Scholar 

  • BINDER, D.A. (1978), “Bayesian Cluster Analysis”, Biometrika, 65, 31–38.

    Article  MathSciNet  MATH  Google Scholar 

  • BOLTON, R.J., and KRZANOWSKI, W.J. (2003), “Projection Pursuit Clustering for Exploratory Data Analysis”, Journal of Computational and Graphical Statistics, 12, 121–142.

    Article  MathSciNet  Google Scholar 

  • COOK, D., BUJA, A., and CABRERA, J. (1993), “Projection Pursuit Indexes Based on Orthonormal Function Expansions”, Journal of Computational and Graphical Statistics, 2, 225–250.

    Article  MathSciNet  Google Scholar 

  • DASGUPTA, A., and RAFTERY, A.E. (1998), “Detecting Features in Spatial Point Processes with Clutter via Model-Based Clustering”, Journal of the American Statistical Association, 93, 294–302.

    Article  MATH  Google Scholar 

  • DAY, N. (1969), “Estimating the Components of a Mixture of Normal Distributions”, Biometrika, 56, 463–474.

    Article  MathSciNet  MATH  Google Scholar 

  • DIACONIS, P., and FREEDMAN, D. (1984), “Asymptotics of Graphical Projection Pursuit”, Annals of Statistics, 12, 793–815.

    Article  MathSciNet  MATH  Google Scholar 

  • FERN, X.Z., and LIN, W. (2008), “Cluster Ensemble Selection”, Statistical Analysis and Data Mining, 1, 128–141.

    Article  MathSciNet  Google Scholar 

  • FISHER, W.D. (1958), “On Grouping for Maximum Homogeneity”, Journal of the American Statistical Association, 53, 789–798.

    Article  MathSciNet  MATH  Google Scholar 

  • FISHER, R.A. (1936), “The Use of Multiple Measurements in Taxonomic Problems”, Annals of Eugenics, 7, 179–188.

    Article  Google Scholar 

  • FRALEY, C., and RAFTERY, A. (1999), “MCLUST: Software for Model-Based Cluster Analysis”, Journal of Classification, 16, 297–306.

    Article  MATH  Google Scholar 

  • FRIEDMAN, J.H. (1987), “Exploratory Projection Pursuit”, Journal of the American Statistical Association, 82, 249–266.

    Article  MathSciNet  MATH  Google Scholar 

  • FRIEDMAN, J.H., and TUKEY, J.W. (1974), “A Projection Pursuit Algorithm for Exploratory Data Analysis”, IEEE Transactions on Computers, 23, 881–890.

    Article  MATH  Google Scholar 

  • FRIEDMAN, H.P., and RUBIN, J. (1967), “On Some Invariant Criterion for Grouping Data”, Journal of the American Statistical Association, 62, 1159–1178.

    Article  MathSciNet  Google Scholar 

  • GRAY, J.B., and LING, R.F. (1984), “K-Clustering as a Detection Tool for Influential Subsets in Regression”, Technometrics, 26, 305–330.

    MathSciNet  Google Scholar 

  • HADI, A.S., and SIMONOFF, J.S. (1993), “Procedures for the Identification of Multiple Outliers in LinearModels”, Journal of the American Statistical Association, 88, 1264–1272.

    Article  MathSciNet  Google Scholar 

  • HALL, P. (1989), “Polynomial Projection Pursuit” Annals of Statistics, 17, 589–605.

    Article  MathSciNet  MATH  Google Scholar 

  • HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.

    MATH  Google Scholar 

  • HILL, B.M. (1963), “Information for Estimating the Proportions in Mixtures of Exponentials and Normal Distributions”, Journal of the American Statistical Association, 58, 918–932.

    Article  MathSciNet  Google Scholar 

  • HUBER, P.J. (1985), “Projection Pursuit (With Discussion)”, Annals of Statistics, 13, 435–525.

    Article  MathSciNet  MATH  Google Scholar 

  • JOHNSON, R.A., and WICHERN, D.W. (1992), Applied Multivariate Statistical Analysis, Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • JONES, M.C., and SIBSON, R. (1987), “What is Projection Pursuit? (With Discussion)”, Journal of the Royal Statistical Society, Series A, 150, 1–36.

    Article  MathSciNet  MATH  Google Scholar 

  • KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.

    Book  Google Scholar 

  • KETTENRING, J.R. (2006), “The Practice of Cluster Analysis”, Journal of Classification, 23, 3–30.

    Article  MathSciNet  Google Scholar 

  • KRUSKAL, J.B. (1969), “Towards a Practical Method which Helps Uncover the Structure of Multivariate Observations by Finding the Linear Transformation Which Optimizes a New “Index of Condensation”, in Statistical Computation, eds. R.C. Milton and J.A. Nelder, New York: Academic Press, pp. 427–440.

  • LECAM, L.M. (1986), Asymptotic Methods in Statistical Decision Theory, New York: Springer.

    Google Scholar 

  • MINNOTTE, M.C., and SCOTT, D.W. (1993), “TheMode Tree: A Tool for Visualization of Nonparametric Density Features”, Journal of Computational and Graphical Statistics, 2, 51–68.

    Google Scholar 

  • NASON, G. (1995), “Three-Dimensional Projection Pursuit”, Applied Statistics, 44, 411–430.

    Article  MathSciNet  MATH  Google Scholar 

  • PEÑA, D., and PRIETO, F.J. (2007), “Combining Random and Specific Directions for Outlier Detection and Robust Estimation in High-Dimensional Multivariate Data”, Journal of Computational and Graphical Statistics, 16, 228–254.

    Article  MathSciNet  Google Scholar 

  • PEÑA, D., and PRIETO, F.J. (2001), “Cluster Identification Using Projections”, Journal of the American Statistical Association, 96, 1433–1445.

    Article  MathSciNet  MATH  Google Scholar 

  • PERISIC, I., and POSSE, C. (2005), “Projection Pursuit Indices Based on the Empirical Distribution Function”, Journal of Computational and Graphical Statistics, 14, 700–715.

    Article  MathSciNet  Google Scholar 

  • POSSE, C. (1995), “Tools for Two-dimensional Exploratory Projection Pursuit”, Journal of Computational and Graphical Statistics, 4, 83–100.

    Google Scholar 

  • PRIETO, F.J. (2010), Personal Communication.

  • RUSPINI, E.H. (1970), “Numerical Methods for Fuzzy Clustering”, Information Science, 2, 319–350.

    Article  MATH  Google Scholar 

  • SCOTT, A.J., and SYMONS, M.J. (1971), “Clustering Methods Based on Likelihood Ratio Criteria”, Biometrics, 27, 387–397.

    Article  Google Scholar 

  • SWITZER, P. (1985), “Discussion on Projection Pursuit (by P. Huber)”, Annals of Statistics, 13, 515–517.

    Article  Google Scholar 

  • SYMONS, M.J. (1981), “Clustering Criteria and Multivariate Normal Mixtures”, Biometrics, 37, 35–43.

    Article  MathSciNet  MATH  Google Scholar 

  • VICHI, M., and SAPORTA, G. (2009), “Clustering and Disjoint Principal Component Analysis”, Computational Statistics and Data Analysis, 53, 3194–3208.

    Article  MathSciNet  MATH  Google Scholar 

  • WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Mixture Analysis”, Multivariate Behavioral Research, 5, 329–350.

    Article  Google Scholar 

  • YATRACOS, Y.G. (2009), “The Asymptotic Distribution of a Cluster Index for I.I.D. Normal Random Variables”, Annals of Applied Probability, 19, 585–595.

    Article  MathSciNet  MATH  Google Scholar 

  • YATRACOS, Y.G. (1998), “Variance and Clustering”, Proceedings of the American Mathematical Society, 126, 1177–1179.

    Article  MathSciNet  MATH  Google Scholar 

  • YU, B. (2007), “Embracing Statistical Challenges in the Information Technology Age”, Technometrics, 49, 237–248.

    Article  MathSciNet  Google Scholar 

  • ZULEEG, F. (2010), “European Economic Sustainability Index”, Issue Paper, June 16, 2010, European Policy Center, available at http://www.epc.eu/.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannis G. Yatracos.

Additional information

Many thanks are due to the two anonymous referees and Professor Willem J. Heiser, the Editor, for many useful comments and suggestions that greatly helped to improve the presentation and the quality of this work. Thanks are also due to Professor Rudy Beran for his comments and suggestions. Last but not least, I would like to thank Dr. Michalis Kolossiatis for his invaluable help with the simulations. Part of this work was done while the author was affiliated with the Department of Statistics and Applied Probability, National University of Singapore. This research was partially supported by the National University of Singapore and the Cyprus University of Technology.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yatracos, Y.G. Detecting Clusters in the Data from Variance Decompositions of Its Projections. J Classif 30, 30–55 (2013). https://doi.org/10.1007/s00357-013-9124-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-013-9124-9

Keywords

Navigation