Abstract
A new projection-pursuit index is used to identify clusters and other structures in multivariate data. It is obtained from the variance decompositions of the data’s one-dimensional projections, without assuming a model for the data or that the number of clusters is known. The index is affine invariant and successful with real and simulated data. A general result is obtained indicating that clusters’ separation increases with the data’s dimension. In simulations it is thus confirmed, as expected, that the performance of the index either improves or does not deteriorate when the data’s dimension increases, making it especially useful for “large dimension-small sample size” data. The efficiency of this index will increase with the continuously improved computer technology. Several applications are presented.
Similar content being viewed by others
References
ANDREWS, D.F., GNANADESIKAN, R., and WARNER, J.L. (1971), “Transformations of Multivariate Data”, Biometrics, 27, 825–840.
BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–822.
BINDER, D.A. (1978), “Bayesian Cluster Analysis”, Biometrika, 65, 31–38.
BOLTON, R.J., and KRZANOWSKI, W.J. (2003), “Projection Pursuit Clustering for Exploratory Data Analysis”, Journal of Computational and Graphical Statistics, 12, 121–142.
COOK, D., BUJA, A., and CABRERA, J. (1993), “Projection Pursuit Indexes Based on Orthonormal Function Expansions”, Journal of Computational and Graphical Statistics, 2, 225–250.
DASGUPTA, A., and RAFTERY, A.E. (1998), “Detecting Features in Spatial Point Processes with Clutter via Model-Based Clustering”, Journal of the American Statistical Association, 93, 294–302.
DAY, N. (1969), “Estimating the Components of a Mixture of Normal Distributions”, Biometrika, 56, 463–474.
DIACONIS, P., and FREEDMAN, D. (1984), “Asymptotics of Graphical Projection Pursuit”, Annals of Statistics, 12, 793–815.
FERN, X.Z., and LIN, W. (2008), “Cluster Ensemble Selection”, Statistical Analysis and Data Mining, 1, 128–141.
FISHER, W.D. (1958), “On Grouping for Maximum Homogeneity”, Journal of the American Statistical Association, 53, 789–798.
FISHER, R.A. (1936), “The Use of Multiple Measurements in Taxonomic Problems”, Annals of Eugenics, 7, 179–188.
FRALEY, C., and RAFTERY, A. (1999), “MCLUST: Software for Model-Based Cluster Analysis”, Journal of Classification, 16, 297–306.
FRIEDMAN, J.H. (1987), “Exploratory Projection Pursuit”, Journal of the American Statistical Association, 82, 249–266.
FRIEDMAN, J.H., and TUKEY, J.W. (1974), “A Projection Pursuit Algorithm for Exploratory Data Analysis”, IEEE Transactions on Computers, 23, 881–890.
FRIEDMAN, H.P., and RUBIN, J. (1967), “On Some Invariant Criterion for Grouping Data”, Journal of the American Statistical Association, 62, 1159–1178.
GRAY, J.B., and LING, R.F. (1984), “K-Clustering as a Detection Tool for Influential Subsets in Regression”, Technometrics, 26, 305–330.
HADI, A.S., and SIMONOFF, J.S. (1993), “Procedures for the Identification of Multiple Outliers in LinearModels”, Journal of the American Statistical Association, 88, 1264–1272.
HALL, P. (1989), “Polynomial Projection Pursuit” Annals of Statistics, 17, 589–605.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.
HILL, B.M. (1963), “Information for Estimating the Proportions in Mixtures of Exponentials and Normal Distributions”, Journal of the American Statistical Association, 58, 918–932.
HUBER, P.J. (1985), “Projection Pursuit (With Discussion)”, Annals of Statistics, 13, 435–525.
JOHNSON, R.A., and WICHERN, D.W. (1992), Applied Multivariate Statistical Analysis, Englewood Cliffs, NJ: Prentice-Hall.
JONES, M.C., and SIBSON, R. (1987), “What is Projection Pursuit? (With Discussion)”, Journal of the Royal Statistical Society, Series A, 150, 1–36.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
KETTENRING, J.R. (2006), “The Practice of Cluster Analysis”, Journal of Classification, 23, 3–30.
KRUSKAL, J.B. (1969), “Towards a Practical Method which Helps Uncover the Structure of Multivariate Observations by Finding the Linear Transformation Which Optimizes a New “Index of Condensation”, in Statistical Computation, eds. R.C. Milton and J.A. Nelder, New York: Academic Press, pp. 427–440.
LECAM, L.M. (1986), Asymptotic Methods in Statistical Decision Theory, New York: Springer.
MINNOTTE, M.C., and SCOTT, D.W. (1993), “TheMode Tree: A Tool for Visualization of Nonparametric Density Features”, Journal of Computational and Graphical Statistics, 2, 51–68.
NASON, G. (1995), “Three-Dimensional Projection Pursuit”, Applied Statistics, 44, 411–430.
PEÑA, D., and PRIETO, F.J. (2007), “Combining Random and Specific Directions for Outlier Detection and Robust Estimation in High-Dimensional Multivariate Data”, Journal of Computational and Graphical Statistics, 16, 228–254.
PEÑA, D., and PRIETO, F.J. (2001), “Cluster Identification Using Projections”, Journal of the American Statistical Association, 96, 1433–1445.
PERISIC, I., and POSSE, C. (2005), “Projection Pursuit Indices Based on the Empirical Distribution Function”, Journal of Computational and Graphical Statistics, 14, 700–715.
POSSE, C. (1995), “Tools for Two-dimensional Exploratory Projection Pursuit”, Journal of Computational and Graphical Statistics, 4, 83–100.
PRIETO, F.J. (2010), Personal Communication.
RUSPINI, E.H. (1970), “Numerical Methods for Fuzzy Clustering”, Information Science, 2, 319–350.
SCOTT, A.J., and SYMONS, M.J. (1971), “Clustering Methods Based on Likelihood Ratio Criteria”, Biometrics, 27, 387–397.
SWITZER, P. (1985), “Discussion on Projection Pursuit (by P. Huber)”, Annals of Statistics, 13, 515–517.
SYMONS, M.J. (1981), “Clustering Criteria and Multivariate Normal Mixtures”, Biometrics, 37, 35–43.
VICHI, M., and SAPORTA, G. (2009), “Clustering and Disjoint Principal Component Analysis”, Computational Statistics and Data Analysis, 53, 3194–3208.
WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Mixture Analysis”, Multivariate Behavioral Research, 5, 329–350.
YATRACOS, Y.G. (2009), “The Asymptotic Distribution of a Cluster Index for I.I.D. Normal Random Variables”, Annals of Applied Probability, 19, 585–595.
YATRACOS, Y.G. (1998), “Variance and Clustering”, Proceedings of the American Mathematical Society, 126, 1177–1179.
YU, B. (2007), “Embracing Statistical Challenges in the Information Technology Age”, Technometrics, 49, 237–248.
ZULEEG, F. (2010), “European Economic Sustainability Index”, Issue Paper, June 16, 2010, European Policy Center, available at http://www.epc.eu/.
Author information
Authors and Affiliations
Corresponding author
Additional information
Many thanks are due to the two anonymous referees and Professor Willem J. Heiser, the Editor, for many useful comments and suggestions that greatly helped to improve the presentation and the quality of this work. Thanks are also due to Professor Rudy Beran for his comments and suggestions. Last but not least, I would like to thank Dr. Michalis Kolossiatis for his invaluable help with the simulations. Part of this work was done while the author was affiliated with the Department of Statistics and Applied Probability, National University of Singapore. This research was partially supported by the National University of Singapore and the Cyprus University of Technology.
Rights and permissions
About this article
Cite this article
Yatracos, Y.G. Detecting Clusters in the Data from Variance Decompositions of Its Projections. J Classif 30, 30–55 (2013). https://doi.org/10.1007/s00357-013-9124-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-013-9124-9