A cluster stability criteria based on the two-sample test concept

Volkovich, Z.; Barzily, Z.; Morozensky, L.

doi:10.1007/3-540-33880-2_33

Z. Volkovich^7,8,
Z. Barzily⁷ &
L. Morozensky⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

667 Accesses
1 Citations

1 Abstract

A method for assessing cluster stability is presented in this paper. We hypothesize that if one uses a “consistent” clustering algorithm to partition several independent samples then the clustered samples should be identically distributed. We use the two sample energy test approach for analyzing this hypothesis. Such a test is not very efficient in the clustering problems because outliers in the samples and limitations of the clustering algorithms heavily contribute to the noise level. Thus, we repeat calculating the value of the test statistic many times and an empirical distribution of this statistic is obtained. We choose the value of the “true” number of clusters as the one which yields the most concentrated distribution. Results of the numerical experiments are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ya. Belopolskaya, L. Klebanov, and V. Volkovich. Characterization of elliptic distributions. Journal of Mathematical Sciences, 127(1):1682–1686, 2005.
Article MATH MathSciNet Google Scholar
A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6–17, 2002.
Google Scholar
R. Calinski and J. Harabasz. A dendrite method for cluster analysis. Commun Statistics, 3:1–27, 1974.
Article MATH MathSciNet Google Scholar
G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315, 1992.
Article MATH MathSciNet Google Scholar
W. J. Conover, M. E. Johnson, and M. M. Johnson. Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23:351–361, 1981.
Article Google Scholar
T. M. Cover and J.A. Thomas. Elements of Information Theory. New York: Wiley, 1991.
MATH Google Scholar
I. Dhillon, J. Kogan, and Ch. Nicholas. Feature selection and document clustering. In M. Berry, editor, A Comprehensive Survey of Text Mining, pages 73–100. Springer, Berlin Heildelberg New York, 2003.
Google Scholar
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.
Article MATH Google Scholar
B. S. Duran. A survey of nonparametric tests for scale. Communications in statistics — Theory and Methods, 5:1287–1312, 1976.
MathSciNet Google Scholar
C. Fraley and A.E. Raftery. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8):578–588, 1998.
Article MATH Google Scholar
J. H. Friedman. Exploratory projection pursuit. J. of the American Statistical Association, 82(397):249–266, 1987.
Article MATH Google Scholar
J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Annals of Statistics, 7:697–717, 1979.
MATH MathSciNet Google Scholar
A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547–568, 1987.
Article Google Scholar
J. Hartigan. Statistical theory in clustering. J Classification, 2:6376, 1985.
Article MathSciNet Google Scholar
L. Klebanov. One class of distribution free multivariate tests. SPb. Math. Society, Preprint, 2003(03), 2003.
Google Scholar
L. Klebanov, T. Kozubowskii, S. Rachev, and V. Volkovich. Characterization of distributions symmetric with respect to a group of transformations and testing of corresponding statistical hypothesis. Statistics and Probability Letters, 53:241–247, 2001.
Article MATH MathSciNet Google Scholar
J. Kogan, C. Nicholas, and V. Volkovich. Text mining with information-theoretical clustering. Computing in Science and Engineering, pages 52–59, November/December 2003.
Google Scholar
W. Krzanowski and Y. Lai. A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics, 44:2334, 1985.
MathSciNet Google Scholar
E. Levine and E. Domany. Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13:2573–2593, 2001.
Article MATH Google Scholar
K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945–848, 1990.
Article Google Scholar
V. Roth, V. Lange, M. Braun, and Buhmann J. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299–1323, 2004.
Article MATH Google Scholar
S. Still and W. Bialek. How many clusters? An information-theoretic perspective. Neural computation, 16(12):2483–2506, December 2004.
Article MATH Google Scholar
C. Sugar and G. James. Finding the number of clusters in a data set: An information theoretic approach. J of the American Statistical Association, 98:750–763, 2003.
Article MATH MathSciNet Google Scholar
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2):411423, 2001.
MathSciNet Google Scholar
V. Volkovich, J. Kogan, and C. Nicholas. k-means initialization by sampling large datasets. In I. Dhillon and J. Kogan, editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM 2004), pages 17–22, 2004.
Google Scholar
G. Zech and B. Asian. New test for the multivariate two-sample problem based on the concept of minimum energy. The Journal of Statistical Computation and Simulation, 75(2):109–119, february 2005.
Article MATH Google Scholar
A.A Zinger, A.V. Kakosyan, and L.B Klebanov. Characterization of distributions by means of the mean values of statistics in connection with some probability metrics. In Stability Problems for Stochastic Models, VNIISI, pages 47–55, 1989
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, ORT Braude Academic College, Karmiel, 21982, Israel
Z. Volkovich (Affiliate Professor), Z. Barzily & L. Morozensky
Department of Mathematics and Statistics, The University of Maryland, Baltimore County, USA
Z. Volkovich (Affiliate Professor)

Authors

Z. Volkovich
View author publications
You can also search for this author in PubMed Google Scholar
Z. Barzily
View author publications
You can also search for this author in PubMed Google Scholar
L. Morozensky
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Institute of Computer Sciences, Technical University of Lodz, ul. Wolczanska 215, 93-1005, Lodz, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Department of Software Engineering, ORT Braude College, POB. 78, 21982, Karmiel, Israel
Zeev Volkovich
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL, 33620, USA
Abraham Kandel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Volkovich, Z., Barzily, Z., Morozensky, L. (2006). A cluster stability criteria based on the two-sample test concept. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_33

Download citation

DOI: https://doi.org/10.1007/3-540-33880-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics