Skip to main content

A cluster stability criteria based on the two-sample test concept

  • Chapter
Advances in Web Intelligence and Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

1 Abstract

A method for assessing cluster stability is presented in this paper. We hypothesize that if one uses a “consistent” clustering algorithm to partition several independent samples then the clustered samples should be identically distributed. We use the two sample energy test approach for analyzing this hypothesis. Such a test is not very efficient in the clustering problems because outliers in the samples and limitations of the clustering algorithms heavily contribute to the noise level. Thus, we repeat calculating the value of the test statistic many times and an empirical distribution of this statistic is obtained. We choose the value of the “true” number of clusters as the one which yields the most concentrated distribution. Results of the numerical experiments are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ya. Belopolskaya, L. Klebanov, and V. Volkovich. Characterization of elliptic distributions. Journal of Mathematical Sciences, 127(1):1682–1686, 2005.

    Article  MATH  MathSciNet  Google Scholar 

  2. A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6–17, 2002.

    Google Scholar 

  3. R. Calinski and J. Harabasz. A dendrite method for cluster analysis. Commun Statistics, 3:1–27, 1974.

    Article  MATH  MathSciNet  Google Scholar 

  4. G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  5. W. J. Conover, M. E. Johnson, and M. M. Johnson. Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23:351–361, 1981.

    Article  Google Scholar 

  6. T. M. Cover and J.A. Thomas. Elements of Information Theory. New York: Wiley, 1991.

    MATH  Google Scholar 

  7. I. Dhillon, J. Kogan, and Ch. Nicholas. Feature selection and document clustering. In M. Berry, editor, A Comprehensive Survey of Text Mining, pages 73–100. Springer, Berlin Heildelberg New York, 2003.

    Google Scholar 

  8. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.

    Article  MATH  Google Scholar 

  9. B. S. Duran. A survey of nonparametric tests for scale. Communications in statistics — Theory and Methods, 5:1287–1312, 1976.

    MathSciNet  Google Scholar 

  10. C. Fraley and A.E. Raftery. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8):578–588, 1998.

    Article  MATH  Google Scholar 

  11. J. H. Friedman. Exploratory projection pursuit. J. of the American Statistical Association, 82(397):249–266, 1987.

    Article  MATH  Google Scholar 

  12. J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Annals of Statistics, 7:697–717, 1979.

    MATH  MathSciNet  Google Scholar 

  13. A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547–568, 1987.

    Article  Google Scholar 

  14. J. Hartigan. Statistical theory in clustering. J Classification, 2:6376, 1985.

    Article  MathSciNet  Google Scholar 

  15. L. Klebanov. One class of distribution free multivariate tests. SPb. Math. Society, Preprint, 2003(03), 2003.

    Google Scholar 

  16. L. Klebanov, T. Kozubowskii, S. Rachev, and V. Volkovich. Characterization of distributions symmetric with respect to a group of transformations and testing of corresponding statistical hypothesis. Statistics and Probability Letters, 53:241–247, 2001.

    Article  MATH  MathSciNet  Google Scholar 

  17. J. Kogan, C. Nicholas, and V. Volkovich. Text mining with information-theoretical clustering. Computing in Science and Engineering, pages 52–59, November/December 2003.

    Google Scholar 

  18. W. Krzanowski and Y. Lai. A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics, 44:2334, 1985.

    MathSciNet  Google Scholar 

  19. E. Levine and E. Domany. Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13:2573–2593, 2001.

    Article  MATH  Google Scholar 

  20. K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945–848, 1990.

    Article  Google Scholar 

  21. V. Roth, V. Lange, M. Braun, and Buhmann J. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299–1323, 2004.

    Article  MATH  Google Scholar 

  22. S. Still and W. Bialek. How many clusters? An information-theoretic perspective. Neural computation, 16(12):2483–2506, December 2004.

    Article  MATH  Google Scholar 

  23. C. Sugar and G. James. Finding the number of clusters in a data set: An information theoretic approach. J of the American Statistical Association, 98:750–763, 2003.

    Article  MATH  MathSciNet  Google Scholar 

  24. R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2):411423, 2001.

    MathSciNet  Google Scholar 

  25. V. Volkovich, J. Kogan, and C. Nicholas. k-means initialization by sampling large datasets. In I. Dhillon and J. Kogan, editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM 2004), pages 17–22, 2004.

    Google Scholar 

  26. G. Zech and B. Asian. New test for the multivariate two-sample problem based on the concept of minimum energy. The Journal of Statistical Computation and Simulation, 75(2):109–119, february 2005.

    Article  MATH  Google Scholar 

  27. A.A Zinger, A.V. Kakosyan, and L.B Klebanov. Characterization of distributions by means of the mean values of statistics in connection with some probability metrics. In Stability Problems for Stochastic Models, VNIISI, pages 47–55, 1989

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Volkovich, Z., Barzily, Z., Morozensky, L. (2006). A cluster stability criteria based on the two-sample test concept. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_33

Download citation

  • DOI: https://doi.org/10.1007/3-540-33880-2_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33879-6

  • Online ISBN: 978-3-540-33880-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics