Skip to main content
Log in

Self-learning K-means clustering: a global optimization approach

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

An appropriate distance is an essential ingredient in various real-world learning tasks. Distance metric learning proposes to study a metric, which is capable of reflecting the data configuration much better in comparison with the commonly used methods. We offer an algorithm for simultaneous learning the Mahalanobis like distance and K-means clustering aiming to incorporate data rescaling and clustering so that the data separability grows iteratively in the rescaled space with its sequential clustering. At each step of the algorithm execution, a global optimization problem is resolved in order to minimize the cluster distortions resting upon the current cluster configuration. The obtained weight matrix can also be used as a cluster validation characteristic. Namely, closeness of such matrices learned during a sample process can indicate the clusters readiness; i.e. estimates the true number of clusters. Numerical experiments performed on synthetic and on real datasets verify the high reliability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University (2006)

  2. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, pp. 281–297 (1967)

  3. Celeux G., Govaert G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(315), 315–332 (1992)

    Article  Google Scholar 

  4. Rose K., Gurewitz E., Fox G.: Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett. 65, 945–948 (1990)

    Article  Google Scholar 

  5. Ahmed, N.A., Gokhale, D.V.: Entropy expressions and their estimators for multivariate distributions. In: Information Theory, IEEE Transactions, 35(3), 688–692 (1989); BC Res, NJ Piscataway

  6. Daichi, M., Genichiro, K., Kenji, K.: Learning nonstructural distance metric by minimum cluster distortions. In: International Conference on Computational Linguistic—COLING, pp. 341–348 (2004), available at http://academic.research.microsoft.com/Publication/3317495

  7. Ishikawa, Y., Subramanya, R., Faloutsos, C.: MindReader: querying databases through multiple examples. In: Proceedings of 24rd International Conference on Very Large Data Bases, pp. 24–27 (1998)

  8. Rand W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. (Am Stat Assoc) 66(336), 846–850 (1071)

    Article  Google Scholar 

  9. Jain A., Dubes R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    Google Scholar 

  10. Gordon A.D.: Classification. Chapman and Hall, CRC, Boca Raton (1999)

    Google Scholar 

  11. Dunn J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)

    Article  Google Scholar 

  12. Hubert L., Schultz J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)

    Google Scholar 

  13. Caliski R., Harabasz J.: A dendrite method for cluster analysis. Common Stat. 3, 1–27 (1974)

    Google Scholar 

  14. Hartigan J.A.: Statistical theory in clustering. J. Classif. 2, 63–76 (1985)

    Article  Google Scholar 

  15. Krzanowski W., Lai Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)

    Article  Google Scholar 

  16. Sugar C., James G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)

    Article  Google Scholar 

  17. Gordon A.D.: Identifying genuine clusters in a classification. Computat. Stat. Data Anal. 18, 561–581 (1994)

    Article  Google Scholar 

  18. Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)

    Article  Google Scholar 

  19. Tibshirani R., Walther G., Hastie T.: Estimating the number of clusters via the gap statistic. J. R. Statist. Soc. B 63(2), 411–423 (2001)

    Article  Google Scholar 

  20. Levine E., Domany E.: Resampling method for unsupervised estimation of cluster validity. Neural Comput. 13, 2573–2593 (2001)

    Article  Google Scholar 

  21. Ben-Hur A., Elisseeff A., Guyon I.: A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 2, 6–17 (2002)

    Google Scholar 

  22. Ben-Hur A., Guyon I.: Detecting stable clusters using principal component analysis. In: Brownstein, M.J., Khodursky, A. (eds) Methods in Molecular Biology, pp. 159–182. Humana Press, Clifton (2003)

    Google Scholar 

  23. Mufti, G.B., Bertrand, P., El Moubarki, L.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)

  24. Volkovich Z., Barzily Z., Morozensky L.: A statistical model of cluster stability. Pattern Recognit. 41(7), 2174–2188 (2008)

    Article  Google Scholar 

  25. Barzily Z., Volkovich Z., Akteko-Ozturk B., Weber G.-W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)

    Google Scholar 

  26. Volkovich, Z., Barzily, Z.: On application probability metrics in the cluster problem. In: 1st European Conference on Data Mining (ECDM07). Lisbon, Portugal, pp. 57–59 (2007)

  27. Toledano-Kitai D., Avros R., Volkovich Z.: A fractal dimension standpoint to the cluster validation problem. Int. J. Pure Appl. Math. 68(2), 233–252 (2011)

    Google Scholar 

  28. Lange T., Braun M., Roth V., Buhmann J.M.: Stability-based model validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)

    Article  Google Scholar 

  29. Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proceedings of the International Conference on Computational Statistics (COMPSTAT), pp. 123–128 (2002), available at http://www.cs.uni-bonn.De/braunm

  30. Bickel P., Levina E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)

    Article  Google Scholar 

  31. Yuan M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–2286 (2010)

    Google Scholar 

  32. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147, July 1999

  33. Kogan J., Nicholas C., Volkovich V.: Text mining with information–theoretical clustering. Comput. Sci. Eng. 5(6), 52–59 (2003)

    Article  Google Scholar 

  34. Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In Proceedings of the Workshop on Text Mining held in conjunction with the Third SIAM International Conference on Data Mining. M.W. Berry and W.M. Pottenger, pp. 5–16 (2003)

  35. Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k-means like algorithms. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with the Third SIAM International Conference on Data Mining), 2003

  36. Volkovich, V., Kogan, J., Nicholas, C.: k-means initialization by sampling large datasets. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with SDM 2004). I. Dhillon and J. Kogan, pp. 17–22 (2004)

  37. Dhillon I., Kogan J., Nicholas C.: Feature Selection and Document Clustering. A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)

    Google Scholar 

  38. Dudoit S., Fridlyand J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 0036.1–0036.21 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Z. Volkovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Volkovich, Z., Toledano-Kitai, D. & Weber, GW. Self-learning K-means clustering: a global optimization approach. J Glob Optim 56, 219–232 (2013). https://doi.org/10.1007/s10898-012-9854-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-012-9854-y

Keywords

Navigation