Self-learning K-means clustering: a global optimization approach

Volkovich, Z.; Toledano-Kitai, D.; Weber, G.-W.

doi:10.1007/s10898-012-9854-y

Self-learning K-means clustering: a global optimization approach

Published: 07 February 2012

Volume 56, pages 219–232, (2013)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

Z. Volkovich¹,
D. Toledano-Kitai¹ &
G.-W. Weber^2,3,4,5

414 Accesses
11 Citations
Explore all metrics

Abstract

An appropriate distance is an essential ingredient in various real-world learning tasks. Distance metric learning proposes to study a metric, which is capable of reflecting the data configuration much better in comparison with the commonly used methods. We offer an algorithm for simultaneous learning the Mahalanobis like distance and K-means clustering aiming to incorporate data rescaling and clustering so that the data separability grows iteratively in the rescaled space with its sequential clustering. At each step of the algorithm execution, a global optimization problem is resolved in order to minimize the cluster distortions resting upon the current cluster configuration. The obtained weight matrix can also be used as a cluster validation characteristic. Namely, closeness of such matrices learned during a sample process can indicate the clusters readiness; i.e. estimates the true number of clusters. Numerical experiments performed on synthetic and on real datasets verify the high reliability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University (2006)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, pp. 281–297 (1967)
Celeux G., Govaert G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(315), 315–332 (1992)
Article Google Scholar
Rose K., Gurewitz E., Fox G.: Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett. 65, 945–948 (1990)
Article Google Scholar
Ahmed, N.A., Gokhale, D.V.: Entropy expressions and their estimators for multivariate distributions. In: Information Theory, IEEE Transactions, 35(3), 688–692 (1989); BC Res, NJ Piscataway
Daichi, M., Genichiro, K., Kenji, K.: Learning nonstructural distance metric by minimum cluster distortions. In: International Conference on Computational Linguistic—COLING, pp. 341–348 (2004), available at http://academic.research.microsoft.com/Publication/3317495
Ishikawa, Y., Subramanya, R., Faloutsos, C.: MindReader: querying databases through multiple examples. In: Proceedings of 24rd International Conference on Very Large Data Bases, pp. 24–27 (1998)
Rand W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. (Am Stat Assoc) 66(336), 846–850 (1071)
Article Google Scholar
Jain A., Dubes R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Google Scholar
Gordon A.D.: Classification. Chapman and Hall, CRC, Boca Raton (1999)
Google Scholar
Dunn J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
Article Google Scholar
Hubert L., Schultz J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)
Google Scholar
Caliski R., Harabasz J.: A dendrite method for cluster analysis. Common Stat. 3, 1–27 (1974)
Google Scholar
Hartigan J.A.: Statistical theory in clustering. J. Classif. 2, 63–76 (1985)
Article Google Scholar
Krzanowski W., Lai Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)
Article Google Scholar
Sugar C., James G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Article Google Scholar
Gordon A.D.: Identifying genuine clusters in a classification. Computat. Stat. Data Anal. 18, 561–581 (1994)
Article Google Scholar
Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Article Google Scholar
Tibshirani R., Walther G., Hastie T.: Estimating the number of clusters via the gap statistic. J. R. Statist. Soc. B 63(2), 411–423 (2001)
Article Google Scholar
Levine E., Domany E.: Resampling method for unsupervised estimation of cluster validity. Neural Comput. 13, 2573–2593 (2001)
Article Google Scholar
Ben-Hur A., Elisseeff A., Guyon I.: A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 2, 6–17 (2002)
Google Scholar
Ben-Hur A., Guyon I.: Detecting stable clusters using principal component analysis. In: Brownstein, M.J., Khodursky, A. (eds) Methods in Molecular Biology, pp. 159–182. Humana Press, Clifton (2003)
Google Scholar
Mufti, G.B., Bertrand, P., El Moubarki, L.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)
Volkovich Z., Barzily Z., Morozensky L.: A statistical model of cluster stability. Pattern Recognit. 41(7), 2174–2188 (2008)
Article Google Scholar
Barzily Z., Volkovich Z., Akteko-Ozturk B., Weber G.-W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)
Google Scholar
Volkovich, Z., Barzily, Z.: On application probability metrics in the cluster problem. In: 1st European Conference on Data Mining (ECDM07). Lisbon, Portugal, pp. 57–59 (2007)
Toledano-Kitai D., Avros R., Volkovich Z.: A fractal dimension standpoint to the cluster validation problem. Int. J. Pure Appl. Math. 68(2), 233–252 (2011)
Google Scholar
Lange T., Braun M., Roth V., Buhmann J.M.: Stability-based model validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)
Article Google Scholar
Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proceedings of the International Conference on Computational Statistics (COMPSTAT), pp. 123–128 (2002), available at http://www.cs.uni-bonn.De/braunm
Bickel P., Levina E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)
Article Google Scholar
Yuan M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–2286 (2010)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147, July 1999
Kogan J., Nicholas C., Volkovich V.: Text mining with information–theoretical clustering. Comput. Sci. Eng. 5(6), 52–59 (2003)
Article Google Scholar
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In Proceedings of the Workshop on Text Mining held in conjunction with the Third SIAM International Conference on Data Mining. M.W. Berry and W.M. Pottenger, pp. 5–16 (2003)
Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k-means like algorithms. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with the Third SIAM International Conference on Data Mining), 2003
Volkovich, V., Kogan, J., Nicholas, C.: k-means initialization by sampling large datasets. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with SDM 2004). I. Dhillon and J. Kogan, pp. 17–22 (2004)
Dhillon I., Kogan J., Nicholas C.: Feature Selection and Document Clustering. A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)
Google Scholar
Dudoit S., Fridlyand J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 0036.1–0036.21 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Ort Braude College of Engineering, Karmiel, 21982, Israel
Z. Volkovich & D. Toledano-Kitai
Institute of Applied Mathematics, Middle East Technical University, Ankara, 06531, Turkey
G.-W. Weber
University of Siegen, Siegen, Germany
G.-W. Weber
University of Aveiro, Aveiro, Portugal
G.-W. Weber
Universiti Teknologi Malaysia, Skudai, Malaysia
G.-W. Weber

Authors

Z. Volkovich
View author publications
You can also search for this author in PubMed Google Scholar
D. Toledano-Kitai
View author publications
You can also search for this author in PubMed Google Scholar
G.-W. Weber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Z. Volkovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Volkovich, Z., Toledano-Kitai, D. & Weber, GW. Self-learning K-means clustering: a global optimization approach. J Glob Optim 56, 219–232 (2013). https://doi.org/10.1007/s10898-012-9854-y

Download citation

Received: 16 May 2011
Accepted: 23 January 2012
Published: 07 February 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10898-012-9854-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Self-learning K-means clustering: a global optimization approach

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Hybrid approaches to optimization and machine learning methods: a systematic literature review

Comprehensive survey on hierarchical clustering algorithms and the recent developments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-learning K-means clustering: a global optimization approach

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Hybrid approaches to optimization and machine learning methods: a systematic literature review

Comprehensive survey on hierarchical clustering algorithms and the recent developments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation