Abstract
An appropriate distance is an essential ingredient in various real-world learning tasks. Distance metric learning proposes to study a metric, which is capable of reflecting the data configuration much better in comparison with the commonly used methods. We offer an algorithm for simultaneous learning the Mahalanobis like distance and K-means clustering aiming to incorporate data rescaling and clustering so that the data separability grows iteratively in the rescaled space with its sequential clustering. At each step of the algorithm execution, a global optimization problem is resolved in order to minimize the cluster distortions resting upon the current cluster configuration. The obtained weight matrix can also be used as a cluster validation characteristic. Namely, closeness of such matrices learned during a sample process can indicate the clusters readiness; i.e. estimates the true number of clusters. Numerical experiments performed on synthetic and on real datasets verify the high reliability of the proposed method.
Similar content being viewed by others
References
Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University (2006)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, pp. 281–297 (1967)
Celeux G., Govaert G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(315), 315–332 (1992)
Rose K., Gurewitz E., Fox G.: Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett. 65, 945–948 (1990)
Ahmed, N.A., Gokhale, D.V.: Entropy expressions and their estimators for multivariate distributions. In: Information Theory, IEEE Transactions, 35(3), 688–692 (1989); BC Res, NJ Piscataway
Daichi, M., Genichiro, K., Kenji, K.: Learning nonstructural distance metric by minimum cluster distortions. In: International Conference on Computational Linguistic—COLING, pp. 341–348 (2004), available at http://academic.research.microsoft.com/Publication/3317495
Ishikawa, Y., Subramanya, R., Faloutsos, C.: MindReader: querying databases through multiple examples. In: Proceedings of 24rd International Conference on Very Large Data Bases, pp. 24–27 (1998)
Rand W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. (Am Stat Assoc) 66(336), 846–850 (1071)
Jain A., Dubes R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Gordon A.D.: Classification. Chapman and Hall, CRC, Boca Raton (1999)
Dunn J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
Hubert L., Schultz J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)
Caliski R., Harabasz J.: A dendrite method for cluster analysis. Common Stat. 3, 1–27 (1974)
Hartigan J.A.: Statistical theory in clustering. J. Classif. 2, 63–76 (1985)
Krzanowski W., Lai Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)
Sugar C., James G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Gordon A.D.: Identifying genuine clusters in a classification. Computat. Stat. Data Anal. 18, 561–581 (1994)
Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Tibshirani R., Walther G., Hastie T.: Estimating the number of clusters via the gap statistic. J. R. Statist. Soc. B 63(2), 411–423 (2001)
Levine E., Domany E.: Resampling method for unsupervised estimation of cluster validity. Neural Comput. 13, 2573–2593 (2001)
Ben-Hur A., Elisseeff A., Guyon I.: A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 2, 6–17 (2002)
Ben-Hur A., Guyon I.: Detecting stable clusters using principal component analysis. In: Brownstein, M.J., Khodursky, A. (eds) Methods in Molecular Biology, pp. 159–182. Humana Press, Clifton (2003)
Mufti, G.B., Bertrand, P., El Moubarki, L.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)
Volkovich Z., Barzily Z., Morozensky L.: A statistical model of cluster stability. Pattern Recognit. 41(7), 2174–2188 (2008)
Barzily Z., Volkovich Z., Akteko-Ozturk B., Weber G.-W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)
Volkovich, Z., Barzily, Z.: On application probability metrics in the cluster problem. In: 1st European Conference on Data Mining (ECDM07). Lisbon, Portugal, pp. 57–59 (2007)
Toledano-Kitai D., Avros R., Volkovich Z.: A fractal dimension standpoint to the cluster validation problem. Int. J. Pure Appl. Math. 68(2), 233–252 (2011)
Lange T., Braun M., Roth V., Buhmann J.M.: Stability-based model validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)
Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proceedings of the International Conference on Computational Statistics (COMPSTAT), pp. 123–128 (2002), available at http://www.cs.uni-bonn.De/braunm
Bickel P., Levina E.: Regularized estimation of large covariance matrices. Ann. Stat. 36, 199–227 (2008)
Yuan M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–2286 (2010)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147, July 1999
Kogan J., Nicholas C., Volkovich V.: Text mining with information–theoretical clustering. Comput. Sci. Eng. 5(6), 52–59 (2003)
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In Proceedings of the Workshop on Text Mining held in conjunction with the Third SIAM International Conference on Data Mining. M.W. Berry and W.M. Pottenger, pp. 5–16 (2003)
Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k-means like algorithms. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with the Third SIAM International Conference on Data Mining), 2003
Volkovich, V., Kogan, J., Nicholas, C.: k-means initialization by sampling large datasets. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in Conjunction with SDM 2004). I. Dhillon and J. Kogan, pp. 17–22 (2004)
Dhillon I., Kogan J., Nicholas C.: Feature Selection and Document Clustering. A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)
Dudoit S., Fridlyand J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 0036.1–0036.21 (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Volkovich, Z., Toledano-Kitai, D. & Weber, GW. Self-learning K-means clustering: a global optimization approach. J Glob Optim 56, 219–232 (2013). https://doi.org/10.1007/s10898-012-9854-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-012-9854-y