Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Kogan, Jacob; Teboulle, Marc; Nicholas, Charles

doi:10.1007/s10791-005-5666-8

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Published: April 2005

Volume 8, pages 331–349, (2005)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Download PDF

Jacob Kogan¹,
Marc Teboulle² &
Charles Nicholas³

308 Accesses
14 Citations
Explore all metrics

Abstract

We present an optimization approach that generates k-means like clustering algorithms. The batch k-means and the incremental k-means are two well known versions of the classical k-means clustering algorithm (Duda et al. 2000). To benefit from the speed of the batch version and the accuracy of the incremental version we combine the two in a “ping–pong” fashion. We use a distance-like function that combines the squared Euclidean distance with relative entropy. In the extreme cases our algorithm recovers the classical k-means clustering algorithm and generalizes the Divisive Information Theoretic clustering algorithm recently reported independently by Berkhin and Becher (2002) and Dhillon1 et al. (2002). Results of numerical experiments that demonstrate the viability of our approach are reported.

Article PDF

Even Faster Exact k-Means Clustering

Algorithms of Combinatorial Cluster Analysis

A Quality Metric for K-Means Clustering Based on Centroid Locations

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Auslender A, Teboulle M and Ben–Tiba S (1999) Interior proximal amd multiplier methods based on second order homogeneous kernels. Mathematics of Operations Research, 24:645–668.
Google Scholar
Berkhin P and Becher JD (2002) Learning simple relations: Theory and applications. In: Proceedings of the Second SIAM International Conference on Data Mining.
Berry M and Browne M (1999) Understanding Search Engines. SIAM.
Bertsekas DP and Tsitsiklis JN (1989) Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, New Jersey.
Google Scholar
Dempster A, Laird N and Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39.
Dhillon IS, Guan Y and Kogan J (2002) Refining clusters in high-dimensional text data. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Second SIAM International Conference on Data Mining).
Dhillon, IS, Kogan J and Nicholas C (2003) Feature selection and document clustering, In Berry MW Ed. A Comprehensive Survey of Text Mining, pp. 73–100.
Dhillon IS and Modha DS (2001) Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175.
Google Scholar
Duda RO, Hart PE and Stork DG (2000) Pattern Classification. John Wiley & Sons.
Forgy E (1965) Cluster analysis of multivariate Data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768.
Google Scholar
Inderjit S. Dhillon Subramanyam Mallela and Rahul Kumar (2002) Enhanced word clustering for hierarchical text classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), pp. 191–200.
Kogan J (2001) Clustering large unstructured document sets. In: Berry MW Ed. Computational Information Retrieval, pp. 107–117.
Kogan J (2001) Means clustering for text data. In: Proceedings of the Workshop on Text Mining at the First SIAM International Conference on Data Mining, pp. 47–54.
Kogan J, Teboulle M and Nicholas C (2003) Optimization approach to generating families of k-means like algorithms. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Third SIAM International Conference on Data Mining).
Kogan J, Teboulle M and Nicholas C (2003) The entropic geometric means algorithm: An approach for building small clusters for large text datasets. In: Proceedings of the Workshop on Clustering Large Data Sets (held in conjunction with the Third IEEE International Conference on Data Mining), pp. 63–71.
Lei Xu and Michael I. Jordan (1995) On convergence properties of the EM Algorithm for Gaussian Mixtures. MIT A.I. Memo No. 1520, C.B.C.L. paper No. 111.
Rockafellar RT (1970) Convex Analysis Princeton University Press, Princeton, NJ.
Google Scholar
Teboulle M (1992) On ϕ-divergence and its applications. In: Phillips FY and Rousseau J Eds. Systems and Management Science by Extremal Methods–Research Honoring Abraham Charnes at Age 70, Kluwer Academic Publishers, pp. 255–273.
Teboulle M (1997) Convergence of proximal-like algorithms. SIAM J. of Optimization, 7:1069-1083.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, UMBC, Baltimore, MD, 21250
Jacob Kogan
School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel
Marc Teboulle
Department of Computer Science and Electrical Engineering, UMBC, Baltimore, MD, 21250
Charles Nicholas

Authors

Jacob Kogan
View author publications
You can also search for this author inPubMed Google Scholar
Marc Teboulle
View author publications
You can also search for this author inPubMed Google Scholar
Charles Nicholas
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jacob Kogan.

Additional information

This research was supported in part by the US Department of Defense, the United States–Israel Binational Science Foundation (BSF), and Northrop Grumman Mission Systems (NG/MS).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kogan, J., Teboulle, M. & Nicholas, C. Data Driven Similarity Measures for k-Means Like Clustering Algorithms. Inf Retrieval 8, 331–349 (2005). https://doi.org/10.1007/s10791-005-5666-8

Download citation

Issue Date: April 2005
DOI: https://doi.org/10.1007/s10791-005-5666-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Abstract

Article PDF

Similar content being viewed by others

Even Faster Exact k-Means Clustering

Algorithms of Combinatorial Cluster Analysis

A Quality Metric for K-Means Clustering Based on Centroid Locations

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords