Parallel bisecting k-means with prediction clustering algorithm

Li, Yanjun; Chung, Soon M.

doi:10.1007/s11227-006-0002-7

Parallel bisecting k-means with prediction clustering algorithm

Published: January 2007

Volume 39, pages 19–37, (2007)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yanjun Li¹ &
Soon M. Chung¹

707 Accesses
32 Citations
Explore all metrics

Abstract

In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anderberg MR (1973) Cluster analysis for applications. Academic Press
Bradley P, Fayyad U (1998) Refining initial points for k-means clustering. In: Proc. of the 15th int’l conf. on machine learning–ICML’98. pp. 91–99
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proc. of the 15th ACM SIGIR conf. on research and development in information retrieval, pp 318–329
Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39(11):78–85
Article Google Scholar
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining, LNCS 1759. Springer-Verlag
Garey MR, Johnson DS, Witsenhausen HS (1982) Complexity of the generalized lloyd-max problem. IEEE Trans Inf Theory 28(2):256–257
Article MathSciNet Google Scholar
Hartigan JA (1975) Clustering algorithms. John Wiley & Sons
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Milligan G (1985) An algorithm for creating artificial test clusters. Psychometrika 50(1):123–127
Article Google Scholar
Ordonez C, Omiecinski E (2004) Efficient disk-based k-Means Clustering for Relational Databases. IEEE Trans Knowl Data Engi 16(8):909–921
Article Google Scholar
Salton G (1988) Automatic text processing: the transformation. analysis, and retrieval information by Computer. Addison-Wesley
Shannon CE (1948) A mathematical theory of communication, Bell Syst Techn 27(July/October):379–423 and 623–656
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining
UCI Machine learning repository, http://www.ics.uci.edu/~mlearn/MLRepository.html
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3):311–331
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, Ohio, 45435, USA
Yanjun Li & Soon M. Chung

Authors

Yanjun Li
View author publications
You can also search for this author in PubMed Google Scholar
Soon M. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon M. Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Chung, S.M. Parallel bisecting k-means with prediction clustering algorithm. J Supercomput 39, 19–37 (2007). https://doi.org/10.1007/s11227-006-0002-7

Download citation

Issue Date: January 2007
DOI: https://doi.org/10.1007/s11227-006-0002-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel bisecting k-means with prediction clustering algorithm

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

A survey of machine learning for big data processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel bisecting k-means with prediction clustering algorithm

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

A survey of machine learning for big data processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation