Abstract
In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters.
Similar content being viewed by others
References
Anderberg MR (1973) Cluster analysis for applications. Academic Press
Bradley P, Fayyad U (1998) Refining initial points for k-means clustering. In: Proc. of the 15th int’l conf. on machine learning–ICML’98. pp. 91–99
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proc. of the 15th ACM SIGIR conf. on research and development in information retrieval, pp 318–329
Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39(11):78–85
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining, LNCS 1759. Springer-Verlag
Garey MR, Johnson DS, Witsenhausen HS (1982) Complexity of the generalized lloyd-max problem. IEEE Trans Inf Theory 28(2):256–257
Hartigan JA (1975) Clustering algorithms. John Wiley & Sons
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Milligan G (1985) An algorithm for creating artificial test clusters. Psychometrika 50(1):123–127
Ordonez C, Omiecinski E (2004) Efficient disk-based k-Means Clustering for Relational Databases. IEEE Trans Knowl Data Engi 16(8):909–921
Salton G (1988) Automatic text processing: the transformation. analysis, and retrieval information by Computer. Addison-Wesley
Shannon CE (1948) A mathematical theory of communication, Bell Syst Techn 27(July/October):379–423 and 623–656
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining
UCI Machine learning repository, http://www.ics.uci.edu/~mlearn/MLRepository.html
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3):311–331
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Chung, S.M. Parallel bisecting k-means with prediction clustering algorithm. J Supercomput 39, 19–37 (2007). https://doi.org/10.1007/s11227-006-0002-7
Issue Date:
DOI: https://doi.org/10.1007/s11227-006-0002-7