Skip to main content
Log in

Parallel bisecting k-means with prediction clustering algorithm

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anderberg MR (1973) Cluster analysis for applications. Academic Press

  2. Bradley P, Fayyad U (1998) Refining initial points for k-means clustering. In: Proc. of the 15th int’l conf. on machine learning–ICML’98. pp. 91–99

  3. Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proc. of the 15th ACM SIGIR conf. on research and development in information retrieval, pp 318–329

  4. Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39(11):78–85

    Article  Google Scholar 

  5. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining, LNCS 1759. Springer-Verlag

  6. Garey MR, Johnson DS, Witsenhausen HS (1982) Complexity of the generalized lloyd-max problem. IEEE Trans Inf Theory 28(2):256–257

    Article  MathSciNet  Google Scholar 

  7. Hartigan JA (1975) Clustering algorithms. John Wiley & Sons

  8. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  9. Milligan G (1985) An algorithm for creating artificial test clusters. Psychometrika 50(1):123–127

    Article  Google Scholar 

  10. Ordonez C, Omiecinski E (2004) Efficient disk-based k-Means Clustering for Relational Databases. IEEE Trans Knowl Data Engi 16(8):909–921

    Article  Google Scholar 

  11. Salton G (1988) Automatic text processing: the transformation. analysis, and retrieval information by Computer. Addison-Wesley

  12. Shannon CE (1948) A mathematical theory of communication, Bell Syst Techn 27(July/October):379–423 and 623–656

  13. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining

  14. UCI Machine learning repository, http://www.ics.uci.edu/~mlearn/MLRepository.html

  15. Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3):311–331

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soon M. Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Chung, S.M. Parallel bisecting k-means with prediction clustering algorithm. J Supercomput 39, 19–37 (2007). https://doi.org/10.1007/s11227-006-0002-7

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-0002-7

Keywords

Navigation