Skip to main content
Log in

RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multi-dimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O(nd) transmission cost, where n is the number of data points and d is the number of dimensions. For large datasets, this is prohibitively expensive. In contrast, RACHET runs with at most O(n) time, space, and communication costs to build a global hierarchy of comparable clustering quality by merging locally generated clustering hierarchies. RACHET employs the encircling tactic in which the merges at each stage are chosen so as to minimize the volume of a covering hypersphere. For each cluster centroid, RACHET maintains descriptive statistics of constant complexity to enable these choices. RACHET's framework is applicable to a wide class of centroid-based hierarchical clustering algorithms, such as centroid, medoid, and Ward.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. M.R. Anderberg, Cluster Analysis and Applications, Academic Press: New York, 1973.

    Google Scholar 

  2. R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis, “Mining business databases,” Communications of ACM, vol. 39, no.11, pp. 42–48, 1996.

    Google Scholar 

  3. W.H.E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clustering methods,” Journal of Classification, vol. 1, pp. 7–24, 1984.

    Google Scholar 

  4. I. Dhillon and D. Modha, “A data clustering algorithm on distributed memory multiprocessors,” in Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, Mohammed Javeed Zaki and Ching-Tien Ho (Eds.), SIGKDD, Aug. 15, 1999, San Diego, CA, USA, pp. 245–260.

  5. R. Dubes and A. Jain, “Clustering methodologies in exploratory data analysis,” Advances in Computers, vol. 19, pp. 113–228, 1980.

    Google Scholar 

  6. U. Fayyad, D. Haussler, P. Stolorz, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, MIT Press: Cambridge, MA, 1996.

    Google Scholar 

  7. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press: New York, 1990.

    Google Scholar 

  8. J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons: New York, 1991.

    Google Scholar 

  9. A.K. Jain, M.N. Murty, and P.J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, pp. 264–323, 1999.

    Google Scholar 

  10. E. Johnson and H. Kargupta, “Collective, hierarchical clustering from distributed, heterogeneous data,” Lecture Notes in Computer Science, vol. 1759, Springer-Verlag: Berlin, 1999.

    Google Scholar 

  11. H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed clustering using collective principal component analysis,” Knowledge and Information Systems, vol. 3, no.4, pp. 422–448, 2001.

    Google Scholar 

  12. L. Kaufman and P. Rousseeuw, Finding Groups in Data, John Wiley and Sons: New York, 1989.

    Google Scholar 

  13. G.N. Lance and W.T. Williams, “A general theory of classificatory sorting strategies. 1: Hierarchical systems,” Computer Journal, vol. 9, pp. 373–380, 1967.

    Google Scholar 

  14. F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” Computer Journal, vol. 26, p. 354–359, 1983.

    Google Scholar 

  15. C. Olson, “Parallel algorithms for hierarchical clustering,” Parallel Computing, vol. 8, pp. 1313–1325, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Samatova, N.F., Ostrouchov, G., Geist, A. et al. RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases 11, 157–180 (2002). https://doi.org/10.1023/A:1013988102576

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1013988102576

Navigation