Abstract
This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multi-dimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O(nd) transmission cost, where n is the number of data points and d is the number of dimensions. For large datasets, this is prohibitively expensive. In contrast, RACHET runs with at most O(n) time, space, and communication costs to build a global hierarchy of comparable clustering quality by merging locally generated clustering hierarchies. RACHET employs the encircling tactic in which the merges at each stage are chosen so as to minimize the volume of a covering hypersphere. For each cluster centroid, RACHET maintains descriptive statistics of constant complexity to enable these choices. RACHET's framework is applicable to a wide class of centroid-based hierarchical clustering algorithms, such as centroid, medoid, and Ward.
Similar content being viewed by others
References
M.R. Anderberg, Cluster Analysis and Applications, Academic Press: New York, 1973.
R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis, “Mining business databases,” Communications of ACM, vol. 39, no.11, pp. 42–48, 1996.
W.H.E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clustering methods,” Journal of Classification, vol. 1, pp. 7–24, 1984.
I. Dhillon and D. Modha, “A data clustering algorithm on distributed memory multiprocessors,” in Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, Mohammed Javeed Zaki and Ching-Tien Ho (Eds.), SIGKDD, Aug. 15, 1999, San Diego, CA, USA, pp. 245–260.
R. Dubes and A. Jain, “Clustering methodologies in exploratory data analysis,” Advances in Computers, vol. 19, pp. 113–228, 1980.
U. Fayyad, D. Haussler, P. Stolorz, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, MIT Press: Cambridge, MA, 1996.
K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press: New York, 1990.
J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons: New York, 1991.
A.K. Jain, M.N. Murty, and P.J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, pp. 264–323, 1999.
E. Johnson and H. Kargupta, “Collective, hierarchical clustering from distributed, heterogeneous data,” Lecture Notes in Computer Science, vol. 1759, Springer-Verlag: Berlin, 1999.
H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed clustering using collective principal component analysis,” Knowledge and Information Systems, vol. 3, no.4, pp. 422–448, 2001.
L. Kaufman and P. Rousseeuw, Finding Groups in Data, John Wiley and Sons: New York, 1989.
G.N. Lance and W.T. Williams, “A general theory of classificatory sorting strategies. 1: Hierarchical systems,” Computer Journal, vol. 9, pp. 373–380, 1967.
F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” Computer Journal, vol. 26, p. 354–359, 1983.
C. Olson, “Parallel algorithms for hierarchical clustering,” Parallel Computing, vol. 8, pp. 1313–1325, 1995.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Samatova, N.F., Ostrouchov, G., Geist, A. et al. RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases 11, 157–180 (2002). https://doi.org/10.1023/A:1013988102576
Issue Date:
DOI: https://doi.org/10.1023/A:1013988102576