Abstract
This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n 2) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n 2) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113–228
Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30–34
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9–15
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103–114
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144–155
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73–84
Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313–1325
Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multi-processors. In: Workshop on Large-Scale Parallel KDD Systems. (1999)
Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing’ 97, The Society for Computer Simulation International (1996) 290–295
Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211–214
Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35–61
Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996)
Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999)
Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999)
Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Johnson, E.L., Kargupta, H. (2000). Collective, Hierarchical Clustering from Distributed, Heterogeneous Data. In: Zaki, M.J., Ho, CT. (eds) Large-Scale Parallel Data Mining. Lecture Notes in Computer Science(), vol 1759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46502-2_12
Download citation
DOI: https://doi.org/10.1007/3-540-46502-2_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67194-7
Online ISBN: 978-3-540-46502-7
eBook Packages: Springer Book Archive