Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Johnson, Erik L.; Kargupta, Hillol

doi:10.1007/3-540-46502-2_12

Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Erik L. Johnson³ &
Hillol Kargupta³

Conference paper
First Online: 01 January 2002

834 Accesses
40 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1759))

Abstract

This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n ²) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n ²) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113–228
Google Scholar
Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30–34
Article MathSciNet Google Scholar
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9–15
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103–114
Google Scholar
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144–155
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73–84
Google Scholar
Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313–1325
Article MathSciNet Google Scholar
Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multi-processors. In: Workshop on Large-Scale Parallel KDD Systems. (1999)
Google Scholar
Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing’ 97, The Society for Computer Simulation International (1996) 290–295
Google Scholar
Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211–214
Google Scholar
Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35–61
Google Scholar
Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996)
Google Scholar
Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999)
Google Scholar
Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999)
Google Scholar
Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Washington State University, USA
Erik L. Johnson & Hillol Kargupta

Authors

Erik L. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Hillol Kargupta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Mohammed J. Zaki
K55/B1, IBM Almaden Research Center, 650 Harry Road, San Jose, CA, 95120, USA
Ching-Tien Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Johnson, E.L., Kargupta, H. (2000). Collective, Hierarchical Clustering from Distributed, Heterogeneous Data. In: Zaki, M.J., Ho, CT. (eds) Large-Scale Parallel Data Mining. Lecture Notes in Computer Science(), vol 1759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46502-2_12

Download citation

DOI: https://doi.org/10.1007/3-540-46502-2_12
Published: 17 May 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67194-7
Online ISBN: 978-3-540-46502-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics