Skip to main content

Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1759))

Abstract

This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n 2) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n 2) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113–228

    Google Scholar 

  2. Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30–34

    Article  MathSciNet  Google Scholar 

  3. Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9–15

    Google Scholar 

  4. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103–114

    Google Scholar 

  5. Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144–155

    Google Scholar 

  6. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73–84

    Google Scholar 

  7. Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313–1325

    Article  MathSciNet  Google Scholar 

  8. Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multi-processors. In: Workshop on Large-Scale Parallel KDD Systems. (1999)

    Google Scholar 

  9. Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing’ 97, The Society for Computer Simulation International (1996) 290–295

    Google Scholar 

  10. Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211–214

    Google Scholar 

  11. Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35–61

    Google Scholar 

  12. Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996)

    Google Scholar 

  13. Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999)

    Google Scholar 

  14. Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999)

    Google Scholar 

  15. Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Johnson, E.L., Kargupta, H. (2000). Collective, Hierarchical Clustering from Distributed, Heterogeneous Data. In: Zaki, M.J., Ho, CT. (eds) Large-Scale Parallel Data Mining. Lecture Notes in Computer Science(), vol 1759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46502-2_12

Download citation

  • DOI: https://doi.org/10.1007/3-540-46502-2_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67194-7

  • Online ISBN: 978-3-540-46502-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics