Skip to main content

Ensemble Learning Based Distributed Clustering

  • Conference paper
Book cover Emerging Technologies in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Included in the following conference series:

Abstract

Data mining techniques such as clustering are usually applied to centralized data sets. At present, more and more data is generated and stored in local sites. The transmission of the entire local data set to server is often unacceptable because of performance considerations, privacy and security aspects, and bandwidth constraints. In this paper, we propose a distributed clustering model based on ensemble learning, which could analyze and mine distributed data sources to find global clustering patterns. A typical scenario of the distributed clustering is a ‘two-stage’ course, i.e. firstly doing clustering in local sites and then in global site. The local clustering results transmitted to server site form an ensemble and combining schemes of ensemble learning use the ensemble to generate global clustering results. In the model, generating global patterns from ensemble is mathematically converted to be a combinatorial optimization problem. As an implementation for the model, a novel distributed clustering algorithm called DK-means is presented. Experimental results show that DK-means achieves similar results to K-means which clusters centralized data set at a time and is scalable to data distribution varying in local sites, and show validity of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, J. (eds.) Proc. of the 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)

    Google Scholar 

  2. McLachlan, G., Basford, K.: Mixture Models: Inference and Application to Clustering, Marcel Dekker, New York (1988)

    Google Scholar 

  3. Ester, M., Kriegel, H.P., Sander, J., et al.: A density based algorithm of discovering clusters in large spatial databases with noise. In: Simoudis, E., Jiawei, H., Fayyad, U.M. (eds.) Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 226–231. AAAI Press, Stanford, California, USA (1996)

    Google Scholar 

  4. Park, B.H., Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N. (ed.) The Handbook of Data Mining, Lawrence Erlbaum Associates Publishers, Mahwah, NJ (2003)

    Google Scholar 

  5. Januzaj, E., Kriegel, H.P., Pfeifle, M.: Scalable Density-Based Distributed Clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, Springer, Heidelberg (2004)

    Google Scholar 

  6. Kriegel, H.-P., Kröger, P., Pryakhin, A., et al.: Effective and Efficient Distributed Model-based Clustering. In: Proc. of the 5th IEEE International Conference on Data Mining, pp. 258–265 (2005)

    Google Scholar 

  7. Topchy, A., Jain, A.K., Punch, W.: Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12), 1866–1881 (2005)

    Article  Google Scholar 

  8. Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proc. Intl. Conf. on Information Technology, ITCC 2004, Las Vegas (2004)

    Google Scholar 

  9. Hore, P., Hall, L.O.: Scalable Clustering: A Distributed Approach. IEEE International Conference on Fuzzy Systems 1, 143–148 (2004)

    Google Scholar 

  10. Dubes, R., Jain, A.K.: Clustering Techniques: The User’s Dilemma. Pattern Recognition 8, 247–260 (1976)

    Article  Google Scholar 

  11. Fred, A., Jain, A.K.: Evidence Accumulation Clustering Based on the k-Means Algorithm. In: Caelli, T., et al. (eds.) Proc. Structural, Syntactic, and Statistical Pattern Recognition, pp. 442–451 (2002)

    Google Scholar 

  12. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, CA. University of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

  13. Modha, D.S., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ji, G., Ling, X. (2007). Ensemble Learning Based Distributed Clustering. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77018-3_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77016-9

  • Online ISBN: 978-3-540-77018-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics