Ensemble Learning Based Distributed Clustering

Ji, Genlin; Ling, Xiaohan

doi:10.1007/978-3-540-77018-3_32

Genlin Ji¹ &
Xiaohan Ling¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1590 Accesses
7 Citations

Abstract

Data mining techniques such as clustering are usually applied to centralized data sets. At present, more and more data is generated and stored in local sites. The transmission of the entire local data set to server is often unacceptable because of performance considerations, privacy and security aspects, and bandwidth constraints. In this paper, we propose a distributed clustering model based on ensemble learning, which could analyze and mine distributed data sources to find global clustering patterns. A typical scenario of the distributed clustering is a ‘two-stage’ course, i.e. firstly doing clustering in local sites and then in global site. The local clustering results transmitted to server site form an ensemble and combining schemes of ensemble learning use the ensemble to generate global clustering results. In the model, generating global patterns from ensemble is mathematically converted to be a combinatorial optimization problem. As an implementation for the model, a novel distributed clustering algorithm called DK-means is presented. Experimental results show that DK-means achieves similar results to K-means which clusters centralized data set at a time and is scalable to data distribution varying in local sites, and show validity of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, J. (eds.) Proc. of the 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)
Google Scholar
McLachlan, G., Basford, K.: Mixture Models: Inference and Application to Clustering, Marcel Dekker, New York (1988)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., et al.: A density based algorithm of discovering clusters in large spatial databases with noise. In: Simoudis, E., Jiawei, H., Fayyad, U.M. (eds.) Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 226–231. AAAI Press, Stanford, California, USA (1996)
Google Scholar
Park, B.H., Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N. (ed.) The Handbook of Data Mining, Lawrence Erlbaum Associates Publishers, Mahwah, NJ (2003)
Google Scholar
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Scalable Density-Based Distributed Clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, Springer, Heidelberg (2004)
Google Scholar
Kriegel, H.-P., Kröger, P., Pryakhin, A., et al.: Effective and Efficient Distributed Model-based Clustering. In: Proc. of the 5th IEEE International Conference on Data Mining, pp. 258–265 (2005)
Google Scholar
Topchy, A., Jain, A.K., Punch, W.: Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12), 1866–1881 (2005)
Article Google Scholar
Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proc. Intl. Conf. on Information Technology, ITCC 2004, Las Vegas (2004)
Google Scholar
Hore, P., Hall, L.O.: Scalable Clustering: A Distributed Approach. IEEE International Conference on Fuzzy Systems 1, 143–148 (2004)
Google Scholar
Dubes, R., Jain, A.K.: Clustering Techniques: The User’s Dilemma. Pattern Recognition 8, 247–260 (1976)
Article Google Scholar
Fred, A., Jain, A.K.: Evidence Accumulation Clustering Based on the k-Means Algorithm. In: Caelli, T., et al. (eds.) Proc. Structural, Syntactic, and Statistical Pattern Recognition, pp. 442–451 (2002)
Google Scholar
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, CA. University of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Modha, D.S., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Nanjing Normal University, Nanjing 210097, P.R. China
Genlin Ji & Xiaohan Ling

Authors

Genlin Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohan Ling
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, G., Ling, X. (2007). Ensemble Learning Based Distributed Clustering. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-540-77018-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics