Communication-Efficient Exact Clustering of Distributed Streaming Data

Tran, Dang-Hoan; Sattler, Kai-Uwe

doi:10.1007/978-3-642-39640-3_31

Communication-Efficient Exact Clustering of Distributed Streaming Data

Dang-Hoan Tran²⁴ &
Kai-Uwe Sattler²⁴

Conference paper

1762 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7975))

Abstract

A widely used approach to clustering a single data stream is the two-phased approach in which the online phase creates and maintains micro-clusters while the off-line phase generates the macro-clustering from the micro-clusters. We use this approach to propose a distributed framework for clustering streaming data. Every remote-site process generates and maintains micro-clusters that represent cluster information summary from its local data stream. Remote sites send the local micro-clusterings to the coordinator, or the coordinator invokes the remote methods in order to get the local micro-clusterings from the remote sites. Having received all the local micro-clusterings from the remote sites, the coordinator generates the global clustering by the macro-clustering method. Our theoretical and empirical results show that the global clustering generated by our distributed framework is similar to the clustering generated by the underlying centralized algorithm on the same data set. By using the local micro-clustering approach, our framework achieves high scalability, and communication-efficiency.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 81–92. VLDB Endowment (2003)
Google Scholar
Bandyopadhyay, S., Gianella, C., Maulik, U., Kargupta, H., Liu, K., Datta, S.: Clustering Distributed Data Streams in Peer-to-Peer Environments (2004)
Google Scholar
Barbará, D.: Requirements for clustering data streams. ACM SIGKDD Explorations Newsletter 3(2), 23–27 (2002)
Article Google Scholar
Beringer, J., Hullermeier, E.: Online clustering of parallel data streams. Data & Knowledge Engineering 58(2), 180–204 (2006)
Article Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)
Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)
Google Scholar
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 1036–1045. IEEE (2007)
Google Scholar
Da Silva, A., Chiky, R., Hebrail, G.: Clusmaster: A clustering approach for sampling data streams in sensor networks. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 98–107. IEEE (2010)
Google Scholar
Dai, B., Huang, J., Yeh, M., Chen, M.: Clustering on demand for multiple data streams. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 367–370. IEEE (2004)
Google Scholar
Datta, S., Bhaduri, K., Giannella, C., Wolff, R., Kargupta, H.: Distributed data mining in peer-to-peer networks. In: IEEE Internet Computing, pp. 18–26 (2006)
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)
Article Google Scholar
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM computing surveys (CSUR) 31(3), 264–323 (1999)
Article Google Scholar
Karnstedt, K., Sattler, D., Quasebarth, J.: Incremental mining for facility management. In: LWA 2007 Lernen–Wissen–Adaption, p. 183 (2007)
Google Scholar
Klan, D., Karnstedt, M., Hose, K., Ribe-Baumann, L., Sattler, K.: Stream engines meet wireless sensor networks: Cost-based planning and processing of complex queries in anduin, distributed and parallel databases. Distributed and Parallel Databases 29(1), 151–183 (2011)
Article Google Scholar
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 249–258. IEEE (2009)
Google Scholar
Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 929–934. IEEE (2008)
Google Scholar
Naor, M., Stockmeyer, L.: What can be computed locally? pp. 184–193 (1993)
Google Scholar
Sun, J., Papadimitriou, S., Faloutsos, C.: Distributed pattern discovery in multiple streams. In: Advances in Knowledge Discovery and Data Mining, pp. 713–718 (2006)
Google Scholar
Yin, J., Gaber, M.: Clustering distributed time series in sensor networks. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 678–687. IEEE (2008)
Google Scholar
Zaki, M., Pan, Y.: Introduction: recent developments in parallel and distributed data mining. Distributed and Parallel Databases 11(2), 123–127 (2002)
Google Scholar
Zhang, Q., Liu, J., Wang, W.: Approximate clustering on distributed data streams. In: ICDE, pp. 1131–1139 (2008)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2), 103–114 (1996)
Article Google Scholar
Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering: A fast em-based approach. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 736–745. IEEE (2007)
Google Scholar
Zhu, X.: Stream data mining repository (2010), http://www.cse.fau.edu/~xqzhu/stream.html

Download references

Author information

Authors and Affiliations

Department of Computer Science & Automation, Ilmenau University of Technology, Germany
Dang-Hoan Tran & Kai-Uwe Sattler

Authors

Dang-Hoan Tran
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Sattler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

L-I.S.U.T. - D.A.P.I.t. Facoltà Ingegneria, Università degli Studi della Basilicata, Viale dell’Ateneo Lucano, 10, 85100, Potenza, Italy
Beniamino Murgante
Covenant University, Canaanland OTA, Nigeria
Sanjay Misra
Partimento di Scienze e Tecnologie per LAgricoltura, le Foreste, la Natura e lEnergia, Università degli Studi della Tuscia, Via S. Camillo de Lellis, snc, 01100, Viterbo, Italy
Maurizio Carlini
Dipartimento di Scienze dell’Ingegneria Civile e dell’Architecttura, Politecnico di Bari, Via Orabona, 4, 70125, Bari, Italy
Carmelo M. Torre
International University VNU-HCM, Quarter 6, Linh Trung, Thu Duc, Ho Chi Minh City, Vietnam
Hong-Quang Nguyen
School of Business Systems, Monash University, 3800, Clayton, VIC, Australia
David Taniar
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, 813-8503, Higashi-ku, Fukuoka, Japan
Bernady O. Apduhan
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli, 1, 06123, Perugia, Italy
Osvaldo Gervasi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, DH., Sattler, KU. (2013). Communication-Efficient Exact Clustering of Distributed Streaming Data. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39640-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-39640-3_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39639-7
Online ISBN: 978-3-642-39640-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics