Abstract
A widely used approach to clustering a single data stream is the two-phased approach in which the online phase creates and maintains micro-clusters while the off-line phase generates the macro-clustering from the micro-clusters. We use this approach to propose a distributed framework for clustering streaming data. Every remote-site process generates and maintains micro-clusters that represent cluster information summary from its local data stream. Remote sites send the local micro-clusterings to the coordinator, or the coordinator invokes the remote methods in order to get the local micro-clusterings from the remote sites. Having received all the local micro-clusterings from the remote sites, the coordinator generates the global clustering by the macro-clustering method. Our theoretical and empirical results show that the global clustering generated by our distributed framework is similar to the clustering generated by the underlying centralized algorithm on the same data set. By using the local micro-clustering approach, our framework achieves high scalability, and communication-efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 81–92. VLDB Endowment (2003)
Bandyopadhyay, S., Gianella, C., Maulik, U., Kargupta, H., Liu, K., Datta, S.: Clustering Distributed Data Streams in Peer-to-Peer Environments (2004)
Barbará, D.: Requirements for clustering data streams. ACM SIGKDD Explorations Newsletter 3(2), 23–27 (2002)
Beringer, J., Hullermeier, E.: Online clustering of parallel data streams. Data & Knowledge Engineering 58(2), 180–204 (2006)
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 1036–1045. IEEE (2007)
Da Silva, A., Chiky, R., Hebrail, G.: Clusmaster: A clustering approach for sampling data streams in sensor networks. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 98–107. IEEE (2010)
Dai, B., Huang, J., Yeh, M., Chen, M.: Clustering on demand for multiple data streams. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 367–370. IEEE (2004)
Datta, S., Bhaduri, K., Giannella, C., Wolff, R., Kargupta, H.: Distributed data mining in peer-to-peer networks. In: IEEE Internet Computing, pp. 18–26 (2006)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM computing surveys (CSUR) 31(3), 264–323 (1999)
Karnstedt, K., Sattler, D., Quasebarth, J.: Incremental mining for facility management. In: LWA 2007 Lernen–Wissen–Adaption, p. 183 (2007)
Klan, D., Karnstedt, M., Hose, K., Ribe-Baumann, L., Sattler, K.: Stream engines meet wireless sensor networks: Cost-based planning and processing of complex queries in anduin, distributed and parallel databases. Distributed and Parallel Databases 29(1), 151–183 (2011)
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 249–258. IEEE (2009)
Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 929–934. IEEE (2008)
Naor, M., Stockmeyer, L.: What can be computed locally? pp. 184–193 (1993)
Sun, J., Papadimitriou, S., Faloutsos, C.: Distributed pattern discovery in multiple streams. In: Advances in Knowledge Discovery and Data Mining, pp. 713–718 (2006)
Yin, J., Gaber, M.: Clustering distributed time series in sensor networks. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 678–687. IEEE (2008)
Zaki, M., Pan, Y.: Introduction: recent developments in parallel and distributed data mining. Distributed and Parallel Databases 11(2), 123–127 (2002)
Zhang, Q., Liu, J., Wang, W.: Approximate clustering on distributed data streams. In: ICDE, pp. 1131–1139 (2008)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2), 103–114 (1996)
Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering: A fast em-based approach. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 736–745. IEEE (2007)
Zhu, X.: Stream data mining repository (2010), http://www.cse.fau.edu/~xqzhu/stream.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tran, DH., Sattler, KU. (2013). Communication-Efficient Exact Clustering of Distributed Streaming Data. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39640-3_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-39640-3_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39639-7
Online ISBN: 978-3-642-39640-3
eBook Packages: Computer ScienceComputer Science (R0)