Skip to main content

Communication-Efficient Exact Clustering of Distributed Streaming Data

  • Conference paper
  • 1762 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7975))

Abstract

A widely used approach to clustering a single data stream is the two-phased approach in which the online phase creates and maintains micro-clusters while the off-line phase generates the macro-clustering from the micro-clusters. We use this approach to propose a distributed framework for clustering streaming data. Every remote-site process generates and maintains micro-clusters that represent cluster information summary from its local data stream. Remote sites send the local micro-clusterings to the coordinator, or the coordinator invokes the remote methods in order to get the local micro-clusterings from the remote sites. Having received all the local micro-clusterings from the remote sites, the coordinator generates the global clustering by the macro-clustering method. Our theoretical and empirical results show that the global clustering generated by our distributed framework is similar to the clustering generated by the underlying centralized algorithm on the same data set. By using the local micro-clustering approach, our framework achieves high scalability, and communication-efficiency.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 81–92. VLDB Endowment (2003)

    Google Scholar 

  2. Bandyopadhyay, S., Gianella, C., Maulik, U., Kargupta, H., Liu, K., Datta, S.: Clustering Distributed Data Streams in Peer-to-Peer Environments (2004)

    Google Scholar 

  3. Barbará, D.: Requirements for clustering data streams. ACM SIGKDD Explorations Newsletter 3(2), 23–27 (2002)

    Article  Google Scholar 

  4. Beringer, J., Hullermeier, E.: Online clustering of parallel data streams. Data & Knowledge Engineering 58(2), 180–204 (2006)

    Article  Google Scholar 

  5. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)

    Google Scholar 

  6. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)

    Google Scholar 

  7. Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 1036–1045. IEEE (2007)

    Google Scholar 

  8. Da Silva, A., Chiky, R., Hebrail, G.: Clusmaster: A clustering approach for sampling data streams in sensor networks. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 98–107. IEEE (2010)

    Google Scholar 

  9. Dai, B., Huang, J., Yeh, M., Chen, M.: Clustering on demand for multiple data streams. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 367–370. IEEE (2004)

    Google Scholar 

  10. Datta, S., Bhaduri, K., Giannella, C., Wolff, R., Kargupta, H.: Distributed data mining in peer-to-peer networks. In: IEEE Internet Computing, pp. 18–26 (2006)

    Google Scholar 

  11. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)

    Article  Google Scholar 

  12. Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM computing surveys (CSUR) 31(3), 264–323 (1999)

    Article  Google Scholar 

  13. Karnstedt, K., Sattler, D., Quasebarth, J.: Incremental mining for facility management. In: LWA 2007 Lernen–Wissen–Adaption, p. 183 (2007)

    Google Scholar 

  14. Klan, D., Karnstedt, M., Hose, K., Ribe-Baumann, L., Sattler, K.: Stream engines meet wireless sensor networks: Cost-based planning and processing of complex queries in anduin, distributed and parallel databases. Distributed and Parallel Databases 29(1), 151–183 (2011)

    Article  Google Scholar 

  15. Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 249–258. IEEE (2009)

    Google Scholar 

  16. Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 929–934. IEEE (2008)

    Google Scholar 

  17. Naor, M., Stockmeyer, L.: What can be computed locally? pp. 184–193 (1993)

    Google Scholar 

  18. Sun, J., Papadimitriou, S., Faloutsos, C.: Distributed pattern discovery in multiple streams. In: Advances in Knowledge Discovery and Data Mining, pp. 713–718 (2006)

    Google Scholar 

  19. Yin, J., Gaber, M.: Clustering distributed time series in sensor networks. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 678–687. IEEE (2008)

    Google Scholar 

  20. Zaki, M., Pan, Y.: Introduction: recent developments in parallel and distributed data mining. Distributed and Parallel Databases 11(2), 123–127 (2002)

    Google Scholar 

  21. Zhang, Q., Liu, J., Wang, W.: Approximate clustering on distributed data streams. In: ICDE, pp. 1131–1139 (2008)

    Google Scholar 

  22. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2), 103–114 (1996)

    Article  Google Scholar 

  23. Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering: A fast em-based approach. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 736–745. IEEE (2007)

    Google Scholar 

  24. Zhu, X.: Stream data mining repository (2010), http://www.cse.fau.edu/~xqzhu/stream.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tran, DH., Sattler, KU. (2013). Communication-Efficient Exact Clustering of Distributed Streaming Data. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39640-3_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39640-3_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39639-7

  • Online ISBN: 978-3-642-39640-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics