Skip to main content

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2018 (WISE 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11234))

Included in the following conference series:

Abstract

Data stream clustering is an important problem of data mining. As the infinite growth of data stream’s length, excessive data is making great troubles to the storage of data. A number of algorithms have been proposed for data stream clustering, such as CluStream, DenStream, DStream and StrAP. With the Big Data era’s coming, the amount of data in one timestamp is growing at a great speed, so the time efficiency of data stream clustering algorithms is drawing huge attention from researchers while some state-of-the-art algorithms are excellent in cluster purity but intolerable in time efficiency. In this paper, we propose the StrDip, a fast data stream clustering algorithm which combines the Dip Test of Unimodality with the online/offline two-stage stream clustering framework. The StrDip also adapts a novel clustering feature vector and some microcluster pruning methods. Comparing to others algorithms, results of experiments on synthetic and real-world datasets show that, the StrDip gains a huge advantage in time efficiency and the clustering purity and quality are also good.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Available at the following website: github.com/samhelmholtz/skinny-dip.

References

  1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, pp. 81–92. VLDB Endowment (2003)

    Google Scholar 

  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30, pp. 852–863. VLDB Endowment (2004)

    Google Scholar 

  3. Arasu, A., et al.: STREAM: the stanford data stream management system. Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_16

    Chapter  Google Scholar 

  4. Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Knowl. Inf. Syst. 41(1), 127–152 (2014)

    Article  Google Scholar 

  5. Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. SIAM (2006)

    Google Scholar 

  6. Chen, J.Y., He, H.H.: A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf. Sci. 345, 271–293 (2016)

    Article  Google Scholar 

  7. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)

    Google Scholar 

  8. Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)

    Article  Google Scholar 

  9. Cup, K.: Dataset. available at the following website (1999). http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

  10. Dai, D.B., Zhao, G., Sun, S.L.: Effective clustering algorithm for probabilistic data stream. J. Softw. 20(5), 1313–1328 (2009)

    Article  Google Scholar 

  11. De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120. ACM (2016)

    Google Scholar 

  12. Dixon, W.J., Massey Frank, J.: Introduction To Statistical Analsis. McGraw-Hill Book Company Inc., New York (1950)

    Google Scholar 

  13. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)

    Google Scholar 

  14. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)

    Article  MathSciNet  Google Scholar 

  15. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams (2000)

    Google Scholar 

  16. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

    MATH  Google Scholar 

  17. Hartigan, J.A., Hartigan, P.: The dip test of unimodality. Ann. Stat. 13, 70–84 (1985)

    Article  MathSciNet  Google Scholar 

  18. Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1055–1064. ACM (2016)

    Google Scholar 

  19. Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)

    Google Scholar 

  20. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)

    Google Scholar 

  21. Nguyen, D.T., Jung, J.J.: Real-time event detection on social data stream. Mob. Netw. Appl. 20(4), 475–486 (2015)

    Article  Google Scholar 

  22. O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)

    Google Scholar 

  23. Pietruczuk, L., Rutkowski, L., Jaworski, M., Duda, P.: How to adjust an ensemble size in stream data mining? Inf. Sci. 381, 46–54 (2017)

    Article  MathSciNet  Google Scholar 

  24. Pramod, S., Vyas, O.: Data stream mining: a review on windowing approach. Glob. J. Comput. Sci. Technol. Softw. Data Eng. 12(11), 26–30 (2012)

    Google Scholar 

  25. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

  26. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, pp. 1–6. IEEE (2009)

    Google Scholar 

  27. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)

    Google Scholar 

  28. Yoo, S., Huang, H., Kasiviswanathan, S.P.: Streaming spectral clustering. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 637–648. IEEE (2016)

    Google Scholar 

  29. Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (No. 61772289), Natural Science Foundation of Tianjin (No. 17JCQNJC00200) and Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, Y., Zhang, Y., Ding, X., Cai, X., Song, C., Yuan, X. (2018). StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11234. Springer, Cham. https://doi.org/10.1007/978-3-030-02925-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02925-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02924-1

  • Online ISBN: 978-3-030-02925-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics