Abstract
Data stream clustering is an important problem of data mining. As the infinite growth of data stream’s length, excessive data is making great troubles to the storage of data. A number of algorithms have been proposed for data stream clustering, such as CluStream, DenStream, DStream and StrAP. With the Big Data era’s coming, the amount of data in one timestamp is growing at a great speed, so the time efficiency of data stream clustering algorithms is drawing huge attention from researchers while some state-of-the-art algorithms are excellent in cluster purity but intolerable in time efficiency. In this paper, we propose the StrDip, a fast data stream clustering algorithm which combines the Dip Test of Unimodality with the online/offline two-stage stream clustering framework. The StrDip also adapts a novel clustering feature vector and some microcluster pruning methods. Comparing to others algorithms, results of experiments on synthetic and real-world datasets show that, the StrDip gains a huge advantage in time efficiency and the clustering purity and quality are also good.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Available at the following website: github.com/samhelmholtz/skinny-dip.
References
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, pp. 81–92. VLDB Endowment (2003)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30, pp. 852–863. VLDB Endowment (2004)
Arasu, A., et al.: STREAM: the stanford data stream management system. Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_16
Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Knowl. Inf. Syst. 41(1), 127–152 (2014)
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. SIAM (2006)
Chen, J.Y., He, H.H.: A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf. Sci. 345, 271–293 (2016)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)
Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
Cup, K.: Dataset. available at the following website (1999). http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Dai, D.B., Zhao, G., Sun, S.L.: Effective clustering algorithm for probabilistic data stream. J. Softw. 20(5), 1313–1328 (2009)
De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120. ACM (2016)
Dixon, W.J., Massey Frank, J.: Introduction To Statistical Analsis. McGraw-Hill Book Company Inc., New York (1950)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams (2000)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Hartigan, J.A., Hartigan, P.: The dip test of unimodality. Ann. Stat. 13, 70–84 (1985)
Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1055–1064. ACM (2016)
Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)
Nguyen, D.T., Jung, J.J.: Real-time event detection on social data stream. Mob. Netw. Appl. 20(4), 475–486 (2015)
O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)
Pietruczuk, L., Rutkowski, L., Jaworski, M., Duda, P.: How to adjust an ensemble size in stream data mining? Inf. Sci. 381, 46–54 (2017)
Pramod, S., Vyas, O.: Data stream mining: a review on windowing approach. Glob. J. Comput. Sci. Technol. Softw. Data Eng. 12(11), 26–30 (2012)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, pp. 1–6. IEEE (2009)
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)
Yoo, S., Huang, H., Kasiviswanathan, S.P.: Streaming spectral clustering. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 637–648. IEEE (2016)
Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)
Acknowledgements
This research is supported by National Natural Science Foundation of China (No. 61772289), Natural Science Foundation of Tianjin (No. 17JCQNJC00200) and Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, Y., Zhang, Y., Ding, X., Cai, X., Song, C., Yuan, X. (2018). StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11234. Springer, Cham. https://doi.org/10.1007/978-3-030-02925-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-02925-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02924-1
Online ISBN: 978-3-030-02925-8
eBook Packages: Computer ScienceComputer Science (R0)