StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

Luo, Yonghong; Zhang, Ying; Ding, Xiaoke; Cai, Xiangrui; Song, Chunyao; Yuan, Xiaojie

doi:10.1007/978-3-030-02925-8_14

Yonghong Luo¹⁸,
Ying Zhang¹⁸,
Xiaoke Ding¹⁸,
Xiangrui Cai¹⁸,
Chunyao Song¹⁸ &
…
Xiaojie Yuan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11234))

Included in the following conference series:

International Conference on Web Information Systems Engineering

Abstract

Data stream clustering is an important problem of data mining. As the infinite growth of data stream’s length, excessive data is making great troubles to the storage of data. A number of algorithms have been proposed for data stream clustering, such as CluStream, DenStream, DStream and StrAP. With the Big Data era’s coming, the amount of data in one timestamp is growing at a great speed, so the time efficiency of data stream clustering algorithms is drawing huge attention from researchers while some state-of-the-art algorithms are excellent in cluster purity but intolerable in time efficiency. In this paper, we propose the StrDip, a fast data stream clustering algorithm which combines the Dip Test of Unimodality with the online/offline two-stage stream clustering framework. The StrDip also adapts a novel clustering feature vector and some microcluster pruning methods. Comparing to others algorithms, results of experiments on synthetic and real-world datasets show that, the StrDip gains a huge advantage in time efficiency and the clustering purity and quality are also good.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comparative Study on Data Stream Clustering Algorithms

State-of-the-art on clustering data streams

Article Open access 01 December 2016

Adapting K-Means Algorithm for Pair-Wise Constrained Clustering of Imbalanced Data Streams

Notes

1.
Available at the following website: github.com/samhelmholtz/skinny-dip.

References

Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, pp. 81–92. VLDB Endowment (2003)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30, pp. 852–863. VLDB Endowment (2004)
Google Scholar
Arasu, A., et al.: STREAM: the stanford data stream management system. Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_16
Chapter Google Scholar
Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Knowl. Inf. Syst. 41(1), 127–152 (2014)
Article Google Scholar
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. SIAM (2006)
Google Scholar
Chen, J.Y., He, H.H.: A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf. Sci. 345, 271–293 (2016)
Article Google Scholar
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)
Google Scholar
Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
Article Google Scholar
Cup, K.: Dataset. available at the following website (1999). http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Dai, D.B., Zhao, G., Sun, S.L.: Effective clustering algorithm for probabilistic data stream. J. Softw. 20(5), 1313–1328 (2009)
Article Google Scholar
De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120. ACM (2016)
Google Scholar
Dixon, W.J., Massey Frank, J.: Introduction To Statistical Analsis. McGraw-Hill Book Company Inc., New York (1950)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams (2000)
Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
MATH Google Scholar
Hartigan, J.A., Hartigan, P.: The dip test of unimodality. Ann. Stat. 13, 70–84 (1985)
Article MathSciNet Google Scholar
Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1055–1064. ACM (2016)
Google Scholar
Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)
Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002)
Google Scholar
Nguyen, D.T., Jung, J.J.: Real-time event detection on social data stream. Mob. Netw. Appl. 20(4), 475–486 (2015)
Article Google Scholar
O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)
Google Scholar
Pietruczuk, L., Rutkowski, L., Jaworski, M., Duda, P.: How to adjust an ensemble size in stream data mining? Inf. Sci. 381, 46–54 (2017)
Article MathSciNet Google Scholar
Pramod, S., Vyas, O.: Data stream mining: a review on windowing approach. Glob. J. Comput. Sci. Technol. Softw. Data Eng. 12(11), 26–30 (2012)
Google Scholar
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Article Google Scholar
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, pp. 1–6. IEEE (2009)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)
Google Scholar
Yoo, S., Huang, H., Kasiviswanathan, S.P.: Streaming spectral clustering. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 637–648. IEEE (2016)
Google Scholar
Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)
Article Google Scholar

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (No. 61772289), Natural Science Foundation of Tianjin (No. 17JCQNJC00200) and Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

College of Computer Science, Nankai University, Tianjin, China
Yonghong Luo, Ying Zhang, Xiaoke Ding, Xiangrui Cai, Chunyao Song & Xiaojie Yuan

Authors

Yonghong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoke Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiangrui Cai
View author publications
You can also search for this author in PubMed Google Scholar
Chunyao Song
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

Zayed University, Dubai, United Arab Emirates
Hakim Hacid
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
University of Victoria, Footscray, VIC, Australia
Hua Wang
University of New South Wales, Sydney, NSW, Australia
Hye-Young Paik
Swinburne University of Technology, Hawthorn, VIC, Australia
Rui Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Zhang, Y., Ding, X., Cai, X., Song, C., Yuan, X. (2018). StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11234. Springer, Cham. https://doi.org/10.1007/978-3-030-02925-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-02925-8_14
Published: 21 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02924-1
Online ISBN: 978-3-030-02925-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics