Abstract
Data stream is a potentially massive, continuous, rapid sequence of data information. It has aroused great concern and research upsurge in the field of data mining. Clustering is an effective tool of data mining, so data stream clustering will undoubtedly become the focus of the study in data stream mining. In view of the characteristic of the high dimension, dynamic, real-time, many effective data stream clustering algorithms have been proposed. In addition, data stream information are not deterministic and always exist outliers and contain noises, so developing effective data stream clustering algorithm is crucial. This paper reviews the development and trend of data stream clustering and analyzes typical data stream clustering algorithms proposed in recent years, such as Birch algorithm, Local Search algorithm, Stream algorithm and CluStream algorithm. We also summarize the latest research achievements in this field and introduce some new strategies to deal with outliers and noise data. At last, we put forward the focal points and difficulties of future research for data stream clustering.
Similar content being viewed by others
References
Aggarwal CC, Han J, Wang J et al (2003) A framewrok for clustering evolving data streams. In: Proceedings of VLDB 2003. pp 81–92
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th international conference on very large data bases. pp 852–863
Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceeding of the 24th international conference on data engineering. pp 150–159
Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceeding of the SIAM data mining conference pp 483–493
Aggarwal CC, Han J, Wang J et al (2005) On high dimension projected clustering of uncertain data streams. Data Min Knowl Discov 10(3):251–273
Babcock B, Babu S, Datar M, et al (2002) Models and issues in data streams. In: Proceedings of the 21th ACM symposium on principles of database systems. pp 1–16
Barbará D (2003) Requirements for clustering data streams. ACM SIGKDD Explor Newsl 3(2):23–27
Bifet A, Holmes G, Pfahringer B (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. pp 139–148
Bulut A, Singh AK (2003) SWAT: hierarchical stream summarization in large networks. In: Proceeding of the 19th international conference on data engineering. pp 303–314
Cao F, Zhou A (2007) Fast clustering of data stream using graphics processors. J Softw 18(2):291–304
Chang J, Cao F, Zhou A (2007) Clustering evolving data stream over sliding windows. J Softw 18(4):905–918
Chen H, Shi B (2010) Wavelet synopsis based clustering of parallel data streams. J Softw 21(4):644–658
Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of the ACM SIGMOD international conference on management of data. pp 281–289
Dai D, Zhao W, Sun L (2009) Effective clustering algorithm for probabilistic data stream. J Softw 20(5):1313–1328
Dingi H, Trajcevski G, Scheuestern P, Xiaoyue W, Eamonn K (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: ACM Proceedings of the VLDB endowment. 1(2):1542–1552
Gaber MM, Zaslavsky AB, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Rec 34(2):18–26
Guha S, Meyerson A et al (2003) Clustering datastreams: theory and practice. IEEE TKDE Special Issue Clust 3(2):37–46
Guha S, Meyerson A, Mishra N et al (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):505–528
Guha S, Harb B (2005) Wavelet synopsis for data streams: minimizing non-euclidean error. In: Proceeding of the 11th ACM SIGKDD international conference on knowledge discovery in data mining. pp 88–97
Guha S, Mishra N, Motwani R et al (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. pp 359–366
Guha S, Mishra N, Motwani R et al (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. Washington: IEEE Computer Society. pp 359–366
Han D, Gong P, Xiao C (2011) Load shedding strategies on sliding window joins over data streams. J Comput Res Dev 48(1):103–109
Jayram TS, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceeding of the 18th annual ACM-SIAM syrup. On discrete algorithms(SODA). pp 346–355
Jayram TS, McGregor A, Muthukrishan VE (2008) Estimating statistical aggregates on probabilistic data streams. ACM Trans Database Syst 33(4):26–30
Karras P, Mamoulis N (2005) One-pass wavelet synopses for maximum-errormetrics. In: Proceeding of the 31st international conference on very large data bases. pp 421–432
Kavitha V, Punithavalli M (2010) Clustering time series data stream—a literature survey. Int J Comput Sci Inf Secur IJCSIS 8(1):289–294
Mahdiraji AR (2009) Clustering data stream: a survey of algorithms. Int J Knowl Based Intell Eng Syst 12(2):39–44
Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. Duned Aust Comput Soc 32:115–120
Muthukrishnan S (2003) Data streams algorithms and applications. In: Proceeding of the 14th annual ACM-SIAM symposium on discrete algorithms. pp 13–413
Ni W, Lu J, Chen G, Sun Z (2007) Efficient data stream clustering algorithm based on k-means partitioning and density. J Chin Comput Syst 28(1):83–87
O’Callaghan L, Mishra N, Meyerson A et al (2002) Motwani. Streaming data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering. pp 685–704
Ordonez C (2003) Clustering binary data streams with K- mean. In: Proceedings of DMKD’03. pp 12–19
Palpanas T, Vlachos M, Keogh E (2004) Online amnesic approximation of streaming time series. In: Proceeding of the 20th international conference on data engineering. pp 339–349
Song M, Wang H (2005) Highly efficient incremental estimation of gaussian mixture models for online data stream clustering. In: Proceeding of intelligence computing: theory and application. pp 174–183
Sun H, Zhao F, Bao Y (2004) CD-stream-a space partition based density clustering algorithm over data stream. J Comput Res Dev 41(suppl):289–294
Sun Y, Mao G, Liu X (2008) Ming concept drifts from data streams based on muti-classifiers. Acta Automatica Sinica 34(1):93–97
Talbot LM, Talbot BG, Peterson RE (1999) Application of fuzzy grade-of membership clustering to analysis of remotesensing data. J Clim 12:200–219
Wang XZ, Li RF (1999) Combining conceptual clustering and principal component analysis for state space based process monitoring. Ind Eng Chem Res 38:4345–4358
Wang Y, Tang CJ, Li C, Chen Y, Yang N, Tang R, Zhu J (2009) Intervention events detection and prediction in data streams. Lect Notes Comput Sci 5446:519–525
Wang Y, Tang CJ, Wang Y (2011) Mining hotspots from multiple text streams based on stream information distance. J Softw 22(8):1761–1770
Wu F, Zhong Y, Jin X (2009) Arbitrary shape clustering algorithm for evolving data stream over sliding windows. J Chin Comput Syst 30(5):887–890
Xin L, Ni Z, Huang L (2007) Modifiable Birch cluster algorithm used in data stream. Comput Eng Appl 43(5):166–169
Yang C, Zhou J (2007) A Heterogeneous data stream clustering algorithm. Chin J Comput 30(8):1364–1371
Yang N, Tang C, Wang Y (2010) Clustering algorithm on data stream with skew distribution based on density. J Softw 21(5):1031–1041
Yue Wang, Changjie Tang, Ning Yang (2011) Mining optimized probabilistic intervention strategy over uncertain data set. J Softw 22(2):285–297
Zhang C, Jin C, Zhou A (2010) Clustering algorithm over uncertain data stream. J Softw 21(9):2173–2181
Zhang L, Zou P, Jia Y (2011) Continuous dynamic skyline queries over data stream. J Comput Res Dev 48(1):77–85
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceeding of the SIGMOD. pp 103–114
Zhu W, Yin J, Xie Y (2006) Arbitrary shape cluster algorithm for clustering data stream. J Softw 17(3):379–387
Zhu Q, Zhang Y, Hu X (2011) A double-window-based classification algorithm for concept drifting data streams. Acta Automatica Sinica 37(9):1078–1084
Acknowledgments
This work is supported by the National Basic Research Program of China (No.2013CB 329502), the National Natural Science Foundation of China (No.41074003), and the Opening Foundation of Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (No.IIP2010-1).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ding, S., Wu, F., Qian, J. et al. Research on data stream clustering algorithms. Artif Intell Rev 43, 593–600 (2015). https://doi.org/10.1007/s10462-013-9398-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-013-9398-7