Skip to main content
Log in

Research on data stream clustering algorithms

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Data stream is a potentially massive, continuous, rapid sequence of data information. It has aroused great concern and research upsurge in the field of data mining. Clustering is an effective tool of data mining, so data stream clustering will undoubtedly become the focus of the study in data stream mining. In view of the characteristic of the high dimension, dynamic, real-time, many effective data stream clustering algorithms have been proposed. In addition, data stream information are not deterministic and always exist outliers and contain noises, so developing effective data stream clustering algorithm is crucial. This paper reviews the development and trend of data stream clustering and analyzes typical data stream clustering algorithms proposed in recent years, such as Birch algorithm, Local Search algorithm, Stream algorithm and CluStream algorithm. We also summarize the latest research achievements in this field and introduce some new strategies to deal with outliers and noise data. At last, we put forward the focal points and difficulties of future research for data stream clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Han J, Wang J et al (2003) A framewrok for clustering evolving data streams. In: Proceedings of VLDB 2003. pp 81–92

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th international conference on very large data bases. pp 852–863

  • Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceeding of the 24th international conference on data engineering. pp 150–159

  • Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceeding of the SIAM data mining conference pp 483–493

  • Aggarwal CC, Han J, Wang J et al (2005) On high dimension projected clustering of uncertain data streams. Data Min Knowl Discov 10(3):251–273

    Article  MathSciNet  Google Scholar 

  • Babcock B, Babu S, Datar M, et al (2002) Models and issues in data streams. In: Proceedings of the 21th ACM symposium on principles of database systems. pp 1–16

  • Barbará D (2003) Requirements for clustering data streams. ACM SIGKDD Explor Newsl 3(2):23–27

    Article  Google Scholar 

  • Bifet A, Holmes G, Pfahringer B (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. pp 139–148

  • Bulut A, Singh AK (2003) SWAT: hierarchical stream summarization in large networks. In: Proceeding of the 19th international conference on data engineering. pp 303–314

  • Cao F, Zhou A (2007) Fast clustering of data stream using graphics processors. J Softw 18(2):291–304

    Article  Google Scholar 

  • Chang J, Cao F, Zhou A (2007) Clustering evolving data stream over sliding windows. J Softw 18(4):905–918

    Article  Google Scholar 

  • Chen H, Shi B (2010) Wavelet synopsis based clustering of parallel data streams. J Softw 21(4):644–658

    Article  MATH  MathSciNet  Google Scholar 

  • Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of the ACM SIGMOD international conference on management of data. pp 281–289

  • Dai D, Zhao W, Sun L (2009) Effective clustering algorithm for probabilistic data stream. J Softw 20(5):1313–1328

    Article  Google Scholar 

  • Dingi H, Trajcevski G, Scheuestern P, Xiaoyue W, Eamonn K (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: ACM Proceedings of the VLDB endowment. 1(2):1542–1552

  • Gaber MM, Zaslavsky AB, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Rec 34(2):18–26

    Article  Google Scholar 

  • Guha S, Meyerson A et al (2003) Clustering datastreams: theory and practice. IEEE TKDE Special Issue Clust 3(2):37–46

    MathSciNet  Google Scholar 

  • Guha S, Meyerson A, Mishra N et al (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):505–528

    Article  Google Scholar 

  • Guha S, Harb B (2005) Wavelet synopsis for data streams: minimizing non-euclidean error. In: Proceeding of the 11th ACM SIGKDD international conference on knowledge discovery in data mining. pp 88–97

  • Guha S, Mishra N, Motwani R et al (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. pp 359–366

  • Guha S, Mishra N, Motwani R et al (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. Washington: IEEE Computer Society. pp 359–366

  • Han D, Gong P, Xiao C (2011) Load shedding strategies on sliding window joins over data streams. J Comput Res Dev 48(1):103–109

    Google Scholar 

  • Jayram TS, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceeding of the 18th annual ACM-SIAM syrup. On discrete algorithms(SODA). pp 346–355

  • Jayram TS, McGregor A, Muthukrishan VE (2008) Estimating statistical aggregates on probabilistic data streams. ACM Trans Database Syst 33(4):26–30

    Article  Google Scholar 

  • Karras P, Mamoulis N (2005) One-pass wavelet synopses for maximum-errormetrics. In: Proceeding of the 31st international conference on very large data bases. pp 421–432

  • Kavitha V, Punithavalli M (2010) Clustering time series data stream—a literature survey. Int J Comput Sci Inf Secur IJCSIS 8(1):289–294

    Google Scholar 

  • Mahdiraji AR (2009) Clustering data stream: a survey of algorithms. Int J Knowl Based Intell Eng Syst 12(2):39–44

    Google Scholar 

  • Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. Duned Aust Comput Soc 32:115–120

    Google Scholar 

  • Muthukrishnan S (2003) Data streams algorithms and applications. In: Proceeding of the 14th annual ACM-SIAM symposium on discrete algorithms. pp 13–413

  • Ni W, Lu J, Chen G, Sun Z (2007) Efficient data stream clustering algorithm based on k-means partitioning and density. J Chin Comput Syst 28(1):83–87

    Google Scholar 

  • O’Callaghan L, Mishra N, Meyerson A et al (2002) Motwani. Streaming data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering. pp 685–704

  • Ordonez C (2003) Clustering binary data streams with K- mean. In: Proceedings of DMKD’03. pp 12–19

  • Palpanas T, Vlachos M, Keogh E (2004) Online amnesic approximation of streaming time series. In: Proceeding of the 20th international conference on data engineering. pp 339–349

  • Song M, Wang H (2005) Highly efficient incremental estimation of gaussian mixture models for online data stream clustering. In: Proceeding of intelligence computing: theory and application. pp 174–183

  • Sun H, Zhao F, Bao Y (2004) CD-stream-a space partition based density clustering algorithm over data stream. J Comput Res Dev 41(suppl):289–294

    Google Scholar 

  • Sun Y, Mao G, Liu X (2008) Ming concept drifts from data streams based on muti-classifiers. Acta Automatica Sinica 34(1):93–97

    Article  MathSciNet  Google Scholar 

  • Talbot LM, Talbot BG, Peterson RE (1999) Application of fuzzy grade-of membership clustering to analysis of remotesensing data. J Clim 12:200–219

    Article  Google Scholar 

  • Wang XZ, Li RF (1999) Combining conceptual clustering and principal component analysis for state space based process monitoring. Ind Eng Chem Res 38:4345–4358

    Article  Google Scholar 

  • Wang Y, Tang CJ, Li C, Chen Y, Yang N, Tang R, Zhu J (2009) Intervention events detection and prediction in data streams. Lect Notes Comput Sci 5446:519–525

    Article  Google Scholar 

  • Wang Y, Tang CJ, Wang Y (2011) Mining hotspots from multiple text streams based on stream information distance. J Softw 22(8):1761–1770

    Article  MathSciNet  Google Scholar 

  • Wu F, Zhong Y, Jin X (2009) Arbitrary shape clustering algorithm for evolving data stream over sliding windows. J Chin Comput Syst 30(5):887–890

    Google Scholar 

  • Xin L, Ni Z, Huang L (2007) Modifiable Birch cluster algorithm used in data stream. Comput Eng Appl 43(5):166–169

    Google Scholar 

  • Yang C, Zhou J (2007) A Heterogeneous data stream clustering algorithm. Chin J Comput 30(8):1364–1371

    Google Scholar 

  • Yang N, Tang C, Wang Y (2010) Clustering algorithm on data stream with skew distribution based on density. J Softw 21(5):1031–1041

    Article  MATH  MathSciNet  Google Scholar 

  • Yue Wang, Changjie Tang, Ning Yang (2011) Mining optimized probabilistic intervention strategy over uncertain data set. J Softw 22(2):285–297

    Article  Google Scholar 

  • Zhang C, Jin C, Zhou A (2010) Clustering algorithm over uncertain data stream. J Softw 21(9):2173–2181

    MATH  Google Scholar 

  • Zhang L, Zou P, Jia Y (2011) Continuous dynamic skyline queries over data stream. J Comput Res Dev 48(1):77–85

    Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceeding of the SIGMOD. pp 103–114

  • Zhu W, Yin J, Xie Y (2006) Arbitrary shape cluster algorithm for clustering data stream. J Softw 17(3):379–387

    Article  MATH  Google Scholar 

  • Zhu Q, Zhang Y, Hu X (2011) A double-window-based classification algorithm for concept drifting data streams. Acta Automatica Sinica 37(9):1078–1084

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Basic Research Program of China (No.2013CB 329502), the National Natural Science Foundation of China (No.41074003), and the Opening Foundation of Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (No.IIP2010-1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Qian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, S., Wu, F., Qian, J. et al. Research on data stream clustering algorithms. Artif Intell Rev 43, 593–600 (2015). https://doi.org/10.1007/s10462-013-9398-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-013-9398-7

Keywords

Navigation