Abstract
Clustering is a fundamental operation that plays an essential role in data management and analysis. Clustering algorithms have been well studied over the past two decades, but the real-time clustering has yet to be maturely applied. For applications based on clustering calculations, capturing the dynamic changes of clusters and trends of moving objects in a real-time manner can maximize the value of the data. Although the DSPE (D istributed S tream P rocessing E ngine) is capable of such workloads, it still faces the problems of fixed window size and computational resources waste. In this paper, we introduce a new C ost-e ffective and A daptive C lustering method (CeAC), which can improve computational efficiency while ensuring the accuracy of the clustering result. Specifically, we design a composite window model which contains the latest data records and maintains historical states. To achieve a lightweight clustering, we propose a fully online clustering algorithm based on grid density, which can capture clusters with arbitrary shape and effectively handle outliers in parallel. We further introduce an adaptive calculation model to accelerate the clustering operation by shedding workload according to the incoming data characteristic. Experimental results show that the proposed method is accurate and efficient in real-time data stream clustering.










Similar content being viewed by others
References
(2021) Apache flink. https://flink.apache.org/
Aggarwal CC (2018) A survey of stream clustering algorithms. In: Data clustering: algorithms and applications. CRC Press, pp 231–258
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the VLDB, pp 852–863
Aggarwal CC, Yu PS, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings of the VLDB, pp 81–92
Akidau T, Schmidt E, Whittle S, Bradshaw R, Perry F (2015) The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803
Amini, Wah TY (2014) On density-based data streams clustering algorithms: A survey. J Comput Sci Technol 29(1):116–141
Amini A, Wah TY et al (2013) Leaden-stream: A leader density-based clustering algorithm over evolving data stream. J Comput Sci Comm 1(05):26
Baldassi C (2019) Recombinator-k-means: A population based algorithm that exploits k-means++ for recombination
Botan I, Derakhshan R, Dindar N, Haas L, Miller RJ, Tatbul N (2010) Secret: a model for analysis of the execution semantics of stream processing systems. Proc VLDB Endow 3(1-2):232–243
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM, pp 328–339
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the SIGKDD, pp 554–560
Chen L, Shang S, Jensen CS, Xu J, Kalnis P, Yao B, Shao L (2020) Top-k term publish/subscribe for geo-textual data streams. VLDB J 29 (5):1101–1128
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the SIGKDD, pp 133–142
Datar M, Gionis A, Indyk Pi, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6):1794–1813
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACD, Gama J (2013) Data stream clustering: A survey. ACM Comput Surv 46(1):1–31
de Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci 525:153–171
Gan J, Tao Y (2017) Dynamic density based clustering. In: Proceedings of the SIGMOD, pp 1493–1507
Gong S, Zhang Y, Yu G (2017) Clustering stream data by exploring the evolution of density mountain. Proc VLDB Endow 11(4):393–405
Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461
Han J, Pei J, Kamber M (2011) Data Mining: Concepts and Techniques, 3rd edition Morgan Kaufmann
Isaksson C, Dunham MH, Hahsler M (2012) Sostream: Self organizing density-based clustering over data stream. In: Proceedings of the MLDM, pp 264–278
Li Y, Li H, Wang Z, Liu B, Cui J, Fei H (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng
Liu A, Wang W, Shang S, Li Q, Zhang X (2018) Efficient task assignment in spatial crowdsourcing with worker and task privacy protection. GeoInformatica 22(2):335–362
Liu J, Zhao K, Sommer P, Shang S, Kusy B, Lee JG, Jurdak R (2016) A novel framework for online amnesic trajectory compression in resource-constrained environments. IEEE Trans Knowl Data Eng 28 (11):2827–2841
Liu X, Buyya R (2020) Resource management and scheduling in distributed stream processing systems: A taxonomy, review, and future directions. ACM Comput Surv 53(3):1–41
Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G (2019) Learning under concept drift: A review. IEEE Trans Knowl Data Eng 31(12):2346–2363
Mansalis S, Ntoutsi E, Pelekis N, Theodoridis Y (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min 11(4):167–187
Nguyen H-L, Woon YK, Ng W-K (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569
Nguyen H-L, Woon YK, Ng W-K (2015) A survey on data stream clustering and classification. Knowl. Inf Syst 45(3):535–569
Pei Y, Zaïane O (2006) A synthetic data generator for clustering and outlier analysis. Technical Report
Puschmann D, Barnaghi PaM., Tafazolli R (2017) Adaptive clustering for dynamic IOT data streams. IEEE Internet Things J 4(1):64–74
Rasool Z, Zhou R, Chen L, Liu C, Xu J (2020) Index-based solutions for efficient density peaks clustering. IEEE Trans Knowl Data Eng
Ren J, Ma R (2009) Density-based data streams clustering over sliding windows. In: Proceedings of the FSKD, pp 248–252
Shang S, Chen L, Jensen CS, Wen J-R, Kalnis P (2018) Searching trajectories by regions of interest. In: Proceedings of the ICDE, pp 1741–1742
Shang S, Chen L, Wei Z, Jensen CS, Zheng K, Kalnis P (2018) Parallel trajectory similarity joins in spatial networks. VLDB J 27(3):395–420
Shang S, Ding R, Zheng K, Jensen CS, Kalnis P, Zhou X (2014) Personalized trajectory matching in spatial networks. VLDB J 23 (3):449–468
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the SIGKDD, pp 706–711
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: Evolution-based technique for stream clustering. In: Proceedings of the ADMA, pp 605–615
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28
Xu J, Chen J, Zhou R, Fang J, Liu C (2019) On workflow aware location-based service composition for personal trip planning. Futur Gener Comput Syst 98:274–285
Xu J, Gao Y, Liu C, Zhao L, Ding Z (2015) Efficient route search on hierarchical dynamic road networks. Distrib Parallel Databases 33 (2):227–252
Xu J, Zhao J, Zhou R, Liu Ch, Zhao P, Zhao L (2021) Predicting destinations by a deep learning based approach. IEEE Trans. Knowl Data Eng 33(2):651–666
Yang K, Gao Y, Ma R, Chen L, Wu S, Chen G (2019) DBSCAN-MS: distributed density-based clustering in metric spaces. In: Proceedings of the ICDE, pp 1346–1357
Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the SIGKDD, pp 316–324
Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y (2010) T-drive: driving directions based on taxi trajectories. In: Proceedings of the SIGSPATIAL, pp 99–108
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Zubaroglu A, Atalay V (2021) Data stream clustering: A review. Artif Intell Rev 54(2):1201–1236
Acknowledgements
This work was supported by National Natural Science Foundation of China under grant(No.61802273), Postdoctoral Science Foundation of China (No.2020M681529), Natural Science Foundation for Colleges and Universities in Jiangsu Province (No.18KJB520044).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xia, Y., Fang, J., Chao, P. et al. Cost-effective and adaptive clustering algorithm for stream processing on cloud system. Geoinformatica 27, 1–21 (2023). https://doi.org/10.1007/s10707-021-00442-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-021-00442-1