Abstract
Clustering analysis is an important data mining method for data stream. Data stream clustering is a branch of clustering in which the patterns are processed in an ordered sequence. Data stream clustering faces various challenges due to high speed, high volume, evolutionary, unstable and unlimited nature of the data. Over time, the data confront with changes in space of input. This obstacle is called concept drift, where its investigation is of high importance. In this paper, a new method called dynamic clustering of data stream with considering concept drift is developed, which is an incremental supervised clustering algorithm. In the proposed algorithm, data stream is automatically clustered in a supervised manner, where the clusters whose values decrease over time are identified and then eliminated. Moreover, the generated clusters can be used to classify unlabeled data. Experimental results on 15 UCI datasets show that the proposed method outperforms the existing techniques.
Similar content being viewed by others
References
Aggarwal CC, Yu Philip S, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference. Elsevier, pp 81–92
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications (ACM)
Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29:116–141
Amini A, Saboohi H, Herawan T, Wah TY (2016) MuDi-Stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385
Anvaripour M, Soltanpour S, Razavi-Far R, Saif M, Jonathan Wu QM (2016) A supervised cooperative clustering scheme for diagnosing process faults in an industrial plant. In: Evolutionary computation (CEC), 2016 IEEE congress on. IEEE, pp 160–67
Asadi S, Ehsan Roshan S (2021) A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in Bagging. Knowl-Based Syst 213:106656
Asadi S, Shahrabi J (2017) Complexity-based parallel rule induction for multiclass classification. Inf Sci 380:53–73
Asadi S, Ehsan Roshan S, Kattan MW (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform 115:103690
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 1–16
Barddal JP, Gomes HM, Enembreck F, Barthès J-P (2016) SNCStream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
Baruah RD, Angelov P (2013) DEC: dynamically evolving clustering and its application to structure identification of evolving fuzzy models. IEEE Tran Cybern 44:1619–1631
Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58:180–204
Berkhin P (2006) A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin
Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28
Bi X, Zhang C, Zhao X, Li D, Sun Y, Ma Y (2020) CODES: efficient incremental semi-supervised classification over drifting and evolving social streams. IEEE Access 8:14024–14035
Bones CC, Romani LAS, de Sousa EPM (2016) Improving multivariate data streams clustering. Procedia Comput Sci 80:461–471
Bouguelia M-R, Belaïd Y, Belaïd A (2013) An adaptive incremental clustering method based on the growing neural gas algorithm. In: 2nd international conference on pattern recognition applications and methods-ICPRAM 2013. SciTePress, pp 42–49
Bungkomkhun P, Auwatanamongkol S (2009) ’Grid-based supervised clustering-GBSC. World Acad Sci Eng Technol 60:536–543
Cai H, Liu B, Xiao Y, Yue Lin L (2020) Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factorization. Inf Sci 536:171–184
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 133–42
Chen D, Yang Q, Liu J, Zeng Z (2020) Selective prototype-based learning on concept-drifting data streams. Inf Sci 516:20–32
De Andrade Silva J, Raul Hruschka E, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci
Donyavi Z, Asadi S (2020) Using decomposition-based multi-objective evolutionary algorithm as synthetic example optimization for self-labeling. Swarm Evolut Comput 58:100736
Eick CF, Zeidat N, Zhao Z (2004a) Supervised clustering-algorithms and benefits. In: 16Th IEEE international conference on tools with artificial intelligence. IEEE, pp 774–776
Eick CF, Zeidat N, Zhao Z (2004b) Supervised clustering-algorithms and benefits. In: Tools with artificial intelligence, 2004. ICTAI 2004. 16th IEEE international conference on. IEEE, pp 774–76
Erra U, Senatore S, Minnella F, Caggianese G (2015) Approximate TF–IDF based on topic extraction from massive message stream using the GPU. Inf Sci 292:143–161
Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Disc 26:1–26
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46:1–37
Georgieva O, Klawonn F (2008) Dynamic data assigning assessment clustering of streaming data. Appl Soft Comput 8:1305–1313
Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50
Guha S, Mishra N (2016) Clustering data streams. Data stream management. Springer, Berlin
Guo K, Zhang Q (2013) Fast clustering-based anonymization approaches with time constraints for data streams. Knowl-Based Syst 46:95–108
Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: Proceedings of the 24th international conference on machine learning, pp 345–352
Hamza H, Belaïd Y, Belaïd A, Baran Chaudhuri B (2008) Incremental classification of invoice documents. In: 19th international conference on pattern recognition-ICPR 2008. IEEE, p 4
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11:56–76
Hassani M, Spaus P, Medhat Gaber M, Seidl T (2012) Density-based projected clustering of data streams. In: International conference on scalable uncertainty management. Springer, pp 311–324
Jirayusakul A (2007) Supervised growing neural gas algorithm in clustering analysis
Islam MK, Ahmed MM, Zamli KZ (2019) A buffer-based online clustering for evolving data stream. Inf Sci 489:113–135
Kavitha M, Baby R (2017) Survey on micro clustering data streams using agglomerative approach. Int J Eng Sci 7:1–4
Khan I, Huang JZ, Ivanov K (2016) Incremental density-based ensemble clustering over evolving data streams. Neurocomputing 191:34–43
Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: Data mining, 2009. ICDM'09. Ninth IEEE international conference on. IEEE, pp 249–258
Li Y, Li D, Wang S, Zhai Y (2014) Incremental entropy-based clustering on categorical data streams with concept drift. Knowl-Based Syst 59:33–47
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl-Based Syst 195:105694
Liu H, Fu Y (2015) Clustering with partition level side information. In: Data mining (ICDM), 2015 IEEE international conference on. IEEE, pp 877–882
Liu H, Shao M, Li S, Fu Y (2016) Infinite ensemble for image clustering. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1745–54
Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2016) Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelled instances. Inf Sci 355:127–151
Mann AK, Kaur N (2013) Review paper on clustering techniques. Glob J Comput Sci Technol 13:1–7
Masmoudi N, Azzag H, Lebbah M, Bertelle C, Jemaa MB (2016) Cl-AntInc algorithm for clustering binary data streams using the ants behavior. Procedia Comput Sci 96:187–196
Michel V, Gramfort A, Varoquaux G, Eger E, Keribin C, Thirion B (2012) A supervised clustering approach for fMRI-based inference of brain states. Pattern Recogn 45:2041–2049
Mining, What Is Data (2006) Data mining: concepts and techniques. Morgan Kaufinann, Burlington
Mohamad S, Bouchachia A (2020) Deep online hierarchical dynamic unsupervised learning for pattern mining from utility usage data. Neurocomputing 390:359–373
Mythily R, Banu A, Raghunathan S (2015) Clustering models for data stream mining. Procedia Comput Sci 46:619–626
O'callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings 18th international conference on data engineering. IEEE, pp 685–694
Otgonbayar A, Pervez Z, Dahal K, Eager S (2018) K-VARP: K-anonymity for varied data streams via partitioning. Inf Sci 467:238–255
Pan F, Wang W, Tung AKH, Yang J (2005a) Finding representative set from massive data. In: Fifth IEEE international conference on data mining (ICDM'05). IEEE, p 8
Pan F, Wang W, Tung AKH, Yang J (2005b) Finding representative set from massive data. In: Data mining, fifth IEEE international conference on. IEEE, p 8
Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63:528–549
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105
Pavithra M, Parvathi RMS (2017) A survey on clustering high dimensional data techniques. Int J Appl Eng Res 12:2893–2899
Peralta B, Caro A, Soto A (2016) A proposal for supervised clustering with Dirichlet process using labels. Pattern Recogn Lett 80:52–57
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57
Rehman M-Z, Li T, Yang Y, Wang H (2014) Hyper-ellipsoidal clustering technique for evolving data stream. Knowl-Based Syst 70:3–14
Ren Y, Kangrong Hu, Dai X, Pan L, Hoi SCH, Zenglin Xu (2019) Semi-supervised deep embedded clustering. Neurocomputing 325:121–130
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20:615–627
Roshan SE, Asadi S (2021) Development of ensemble learning classification with density peak decomposition-based evolutionary multi-objective optimization. Int J Mach Learn Cybern 12:1737–1751
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Sakamoto Y, Fukui K-I, Gama J, Nicklas D, Moriyama K, Numao M (2015) Concept drift detection with clustering via statistical change detection methods. In: 2015 seventh international conference on knowledge and systems engineering (KSE). IEEE, pp 37–42
Shao M, Li S, Ding Z, Fu Y (2015) Deep linear coding for fast graph clustering. In: Twenty-fourth international joint conference on artificial intelligence, pp 3798–3804
Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems
Shindler M, Wong A, Meyerson AW (2011b) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems, pp 2375–2383
Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46:1–31
Śmieja M, Geiger BC (2017) Semi-supervised cross-entropy clustering with information bottleneck constraint. Inf Sci 421:254–271
Song G, Ye Y, Zhang H, Xiaofei X, Lau RYK, Liu F (2016) Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Inf Sci 357:125–143
Su Q, Chen L (2015) A method for discovering clusters of e-commerce interest patterns using click-stream data. Electron Commerce Res Appl 14:1–13
Sun J, Fujita H, Chen P, Li H (2017) Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowl-Based Syst 120:4–14
Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Data mining workshops, 2006. ICDM workshops 2006. Sixth IEEE international conference on. IEEE, pp 638–642
Toshniwal D (2013) Clustering techniques for streaming data-a survey. In: Advance computing conference (IACC), 2013 IEEE 3rd international. IEEE, pp 951–956
Treechalong K, Rakthanmanon T, Waiyamai K (2015) Semi-supervised stream clustering using labeled data points. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 281–295
Tu Q, Lu JF, Yuan B, Tang JB, Yang J-Y (2012) Density-based hierarchical clustering for streaming data. Pattern Recogn Lett 33:641–645
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: International conference on advanced data mining and applications, pp 605–615. Springer
Webb GI, Kuan Lee L, Petitjean F, Goethals B (2017) Understanding concept drift. arXiv preprint http://arxiv.org/abs/1704.00362
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Xu S, Feng L, Liu S, Qiao H (2020) Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Eng Appl Artif Intell 89:103451
Yan M, Wai M (2020) Accurate detecting concept drift in evolving data streams. ICT Express 6:332–338
Ye N, Li X (2001a) A scalable clustering technique for intrusion signature recognition. In: Proceedings of the 2001 IEEE workshop on information assurance and security, pp 5–6. Citeseer
Ye N, Li X (2001b) A scalable clustering technique for intrusion signature recognition. In: Proceedings of 2001 IEEE workshop on information assurance and security, pp 1–4. Citeseer
Zeidat N, Eick CF, Zhao Z (2005) Supervised clustering: algorithms and applications. University of Houston, Houston
Zheng L, Huo H, Guo Y, Fang T (2017) Supervised adaptive incremental clustering for data stream of chunks. Neurocomputing 219:502–517
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Silhouette Index
Appendix A Silhouette Index
Kaufman and Rousseeuw (Rousseeuw 1987) introduced the Silhouette index which is constructed to show graphically how well each object is classified in a given clustering output.
where
-
\({\mathrm{s}}_{\left(\mathrm{i}\right)}=\frac{{\mathrm{b}}_{\left(\mathrm{i}\right)}-{\mathrm{a}}_{\left(\mathrm{i}\right)}}{\mathrm{max}\left\{{\mathrm{a}}_{\left(\mathrm{i}\right)};{\mathrm{b}}_{\left(\mathrm{i}\right)}\right\}}\),
-
\({\mathrm{a}}_{\left(\mathrm{i}\right)}=\frac{{\sum }_{\mathrm{j}\in {\{\mathrm{c}}_{\mathrm{r}}\setminus \mathrm{i}\}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{r}}-1}\), is the average dissimilarity of the ith object to all other objects of cluster Cr,
-
\({\mathrm{b}}_{\left(\mathrm{i}\right)}={\mathrm{min}}_{\mathrm{s}\ne \mathrm{r }}\left\{{\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}\right\},\)
-
\({\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}=\frac{{\sum }_{\mathrm{j}\in {\mathrm{C}}_{\mathrm{s}}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{s}}}\), is the average dissimilarity of the ith object to all objects of cluster Cs
-
The maximum value of the index is used to determine the optimal number of clusters in the data. S(i) is not defined for k = 1 (only one cluster).
Rights and permissions
About this article
Cite this article
Nikpour, S., Asadi, S. A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift. J Ambient Intell Human Comput 13, 2983–3003 (2022). https://doi.org/10.1007/s12652-021-03673-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-021-03673-0