A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift

Nikpour, Soheila; Asadi, Shahrokh

doi:10.1007/s12652-021-03673-0

A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift

Original Research
Published: 04 January 2022

Volume 13, pages 2983–3003, (2022)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Soheila Nikpour¹ &
Shahrokh Asadi²

708 Accesses
4 Citations
Explore all metrics

Abstract

Clustering analysis is an important data mining method for data stream. Data stream clustering is a branch of clustering in which the patterns are processed in an ordered sequence. Data stream clustering faces various challenges due to high speed, high volume, evolutionary, unstable and unlimited nature of the data. Over time, the data confront with changes in space of input. This obstacle is called concept drift, where its investigation is of high importance. In this paper, a new method called dynamic clustering of data stream with considering concept drift is developed, which is an incremental supervised clustering algorithm. In the proposed algorithm, data stream is automatically clustered in a supervised manner, where the clusters whose values decrease over time are identified and then eliminated. Moreover, the generated clusters can be used to classify unlabeled data. Experimental results on 15 UCI datasets show that the proposed method outperforms the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Drifted Data Stream Clustering Based on ClusTree Algorithm

CPOCEDS-concept preserving online clustering for evolving data streams

Article 28 August 2023

Different Aspects of Data Stream Clustering

References

Aggarwal CC, Yu Philip S, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference. Elsevier, pp 81–92
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications (ACM)
Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29:116–141
Article Google Scholar
Amini A, Saboohi H, Herawan T, Wah TY (2016) MuDi-Stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385
Article Google Scholar
Anvaripour M, Soltanpour S, Razavi-Far R, Saif M, Jonathan Wu QM (2016) A supervised cooperative clustering scheme for diagnosing process faults in an industrial plant. In: Evolutionary computation (CEC), 2016 IEEE congress on. IEEE, pp 160–67
Asadi S, Ehsan Roshan S (2021) A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in Bagging. Knowl-Based Syst 213:106656
Article Google Scholar
Asadi S, Shahrabi J (2017) Complexity-based parallel rule induction for multiclass classification. Inf Sci 380:53–73
Article Google Scholar
Asadi S, Ehsan Roshan S, Kattan MW (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform 115:103690
Article Google Scholar
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 1–16
Barddal JP, Gomes HM, Enembreck F, Barthès J-P (2016) SNCStream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
Article Google Scholar
Baruah RD, Angelov P (2013) DEC: dynamically evolving clustering and its application to structure identification of evolving fuzzy models. IEEE Tran Cybern 44:1619–1631
Article Google Scholar
Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58:180–204
Article Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin
Google Scholar
Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28
Article MathSciNet MATH Google Scholar
Bi X, Zhang C, Zhao X, Li D, Sun Y, Ma Y (2020) CODES: efficient incremental semi-supervised classification over drifting and evolving social streams. IEEE Access 8:14024–14035
Article Google Scholar
Bones CC, Romani LAS, de Sousa EPM (2016) Improving multivariate data streams clustering. Procedia Comput Sci 80:461–471
Article Google Scholar
Bouguelia M-R, Belaïd Y, Belaïd A (2013) An adaptive incremental clustering method based on the growing neural gas algorithm. In: 2nd international conference on pattern recognition applications and methods-ICPRAM 2013. SciTePress, pp 42–49
Bungkomkhun P, Auwatanamongkol S (2009) ’Grid-based supervised clustering-GBSC. World Acad Sci Eng Technol 60:536–543
Google Scholar
Cai H, Liu B, Xiao Y, Yue Lin L (2020) Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factorization. Inf Sci 536:171–184
Article MathSciNet MATH Google Scholar
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 133–42
Chen D, Yang Q, Liu J, Zeng Z (2020) Selective prototype-based learning on concept-drifting data streams. Inf Sci 516:20–32
Article Google Scholar
De Andrade Silva J, Raul Hruschka E, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Article Google Scholar
Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci
Donyavi Z, Asadi S (2020) Using decomposition-based multi-objective evolutionary algorithm as synthetic example optimization for self-labeling. Swarm Evolut Comput 58:100736
Article Google Scholar
Eick CF, Zeidat N, Zhao Z (2004a) Supervised clustering-algorithms and benefits. In: 16Th IEEE international conference on tools with artificial intelligence. IEEE, pp 774–776
Eick CF, Zeidat N, Zhao Z (2004b) Supervised clustering-algorithms and benefits. In: Tools with artificial intelligence, 2004. ICTAI 2004. 16th IEEE international conference on. IEEE, pp 774–76
Erra U, Senatore S, Minnella F, Caggianese G (2015) Approximate TF–IDF based on topic extraction from massive message stream using the GPU. Inf Sci 292:143–161
Article Google Scholar
Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Disc 26:1–26
Article MathSciNet Google Scholar
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46:1–37
Article MATH Google Scholar
Georgieva O, Klawonn F (2008) Dynamic data assigning assessment clustering of streaming data. Appl Soft Comput 8:1305–1313
Article Google Scholar
Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50
Article Google Scholar
Guha S, Mishra N (2016) Clustering data streams. Data stream management. Springer, Berlin
Google Scholar
Guo K, Zhang Q (2013) Fast clustering-based anonymization approaches with time constraints for data streams. Knowl-Based Syst 46:95–108
Article Google Scholar
Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: Proceedings of the 24th international conference on machine learning, pp 345–352
Hamza H, Belaïd Y, Belaïd A, Baran Chaudhuri B (2008) Incremental classification of invoice documents. In: 19th international conference on pattern recognition-ICPR 2008. IEEE, p 4
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11:56–76
Article Google Scholar
Hassani M, Spaus P, Medhat Gaber M, Seidl T (2012) Density-based projected clustering of data streams. In: International conference on scalable uncertainty management. Springer, pp 311–324
Jirayusakul A (2007) Supervised growing neural gas algorithm in clustering analysis
Islam MK, Ahmed MM, Zamli KZ (2019) A buffer-based online clustering for evolving data stream. Inf Sci 489:113–135
Article MathSciNet Google Scholar
Kavitha M, Baby R (2017) Survey on micro clustering data streams using agglomerative approach. Int J Eng Sci 7:1–4
Google Scholar
Khan I, Huang JZ, Ivanov K (2016) Incremental density-based ensemble clustering over evolving data streams. Neurocomputing 191:34–43
Article Google Scholar
Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: Data mining, 2009. ICDM'09. Ninth IEEE international conference on. IEEE, pp 249–258
Li Y, Li D, Wang S, Zhai Y (2014) Incremental entropy-based clustering on categorical data streams with concept drift. Knowl-Based Syst 59:33–47
Article Google Scholar
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl-Based Syst 195:105694
Article Google Scholar
Liu H, Fu Y (2015) Clustering with partition level side information. In: Data mining (ICDM), 2015 IEEE international conference on. IEEE, pp 877–882
Liu H, Shao M, Li S, Fu Y (2016) Infinite ensemble for image clustering. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1745–54
Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2016) Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelled instances. Inf Sci 355:127–151
Article Google Scholar
Mann AK, Kaur N (2013) Review paper on clustering techniques. Glob J Comput Sci Technol 13:1–7
Google Scholar
Masmoudi N, Azzag H, Lebbah M, Bertelle C, Jemaa MB (2016) Cl-AntInc algorithm for clustering binary data streams using the ants behavior. Procedia Comput Sci 96:187–196
Article Google Scholar
Michel V, Gramfort A, Varoquaux G, Eger E, Keribin C, Thirion B (2012) A supervised clustering approach for fMRI-based inference of brain states. Pattern Recogn 45:2041–2049
Article MATH Google Scholar
Mining, What Is Data (2006) Data mining: concepts and techniques. Morgan Kaufinann, Burlington
Google Scholar
Mohamad S, Bouchachia A (2020) Deep online hierarchical dynamic unsupervised learning for pattern mining from utility usage data. Neurocomputing 390:359–373
Article Google Scholar
Mythily R, Banu A, Raghunathan S (2015) Clustering models for data stream mining. Procedia Comput Sci 46:619–626
Article Google Scholar
O'callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings 18th international conference on data engineering. IEEE, pp 685–694
Otgonbayar A, Pervez Z, Dahal K, Eager S (2018) K-VARP: K-anonymity for varied data streams via partitioning. Inf Sci 467:238–255
Article Google Scholar
Pan F, Wang W, Tung AKH, Yang J (2005a) Finding representative set from massive data. In: Fifth IEEE international conference on data mining (ICDM'05). IEEE, p 8
Pan F, Wang W, Tung AKH, Yang J (2005b) Finding representative set from massive data. In: Data mining, fifth IEEE international conference on. IEEE, p 8
Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63:528–549
Article Google Scholar
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105
Article Google Scholar
Pavithra M, Parvathi RMS (2017) A survey on clustering high dimensional data techniques. Int J Appl Eng Res 12:2893–2899
Google Scholar
Peralta B, Caro A, Soto A (2016) A proposal for supervised clustering with Dirichlet process using labels. Pattern Recogn Lett 80:52–57
Article Google Scholar
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57
Article Google Scholar
Rehman M-Z, Li T, Yang Y, Wang H (2014) Hyper-ellipsoidal clustering technique for evolving data stream. Knowl-Based Syst 70:3–14
Article Google Scholar
Ren Y, Kangrong Hu, Dai X, Pan L, Hoi SCH, Zenglin Xu (2019) Semi-supervised deep embedded clustering. Neurocomputing 325:121–130
Article Google Scholar
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20:615–627
Article Google Scholar
Roshan SE, Asadi S (2021) Development of ensemble learning classification with density peak decomposition-based evolutionary multi-objective optimization. Int J Mach Learn Cybern 12:1737–1751
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Sakamoto Y, Fukui K-I, Gama J, Nicklas D, Moriyama K, Numao M (2015) Concept drift detection with clustering via statistical change detection methods. In: 2015 seventh international conference on knowledge and systems engineering (KSE). IEEE, pp 37–42
Shao M, Li S, Ding Z, Fu Y (2015) Deep linear coding for fast graph clustering. In: Twenty-fourth international joint conference on artificial intelligence, pp 3798–3804
Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems
Shindler M, Wong A, Meyerson AW (2011b) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems, pp 2375–2383
Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46:1–31
Article MATH Google Scholar
Śmieja M, Geiger BC (2017) Semi-supervised cross-entropy clustering with information bottleneck constraint. Inf Sci 421:254–271
Article MathSciNet MATH Google Scholar
Song G, Ye Y, Zhang H, Xiaofei X, Lau RYK, Liu F (2016) Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Inf Sci 357:125–143
Article MATH Google Scholar
Su Q, Chen L (2015) A method for discovering clusters of e-commerce interest patterns using click-stream data. Electron Commerce Res Appl 14:1–13
Article Google Scholar
Sun J, Fujita H, Chen P, Li H (2017) Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowl-Based Syst 120:4–14
Article Google Scholar
Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Data mining workshops, 2006. ICDM workshops 2006. Sixth IEEE international conference on. IEEE, pp 638–642
Toshniwal D (2013) Clustering techniques for streaming data-a survey. In: Advance computing conference (IACC), 2013 IEEE 3rd international. IEEE, pp 951–956
Treechalong K, Rakthanmanon T, Waiyamai K (2015) Semi-supervised stream clustering using labeled data points. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 281–295
Tu Q, Lu JF, Yuan B, Tang JB, Yang J-Y (2012) Density-based hierarchical clustering for streaming data. Pattern Recogn Lett 33:641–645
Article Google Scholar
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: International conference on advanced data mining and applications, pp 605–615. Springer
Webb GI, Kuan Lee L, Petitjean F, Goethals B (2017) Understanding concept drift. arXiv preprint http://arxiv.org/abs/1704.00362
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet MATH Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Article Google Scholar
Xu S, Feng L, Liu S, Qiao H (2020) Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Eng Appl Artif Intell 89:103451
Article Google Scholar
Yan M, Wai M (2020) Accurate detecting concept drift in evolving data streams. ICT Express 6:332–338
Article Google Scholar
Ye N, Li X (2001a) A scalable clustering technique for intrusion signature recognition. In: Proceedings of the 2001 IEEE workshop on information assurance and security, pp 5–6. Citeseer
Ye N, Li X (2001b) A scalable clustering technique for intrusion signature recognition. In: Proceedings of 2001 IEEE workshop on information assurance and security, pp 1–4. Citeseer
Zeidat N, Eick CF, Zhao Z (2005) Supervised clustering: algorithms and applications. University of Houston, Houston
MATH Google Scholar
Zheng L, Huo H, Guo Y, Fang T (2017) Supervised adaptive incremental clustering for data stream of chunks. Neurocomputing 219:502–517
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, Iran
Soheila Nikpour
Data Mining Laboratory, Department of Engineering, College of Farabi, University of Tehran, Tehran, Iran
Shahrokh Asadi

Authors

Soheila Nikpour
View author publications
You can also search for this author in PubMed Google Scholar
Shahrokh Asadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahrokh Asadi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Silhouette Index

Kaufman and Rousseeuw (Rousseeuw 1987) introduced the Silhouette index which is constructed to show graphically how well each object is classified in a given clustering output.

$$\mathrm{silhouette}=\frac{\sum_{\mathrm{i}=1}^{\mathrm{n}}{\mathrm{s}}_{\left(\mathrm{i}\right)}}{\mathrm{n}},\mathrm{ silhouette }\in \left[-\mathrm{1,1}\right],$$

where

${\mathrm{s}}_{\left(\mathrm{i}\right)}=\frac{{\mathrm{b}}_{\left(\mathrm{i}\right)}-{\mathrm{a}}_{\left(\mathrm{i}\right)}}{\mathrm{max}\left\{{\mathrm{a}}_{\left(\mathrm{i}\right)};{\mathrm{b}}_{\left(\mathrm{i}\right)}\right\}}$,
${\mathrm{a}}_{\left(\mathrm{i}\right)}=\frac{{\sum }_{\mathrm{j}\in {\{\mathrm{c}}_{\mathrm{r}}\setminus \mathrm{i}\}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{r}}-1}$, is the average dissimilarity of the ith object to all other objects of cluster C_r,
${\mathrm{b}}_{\left(\mathrm{i}\right)}={\mathrm{min}}_{\mathrm{s}\ne \mathrm{r }}\left\{{\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}\right\},$
${\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}=\frac{{\sum }_{\mathrm{j}\in {\mathrm{C}}_{\mathrm{s}}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{s}}}$, is the average dissimilarity of the ith object to all objects of cluster Cs
The maximum value of the index is used to determine the optimal number of clusters in the data. S(i) is not defined for k = 1 (only one cluster).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikpour, S., Asadi, S. A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift. J Ambient Intell Human Comput 13, 2983–3003 (2022). https://doi.org/10.1007/s12652-021-03673-0

Download citation

Received: 20 January 2021
Accepted: 16 December 2021
Published: 04 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s12652-021-03673-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift

Abstract

Access this article

Similar content being viewed by others

Drifted Data Stream Clustering Based on ClusTree Algorithm

CPOCEDS-concept preserving online clustering for evolving data streams

Different Aspects of Data Stream Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A Silhouette Index

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift

Abstract

Access this article

Similar content being viewed by others

Drifted Data Stream Clustering Based on ClusTree Algorithm

CPOCEDS-concept preserving online clustering for evolving data streams

Different Aspects of Data Stream Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A Silhouette Index

Appendix A Silhouette Index

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation