Skip to main content
Log in

A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Clustering analysis is an important data mining method for data stream. Data stream clustering is a branch of clustering in which the patterns are processed in an ordered sequence. Data stream clustering faces various challenges due to high speed, high volume, evolutionary, unstable and unlimited nature of the data. Over time, the data confront with changes in space of input. This obstacle is called concept drift, where its investigation is of high importance. In this paper, a new method called dynamic clustering of data stream with considering concept drift is developed, which is an incremental supervised clustering algorithm. In the proposed algorithm, data stream is automatically clustered in a supervised manner, where the clusters whose values decrease over time are identified and then eliminated. Moreover, the generated clusters can be used to classify unlabeled data. Experimental results on 15 UCI datasets show that the proposed method outperforms the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aggarwal CC, Yu Philip S, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference. Elsevier, pp 81–92

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications (ACM)

  • Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29:116–141

    Article  Google Scholar 

  • Amini A, Saboohi H, Herawan T, Wah TY (2016) MuDi-Stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385

    Article  Google Scholar 

  • Anvaripour M, Soltanpour S, Razavi-Far R, Saif M, Jonathan Wu QM (2016) A supervised cooperative clustering scheme for diagnosing process faults in an industrial plant. In: Evolutionary computation (CEC), 2016 IEEE congress on. IEEE, pp 160–67

  • Asadi S, Ehsan Roshan S (2021) A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in Bagging. Knowl-Based Syst 213:106656

    Article  Google Scholar 

  • Asadi S, Shahrabi J (2017) Complexity-based parallel rule induction for multiclass classification. Inf Sci 380:53–73

    Article  Google Scholar 

  • Asadi S, Ehsan Roshan S, Kattan MW (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform 115:103690

    Article  Google Scholar 

  • Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 1–16

  • Barddal JP, Gomes HM, Enembreck F, Barthès J-P (2016) SNCStream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73

    Article  Google Scholar 

  • Baruah RD, Angelov P (2013) DEC: dynamically evolving clustering and its application to structure identification of evolving fuzzy models. IEEE Tran Cybern 44:1619–1631

    Article  Google Scholar 

  • Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58:180–204

    Article  Google Scholar 

  • Berkhin P (2006) A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin

    Google Scholar 

  • Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28

    Article  MathSciNet  MATH  Google Scholar 

  • Bi X, Zhang C, Zhao X, Li D, Sun Y, Ma Y (2020) CODES: efficient incremental semi-supervised classification over drifting and evolving social streams. IEEE Access 8:14024–14035

    Article  Google Scholar 

  • Bones CC, Romani LAS, de Sousa EPM (2016) Improving multivariate data streams clustering. Procedia Comput Sci 80:461–471

    Article  Google Scholar 

  • Bouguelia M-R, Belaïd Y, Belaïd A (2013) An adaptive incremental clustering method based on the growing neural gas algorithm. In: 2nd international conference on pattern recognition applications and methods-ICPRAM 2013. SciTePress, pp 42–49

  • Bungkomkhun P, Auwatanamongkol S (2009) ’Grid-based supervised clustering-GBSC. World Acad Sci Eng Technol 60:536–543

    Google Scholar 

  • Cai H, Liu B, Xiao Y, Yue Lin L (2020) Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factorization. Inf Sci 536:171–184

    Article  MathSciNet  MATH  Google Scholar 

  • Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 133–42

  • Chen D, Yang Q, Liu J, Zeng Z (2020) Selective prototype-based learning on concept-drifting data streams. Inf Sci 516:20–32

    Article  Google Scholar 

  • De Andrade Silva J, Raul Hruschka E, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238

    Article  Google Scholar 

  • Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci

  • Donyavi Z, Asadi S (2020) Using decomposition-based multi-objective evolutionary algorithm as synthetic example optimization for self-labeling. Swarm Evolut Comput 58:100736

    Article  Google Scholar 

  • Eick CF, Zeidat N, Zhao Z (2004a) Supervised clustering-algorithms and benefits. In: 16Th IEEE international conference on tools with artificial intelligence. IEEE, pp 774–776

  • Eick CF, Zeidat N, Zhao Z (2004b) Supervised clustering-algorithms and benefits. In: Tools with artificial intelligence, 2004. ICTAI 2004. 16th IEEE international conference on. IEEE, pp 774–76

  • Erra U, Senatore S, Minnella F, Caggianese G (2015) Approximate TF–IDF based on topic extraction from massive message stream using the GPU. Inf Sci 292:143–161

    Article  Google Scholar 

  • Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Disc 26:1–26

    Article  MathSciNet  Google Scholar 

  • Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46:1–37

    Article  MATH  Google Scholar 

  • Georgieva O, Klawonn F (2008) Dynamic data assigning assessment clustering of streaming data. Appl Soft Comput 8:1305–1313

    Article  Google Scholar 

  • Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50

    Article  Google Scholar 

  • Guha S, Mishra N (2016) Clustering data streams. Data stream management. Springer, Berlin

    Google Scholar 

  • Guo K, Zhang Q (2013) Fast clustering-based anonymization approaches with time constraints for data streams. Knowl-Based Syst 46:95–108

    Article  Google Scholar 

  • Haider P, Brefeld U, Scheffer T (2007) Supervised clustering of streaming data for email batch detection. In: Proceedings of the 24th international conference on machine learning, pp 345–352

  • Hamza H, Belaïd Y, Belaïd A, Baran Chaudhuri B (2008) Incremental classification of invoice documents. In: 19th international conference on pattern recognition-ICPR 2008. IEEE, p 4

  • Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11:56–76

    Article  Google Scholar 

  • Hassani M, Spaus P, Medhat Gaber M, Seidl T (2012) Density-based projected clustering of data streams. In: International conference on scalable uncertainty management. Springer, pp 311–324

  • Jirayusakul A (2007) Supervised growing neural gas algorithm in clustering analysis

  • Islam MK, Ahmed MM, Zamli KZ (2019) A buffer-based online clustering for evolving data stream. Inf Sci 489:113–135

    Article  MathSciNet  Google Scholar 

  • Kavitha M, Baby R (2017) Survey on micro clustering data streams using agglomerative approach. Int J Eng Sci 7:1–4

    Google Scholar 

  • Khan I, Huang JZ, Ivanov K (2016) Incremental density-based ensemble clustering over evolving data streams. Neurocomputing 191:34–43

    Article  Google Scholar 

  • Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: Data mining, 2009. ICDM'09. Ninth IEEE international conference on. IEEE, pp 249–258

  • Li Y, Li D, Wang S, Zhai Y (2014) Incremental entropy-based clustering on categorical data streams with concept drift. Knowl-Based Syst 59:33–47

    Article  Google Scholar 

  • Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl-Based Syst 195:105694

    Article  Google Scholar 

  • Liu H, Fu Y (2015) Clustering with partition level side information. In: Data mining (ICDM), 2015 IEEE international conference on. IEEE, pp 877–882

  • Liu H, Shao M, Li S, Fu Y (2016) Infinite ensemble for image clustering. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1745–54

  • Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2016) Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelled instances. Inf Sci 355:127–151

    Article  Google Scholar 

  • Mann AK, Kaur N (2013) Review paper on clustering techniques. Glob J Comput Sci Technol 13:1–7

    Google Scholar 

  • Masmoudi N, Azzag H, Lebbah M, Bertelle C, Jemaa MB (2016) Cl-AntInc algorithm for clustering binary data streams using the ants behavior. Procedia Comput Sci 96:187–196

    Article  Google Scholar 

  • Michel V, Gramfort A, Varoquaux G, Eger E, Keribin C, Thirion B (2012) A supervised clustering approach for fMRI-based inference of brain states. Pattern Recogn 45:2041–2049

    Article  MATH  Google Scholar 

  • Mining, What Is Data (2006) Data mining: concepts and techniques. Morgan Kaufinann, Burlington

    Google Scholar 

  • Mohamad S, Bouchachia A (2020) Deep online hierarchical dynamic unsupervised learning for pattern mining from utility usage data. Neurocomputing 390:359–373

    Article  Google Scholar 

  • Mythily R, Banu A, Raghunathan S (2015) Clustering models for data stream mining. Procedia Comput Sci 46:619–626

    Article  Google Scholar 

  • O'callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings 18th international conference on data engineering. IEEE, pp 685–694

  • Otgonbayar A, Pervez Z, Dahal K, Eager S (2018) K-VARP: K-anonymity for varied data streams via partitioning. Inf Sci 467:238–255

    Article  Google Scholar 

  • Pan F, Wang W, Tung AKH, Yang J (2005a) Finding representative set from massive data. In: Fifth IEEE international conference on data mining (ICDM'05). IEEE, p 8

  • Pan F, Wang W, Tung AKH, Yang J (2005b) Finding representative set from massive data. In: Data mining, fifth IEEE international conference on. IEEE, p 8

  • Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63:528–549

    Article  Google Scholar 

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105

    Article  Google Scholar 

  • Pavithra M, Parvathi RMS (2017) A survey on clustering high dimensional data techniques. Int J Appl Eng Res 12:2893–2899

    Google Scholar 

  • Peralta B, Caro A, Soto A (2016) A proposal for supervised clustering with Dirichlet process using labels. Pattern Recogn Lett 80:52–57

    Article  Google Scholar 

  • Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57

    Article  Google Scholar 

  • Rehman M-Z, Li T, Yang Y, Wang H (2014) Hyper-ellipsoidal clustering technique for evolving data stream. Knowl-Based Syst 70:3–14

    Article  Google Scholar 

  • Ren Y, Kangrong Hu, Dai X, Pan L, Hoi SCH, Zenglin Xu (2019) Semi-supervised deep embedded clustering. Neurocomputing 325:121–130

    Article  Google Scholar 

  • Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20:615–627

    Article  Google Scholar 

  • Roshan SE, Asadi S (2021) Development of ensemble learning classification with density peak decomposition-based evolutionary multi-objective optimization. Int J Mach Learn Cybern 12:1737–1751

    Article  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Sakamoto Y, Fukui K-I, Gama J, Nicklas D, Moriyama K, Numao M (2015) Concept drift detection with clustering via statistical change detection methods. In: 2015 seventh international conference on knowledge and systems engineering (KSE). IEEE, pp 37–42

  • Shao M, Li S, Ding Z, Fu Y (2015) Deep linear coding for fast graph clustering. In: Twenty-fourth international joint conference on artificial intelligence, pp 3798–3804

  • Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems

  • Shindler M, Wong A, Meyerson AW (2011b) Fast and accurate k-means for large datasets. In: Advances in neural information processing systems, pp 2375–2383

  • Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46:1–31

    Article  MATH  Google Scholar 

  • Śmieja M, Geiger BC (2017) Semi-supervised cross-entropy clustering with information bottleneck constraint. Inf Sci 421:254–271

    Article  MathSciNet  MATH  Google Scholar 

  • Song G, Ye Y, Zhang H, Xiaofei X, Lau RYK, Liu F (2016) Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Inf Sci 357:125–143

    Article  MATH  Google Scholar 

  • Su Q, Chen L (2015) A method for discovering clusters of e-commerce interest patterns using click-stream data. Electron Commerce Res Appl 14:1–13

    Article  Google Scholar 

  • Sun J, Fujita H, Chen P, Li H (2017) Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowl-Based Syst 120:4–14

    Article  Google Scholar 

  • Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Data mining workshops, 2006. ICDM workshops 2006. Sixth IEEE international conference on. IEEE, pp 638–642

  • Toshniwal D (2013) Clustering techniques for streaming data-a survey. In: Advance computing conference (IACC), 2013 IEEE 3rd international. IEEE, pp 951–956

  • Treechalong K, Rakthanmanon T, Waiyamai K (2015) Semi-supervised stream clustering using labeled data points. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 281–295

  • Tu Q, Lu JF, Yuan B, Tang JB, Yang J-Y (2012) Density-based hierarchical clustering for streaming data. Pattern Recogn Lett 33:641–645

    Article  Google Scholar 

  • Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: International conference on advanced data mining and applications, pp 605–615. Springer

  • Webb GI, Kuan Lee L, Petitjean F, Goethals B (2017) Understanding concept drift. arXiv preprint http://arxiv.org/abs/1704.00362

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421

    Article  MathSciNet  MATH  Google Scholar 

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678

    Article  Google Scholar 

  • Xu S, Feng L, Liu S, Qiao H (2020) Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Eng Appl Artif Intell 89:103451

    Article  Google Scholar 

  • Yan M, Wai M (2020) Accurate detecting concept drift in evolving data streams. ICT Express 6:332–338

    Article  Google Scholar 

  • Ye N, Li X (2001a) A scalable clustering technique for intrusion signature recognition. In: Proceedings of the 2001 IEEE workshop on information assurance and security, pp 5–6. Citeseer

  • Ye N, Li X (2001b) A scalable clustering technique for intrusion signature recognition. In: Proceedings of 2001 IEEE workshop on information assurance and security, pp 1–4. Citeseer

  • Zeidat N, Eick CF, Zhao Z (2005) Supervised clustering: algorithms and applications. University of Houston, Houston

    MATH  Google Scholar 

  • Zheng L, Huo H, Guo Y, Fang T (2017) Supervised adaptive incremental clustering for data stream of chunks. Neurocomputing 219:502–517

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shahrokh Asadi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Silhouette Index

Appendix A Silhouette Index

Kaufman and Rousseeuw (Rousseeuw 1987) introduced the Silhouette index which is constructed to show graphically how well each object is classified in a given clustering output.

$$\mathrm{silhouette}=\frac{\sum_{\mathrm{i}=1}^{\mathrm{n}}{\mathrm{s}}_{\left(\mathrm{i}\right)}}{\mathrm{n}},\mathrm{ silhouette }\in \left[-\mathrm{1,1}\right],$$

where

  • \({\mathrm{s}}_{\left(\mathrm{i}\right)}=\frac{{\mathrm{b}}_{\left(\mathrm{i}\right)}-{\mathrm{a}}_{\left(\mathrm{i}\right)}}{\mathrm{max}\left\{{\mathrm{a}}_{\left(\mathrm{i}\right)};{\mathrm{b}}_{\left(\mathrm{i}\right)}\right\}}\),

  • \({\mathrm{a}}_{\left(\mathrm{i}\right)}=\frac{{\sum }_{\mathrm{j}\in {\{\mathrm{c}}_{\mathrm{r}}\setminus \mathrm{i}\}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{r}}-1}\), is the average dissimilarity of the ith object to all other objects of cluster Cr,

  • \({\mathrm{b}}_{\left(\mathrm{i}\right)}={\mathrm{min}}_{\mathrm{s}\ne \mathrm{r }}\left\{{\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}\right\},\)

  • \({\mathrm{d}}_{{\mathrm{iC}}_{\mathrm{s}}}=\frac{{\sum }_{\mathrm{j}\in {\mathrm{C}}_{\mathrm{s}}}{d}_{\mathrm{ij}}}{{\mathrm{n}}_{\mathrm{s}}}\), is the average dissimilarity of the ith object to all objects of cluster Cs

  • The maximum value of the index is used to determine the optimal number of clusters in the data. S(i) is not defined for k = 1 (only one cluster).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nikpour, S., Asadi, S. A dynamic hierarchical incremental learning-based supervised clustering for data stream with considering concept drift. J Ambient Intell Human Comput 13, 2983–3003 (2022). https://doi.org/10.1007/s12652-021-03673-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03673-0

Keywords

Navigation