A clustering and ensemble based classifier for data stream classification

https://doi.org/10.1016/j.asoc.2020.107076Get rights and content

Highlights

  • This paper presents hybrid method which uses both supervised and unsupervised learning.

  • In this paper we have used ensemble method to handle huge amount of data stream.

  • Grid and density based clustering method is used as a base learner.

  • Divide and merge method is used to improve the performance in terms of accuracy.

  • This proposed method handles both gradual and abrupt concept drifts and mainly focused on accuracy as well as requires comparative time and memory.

Abstract

In the era of data mining, the research industry has great attention to data stream mining as well as it has a great impact on a wide range of applications like networking, telecommunication, education, banking, weather forecasting, a stock market, and so on. Because of these data stream mining having more attention from researchers. The handling of concept drifting data streams is one of the major issues and challenges in the data stream mining field. In the presence of the concept drift, the performance of the learning algorithm always degrades. In this paper, a hybrid method has been proposed which are the combination of an ensemble, and grid and density-based clustering methods. The proposed method is tested on both synthetic as well as real data. The proposed method works well in the presence of concept drift and performance is measured in terms of time, accuracy, and memory. As compared with the state-of-art algorithms, the proposed method performed well and gave better accuracy using synthetic datasets like 88.29%, 71.34%, and 75.39% for Hyperplane, RBF, and LED respectively and for real datasets 86.17%, 86.28%, 95.15%, and 99.83% for Adult, Census-Income, KDDCup99%–10%, and Covertype respectively.

Introduction

The huge amount of data has been generated rapidly in today’s growing world. Such a huge amount of data handling as well as analyzing are the most challenging tasks. In the research field, the data stream mining has great attention because of a tremendous amount of data has been generated from applications or industries such as networking, finance, stock market, education, telecommunication, healthcare, weather forecasting and many more. The research industry has given great attention in solving data streams mining issues like scanning of data in a single pass, classification of such huge amount of data, suitable algorithm selection, concept drift, performance in terms of time, accuracy, and memory and learning approach. To solve these issues, a great number of authors have attracted to these fields and proposed the number of methods [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19].

In data stream mining, we cannot apply standard approaches because day by day new difficulties are occurring during the evaluation of data streams. While learning with data streams, single scan learning has great importance. Data streams are always generated continuously, in huge amounts, and very fast. During the classification of data, it must be scanned once and must use limited time and memory. If we were unable to process data samples in a single pass then its complexity increases over time and memory which results in degradation of accuracy [20].

Data streams classification is done through two methods namely, single classification and ensemble classification. The single classification method is fast, also takes less memory for computation but as unknown patterns or unknown samples increase its performance decreases. In vice versa, an ensemble method requires more time and memory but it performs well in the presence of unknown patterns. In the ensemble method, the generated output is the prediction of the different classifiers. For data stream handling most of the researchers used the ensemble method because of easy to implement, ability to handle different types of data, and most important its high performance [21], [22], [23]. Bagging and boosting are types of ensemble methods. Most of the researches have tried ensemble method for data stream classification in which they used chunk-based [21], [24], [25] and windowing based approaches for learning [26], [27]. These approaches play an important role during the evaluation process. Concept drift detection and handling are the major issues in the data stream mining field. The concept drift is the concept of data changes over time. Most of the researches tried to solve concept drift problems with the help of different methods [22], [28], [29], [30], [31].

This paper proposes an ensemble clustering method, hybrid one contains ensemble boosting, clustering as grid and density, and for improving performance divide and merge methods have used. In this proposed approach, boosting is used for handling a large amount of data but requires more processing time in comparison with the bagging method. The boosting method works iteratively to improve performance in terms of accuracy. In the ensemble boosting method, grid and density-based clustering are used as a base learner. The grid and density-based clustering method have the ability to works in the presence of huge data but it requires more time. Also, for accuracy improvement, we have used the divide and merge method. This proposed hybrid method is extended from the work [32] and specially designed for the classification of a data stream.

The proposed hybrid algorithm works as follows,

  • a.

    In the first part, incoming data is divided into no of the sliding window and every window is processed by the proposed hybrid method.

  • b.

    In the second part, the grid and density-based clustering algorithm is used as a base classifier for boosting ensemble. Here every data sample is mapped on the grid and according to the mapping, the grid is divided into a dense grid and sparse grid. Based on this, the performance of the algorithm is calculated.

  • c.

    In the last part, divide and merge method is used for performance improvement. In this method, the formed clusters are again merged and divide according to the density of the data samples.

  • d.

    The ensemble method plays a vital role in the data stream classification process. This method requires more time and memory as compared with other algorithms but it is focused on accuracy improvement. Also, grid and density-based clustering can handle a huge amount of data. So results show that the proposed hybrid method has focused on accuracy improvement by keeping moderate time and memory as compared with other algorithms.

The organization of the paper is as a literature survey is described in Section 2 with the proposed method is covered in Section 3, Section 4 included experimental work with results, results are discussed in Sections 5 Discussion, 6 Conclusion concludes conclusion.

Section snippets

Literature survey

Lots of work have been done by the research community to solve issues of data stream classification using supervised and unsupervised methods. They are discussed as follows,

Methodology

In this paper, we have proposed a semi-supervised hybrid method in which ensemble method, grid and density-based clustering and divide and merge methods are used. In this proposed ensemble method, grid and density-based clustering methods are used as base learner. Also improving performance in terms of accuracy as well as to handle concept drift, divide and merge method is used.

Experimental setup and results

The proposed hybrid method and state-of-the-art algorithms are evaluated using both synthetic and real datasets. The experiments were performed on a 1.90 GHz Intel i3 processor with 8 Gigabyte main memory, running on Ubuntu 14.04 and Java programming language is used for the implementation of the proposed algorithm. UCI machine learning repository dataset archives [77] is used for real datasets. The performance of proposed method has been compared with OzaBag [34], OzaBoost [34], OCBoost [35],

Discussion

In this proposed hybrid method, ensemble classifier with grid and density-based clustering as a base learner and divide and merge methods are used. The divide and merge method is used for performance improvement. This proposed method is designed for performance improvement in terms of accuracy by keeping moderate time and memory. According to this, we have implemented the proposed hybrid method. The results show that the proposed algorithm works well in terms of accuracy but requires moderate

Conclusion

In this paper, we have proposed a hybrid method i.e. semi-supervised method. We have studied the number of supervised and unsupervised methods. Based on the study, we have tried to overcome issues of data stream mining. The proposed hybrid method uses ensemble boosting method, grid, and density-based clustering method as a base learner with a divide and merge method which improves performance. The experimental results show that the proposed hybrid method works well in the presence of both

CRediT authorship contribution statement

Kapil K. Wankhade: Conceptualization, Methodology, Software, Writing - original draft, Data curation. Kalpana C. Jondhale: Writing- reviewing and editing, Supervision. Snehlata S. Dongre: Software, Validation, Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (79)

  • GamaJ. et al.

    Decision trees for mining data streams

    Intell. Data Anal.

    (2006)
  • GaoJ. et al.

    On appropriate assumptions to mine data streams: analysis and practice

  • HanJ. et al.

    Data Mining: Concepts and Techniques

    (2006)
  • PfahringerB. et al.

    New options for hoeffding trees

  • GamaJ.

    A survey on learning from data streams: Current and future trends

    Prog. Artif. Intell.

    (2012)
  • GrossiV. et al.

    Stream mining: A novel architecture for ensemble-based classification

    Knowl. Inform. Syst.

    (2012)
  • BrzezinskiD. et al.

    Reacting to different types of concept drift: The accuracy updated ensemble algorithm

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • BoseR. et al.

    Dealing with concept drifts in process mining

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • KunchevaL. et al.

    PCA feature extraction for change detection in multidimensional unlabeled data

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • PratamaM. et al.

    PANFIS: A novel incremental learning machine

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • KasabovN.

    Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning

    IEEE Trans. Syst. Man, Cybern. B

    (2001)
  • FaisalMustafa Amir et al.

    Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study

    IEEE Syst. J.

    (2015)
  • DomingosP. et al.

    Mining high-speed data streams

    Knowl. Discov. Data Mining

    (2000)
  • H. Wang, W. Fan, V. Yu, J. Han, Mining concept-drifting data streams using ensemble classifiers, in: ACM SIGKDD,...
  • BifetA. et al.

    New ensemble methods for evolving data streams

  • AttarV. et al.

    A fast and light classifier for data streams

    Springer’s Evolv. Syst.

    (2010)
  • MasudM. et al.

    A multi-partition multi-chunk ensemble technique to classify concept drifting data streams

  • MasudMohammad M. et al.

    Classification and novel class detection in concept-drifting data streams under time constraints

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • WidmerG. et al.

    Learning in the presence of concept drift and hidden contexts

    Mach. Learn.

    (1996)
  • A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in: Proc. SIAM Int. Conf. Data Mining,...
  • HultenG. et al.

    Mining time-changing data streams

  • FanW. et al.

    Decision tree evolution using limited number of labeled data items from drifting data streams

  • VivekanandanP. et al.

    Mining rules of concept drift using genetic algorithm

    J. Artif. Intell. Soft Comput. Res.

    (2011)
  • WankhadeK. et al.

    A hybrid approach for classification of rare class data

    Springer’s Knowl. Inf. Syst.

    (2018)
  • KolterJ. et al.

    Dynamic weighted majority: a new ensemble method for tracking concept drift

    J. Mach. Learn. Res.

    (2007)
  • OzaN. et al.

    Experimental comparisons of online and batch versions of bagging and boosting

  • PelossofR. et al.

    Online coordinate boosting

    (2008)
  • ZliobaiteI.

    Ensemble learning for concept drift handling- the role of new expert

  • SunYu et al.

    Online ensemble learning of data streams with gradually evolved classes

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • Cited by (24)

    • Geometric consistent fuzzy cluster ensemble with membership reconstruction for image segmentation

      2023, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      For examples, Li et al. [23] proposed a cluster ensemble method by combining density-based spatial clustering of applications with noise (DBSCAN) and K-means algorithms to discover behavioural patterns; Xu et al. [45] proposed a dual-granularity weighted ensemble clustering model, in which a sample local similarity measurement method is designed for the evaluation of reliability of clusters in the fusion function to improve the final clustering result. Some cluster ensemble methods based on boosting or bagging were also proposed [41,37] to achieve better results. As is well known, ensemble methods require more time and memory as compared with other algorithms but they are focused on accuracy improvement.

    • Cardiac Patient Data Classification Using Ensemble Machine Learning Technique

      2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023
    • Prediction of Global Sea Water Level using Linear Regression and Gradient Descent

      2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023
    • A Review on Various Plant Disease Detection Using Image Processing

      2023, Proceedings - 2023 3rd International Conference on Pervasive Computing and Social Networking, ICPCSN 2023
    • Learning from streaming data with unsupervised heterogeneous domain adaptation

      2023, International Journal of Data Science and Analytics
    • Object Detection using YOLO – I-Sight

      2023, 14th International Conference on Advances in Computing, Control, and Telecommunication Technologies, ACT 2023
    View all citing articles on Scopus
    View full text