A clustering and ensemble based classifier for data stream classification
Introduction
The huge amount of data has been generated rapidly in today’s growing world. Such a huge amount of data handling as well as analyzing are the most challenging tasks. In the research field, the data stream mining has great attention because of a tremendous amount of data has been generated from applications or industries such as networking, finance, stock market, education, telecommunication, healthcare, weather forecasting and many more. The research industry has given great attention in solving data streams mining issues like scanning of data in a single pass, classification of such huge amount of data, suitable algorithm selection, concept drift, performance in terms of time, accuracy, and memory and learning approach. To solve these issues, a great number of authors have attracted to these fields and proposed the number of methods [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19].
In data stream mining, we cannot apply standard approaches because day by day new difficulties are occurring during the evaluation of data streams. While learning with data streams, single scan learning has great importance. Data streams are always generated continuously, in huge amounts, and very fast. During the classification of data, it must be scanned once and must use limited time and memory. If we were unable to process data samples in a single pass then its complexity increases over time and memory which results in degradation of accuracy [20].
Data streams classification is done through two methods namely, single classification and ensemble classification. The single classification method is fast, also takes less memory for computation but as unknown patterns or unknown samples increase its performance decreases. In vice versa, an ensemble method requires more time and memory but it performs well in the presence of unknown patterns. In the ensemble method, the generated output is the prediction of the different classifiers. For data stream handling most of the researchers used the ensemble method because of easy to implement, ability to handle different types of data, and most important its high performance [21], [22], [23]. Bagging and boosting are types of ensemble methods. Most of the researches have tried ensemble method for data stream classification in which they used chunk-based [21], [24], [25] and windowing based approaches for learning [26], [27]. These approaches play an important role during the evaluation process. Concept drift detection and handling are the major issues in the data stream mining field. The concept drift is the concept of data changes over time. Most of the researches tried to solve concept drift problems with the help of different methods [22], [28], [29], [30], [31].
This paper proposes an ensemble clustering method, hybrid one contains ensemble boosting, clustering as grid and density, and for improving performance divide and merge methods have used. In this proposed approach, boosting is used for handling a large amount of data but requires more processing time in comparison with the bagging method. The boosting method works iteratively to improve performance in terms of accuracy. In the ensemble boosting method, grid and density-based clustering are used as a base learner. The grid and density-based clustering method have the ability to works in the presence of huge data but it requires more time. Also, for accuracy improvement, we have used the divide and merge method. This proposed hybrid method is extended from the work [32] and specially designed for the classification of a data stream.
The proposed hybrid algorithm works as follows,
- a.
In the first part, incoming data is divided into no of the sliding window and every window is processed by the proposed hybrid method.
- b.
In the second part, the grid and density-based clustering algorithm is used as a base classifier for boosting ensemble. Here every data sample is mapped on the grid and according to the mapping, the grid is divided into a dense grid and sparse grid. Based on this, the performance of the algorithm is calculated.
- c.
In the last part, divide and merge method is used for performance improvement. In this method, the formed clusters are again merged and divide according to the density of the data samples.
- d.
The ensemble method plays a vital role in the data stream classification process. This method requires more time and memory as compared with other algorithms but it is focused on accuracy improvement. Also, grid and density-based clustering can handle a huge amount of data. So results show that the proposed hybrid method has focused on accuracy improvement by keeping moderate time and memory as compared with other algorithms.
The organization of the paper is as a literature survey is described in Section 2 with the proposed method is covered in Section 3, Section 4 included experimental work with results, results are discussed in Sections 5 Discussion, 6 Conclusion concludes conclusion.
Section snippets
Literature survey
Lots of work have been done by the research community to solve issues of data stream classification using supervised and unsupervised methods. They are discussed as follows,
Methodology
In this paper, we have proposed a semi-supervised hybrid method in which ensemble method, grid and density-based clustering and divide and merge methods are used. In this proposed ensemble method, grid and density-based clustering methods are used as base learner. Also improving performance in terms of accuracy as well as to handle concept drift, divide and merge method is used.
Experimental setup and results
The proposed hybrid method and state-of-the-art algorithms are evaluated using both synthetic and real datasets. The experiments were performed on a 1.90 GHz Intel i3 processor with 8 Gigabyte main memory, running on Ubuntu 14.04 and Java programming language is used for the implementation of the proposed algorithm. UCI machine learning repository dataset archives [77] is used for real datasets. The performance of proposed method has been compared with OzaBag [34], OzaBoost [34], OCBoost [35],
Discussion
In this proposed hybrid method, ensemble classifier with grid and density-based clustering as a base learner and divide and merge methods are used. The divide and merge method is used for performance improvement. This proposed method is designed for performance improvement in terms of accuracy by keeping moderate time and memory. According to this, we have implemented the proposed hybrid method. The results show that the proposed algorithm works well in terms of accuracy but requires moderate
Conclusion
In this paper, we have proposed a hybrid method i.e. semi-supervised method. We have studied the number of supervised and unsupervised methods. Based on the study, we have tried to overcome issues of data stream mining. The proposed hybrid method uses ensemble boosting method, grid, and density-based clustering method as a base learner with a divide and merge method which improves performance. The experimental results show that the proposed hybrid method works well in the presence of both
CRediT authorship contribution statement
Kapil K. Wankhade: Conceptualization, Methodology, Software, Writing - original draft, Data curation. Kalpana C. Jondhale: Writing- reviewing and editing, Supervision. Snehlata S. Dongre: Software, Validation, Visualization, Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (79)
- et al.
Handling drifts and shifts in on-line data streams with evolving fuzzy systems
Appl. Soft Comput.
(2011) - et al.
Evolving fuzzy classifiers using different model architectures
Fuzzy Sets and Systems
(2008) - et al.
Ambiguous decision trees for mining concept-drifting data streams
Pattern Recognit. Lett.
(2009) - et al.
Mining frequent itemsets over data streams using efficient window sliding techniques
J. Expert Syst. Appl.
(2009) - et al.
A framework for clustering evolving data streams
Data Streams: Models and Algorithms
(2007)Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams
(2010)- et al.
Data Stream Mining a Practical Approachtechnical report
(2009) - W. Fan, Y. Huang, H. Wang, P.S. Yu, Active mining of data streams, in: Proc. SIAM Int’l Conf. Data Mining, SDM ’04,...
- et al.
Mining data streams: A review
ACM SIGMOD Rec.
(2005)
Decision trees for mining data streams
Intell. Data Anal.
On appropriate assumptions to mine data streams: analysis and practice
Data Mining: Concepts and Techniques
New options for hoeffding trees
A survey on learning from data streams: Current and future trends
Prog. Artif. Intell.
Stream mining: A novel architecture for ensemble-based classification
Knowl. Inform. Syst.
Reacting to different types of concept drift: The accuracy updated ensemble algorithm
IEEE Trans. Neural Netw. Learn. Syst.
Dealing with concept drifts in process mining
IEEE Trans. Neural Netw. Learn. Syst.
PCA feature extraction for change detection in multidimensional unlabeled data
IEEE Trans. Neural Netw. Learn. Syst.
PANFIS: A novel incremental learning machine
IEEE Trans. Neural Netw. Learn. Syst.
Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning
IEEE Trans. Syst. Man, Cybern. B
Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study
IEEE Syst. J.
Mining high-speed data streams
Knowl. Discov. Data Mining
New ensemble methods for evolving data streams
A fast and light classifier for data streams
Springer’s Evolv. Syst.
A multi-partition multi-chunk ensemble technique to classify concept drifting data streams
Classification and novel class detection in concept-drifting data streams under time constraints
IEEE Trans. Knowl. Data Eng.
Learning in the presence of concept drift and hidden contexts
Mach. Learn.
Mining time-changing data streams
Decision tree evolution using limited number of labeled data items from drifting data streams
Mining rules of concept drift using genetic algorithm
J. Artif. Intell. Soft Comput. Res.
A hybrid approach for classification of rare class data
Springer’s Knowl. Inf. Syst.
Dynamic weighted majority: a new ensemble method for tracking concept drift
J. Mach. Learn. Res.
Experimental comparisons of online and batch versions of bagging and boosting
Online coordinate boosting
Ensemble learning for concept drift handling- the role of new expert
Online ensemble learning of data streams with gradually evolved classes
IEEE Trans. Knowl. Data Eng.
Cited by (24)
Geometric consistent fuzzy cluster ensemble with membership reconstruction for image segmentation
2023, Digital Signal Processing: A Review JournalCitation Excerpt :For examples, Li et al. [23] proposed a cluster ensemble method by combining density-based spatial clustering of applications with noise (DBSCAN) and K-means algorithms to discover behavioural patterns; Xu et al. [45] proposed a dual-granularity weighted ensemble clustering model, in which a sample local similarity measurement method is designed for the evaluation of reliability of clusters in the fusion function to improve the final clustering result. Some cluster ensemble methods based on boosting or bagging were also proposed [41,37] to achieve better results. As is well known, ensemble methods require more time and memory as compared with other algorithms but they are focused on accuracy improvement.
Cardiac Patient Data Classification Using Ensemble Machine Learning Technique
2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023Prediction of Global Sea Water Level using Linear Regression and Gradient Descent
2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023A Review on Various Plant Disease Detection Using Image Processing
2023, Proceedings - 2023 3rd International Conference on Pervasive Computing and Social Networking, ICPCSN 2023Learning from streaming data with unsupervised heterogeneous domain adaptation
2023, International Journal of Data Science and AnalyticsObject Detection using YOLO – I-Sight
2023, 14th International Conference on Advances in Computing, Control, and Telecommunication Technologies, ACT 2023