Self-adaption neighborhood density clustering method for mixed data stream with concept drift☆
Introduction
With the development of information society, many fields continue to generate massive data streams such as online shopping, satellite remote sensing, weather forecast and traffic flow monitoring etc. Different from the traditional static data, data stream often has the characteristics of unlimited number, rapid arrival and concept drift which make data stream mining face prodigious challenges (Babcock et al., 2002, Krishnaswamy, 2005, Xu and Wang, 2017, Xu and Wang, 2016). In practical applications, the data needing to be analyzed is often unlabeled and it will have a large cost to obtain the class labels for data stream. Therefore it is valuable to develop a data stream clustering algorithm. Data stream clustering has gained wide attention from researchers and there are many remarkable achievements (Silva et al., 2013, Ding et al., 2015, Kaur et al., 2015). Up to now, the most existing data stream clustering algorithms can be classified into two types according to the type of data set: clustering algorithms for categorical data and numerical data. For the categorical data set, simple matching distance is used to measure the similarity of two data points. Bai et al. (2016) propose an optimization model for clustering categorical data stream; for a data stream model, the proposed algorithm continuously updates the clustering parameters to make the error between the current clustering model and the previous clustering model as small possible; the parameters can be solved by EM algorithm. In order to detect concept drift, a new measure is defined. If the value of the new measure is greater than a predefined threshold, concept drift can be detected. Jiang and Brice (2009) present a Context-Trees algorithm for categorical data stream; the algorithm expands existing clustering techniques for static categorical data sets to predictive models of data streams based on variable length Markov models of clusters. The stored clusters along with the distributional information can be used to create and analyze aggregated clusters over user-specified time intervals to detect the changes of data streams. Maji and Pal (2007) develop a rough-fuzzy C-Medoids algorithm for amino acid sequence analysis; it uses rough fuzzy set to decide which cluster the data point belongs to; the decision can be derived from the upper approximation set and lower approximation set. Cao et al. (2010) propose a framework for clustering categorical time-evolving data; the algorithm employs the uncertainty of rough set to define the membership function of fuzzy set; the distance of between a data point and a cluster and the distance of two clusters are also defined based on the membership function. In addition, the algorithm can create a graph to visual concept drift. Li et al. (2014) propose an incremental entropy clustering algorithm for categorical data stream; the dissimilarity between a data point and a cluster can determined by incremental entropy; the algorithm can autonomously determined the threshold of the dissimilarity distance which is an advantage of the incremental entropy clustering algorithm; both concept drift and outliers can be detected by the algorithm.
For the problem of clustering numeric data stream, there are many related work. Shindler et al. (2011) propose an fast and accurate k-means algorithm for large data set; when handling data points, the distance of the current point to the nearest the cluster centering is computed to determine whether generating a new cluster or partitioning to the nearest cluster; the greater the distance is, the higher the probability of generating new class clusters will be; after the number of clusters exceeding a value, the parameters will be adjusted to reduce the number of clusters. Cao et al. (2006) present a density-based clustering algorithm called as DenStream. DenStream is a development of DBSCAN algorithm and the decision that the data point is a outlier cluster or belongs to a micro cluster is made by the statistical information of the current data; the outdated cluster will be deleted from the system and a new cluster can also be generated if the discriminant condition is satisfied. Chen and Tu (2007) propose a density-based clustering method for real-time data stream called as D-Stream. In D-Stream, density-based clustering algorithm and grid-based clustering algorithm are integrated; the grid is divide into three types: dense grid, sparse grid and other type of grid. The three types of grid can be converted to each other by density; finally, clusters can be generated by merging grids and deleting outdated grids. Hahsler and Bolanos (2016) develop a shared density clustering algorithm called as DBStream; for a data point, if the number of the neighbors is lower than a threshold, the data point is seen as outlier cluster, otherwise it updates the micro clusters of the neighbors and the micro clusters of shared regions; then in offline phase, the graph of shared density is constructed and the data set is clustered again. Zhang et al. (2014a) present a data stream clustering algorithm with affinity propagation called as STRAP; for each cluster, the algorithm utilizes a four tuples to represent the statistical information of cluster; firstly, STRAP uses AP algorithm (Frey and Dueck, 2007) to produce initial clusters; after obtaining a data point, STRAP selects a cluster with minimum distance between the data point and a cluster; if the distance is less than the predefined threshold, the data point belongs to the cluster; otherwise the data point is added into the buffer; concept drift can be detected by PH test; if the PH assumption is violated, concept drift has happened and the clustering model is updated.
From the above summarization, it is obvious that the situation that the algorithms can only deal with single type of data stream restricts the applications of clustering algorithms. Mixed data stream is common in practical applications. For example, in medical data analysis, some biochemical indexes such as ALT, PCT and WBC etc. can be measured by numerical values and some clinical symptoms such as cough, headache and palpitation etc. can be expressed by categorical characteristic values. On the other hand, simple matching distance or Minkowski distance cannot effectively measure the similarity of data points for mixed data stream. Therefore a self-adaption neighborhood density clustering method for mixed data stream with concept drift (SNDC) is proposed in this paper. SNDC employs a significance criterion to evaluate categorical attribute values and makes categorical attribute values become numeric values at first, then a nonlinear dimensionality reduction algorithm based on neighborhood similarity is presented to reduce the complex of data set. In the clustering phase, the clustering center points can be automatically adjusted according to clustering result. The weight of each cluster will be decayed with time and outdated clusters can be deleted from the system. For SNDC method, concept drift can be detected by the similarity of adjacent data block clusters. The main contributions of this paper are as follows:
- •
A mapping method which makes categorical attribute values become numeric values is introduced. After executing the method, categorical attribute values can be replaced by the significance values which ensures that the data can be further processed.
- •
In order to reduce the complexity of data set, a nonlinear dimensionality reduction method based on neighborhood similarity is presented. Neighborhood similarity decreases the effect of geometric spatial structure on similarity measurement.
- •
A self-adaption neighborhood density clustering method is proposed. The method can automatically select the best initial clustering center. The center points can be adjusted according to the clustering result. By comparing the similarity of the clusters of adjacent data block, concept drift can be detected and a series of measures are taken to adapting to new concept in data stream.
The rest of this paper is organized as follows: Section 2 reviews some background knowledge and brief introductions about data stream model, rough set and neighborhood rough set are given; Section 3 introduces a nonlinear dimensionality reduction algorithm based on neighborhood similarity; Section 4 explains the details of SNDC method in clustering phase; in Section 5, the experiments are performed to show the effectiveness of SNDC algorithm; Section 6 concludes the paper and gives some future research directions.
Section snippets
Backgrounds
In this section, we will introduce the basic model of data stream and then some fundamental concepts of rough set and neighborhood rough set are explained. The uncertainty measure methods of rough set and neighborhood rough set are also presented.
The nonlinear dimensionality reduction method for mixed data
In this section, we will introduce a mapping method which can make categorical attribute value become numeric attribute, then a nonlinear dimensionality reduction method based on neighborhood similarity is presented to decrease the data dimensions.
The clustering processes of SNDC for mixed data stream
In this section, we will explain the detail principles of SNDC. First, a new distance is defined based on neighborhood entropy which is used to measure the distance of two objects. Then a self-adaption neighborhood density clustering method for fixed data stream is proposed and concept drift detection method is also presented.
The experimental results and analyses
In order to test the performance of SNDC, a series of data sets are chosen as the experimental data sets. waveform and hyperplane are generated by MOA1 (Bifet et al., 2010) and the other data sets are from UCI Machine Learning Repository.2 DenStream (Cao et al., 2006), Streaming k-service (Braverman et al., 2011) and k-means (Jiawei et al., 2012) with sliding window mechanism which is also denoted as k-means are chosen
Conclusions
In this paper, a self-adaption neighborhood density clustering method for fixed data stream (SNDC) is proposed. For the SNDC, the categorical attributes can be mapped to numeric attributes; then a nonlinear dimension reduction method based on neighborhood similarity is presented to decrease the dimension of data set. In the process of clustering, a neighborhood distance is defined to measure the similarity of data points. SNDC can automatically adjust centering points according to clustering
CRediT authorship contribution statement
Shuliang Xu: Methodology, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Lin Feng: Methodology, Supervision, Formal analysis, Funding acquisition. Shenglan Liu: Formal analysis, Data curation, Supervision, Validation. Hong Qiao: Formal analysis, Data curation, Supervision, Validation.
Acknowledgments
This work was supported by National Key Research and Development Program of China (Nos. 2017YFB1300200, 2017YFB1300203), National Natural Science Fund of China (Nos. 61972064, 61672130, 61602082, 61627808, 91648205), Open Program of State Key Laboratory of Software Architecture, China (No. SKLSAOP1701), Liaoning Revitalization Talents Program, China (No. XLYC1806006), Fundamental Research Funds for the Central Universities, China (Nos. DUT19RC(3)012, DUT17RC(3)071) and the development of
References (37)
- et al.
An entropy-based uncertainty measurement approach in neighborhood systems
Inform. Sci.
(2014) - et al.
Neighborhood rough set based heterogeneous feature subset selection
Inform. Sci.
(2008) - et al.
Incremental entropy-based clustering on categorical data streams with concept drift
Knowl.-Based Syst.
(2014) - et al.
A new measure of uncertainty based on knowledge granulation for rough sets
Inform. Sci.
(2009) - et al.
Approaches to knowledge reduction based on variable precision rough set model
Inform. Sci.
(2004) - et al.
Minimal decision cost reduct in fuzzy decision-theoretic rough set model
Knowl.-Based Syst.
(2017) - et al.
Rough set methods in feature selection and recognition
Pattern Recognit. Lett.
(2003) - et al.
A fast incremental extreme learning machine algorithm for data streams classification
Expert Syst. Appl.
(2016) - et al.
Dynamic extreme learning machine for data stream classification
Neurocomputing
(2017) - et al.
Covering based rough set approximations
Inform. Sci.
(2012)
Neighborhood rough sets for dynamic data mining
Inform. Sci.
Efficient parallel boolean matrix based algorithms for computing composite rough set approximations
Inform. Sci.
An optimization model for clustering categorical data streams with drifting concepts
IEEE Trans. Knowl. Data Eng.
Laplacian eigenmaps and spectral techniques for embedding and clustering
Adv. Neural Inf. Process. Syst.
MOA: Massive online analysis
J. Mach. Learn. Res.
Cited by (17)
Concept Drift Detection in Data Stream Mining: A literature review
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :One of the density based clustering approaches is SNDC. It finds the clustering error Shuliang et al., 2020. Here, the neighbor entropy is used to find the similarities.
Towards an efficient real-time kernel function stream clustering method via shared nearest-neighbor density for the IIoT
2021, Information SciencesCitation Excerpt :First, we studied the influence of the number of data points in the block on the experimental results to determine the appropriate number of data points in each block under the same conditions. The Industrial Internet of things (IIoT) system [33] has a complex structure, a large-scale data stream, high-dimensions and many outliers in real-time clustering. Therefore, it is of great significance to construct a well-performing real-time clustering framework for the IIoT.
Day-ahead prediction of hourly subentry energy consumption in the building sector using pattern recognition algorithms
2020, EnergyCitation Excerpt :This approach lowers down the generalization of extracted patterns when applied in different scenarios. Recently, some improved clustering algorithms with adaptable clustering numbers have been proposed [27,28], while the effectiveness of clustered results still needs further development. After making a comparison of available clustering methods with various assessment indicators, Brun et al. pointed out that current indicators cannot give satisfactory or reliable performance when assessing different clustering methods, because all current assessment indicators, which consist of F value or entropy, have their prior assumptions [29,30].
Multi-type concept drift detection under a dual-layer variable sliding window in frequent pattern mining with cloud computing
2024, Journal of Cloud Computing
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103451.