Self-adaption neighborhood density clustering method for mixed data stream with concept drift

https://doi.org/10.1016/j.engappai.2019.103451Get rights and content

Abstract

Clustering analysis is an important data mining method for data stream. In this paper, a self-adaption neighborhood density clustering method for mixed data stream is proposed. The method uses a significant metric criteria to make categorical attribute values become numeric and then the dimension of data is reduced by a nonlinear dimensionality reduction method. In the clustering method, each point is evaluated by neighborhood density. The k points are selected from the data set with maximum mutual distance after k is determined according to rough set. In addition, a new similarity measure based on neighborhood entropy is presented. The data points can be partitioned into the nearest cluster and the algorithm adaptively adjusts the clustering center points by clustering error. The experimental results show that the proposed method can obtain better clustering results than the comparison algorithms on the most data sets and the experimental results prove that the proposed algorithm is effective for data stream clustering.

Introduction

With the development of information society, many fields continue to generate massive data streams such as online shopping, satellite remote sensing, weather forecast and traffic flow monitoring etc. Different from the traditional static data, data stream often has the characteristics of unlimited number, rapid arrival and concept drift which make data stream mining face prodigious challenges (Babcock et al., 2002, Krishnaswamy, 2005, Xu and Wang, 2017, Xu and Wang, 2016). In practical applications, the data needing to be analyzed is often unlabeled and it will have a large cost to obtain the class labels for data stream. Therefore it is valuable to develop a data stream clustering algorithm. Data stream clustering has gained wide attention from researchers and there are many remarkable achievements (Silva et al., 2013, Ding et al., 2015, Kaur et al., 2015). Up to now, the most existing data stream clustering algorithms can be classified into two types according to the type of data set: clustering algorithms for categorical data and numerical data. For the categorical data set, simple matching distance is used to measure the similarity of two data points. Bai et al. (2016) propose an optimization model for clustering categorical data stream; for a data stream model, the proposed algorithm continuously updates the clustering parameters to make the error between the current clustering model and the previous clustering model as small possible; the parameters can be solved by EM algorithm. In order to detect concept drift, a new measure is defined. If the value of the new measure is greater than a predefined threshold, concept drift can be detected. Jiang and Brice (2009) present a Context-Trees algorithm for categorical data stream; the algorithm expands existing clustering techniques for static categorical data sets to predictive models of data streams based on variable length Markov models of clusters. The stored clusters along with the distributional information can be used to create and analyze aggregated clusters over user-specified time intervals to detect the changes of data streams. Maji and Pal (2007) develop a rough-fuzzy C-Medoids algorithm for amino acid sequence analysis; it uses rough fuzzy set to decide which cluster the data point belongs to; the decision can be derived from the upper approximation set and lower approximation set. Cao et al. (2010) propose a framework for clustering categorical time-evolving data; the algorithm employs the uncertainty of rough set to define the membership function of fuzzy set; the distance of between a data point and a cluster and the distance of two clusters are also defined based on the membership function. In addition, the algorithm can create a graph to visual concept drift. Li et al. (2014) propose an incremental entropy clustering algorithm for categorical data stream; the dissimilarity between a data point and a cluster can determined by incremental entropy; the algorithm can autonomously determined the threshold of the dissimilarity distance which is an advantage of the incremental entropy clustering algorithm; both concept drift and outliers can be detected by the algorithm.

For the problem of clustering numeric data stream, there are many related work. Shindler et al. (2011) propose an fast and accurate k-means algorithm for large data set; when handling data points, the distance of the current point to the nearest the cluster centering is computed to determine whether generating a new cluster or partitioning to the nearest cluster; the greater the distance is, the higher the probability of generating new class clusters will be; after the number of clusters exceeding a value, the parameters will be adjusted to reduce the number of clusters. Cao et al. (2006) present a density-based clustering algorithm called as DenStream. DenStream is a development of DBSCAN algorithm and the decision that the data point is a outlier cluster or belongs to a micro cluster is made by the statistical information of the current data; the outdated cluster will be deleted from the system and a new cluster can also be generated if the discriminant condition is satisfied. Chen and Tu (2007) propose a density-based clustering method for real-time data stream called as D-Stream. In D-Stream, density-based clustering algorithm and grid-based clustering algorithm are integrated; the grid is divide into three types: dense grid, sparse grid and other type of grid. The three types of grid can be converted to each other by density; finally, clusters can be generated by merging grids and deleting outdated grids. Hahsler and Bolanos (2016) develop a shared density clustering algorithm called as DBStream; for a data point, if the number of the neighbors is lower than a threshold, the data point is seen as outlier cluster, otherwise it updates the micro clusters of the neighbors and the micro clusters of shared regions; then in offline phase, the graph of shared density is constructed and the data set is clustered again. Zhang et al. (2014a) present a data stream clustering algorithm with affinity propagation called as STRAP; for each cluster, the algorithm utilizes a four tuples to represent the statistical information of cluster; firstly, STRAP uses AP algorithm (Frey and Dueck, 2007) to produce initial clusters; after obtaining a data point, STRAP selects a cluster with minimum distance between the data point and a cluster; if the distance is less than the predefined threshold, the data point belongs to the cluster; otherwise the data point is added into the buffer; concept drift can be detected by PH test; if the PH assumption is violated, concept drift has happened and the clustering model is updated.

From the above summarization, it is obvious that the situation that the algorithms can only deal with single type of data stream restricts the applications of clustering algorithms. Mixed data stream is common in practical applications. For example, in medical data analysis, some biochemical indexes such as ALT, PCT and WBC etc. can be measured by numerical values and some clinical symptoms such as cough, headache and palpitation etc. can be expressed by categorical characteristic values. On the other hand, simple matching distance or Minkowski distance cannot effectively measure the similarity of data points for mixed data stream. Therefore a self-adaption neighborhood density clustering method for mixed data stream with concept drift (SNDC) is proposed in this paper. SNDC employs a significance criterion to evaluate categorical attribute values and makes categorical attribute values become numeric values at first, then a nonlinear dimensionality reduction algorithm based on neighborhood similarity is presented to reduce the complex of data set. In the clustering phase, the clustering center points can be automatically adjusted according to clustering result. The weight of each cluster will be decayed with time and outdated clusters can be deleted from the system. For SNDC method, concept drift can be detected by the similarity of adjacent data block clusters. The main contributions of this paper are as follows:

  • A mapping method which makes categorical attribute values become numeric values is introduced. After executing the method, categorical attribute values can be replaced by the significance values which ensures that the data can be further processed.

  • In order to reduce the complexity of data set, a nonlinear dimensionality reduction method based on neighborhood similarity is presented. Neighborhood similarity decreases the effect of geometric spatial structure on similarity measurement.

  • A self-adaption neighborhood density clustering method is proposed. The method can automatically select the best initial clustering center. The center points can be adjusted according to the clustering result. By comparing the similarity of the clusters of adjacent data block, concept drift can be detected and a series of measures are taken to adapting to new concept in data stream.

The rest of this paper is organized as follows: Section 2 reviews some background knowledge and brief introductions about data stream model, rough set and neighborhood rough set are given; Section 3 introduces a nonlinear dimensionality reduction algorithm based on neighborhood similarity; Section 4 explains the details of SNDC method in clustering phase; in Section 5, the experiments are performed to show the effectiveness of SNDC algorithm; Section 6 concludes the paper and gives some future research directions.

Section snippets

Backgrounds

In this section, we will introduce the basic model of data stream and then some fundamental concepts of rough set and neighborhood rough set are explained. The uncertainty measure methods of rough set and neighborhood rough set are also presented.

The nonlinear dimensionality reduction method for mixed data

In this section, we will introduce a mapping method which can make categorical attribute value become numeric attribute, then a nonlinear dimensionality reduction method based on neighborhood similarity is presented to decrease the data dimensions.

The clustering processes of SNDC for mixed data stream

In this section, we will explain the detail principles of SNDC. First, a new distance is defined based on neighborhood entropy which is used to measure the distance of two objects. Then a self-adaption neighborhood density clustering method for fixed data stream is proposed and concept drift detection method is also presented.

The experimental results and analyses

In order to test the performance of SNDC, a series of data sets are chosen as the experimental data sets. waveform and hyperplane are generated by MOA1 (Bifet et al., 2010) and the other data sets are from UCI Machine Learning Repository.2 DenStream (Cao et al., 2006), Streaming k-service (Braverman et al., 2011) and k-means (Jiawei et al., 2012) with sliding window mechanism which is also denoted as k-means are chosen

Conclusions

In this paper, a self-adaption neighborhood density clustering method for fixed data stream (SNDC) is proposed. For the SNDC, the categorical attributes can be mapped to numeric attributes; then a nonlinear dimension reduction method based on neighborhood similarity is presented to decrease the dimension of data set. In the process of clustering, a neighborhood distance is defined to measure the similarity of data points. SNDC can automatically adjust centering points according to clustering

CRediT authorship contribution statement

Shuliang Xu: Methodology, Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Lin Feng: Methodology, Supervision, Formal analysis, Funding acquisition. Shenglan Liu: Formal analysis, Data curation, Supervision, Validation. Hong Qiao: Formal analysis, Data curation, Supervision, Validation.

Acknowledgments

This work was supported by National Key Research and Development Program of China (Nos. 2017YFB1300200, 2017YFB1300203), National Natural Science Fund of China (Nos. 61972064, 61672130, 61602082, 61627808, 91648205), Open Program of State Key Laboratory of Software Architecture, China (No. SKLSAOP1701), Liaoning Revitalization Talents Program, China (No. XLYC1806006), Fundamental Research Funds for the Central Universities, China (Nos. DUT19RC(3)012, DUT17RC(3)071) and the development of

References (37)

  • ZhangJ. et al.

    Neighborhood rough sets for dynamic data mining

    Inform. Sci.

    (2014)
  • ZhangJ. et al.

    Efficient parallel boolean matrix based algorithms for computing composite rough set approximations

    Inform. Sci.

    (2016)
  • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In: ACM...
  • BaiL. et al.

    An optimization model for clustering categorical data streams with drifting concepts

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • BelkinM. et al.

    Laplacian eigenmaps and spectral techniques for embedding and clustering

    Adv. Neural Inf. Process. Syst.

    (2001)
  • BifetA. et al.

    MOA: Massive online analysis

    J. Mach. Learn. Res.

    (2010)
  • Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B., 2011. Streaming k-means on...
  • Cao, F., Ester, M., Qian, W., Zhou, A., 2006. Density-based clustering over an evolving data stream with noise. In:...
  • Cited by (17)

    • Concept Drift Detection in Data Stream Mining: A literature review

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      One of the density based clustering approaches is SNDC. It finds the clustering error Shuliang et al., 2020. Here, the neighbor entropy is used to find the similarities.

    • Towards an efficient real-time kernel function stream clustering method via shared nearest-neighbor density for the IIoT

      2021, Information Sciences
      Citation Excerpt :

      First, we studied the influence of the number of data points in the block on the experimental results to determine the appropriate number of data points in each block under the same conditions. The Industrial Internet of things (IIoT) system [33] has a complex structure, a large-scale data stream, high-dimensions and many outliers in real-time clustering. Therefore, it is of great significance to construct a well-performing real-time clustering framework for the IIoT.

    • Day-ahead prediction of hourly subentry energy consumption in the building sector using pattern recognition algorithms

      2020, Energy
      Citation Excerpt :

      This approach lowers down the generalization of extracted patterns when applied in different scenarios. Recently, some improved clustering algorithms with adaptable clustering numbers have been proposed [27,28], while the effectiveness of clustered results still needs further development. After making a comparison of available clustering methods with various assessment indicators, Brun et al. pointed out that current indicators cannot give satisfactory or reliable performance when assessing different clustering methods, because all current assessment indicators, which consist of F value or entropy, have their prior assumptions [29,30].

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103451.

    View full text