Robust clustering by identifying the veins of clusters based on kernel density estimation
Introduction
Clustering is a major tool of data mining for uncovering the potential patterns of data and extracting the information implied in them. In cluster analysis, a group of objects is divided into several subgroups mainly base on their similarities, such that the objects within a subgroup are similar and the subgroups are separated from each other [1]. Up until now, clustering has been widely used in different fields and disciplines such as image processing [2], [3], community detection [4], [5], [6], microbiology [7], [8], genetics [9], etc. Meantime, different kinds of clustering algorithms have been developed, which can be roughly categorized into distance-based methods and density-based methods [10].
The varied distance-based clustering strategies can be further divided into partition, hierarchical and hybrid methods [11]. K-means, for instance, is a representative partition-based algorithm proposed by Mac Queen in 1967. It classifies points to the nearest centers, and iteratively refines the centers until the sum of squared errors converges to minimum. However, K-means cannot detect arbitrary shaped clusters that plague the centroid based methods because of the spherical classification strategy [11]. Spectral clustering algorithm, such as NJW [12], is another type of partition-based algorithm. It calculates K feature vectors corresponding to the largest K eigenvalues of the similarity matrix, and completes the clustering in the new K dimension space. NJW has lower time complexity compared to traditional clustering algorithms, but its performance is not satisfactory when K is large. Affinity Propagation(AP) is an information based partitioning method different from those above. It defines two parameters named responsibility r(i, k) and availability a(i, k), and calculates them for each samples iteratively during the clustering process. In the end, the kth sample corresponding to max is regarded as the cluster center of point i. AP is more robust and accurate than k-centers methods in most of the cases, whereas the time complexity is much more higher. Agglomerative Nesting (AGNES) [13] is a typical method when it comes to hierarchical-based clustering strategies. It successively merges the clusters with higher similarity until the appointed number of cluster has been obtained. AGNES can easily reveal the hierarchical relation of classes, but at the expense of efficiency. Similar to AGNES, BIRCH [14] proposed by Tian Zhang is another classical hierarchical-based method. It constructs a CF Tree by scanning the data sets only once to describe the hierarchical structure between different clusters, and regards the samples within CF nodes as the final clusters. As a result, the clustering efficiency of BIRCH have been apparently improved compared to AGNES. However, to optimize the parameters needed in CF tree is never a simple task, which besets another hierarchical based method Chameleon as well. Chameleon [15] performs well in the clustering of data sets with arbitrary shapes and sizes but the process is quite time-consuming. The traditional prototype-based methods are sensitive to the selection of initial prototypes, that is, the improper initialization may lead to local-optimal solutions. As a result, many practitioners choose to execute the algorithm repeatedly to generate better results. However, it is impractical for large data sets considering the computational cost. The evolutionary-based optimization methods, for instance, have been proved to be effective alternatives to tackle the problem. The parallel updating strategy allows them to search for the optimal solution more efficiency. Whats more, potential better solutions are generated based on the assessment of their last performance, leading to more compelling results than traditional repeating strategies. Many evolutionary-based clustering methods have been proposed in the literature which involves different techniques, including encoding schemes, crossover operators, fitness function and initialization process [16]. Under the definition of encoding schemes, each partition solution is represented by a string or a matrix, allowing the evolutionary process to be performed using various GA methods [17], [18], [19]. Crossover operators [20], [21] as well as fitness functions [22], [23], [24], [25] for evolutionary clustering have also been well studied in the literature. When it comes to the selection of initial population, random initialization strategy is often recommended for its simplicity and effectiveness [26], [27], [28], [29], [30].
The density-based methods are significant clustering strategies of another kind. A precise statistical notion is provided in this kind of approach, where clusters are defined based on some features of the probabilistically distribution underlying the data. The idea is further expanded in two directions, namely, model-based(parametric) clustering and modal(non-parametric) clustering. DBSCAN [31] and OPTICS [32] are two representative model-based methods, which assume the distribution underlying the data follows some parametric forms. By setting the neighborhood radius ε and the minimum number of points MinPts included in the neighborhood, DBSCAN groups the points that are density-reachable and discards the points that are not reachable by any clusters as noise. However, determining the proper thresholds is quite difficult as they vary greatly in different cases. Whats more, DBSCAN is not able to identify clusters with varying densities using fixed ε. OPTICS was proposed by Ankerst et al. to tackle these problems. It introduces variable radius ε when selecting a data point to be included in a cluster and stops the process when ε grows rapidly. The non-parametric clustering, also known as modal clustering, are another important kind of density-based method but less widespread. The idea was first put forward by Carmichael [33] et al. and extended by Wishart [34] date back to the late 1960s. This approach, to be more exact, defines the clusters as dense regions of data points separated by several sparse regions. Mode hunting, such as mean-shift clustering [35] and all of its variants [36], [37], [38], [39], [40], are typical modal clustering methods aiming at connecting each data points with the density modes or patterns underlying the data. In mean-shift, each unmarked point continuously moves forward to the region of higher density gradient and absorbs the points within the circle of radius h(bandwidth) to the same cluster during the movement. One must pay attention to the selection of bandwidth because the performance of mean-shift is affected by the improper setting of the parameter. Density level sets based methods [41], usually using cluster tree estimators, are another branch of non-parametric clustering strategies. They estimate a cluster tree and associate the clusters with dense regions in the sample space instead of modes or certain sample points. Many works have been carried out to estimate such a tree [42], [43], [44], while some researchers focus on the combination of the graph theory [45], [46], [47], [48].
DPC was an efficient density-based clustering algorithm proposed by Rodriguez and Laio in [49]. The algorithm adopts a concise but effective categorizing strategy which assigns data points to the same cluster as their nearest neighbors with higher densities. However, it suffers from the so-called “chain reaction” due to the simplistic partition strategy. Specifically, once a point with higher density is partitioning wrongly, its neighbors with lower densities are more inclined to be misclassified. A fuzzy weighted K-nearest neighbors based DPC algorithm(FKNN-DPC) was introduced by Xie, et al. [50] to reduce the accumulation of errors. FKNN-DPC manages to assign point i to the clusters based on the classification results of its k neighbors, thereby improves the accuracy of clustering and reduces the ripple effect. However, it does not provide an effective way to determine the value of k, an analogous drawback present in DPC as well. There the cut-off distance dc is selected by the users. Several variants of DPC were proposed to handle the dc selecting problem. Nevertheless, estimating the local density based on the number of points within a fixed radius dc can greatly lower the robustness of the algorithm, especially for the clustering of data sets with small scale [51]. The conception of KNN [52], [53] and heat diffusion [54] were introduced to estimate the local densities of points.
In order to take advantage of DPC whilst avoiding the drawbacks aforementioned, we propose a robust clustering algorithm named IVDPC. It can easily identify the main structure(veins) of clusters and classify the rest of points precisely. For each data point i, its local density is estimated through a non-parametric density estimation method so as to eliminate the reliance of user-defined parameter dc in DPC. Then, the similarity matrix between the points is calculated, and the most resembled pairs are connected continuously extending from the high density regions to the edge of clusters. After the construction of veins, the remaining points are assigned to the nearest vein precisely. The main process is shown in Fig. 1. The assigning strategy allows IVDPC to adjust well to the geometry of non-spherical shapes and decrease the chain reaction of DPC.
The rest of the paper is organized as follows: Section 2 briefly discusses the related work, Section 3 presents the detail of the new algorithm, Section 4 demonstrates the performance of IVDPC and Section 5 gives the conclusions of our work.
Section snippets
Density peaks based algorithm
DPC is able to detect non-spherical clusters, and perform the assignment in a single step. It assumes that cluster centers have the highest local densities and a relatively larger distance to the other centers. In order to accomplish the clustering task, the method introduces two significant quantities, namely, the local density ρi of point i and the distance δi to its nearest points with higher density. Specifically, the local density ρi of each point i is computed via cut-off kernel by Eq. (1)
Algorithm proposed
To take advantage of DPC whilst avoiding the drawbacks aforementioned, a robust clustering algorithm named IVDPC is proposed in this section. In general, IVDPC computes the local densities of points by Eq. (4) instead of Eq. (1) or Eq. (2) first to eliminate the reliance of user-defined parameter dc in DPC. Then it calculates the similarity matrix of points and connects the most resembled pairs continuously from high density regions to the edge of clusters to build the veins of clusters. The
Experiment and analysis
In this section, we evaluate the performance of IVDPC using synthetic and real-world data sets against several state-of-art methods, including DPC, DBSCAN, K-means, Mean shift, AGNES and NJW. Two mainstream criterion, namely, ARI and NMI, are adopted to analyze the performance of these methods. The testing data sets are collected from literatures and generated by toolbox, containing clusters with various distribution, shape and densities. The details of these statistics are described as follows.
Conclusions
In this paper, a robust clustering algorithm named IVDPC is proposed. By connecting the most resembled pairs continuously, IVDPC constructs the veins to identify the main structure of data sets. Instead of representing a cluster by one central point, the veins allow IVDPC to adjust well to the geometry of non-spherical shapes and decrease the chain reaction of DPC. What’s more, kernel density estimation is adopted to estimate the density distribution of data sets, therefore, the reliance of the
Acknowledgment
This work is supported by National Natural Science Foundation of China (Grant No. 61304118), Program for New Century Excellent Talents in University (NCET-13-0456) and the Specialized Research Fund for Doctoral Program of Higher Education of China(Grant No:20130201120011).
References (86)
Towards parameter-independent data clustering and image segmentation
Pattern Recognit.
(2016)- et al.
An overlapping community detection algorithm based on density peaks
Neurocomputing.
(2017) Community detection in complex networks using density-based clustering algorithm and manifold learning
Physica A
(2016)- et al.
An evolutionary technique based on k-means algorithm for optimal clustering.
Inf.Sci.
(2002) A genetic c-means clustering algorithm applied to color image quantization
Pattern Recognit.
(1997)- et al.
Automatic clustering via outward statistical testing on density metrics
IEEE Trans. Knowl. Data Eng.
(2016) - et al.
Study on density peaks clustering based on k-nearest neighbors and principal component analysis
Knowl. Based Syst.
(2016) Clustering by fast search and find of density peaks via heat diffusion
Neurocomputing
(2016)Recent developments in nonparametric density estimation
J. Am. Stat. Assoc.
(1991)Wind speed model based on kernel density estimation and its application in reliability assessment of generating systems.
J. Mod. Power Syst. Clean Energy
(2017)
Plug-in bandwidth matrices for bivariate kernel density estimation
J. Nonparametric Stat.
Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation
Probab.Theory Relat. Field
Recent developments in nonparametric density estimation
J. Am. Stat. Assoc.
Panayiotis tsaparas, Clustering aggregation
ACM Trans. Knowl.Discovery Data
A maximum variance cluster algorithm
IEEE Trans. Pattern Anal. Mach.Intell.
Data clustering: a review
Acm Comput. Surv.
Depth estimation by parameter transfer with a lightweight model for single still images
IEEE Trans. Circuits Syst. Video Technol.
Detecting communities on topic of transportation with sparse crowd annotations
IEEE Trans. Intell. Transp. Syst.
Identification of the clustering structure in microbiome data by density clustering on the manhattan distance
Sci.China Inf. Sci.
Densitycut: an efficient and versatile topological approach for automatic clustering of biological data
Bioinformatics.
Computer folding of RNA tetraloops: identification of key force field deficiencies
J. Chem. Theory Comput.
A distance and density-based clustering algorithm using automatic peak detection
IEEE International Conference on Smart Cloud
Survey of clustering algorithms
IEEE Trans. Neural Netw.
On spectral clustering: analysis and an algorithm
International Conference on Neural Information Processing Systems: Natural and Synthetic.
How many clusters? which clustering method? answers via model-based cluster analysis
Comput. J.
BIRCH: An efficient data clustering method for very large databases[c].
ACM SIGMOD International Conference on Management of Data.
Chameleon:hierarchical clustering using dynamic modeling
Computer
A survey of evolutionary algorithms for clustering
IEEE Trans. Syst. Man Cybern.Part C Appl. Rev.
Selection of cluster prototypes from data by a genetic algorithm
European Congress on Intelligent Techniques and Soft Computing.
Genetic k-means algorithm
IEEE Trans. Syst. Man Cybern.Part B
Incremental genetic k-means algorithm and its application in gene expression data analysis
Bmc Bioinf.
A hybrid algorithm for k-medoid clustering of large data sets
IEEE Congress on Evolutionary Computation.
Genetic algorithm based clustering technique
Pattern Recognit.
Clustering gene expression profiles with memetic algorithms
Parallel Problem Solving from Nature - PPSN VII.
Self-Adaptive Genetic Algorithm for Clustering.
A genetic algorithm for cluster analysis
Intell. Data Anal.
Improving the efficiency of a clustering genetic algorithm.
Advances in Artificial Intelligence - IBERAMIA.
An evolutionary clustering algorithm for gene expression microarray data analysis
IEEE Trans. Evol. Comput.
Clustering using genetic algorithm combining validation criteria.
European Symposium on Artificial Neural Networks.
Evolution-based tabu search approach to automatic clustering
IEEE Trans. Syst. Man Cybern.Part C
A density-based algorithm for discovering clusters in large spatial databases with noise
Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD‘96).
OPTICS:ordering points to identify the clustering structure
SIGMOD Conference.
Finding natural clusters
Syst. Zool.
Cited by (33)
A split–merge clustering algorithm based on the k-nearest neighbor graph
2023, Information SystemsAn improved probability propagation algorithm for density peak clustering based on natural nearest neighborhood
2022, ArrayCitation Excerpt :In most of the DPC variants, the idea of K-nearest neighbors is hybridized in the aggregation strategies. For instance, Zhou et al. [18] constructed the veins of clusters by connecting pairs with the highest similarity from the high-density regions to the cluster borders. The rest of the points are then assigned to the nearest veins.
A robust clustering algorithm based on the identification of core points and KNN kernel density estimation
2022, Expert Systems with ApplicationsCitation Excerpt :On the other hand, ICKDC can detect clusters with arbitrary shapes and densities and obtains the best clustering results in terms of the three evaluation metrics in most cases. Then, we compare the clustering results of ICKDC with three state-of-the-art DPC-variants, i.e., DPC-KNN (Du et al., 2016), FKNN-DPC (Xie et al., 2016) and IVDPC (Zhou et al., 2018) on real-world datasets. As we discussed in Section 2, most of the optimized variants of DPC try to improve the original method considering one or two defects.
A Graph Adaptive Density Peaks Clustering algorithm for automatic centroid selection and effective aggregation
2022, Expert Systems with ApplicationsComplex networks from time series data allow an efficient historical stage division of urban air quality information
2021, Applied Mathematics and ComputationDensity peak clustering using global and local consistency adjustable manifold distance
2021, Information Sciences