Elsevier

Knowledge-Based Systems

Volume 159, 1 November 2018, Pages 309-320
Knowledge-Based Systems

Robust clustering by identifying the veins of clusters based on kernel density estimation

https://doi.org/10.1016/j.knosys.2018.06.021Get rights and content

Highlights

  • A robust clustering algorithm(IVDPC) is proposed to solve the ”chain reaction“ and cut off distance selecting problems of DPC.

  • A new similarity coefficient is introduced to represent the relevance between the points which is an extension of γ defined in DPC.

  • The local density is estimated through a non-parametric density estimation method so as to eliminate the reliance of user-defined parameter dc.

  • Clusters are characterized by veins rather than one representative point, which allows IVDPC to identify the main structure of clusters more visualized and precise.

  • The robustness of the algorithm with respect to the choice of input parameters is proved via statistical method.

Abstract

Clustering by fast search and find of density peaks(DPC) was an efficient clustering algorithm proposed by Rodriguez and Laio [49]. It adopts a concise but effective categorizing strategy which assigns data points to the same cluster as their nearest neighbors with higher densities. However, it suffers from the so-called “chain reaction” due to the simplistic strategy. What’s more, the accuracy of DPC badly depends on the selection of cut off distance dc when the data scale ranges. In order to take advantage of DPC whilst avoiding the drawbacks aforementioned, this paper proposed a robust clustering algorithm named IVDPC which provides a feasible approach for solving the classification problem of data with different shape and distribution. The local density is estimated through a non-parametric density estimation method first. Then, by calculating the similarity matrix of points and connecting the most resembled pairs continuously from high density regions to the edge of clusters, IVDPC identifies the main structure(veins) of clusters and classifies the rest of the samples precisely to the nearest vein. Having veins rather than one representative point to represent a cluster allows IVDPC to adjust well to the geometry of non-spherical shapes and decrease the chain reaction of DPC. The method proposed is benchmarked on artificial and real-world data sets against several baseline methods. The experimental results demonstrate that IVDPC can recognize the structure distribution of clusters and perform better in clustering accuracy over several state-of-art algorithms.

Introduction

Clustering is a major tool of data mining for uncovering the potential patterns of data and extracting the information implied in them. In cluster analysis, a group of objects is divided into several subgroups mainly base on their similarities, such that the objects within a subgroup are similar and the subgroups are separated from each other [1]. Up until now, clustering has been widely used in different fields and disciplines such as image processing [2], [3], community detection [4], [5], [6], microbiology [7], [8], genetics [9], etc. Meantime, different kinds of clustering algorithms have been developed, which can be roughly categorized into distance-based methods and density-based methods [10].

The varied distance-based clustering strategies can be further divided into partition, hierarchical and hybrid methods [11]. K-means, for instance, is a representative partition-based algorithm proposed by Mac Queen in 1967. It classifies points to the nearest centers, and iteratively refines the centers until the sum of squared errors converges to minimum. However, K-means cannot detect arbitrary shaped clusters that plague the centroid based methods because of the spherical classification strategy [11]. Spectral clustering algorithm, such as NJW [12], is another type of partition-based algorithm. It calculates K feature vectors corresponding to the largest K eigenvalues of the similarity matrix, and completes the clustering in the new K dimension space. NJW has lower time complexity compared to traditional clustering algorithms, but its performance is not satisfactory when K is large. Affinity Propagation(AP) is an information based partitioning method different from those above. It defines two parameters named responsibility r(i, k) and availability a(i, k), and calculates them for each samples iteratively during the clustering process. In the end, the kth sample corresponding to max (r(i,k)+a(i,k)) is regarded as the cluster center of point i. AP is more robust and accurate than k-centers methods in most of the cases, whereas the time complexity is much more higher. Agglomerative Nesting (AGNES) [13] is a typical method when it comes to hierarchical-based clustering strategies. It successively merges the clusters with higher similarity until the appointed number of cluster has been obtained. AGNES can easily reveal the hierarchical relation of classes, but at the expense of efficiency. Similar to AGNES, BIRCH [14] proposed by Tian Zhang is another classical hierarchical-based method. It constructs a CF Tree by scanning the data sets only once to describe the hierarchical structure between different clusters, and regards the samples within CF nodes as the final clusters. As a result, the clustering efficiency of BIRCH have been apparently improved compared to AGNES. However, to optimize the parameters needed in CF tree is never a simple task, which besets another hierarchical based method Chameleon as well. Chameleon [15] performs well in the clustering of data sets with arbitrary shapes and sizes but the process is quite time-consuming. The traditional prototype-based methods are sensitive to the selection of initial prototypes, that is, the improper initialization may lead to local-optimal solutions. As a result, many practitioners choose to execute the algorithm repeatedly to generate better results. However, it is impractical for large data sets considering the computational cost. The evolutionary-based optimization methods, for instance, have been proved to be effective alternatives to tackle the problem. The parallel updating strategy allows them to search for the optimal solution more efficiency. Whats more, potential better solutions are generated based on the assessment of their last performance, leading to more compelling results than traditional repeating strategies. Many evolutionary-based clustering methods have been proposed in the literature which involves different techniques, including encoding schemes, crossover operators, fitness function and initialization process [16]. Under the definition of encoding schemes, each partition solution is represented by a string or a matrix, allowing the evolutionary process to be performed using various GA methods [17], [18], [19]. Crossover operators [20], [21] as well as fitness functions [22], [23], [24], [25] for evolutionary clustering have also been well studied in the literature. When it comes to the selection of initial population, random initialization strategy is often recommended for its simplicity and effectiveness [26], [27], [28], [29], [30].

The density-based methods are significant clustering strategies of another kind. A precise statistical notion is provided in this kind of approach, where clusters are defined based on some features of the probabilistically distribution underlying the data. The idea is further expanded in two directions, namely, model-based(parametric) clustering and modal(non-parametric) clustering. DBSCAN [31] and OPTICS [32] are two representative model-based methods, which assume the distribution underlying the data follows some parametric forms. By setting the neighborhood radius ε and the minimum number of points MinPts included in the neighborhood, DBSCAN groups the points that are density-reachable and discards the points that are not reachable by any clusters as noise. However, determining the proper thresholds is quite difficult as they vary greatly in different cases. Whats more, DBSCAN is not able to identify clusters with varying densities using fixed ε. OPTICS was proposed by Ankerst et al. to tackle these problems. It introduces variable radius ε when selecting a data point to be included in a cluster and stops the process when ε grows rapidly. The non-parametric clustering, also known as modal clustering, are another important kind of density-based method but less widespread. The idea was first put forward by Carmichael [33] et al. and extended by Wishart [34] date back to the late 1960s. This approach, to be more exact, defines the clusters as dense regions of data points separated by several sparse regions. Mode hunting, such as mean-shift clustering [35] and all of its variants [36], [37], [38], [39], [40], are typical modal clustering methods aiming at connecting each data points with the density modes or patterns underlying the data. In mean-shift, each unmarked point continuously moves forward to the region of higher density gradient and absorbs the points within the circle of radius h(bandwidth) to the same cluster during the movement. One must pay attention to the selection of bandwidth because the performance of mean-shift is affected by the improper setting of the parameter. Density level sets based methods [41], usually using cluster tree estimators, are another branch of non-parametric clustering strategies. They estimate a cluster tree and associate the clusters with dense regions in the sample space instead of modes or certain sample points. Many works have been carried out to estimate such a tree [42], [43], [44], while some researchers focus on the combination of the graph theory [45], [46], [47], [48].

DPC was an efficient density-based clustering algorithm proposed by Rodriguez and Laio in [49]. The algorithm adopts a concise but effective categorizing strategy which assigns data points to the same cluster as their nearest neighbors with higher densities. However, it suffers from the so-called “chain reaction” due to the simplistic partition strategy. Specifically, once a point with higher density is partitioning wrongly, its neighbors with lower densities are more inclined to be misclassified. A fuzzy weighted K-nearest neighbors based DPC algorithm(FKNN-DPC) was introduced by Xie, et al. [50] to reduce the accumulation of errors. FKNN-DPC manages to assign point i to the clusters based on the classification results of its k neighbors, thereby improves the accuracy of clustering and reduces the ripple effect. However, it does not provide an effective way to determine the value of k, an analogous drawback present in DPC as well. There the cut-off distance dc is selected by the users. Several variants of DPC were proposed to handle the dc selecting problem. Nevertheless, estimating the local density based on the number of points within a fixed radius dc can greatly lower the robustness of the algorithm, especially for the clustering of data sets with small scale [51]. The conception of KNN [52], [53] and heat diffusion [54] were introduced to estimate the local densities of points.

In order to take advantage of DPC whilst avoiding the drawbacks aforementioned, we propose a robust clustering algorithm named IVDPC. It can easily identify the main structure(veins) of clusters and classify the rest of points precisely. For each data point i, its local density is estimated through a non-parametric density estimation method so as to eliminate the reliance of user-defined parameter dc in DPC. Then, the similarity matrix between the points is calculated, and the most resembled pairs are connected continuously extending from the high density regions to the edge of clusters. After the construction of veins, the remaining points are assigned to the nearest vein precisely. The main process is shown in Fig. 1. The assigning strategy allows IVDPC to adjust well to the geometry of non-spherical shapes and decrease the chain reaction of DPC.

The rest of the paper is organized as follows: Section 2 briefly discusses the related work, Section 3 presents the detail of the new algorithm, Section 4 demonstrates the performance of IVDPC and Section 5 gives the conclusions of our work.

Section snippets

Density peaks based algorithm

DPC is able to detect non-spherical clusters, and perform the assignment in a single step. It assumes that cluster centers have the highest local densities and a relatively larger distance to the other centers. In order to accomplish the clustering task, the method introduces two significant quantities, namely, the local density ρi of point i and the distance δi to its nearest points with higher density. Specifically, the local density ρi of each point i is computed via cut-off kernel by Eq. (1)

Algorithm proposed

To take advantage of DPC whilst avoiding the drawbacks aforementioned, a robust clustering algorithm named IVDPC is proposed in this section. In general, IVDPC computes the local densities of points by Eq. (4) instead of Eq. (1) or Eq. (2) first to eliminate the reliance of user-defined parameter dc in DPC. Then it calculates the similarity matrix of points and connects the most resembled pairs continuously from high density regions to the edge of clusters to build the veins of clusters. The

Experiment and analysis

In this section, we evaluate the performance of IVDPC using synthetic and real-world data sets against several state-of-art methods, including DPC, DBSCAN, K-means, Mean shift, AGNES and NJW. Two mainstream criterion, namely, ARI and NMI, are adopted to analyze the performance of these methods. The testing data sets are collected from literatures and generated by toolbox, containing clusters with various distribution, shape and densities. The details of these statistics are described as follows.

Conclusions

In this paper, a robust clustering algorithm named IVDPC is proposed. By connecting the most resembled pairs continuously, IVDPC constructs the veins to identify the main structure of data sets. Instead of representing a cluster by one central point, the veins allow IVDPC to adjust well to the geometry of non-spherical shapes and decrease the chain reaction of DPC. What’s more, kernel density estimation is adopted to estimate the density distribution of data sets, therefore, the reliance of the

Acknowledgment

This work is supported by National Natural Science Foundation of China (Grant No. 61304118), Program for New Century Excellent Talents in University (NCET-13-0456) and the Specialized Research Fund for Doctoral Program of Higher Education of China(Grant No:20130201120011).

References (86)

  • T. Duong et al.

    Plug-in bandwidth matrices for bivariate kernel density estimation

    J. Nonparametric Stat.

    (2003)
  • P. Hall et al.

    Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation

    Probab.Theory Relat. Field

    (1987)
  • A.J. Izenman

    Recent developments in nonparametric density estimation

    J. Am. Stat. Assoc.

    (1991)
  • A. Gionis et al.

    Panayiotis tsaparas, Clustering aggregation

    ACM Trans. Knowl.Discovery Data

    (2005)
  • C.J. Veenman et al.

    A maximum variance cluster algorithm

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2002)
  • A.K. Jain et al.

    Data clustering: a review

    Acm Comput. Surv.

    (1999)
  • H. Qin

    Depth estimation by parameter transfer with a lightweight model for single still images

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • J. Cao et al.

    Detecting communities on topic of transportation with sparse crowd annotations

    IEEE Trans. Intell. Transp. Syst.

    (2017)
  • X. Jiang et al.

    Identification of the clustering structure in microbiome data by density clustering on the manhattan distance

    Sci.China Inf. Sci.

    (2016)
  • J. Ding et al.

    Densitycut: an efficient and versatile topological approach for automatic clustering of biological data

    Bioinformatics.

    (2016)
  • K. Petra

    Computer folding of RNA tetraloops: identification of key force field deficiencies

    J. Chem. Theory Comput.

    (2016)
  • R. Zhou

    A distance and density-based clustering algorithm using automatic peak detection

    IEEE International Conference on Smart Cloud

    (2016)
  • R. Xu et al.

    Survey of clustering algorithms

    IEEE Trans. Neural Netw.

    (2005)
  • A.Y. Ng et al.

    On spectral clustering: analysis and an algorithm

    International Conference on Neural Information Processing Systems: Natural and Synthetic.

    (2001)
  • C. Fraley et al.

    How many clusters? which clustering method? answers via model-based cluster analysis

    Comput. J.

    (1998)
  • T. Zhang et al.

    BIRCH: An efficient data clustering method for very large databases[c].

    ACM SIGMOD International Conference on Management of Data.

    (1996)
  • G. karypis et al.

    Chameleon:hierarchical clustering using dynamic modeling

    Computer

    (1999)
  • Hruschka

    A survey of evolutionary algorithms for clustering

    IEEE Trans. Syst. Man Cybern.Part C Appl. Rev.

    (2009)
  • L.I. Kuncheva et al.

    Selection of cluster prototypes from data by a genetic algorithm

    European Congress on Intelligent Techniques and Soft Computing.

    (1997)
  • K. Krishna et al.

    Genetic k-means algorithm

    IEEE Trans. Syst. Man Cybern.Part B

    (1999)
  • Y. Lu

    Incremental genetic k-means algorithm and its application in gene expression data analysis

    Bmc Bioinf.

    (2004)
  • W. Sheng et al.

    A hybrid algorithm for k-medoid clustering of large data sets

    IEEE Congress on Evolutionary Computation.

    (2004)
  • U. Mualik et al.

    Genetic algorithm based clustering technique

    Pattern Recognit.

    (2004)
  • P. Merz et al.

    Clustering gene expression profiles with memetic algorithms

    Parallel Problem Solving from Nature - PPSN VII.

    (2002)
  • O. Nevalainen

    Self-Adaptive Genetic Algorithm for Clustering.

    (2003)
  • E.R. Hruschka et al.

    A genetic algorithm for cluster analysis

    Intell. Data Anal.

    (2003)
  • E.R. Hruschka

    Improving the efficiency of a clustering genetic algorithm.

    Advances in Artificial Intelligence - IBERAMIA.

    (2004)
  • P.C.H. Ma

    An evolutionary clustering algorithm for gene expression microarray data analysis

    IEEE Trans. Evol. Comput.

    (2006)
  • M. Coelho et al.

    Clustering using genetic algorithm combining validation criteria.

    European Symposium on Artificial Neural Networks.

    (2012)
  • S.M. Pan et al.

    Evolution-based tabu search approach to automatic clustering

    IEEE Trans. Syst. Man Cybern.Part C

    (2007)
  • M. Ester

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD‘96).

    (1996)
  • M. Ankerst

    OPTICS:ordering points to identify the clustering structure

    SIGMOD Conference.

    (1999)
  • J.W. Carmichael et al.

    Finding natural clusters

    Syst. Zool.

    (1968)
  • Cited by (33)

    • An improved probability propagation algorithm for density peak clustering based on natural nearest neighborhood

      2022, Array
      Citation Excerpt :

      In most of the DPC variants, the idea of K-nearest neighbors is hybridized in the aggregation strategies. For instance, Zhou et al. [18] constructed the veins of clusters by connecting pairs with the highest similarity from the high-density regions to the cluster borders. The rest of the points are then assigned to the nearest veins.

    • A robust clustering algorithm based on the identification of core points and KNN kernel density estimation

      2022, Expert Systems with Applications
      Citation Excerpt :

      On the other hand, ICKDC can detect clusters with arbitrary shapes and densities and obtains the best clustering results in terms of the three evaluation metrics in most cases. Then, we compare the clustering results of ICKDC with three state-of-the-art DPC-variants, i.e., DPC-KNN (Du et al., 2016), FKNN-DPC (Xie et al., 2016) and IVDPC (Zhou et al., 2018) on real-world datasets. As we discussed in Section 2, most of the optimized variants of DPC try to improve the original method considering one or two defects.

    View all citing articles on Scopus
    View full text