Keywords

1 Introduction

Data clustering has important applications in many fields [19], and a huge amount of clustering algorithms have been proposed in the last decades. In addition to the well-known k-means algorithm, DBSCAN [12] and normalized cuts (NCuts) [35] algorithms are also widely adopted baselines in data clustering, and many variants have been proposed [2, 8, 9]. Traditional clustering algorithms also include hierarchical clustering [42] and distribution based algorithms [39]. In recent developments, SPRG [43] as one type of spectral clustering [29, 31] is shown to perform well by constructing a special affinity matrix. The affinity propagation (AP) algorithm [5] uses the pairwise data similarity matrix as the input, and passes the affinity message among data points iteratively to identify clusters gradually. Different from many partitioning based algorithms, the dominant sets (DSets) algorithm [20,21,22, 30, 32, 36, 37] is proposed by defining dominant set as a cluster concept. It uses the pairwise similarity matrix and input and extracts clusters sequentially, without dependence on any parameters. The well-studied clustering algorithms also include subspace clustering [15, 25], multi-view clustering [26] and others [4, 11, 13, 27, 33, 41].

Density based clustering has an attractive property of detecting clusters of arbitrary shapes. One typical example of this type is the DBSCAN algorithm [12], which defines a density threshold with parameters Eps and MinPts to differentiate between cluster members and noises. The OPTICS [2] algorithm is a generalization of the DBSCAN algorithm, which generates a hierarchical clustering result. The DeLi-Clu [1] algorithm further combines the OPTICS algorithm with single-linkage clustering, thereby removing the parameter Eps completely. Some other works related to DBSCAN also include [7, 17, 24]. The density peak (DP) algorithm [34] adopts a different idea from DBSCAN. This algorithm treats local density peaks as the candidates of cluster centers, and makes use of the density relationship of neighboring data points to group data points into clusters. While being simple, this algorithm has shown significant potential in many experiments. Some recent works related to the DP algorithm include [3, 28].

In the DBSCAN algorithm, a density threshold is defined by the minimum number MinPts of points in the neighborhood of radius Eps. After that, the core points with density above the threshold and those in the neighborhood of core points are grouped into clusters, and the other points are treated as noise. While this practice does help detect noises, the small-density data points in the border of clusters may be treated as noises by mistake. In the case that different clusters have significantly different densities, one small-density cluster may be treated as noises as a whole. With the DP algorithm, cluster centers are detected firstly, and the other data points are then grouped into clusters around the cluster centers. While the clustering results have a significant dependence on cluster center detection results, a reliable method to detect cluster centers is still not available. Consequently, it is not guaranteed to obtain good clustering results with the DP algorithm.

Noticing the nice properties and problems of the DBSCAN and DP algorithms, we study these two algorithms and find that their properties are complementary to some extent. This observation motivates us to merge both algorithms to overcome their problems. Our algorithm can be described informally as follows. The DP clustering process is divided into two steps, namely cluster center identification and grouping of non-center data. The DBSCAN clustering is inserted between these two steps, and we obtain a three-step clustering algorithm. Specifically, we detect cluster centers following the DP algorithm in the first step. After that, we do DBSCAN clustering with the cluster centers as seeds. Finally, the remaining unclustered data points are clustered following the DP algorithm. Both synthetic and real datasets are adopted in experiments, and experimental results show that our algorithm is able to overcome the problems of DBSCAN and DP and improve the clustering results evidently. Our algorithm also compares favorably with other commonly used clustering algorithms, including k-means, NCuts, DBSCAN, AP and DSets algorithms.

In Sect. 2 the DBSCAN and DP algorithms are introduced briefly and their properties are discussed. Section 3 provides the details on how DBSCAN and DP are merged to overcome their drawbacks and improve the clustering results. Our algorithm is validated with experiments and compared with other algorithms in Sect. 4. Finally, 5 summarizes the conclusions.

2 Algorithm Basis

As we plan to combine the DBSCAN and DP algorithms to overcome their drawbacks, we firstly introduce these two algorithms and discuss their properties in brief.

2.1 DBSCAN

As one popular density based clustering algorithm, DBSCAN detects clusters sequentially based on the so-called density-reachable cluster model. With the parameter Eps denoting a neighborhood radius and MinPts denoting the minimum count of data points in this neighborhood, this algorithm defines a density threshold and groups data points into clusters based on the threshold. Specifically, the core points whose densities are above the threshold and those in the neighborhood of core points are grouped into clusters, and the others are regarded as noises. In implementation, we begin with any data point p and calculate its density, namely the number of data points in its Eps neighborhood. If the density is smaller than the threshold, we move to the next point. Otherwise, p is identified as a core point and its neighbors in the Eps neighborhood are included into the cluster. Then for each cluster member, we continue to check if it is a core point and add the neighbors of a core point into the cluster. After all the cluster members are traversed, we obtain the first cluster. By repeating the same procedures in the unclustered data, we are able to obtain the other clusters.

In addition to the ability to extract clusters of arbitrary shapes, DBSCAN has some other special properties. Different from many algorithms obtaining clusters simultaneously from a partitioning process, DBSCAN detects clusters one by one. In detecting each cluster, core points with large density and the neighbors of core points are included into clusters.

2.2 DP

The DP algorithm uses a different idea from DBSCAN to do density based clustering. This algorithm firstly defines two key parameters \(\rho \) and \(\delta \). With the cutoff kernel, the local density \(\rho _i\) of one data point i is calculated as the number of data points in a \(d_c\)-radius neighborhood of i, i.e.,

$$\begin{aligned} \rho _i=\sum _{j \in S, j \ne i} \chi (d_c-d_{ij}). \end{aligned}$$
(1)

Here S denotes the set of data, \(d_{ij}\) means the distance between data points i and j, and \(d_c\) refers to the cutoff distance specified by user. The parameter \(\delta \) is the distance between one data point and its nearest larger-density neighbor, i.e.,

$$\begin{aligned} \delta _i= \min _{j \in S,\rho _j>\rho _i} d_{ij}. \end{aligned}$$
(2)

The clustering with DP is built on two assumptions, i.e., cluster centers are local density peaks distant from each other, and one data point is in the same cluster as the nearest neighbor of higher density. With the first assumption, we see that cluster centers have large \(\rho \) and large \(\delta \). Noticing that the non-center data points usually have no such a property, this property is used to detect the cluster centers. Then the clustering of non-center data points is accomplished based on the second assumption. We sort the non-center data points in decreasing order according to local density, and then group each data point to the same cluster as its nearest neighbor of higher density. As the labels of cluster centers (density peaks) are already known, this process can be accomplished efficiently.

In this algorithm, local density peaks are firstly detected and used as cluster centers, and then the second assumption is used to group the non-center data points into clusters. In other words, large-density data points are firstly grouped into clusters, and then the smaller-density ones are included. In this way, the clusters are obtained in a cluster expansion mode, and the data points are included into clusters according to the decreasing order of local density.

3 Our Algorithm

By investigating the DBSCAN and DP algorithms, we identify some major drawbacks of these two algorithms. Furthermore, we find that these two algorithms have some complementary properties. We then propose to merge the merits of these two algorithms to overcome their drawbacks. The details of our algorithm are presented below.

The DBSCAN algorithm defines core points as those data points with density above a density threshold. In clustering, only the core points and their neighbors in the Eps neighborhood are grouped into clusters, and the small-density data points distant from large-density ones are treated as noises. In other words, only the large-density data points and their neighbors in the Eps neighborhood can be grouped into clusters. While this practice is helpful to find out noises, it may classify some small-density data points as noises by mistake. Taking a cluster of Gaussian distribution illustrated in Fig. 1(a) for example, we observe that the border data points are with evidently smaller density than central data. In this case, the small-density border data points may be misclassified as noises if they are not close enough to the large-density ones. To this problem, we have a look at the DP algorithm. After cluster center identification, the DP algorithm groups the other data points into clusters with the assumption that one data point is in the same cluster as its nearest neighbor of higher density. Based on this assumption, one data point can be included into a cluster only if its nearest neighbor of higher density are already in the cluster. Consequently, in extracting each cluster, the large-density data points are firstly included, followed by the small-density ones. In other words, with the DP algorithm clusters are generated in a cluster-expansion manner, and the expansion is from large-density data points to small-density ones.

Fig. 1.
figure 1

Three cases difficult for the DBSCAN algorithm. The top row shows the original datasets, and the bottom row shows the DBSCAN clustering results. In the bottom row, the black circles indicate the detected noises by the DBSCAN algorithm. In the leftmost case, the border data points of small density are not grouped into clusters. In the middle case, the whole small-density cluster are detected as noises by a large density threshold. In the rightmost case, two clusters are detected as one by a small density threshold.

Based on the above discussion, we present the following method to solve the problem of DBSCAN using DP. With DBSCAN we extract a cluster by including core points and their neighbors in the Eps neighborhood. Instead of extracting the next cluster immediately, we do a cluster expansion step with the DP algorithm. The first step is to sort the unclustered data points in the increasing order according to the average distance to the cluster. For each unclustered data point i, we add it into the cluster if its nearest neighbor of higher density has already been included into the cluster. Otherwise, we leave i unclustered. By introducing cluster expansion with DP, we are able to include the small-density data points into clusters, no matter how far they are from large-density data. Furthermore, as we add small-density data points into the cluster based on DP, the cluster expansion is unlikely to include data points in other clusters and merge neighboring clusters into one. We still use Fig. 1(a) as the example to explain this effect. In the boundary between the two clusters, each data point is assigned the same label as its nearest neighbor of higher density. This means that the cluster expansion will not cross the density valley and include the data points in the other cluster. By dividing clusters with density valley, the cluster expansion matches the idea of density based clustering perfectly.

With the DP-based cluster expansion, we are able to group the small-density data points in the border area into clusters, on condition that each cluster from DBSCAN is a subset of one real cluster. However, with the original DBSCAN algorithm the density threshold is fixed, and the generated clusters may not be able to satisfy the condition, especially in the case that there exists a large density variance across clusters. With a large density threshold, the large-density clusters can be detected, whereas the whole small-density clusters may be treated as noises (Fig. 1(b)). This problem cannot be solved with DP-based expansion which terminates at the boundary between clusters. In contrast, with a small density threshold, the data points in the boundary area between clusters may be treated as core points. In this case, it is likely that two or more neighboring clusters are merged into one (Fig. 1(c)). This problem cannot be solved with DP-based expansion either.

Since the DBSCAN algorithm cannot deal with datasets with large density variance with a fixed density threshold, we propose to determine the density threshold for each cluster adaptively. For this purpose, our approach is to find some seeds in a cluster and infer the density threshold based on the seeds. In the DBSCAN algorithm, only core points with large density and their nearest neighbors can be included into clusters. Therefore the seeds must not be with too small density. In the DP algorithm, the first step is to detect the cluster center, which is the local density peak. Therefore the detected cluster centers have the largest density in the clusters and are most suitable to be used as seeds. Consequently, we can use the density of one cluster center to calculate the density threshold of the cluster. Considering that cluster centers are local density peaks, and have the largest density in the cluster, we determine the parameter Eps and MinPts as follows. The MinPts is set as 4, which is selected based on the experiments presented in Sect. 4. Then we set Eps as the distance between a cluster center and its \(2*MinPts\)-th nearest neighbor. In this way we set a density threshold which is smaller than the density of the cluster center, allowing the neighboring data points of smaller density to be clustered. As we have a DP-based cluster expansion step to include small-density data points into clusters after DBSCAN clustering, the density threshold determined here, if large, will not be a big problem.

The whole process of our algorithm is described as follows. The first step is to calculate the pairwise distance matrix D, based on which the local density \(\rho \) and distance \(\delta \) of each data point are obtained. Then the data points with the largest \(\gamma =\rho \delta \) are selected as cluster centers. Starting from each cluster center, we determine Eps and MinPts and do DBSCAN clustering to obtain an initial cluster. Then the DP algorithm is used to expand the initial cluster to include border data points of small density, yielding the final cluster. The other clusters are obtained in a similar way. The detailed procedures of our algorithm are presented formally in Algorithm 1.

figure a

By merging the merits of DBSCAN and DP, our algorithm has the following properties. First, by using cluster centers as the seeds of DBSCAN clustering, we are able to determine the DBSCAN parameters for each cluster adaptively. This improves the clustering results in the case that there is a large density variance across clusters. Second, with DP-based cluster expansion, the border data points of small density are grouped into clusters, and the expansion does not cross the boundary between clusters. This helps to deal with the clusters where border data points are with significantly smaller density than central data. Third, by combining DBSCAN with DP, our algorithm solves the problems of the DBSCAN algorithm and avoids the drawbacks of the DP algorithm in the meanwhile. With the DP algorithm, one cluster center is a local density peak and has the largest density in a cluster, and all the other cluster members have smaller density. In other words, after one cluster center is identified, only data points of smaller density have the opportunity to be included into the cluster. Consequently, cluster center identification results influence the clustering results significantly. Due to the complexity of data distribution and the imperfect identification method, the identified cluster centers are not guaranteed to be the density peaks in their respective clusters. The inaccurate cluster centers definitely degrade the clustering results. In our algorithm, DBSCAN is used to generate initial clusters containing the detected cluster centers, and then DP is used to expand initial clusters to final clusters. Since DBSCAN clustering groups data points with sufficient large density into clusters, the real cluster centers with the largest density will be included in the initial clusters. Consequently, our algorithm is less influenced by the inaccuracy in cluster center identification.

It is worth noticing that both DBSCAN and DP are able to detect noises. However, as we discussed above, with DBSCAN the noises are detected based on a density threshold and an inappropriate threshold may misclassify small-density data points as noises. Instead, the DP algorithm defines a border region, and data points in that region with density smaller than an adaptive threshold are considered as noises. Therefore we adopt the method of DP to detect noises.

4 Experiments

4.1 Comparison with DBSCAN and DP

As our algorithm is proposed to combine the DBSCAN and DP algorithms for better performance, we firstly make a comparison with these two algorithms to observe if improvements are obtained. With the DBSCAN algorithm, the parameter Eps is calculated based on MinPts following [10], and MinPts is selected from 2, 3, \(\cdots \), 10. For the DP algorithm, we implement two versions with the cutoff and Gaussian kernels and denote them by DP-cutoff and DP-Gaussian, respectively. With both versions, the parameter \(d_c\) is calculated by including a percentage of data points in the \(d_c\)-radius neighborhood, and the percentage is selected from 1.0%, 1.1%, \(\cdots \), 2.0%.

The experiments are conducted on eight synthetic and seven real datasets. The eight synthetic datasets include Aggregation [16], Compound [40], D31 [38], Flame [14], Jain [23], Pathbased [6], R15 [38] and Spiral [6]. All these datasets are composed of 2D points of specially designed shapes. From UCI machine learning repository we take seven real datasets, including Breast, Glass, Iris, Thyroid, Wdbc, Wine and Yeast. In these datasets, Wdbc and Breast datasets are composed of data used for breast cancer diagnosis, and the Thyroid dataset is for thyroid disease diagnosis. The Wine, Iris and Glass datasets are composed of the features taken from different kinds of wine, iris flows and glass, respectively. The Yeast dataset is designed to classify the cellular localization sites of proteins. All these datasets have ground truth clustering results, and we use NMI to evaluate the clustering results. The clustering results comparison of our algorithm with DBSCAN, DP-cutoff and DP-Gaussian are shown in Fig. 2.

Fig. 2.
figure 2

Clustering results comparison of DBSCAN, DP-cutoff, DP-Gaussian and our algorithm.

In comparison with DBSCAN, DP-cutoff and DP-Gaussian, Fig. 2(a) shows that our algorithm generates the best results on 12 datasets in all the 15 datasets. In the remaining three datasets, the best results on Aggregation and Flame datasets are very close to ground truth, and our results are close to the best ones. Only with the Spiral dataset our algorithm is outperformed by DBSCAN and DP-Gaussian evidently. Especially, the Jain dataset is composed of two clusters of significantly different densities. Our algorithm generate nearly perfect result on this dataset, showing evident advantage over DBSCAN and DP algorithms in dealing with density variance across clusters. We believe these observations show that our algorithm does merge the merits of DBSCAN and DP algorithms and obtain improvements with respect to these two algorithms.

In experiments, our algorithm does not improve the clustering results with respect to DBSCAN and DP on three datasets, namely, Aggregation, Flame and Spiral. We discuss the reason as follows. As shown in Fig. 3(a), (b) and (c), in all these three datasets the data points in each cluster are distributed rather evenly, and there is no large density difference among different clusters. This kind of datasets are easy for DBSCAN and DP algorithms, and some of them generate perfect results on these datasets, i.e., DP-cutoff and DP-Gaussian on Aggregation, DP-Gaussian on Flame, and DBSCAN and DP-Gaussian on Spiral. In this case, there is little room for improvement, and our combination of two algorithms is outperformed by the parameter tuning of two algorithms individually.

Fig. 3.
figure 3

Some 2D datasets used in experiments.

4.2 Comparison with Other Algorithms

In the comparison with other algorithms, we adopt the k-means, NCuts, DSets, AP, SPRG [43] algorithms and the one proposed in [18]. With k-means, NCuts and SPRG algorithms, we set the number of clusters as ground truth, and report the average results of five runs. With the DSets algorithm, we build the similarity matrix with \(s(x,y)=exp(-d(x,y)/\sigma )\), where \(\sigma =k\overline{d}\), \(\overline{d}\) is the mean of pairwise distances, and k is selected from 1, 2, 5, 10, 20, \(\cdots \), 100. The input parameter of the AP algorithm is the preference value p, and the authors of [5] published the code to calculate the range [\(p_{min},p_{max}\)]. We set this parameter as \(p_{min}+m\lambda \), where \(\lambda =(p_{max}-p_{min})/10\) and m is selected from 1, 2, \(\cdots \), 9, 9.1, 9.2, \(\cdots \), 9.9. The algorithm proposed in [18] involves no parameters. The results comparison is reported in Table 1 with NMI as the evaluation criterion.

Table 1. Comparison of clustering results (NMI).

In the following we discuss the results shown in Table 1. Our algorithm performs the best on seven datasets (Compound, Flame, Glass, Iris, Jain, R15, Spiral), and near-best on six datasets (Aggregation, Breast, D31, Pathbased, Wdbc, Yeast). Only on two datasets (Thyroid and Wine) our algorithm are outperformed by the SPRG algorithm evidently. Our algorithm also generates the best average result (18% over the second best one) and the second smallest standard deviation. This comparison demonstrates the advantage of our algorithm in clustering accuracy. As to the observation that the SPRG algorithm performs better than our algorithm evidently on Thyroid and Wine datasets, we discuss the reason as follows. The SPRG algorithm belongs to spectral clustering, and its contribution lies in building a robust affinity graph for clustering. Noticing that NCuts as a spectral clustering algorithm performs much worse than SPRG on these two datasets, the good results of SPRG should be attributed to the specially designed affinity graph.

4.3 Discussions

Similar to many existing clustering algorithms, our algorithm involves some parameters, including the number of clusters, MinPts and Eps, which are inherited from the DBSCAN and DP algorithms. In this paper we assume that the number of clusters are pre-determined by user, and we discuss the other two parameters as follows. We haven’t been able to find a method to determine the optimal values for MinPts and Eps automatically, and these two parameters must be tuned in experiments. Intuitively, tuning these two parameters separately for each dataset is able to generate better results than those reported in our experiments. However, in order to reduce the cost of tuning, we have tried to find a fixed MinPts and a fixed Eps for all datasets. These fixed parameters may not be optimal for each individual dataset, but they generate the best overall results. We further determine Eps as the distance between a cluster center and its \(2*MinPts\)-th nearest neighbor, so that a medium number of data points are used in density estimation. Therefore we are left only one parameter MinPts. With the 15 datasets introduced above, we test different values of MinPts from 2 to 10, and obtain the clustering results shown in Fig. 4(a) and (b). Here we show the results in two subfigures to differentiate between different datasets more clearly.

Fig. 4.
figure 4

Influence of the MinPts parameter on our algorithm.

From Fig. 4(a) and (b) we observe that in the range [3, 6], the variance of MinPts has small influence on clustering results with most of datasets, except for Jain and Spiral datasets, and larger values of MinPts rarely improve the clustering results further. On Wine and Wdbc datasets there is a slight performance improvement from \(MinPts=3\) to \(MinPts=4\), and on Wine, Thyroid and Flame datasets we observe an evident decrease in clustering quality from \(MinPts=4\) to larger values. While \(MinPts=4\) generates the worst result on Spiral, it also performs the best on Jain. All these observations together indicate that \(MinPts=4\) is a suitable option for a fixed value of this parameter. In fact, Fig. 4(c) shows that the average result on the 15 datasets reaches the peak at \(MinPts=4\), and this value is also the same as the recommended one in the original DBSCAN paper [12]. Therefore in our algorithm we fix MinPts to be 4 to avoid parameter tuning. As to why the variance of MinPts has a large influence on results on Spiral, Jain and Flame datasets, our opinion is that this is resulted by the special structure of these datasets shown in Fig. 3(b), (c) and (d). The Spiral and Jain datasets have very special and non-spherical shapes, which make the density estimation sensitive to the number of used data points. With respect to the Flame dataset, its two clusters are composed to even-distributed data and the border between two clusters are not distinct. This may be one reason that this dataset experiences a drastic performance degradation at \(MinPts=7\).

In this paper we propose to combine the merits of DBSCAN and DP to achieve better clustering results, and experimental results do show that our algorithm is able to improve the clustering results. However, the combination of two algorithms leads to larger computation load, as in extracting each cluster we run both DBSCAN and DP clustering steps. In future work, our plan is to make full use of existing calculation results to reduce the computation load. As both DBSCAN and DP involves a large amount of distance calculation, density calculation and finding of nearest neighbors, it is possible to do these calculations only once and therefore avoid repetitive calculation.

Finally, we notice that in our experiments most of synthetic datasets are composed of clusters of even density and are suitable to be clustered by density based algorithms. Therefore our algorithm is favored in clustering with these datasets. However, with the 7 real datasets which are not constructed for density based algorithms, our algorithm performs the best on 2 datasets (Glass and Iris) and second-best on 3 datasets (Breast, Thyroid and Wdbc). Even on the remaining 2 datasets (Wine and Yeast), the differences between our results and the best ones are not large. We believe these comparisons show that our algorithm works well for not only even-density clusters, but also clusters of other types. In our opinion, the reason of the good performance of our algorithm is as follows. Our algorithm is proposed to combine the merits of DBSCAN and DP. The DBSCAN is a typical density based clustering algorithm, and it is effective in clustering datasets where the clusters are composed of evenly distributed data points. The DP algorithm groups data points into clusters following a large-density-first order, and it is suitable to cluster datasets where clusters are of Gaussian-like distribution. Consequently, our algorithm is effective for both even-density and Gaussian-distribution clusters. Considering that in many datasets the cluster distribution belongs to one of these two types or the mixture of two types, it is no strange that our algorithm performs well in clustering these datasets.

5 Conclusions

Motivated by the complementary properties of the DBSCAN and density peak algorithms, we present a robust clustering algorithm by combining these two algorithms. In the first step cluster centers are detected with the density peak algorithm. We then use the cluster centers as the seeds of DBSCAN clustering and determine the parameters of DBSCAN for each cluster adaptively. Finally, the clusters from DBSCAN are expanded with the DP algorithm, where the border data points of small density are included into clusters. By making use of the merits of the density peak algorithm, our algorithm is able to solve two major problems of DBSCAN, i.e., one set of fixed parameters do not work well with clusters of varied densities, and border data points of small density may be treated as noises. In the meanwhile, by adding DBSCAN clustering between the two steps of DP algorithm, our algorithm is less influenced by the inaccuracy of cluster center identification. In experiments, our algorithm outperforms DBSCAN and density peak algorithm on most of adopted datasets, and it also compares favorably to several commonly used clustering algorithms. These results show that our algorithm is effective in improving the clustering results.

Our algorithm depends on the DP algorithm to detect cluster centers, which are then used to determine the parameters of DBSCAN. However, we have found that the cluster centers detected by DP are not accurate in some cases. As a result, the DBSCAN parameters and then the clustering results are influenced. In the future work, we plan to explore better methods to detect large-density regions as seeds for DBSCAN.