Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

DBSCAN is a density based clustering algorithm and its effectiveness for spatial datasets has been demonstrated in the existing literature [1]. However, there are two distinct drawbacks for DBSCAN and its extension methods: (1) the performances of clustering depend on two specified parameters. One is the maximum radius of a neighborhood (Eps) and the other is the minimum number of the data points contained in this neighborhood (Minpts). In fact these two specified parameters define a single density. Nevertheless, without enough prior knowledge, these two parameters are difficult to be determined; (2) with these two parameters for a single density, DBSCAN does not perform well to datasets with varying densities.

For example, in Fig. 1(a), DBSCAN fails to find the four clusters, because this dataset has four different densities and the clusters are not totally separated by sparse regions. In Fig. 1(b), DBSCAN discovers only the three small clusters and considers the other two large clusters as noises, or merges the three small clusters in one cluster to be able to find the other two large clusters. These problems occur due to using global values of the parameters (Eps, Minpts).

Fig. 1.
figure 1

Clusters with varying densities

Semi-supervised clustering algorithms have been received a significant amount of attention in data mining and machine learning fields. Unlike traditional clustering algorithms, semi-supervised clustering (also known as constrained clustering) is a category of techniques that tries to incorporate prior information like pairwise constraints into the clustering algorithms. Pairwise constraints provide the supervision information like must-link (ML) and cannot-link (CL), where must-link constraint specifies that the pair of instances should be assigned to the same cluster, and cannot-link constraint specifies that the pair of instances should be placed into different clusters.

In this paper, we propose a semi-supervised clustering (called SemiDen) algorithm that discovers clusters of different densities and arbitrary shapes. The idea of the proposed algorithm is to partition the dataset into different density levels and compute the density parameters for each density level set. Then, use the pairwise constraints for expanding the clustering process based on the computed density parameters. Evaluating SemiDen algorithm on real datasets confirms that the proposed algorithm gives better results than other semi-supervised and unsupervised density based approaches. In summary, our contribution in this paper is clustering multi-density datasets and arbitrary shapes using pairwise constraints.

2 Clustering Multi-density Data

In this section, we propose a semi-supervised density-based clustering (SemiDen) algorithm that can find clusters of varying densities, shapes and sizes, even in the presence of noise and outliers. The proposed algorithm is divided into two main parts: (1) partitioning the dataset into different density levels; (2) using pairwise constraints for expanding the clustering process for each density level. We summarize our semi-supervised clustering (SemiDen) algorithm in Algorithm 1.

First, we describe the details of partitioning the dataset into different density levels. First our algorithm finds the k-nearest neighbors for each point in the given dataset. Based on the k-nearest neighbors, a local density function is used to find the density at each point. Where the local density function at point x is defined as the sum of the distances among the point x and its k-nearest neighbors, as shown in Eq. (1).

$$ DEN\left( x \right) = \mathop \sum \limits_{i = 1}^{k} D\left( {x,y_{i} } \right) $$
(1)

where D(x, y i ) is the Euclidean distance between point x and its k-nearest neighbors y i .

$$ D \left( {x , y} \right) = \sqrt {\mathop \sum \limits_{j = 1}^{n} \left( {x_{j} - y_{j} } \right)^{2} } $$
(2)

After computing the local density function for each data point, we sort them in ascending order and compute the density variation between each two adjacent points p i and p i+1 denoted by DENVAR(p i , p i+1 ). Then, we get DENVAR list (denoted by DVList) in which each element in DVList is a density variation between two points in the dataset.

$$ DENVAR\left( {p_{i} ,p_{i + 1} } \right) = \frac{{\left| {DEN\left( {p_{i + 1} } \right) - DEN\left( {p_{i} } \right)} \right|}}{{DEN\left( {p_{i} } \right)}} $$
(3)

For datasets with widely varied densities, there will be some distinct variation depending on the densities of the data points. But for points in the same density level, the range of variation is small. Thus, we can acquire all density level sets by detecting these distinct variations of density.

Definition 1: (Density Level Set). Density level set (DLS) consists of points whose densities are approximately the same. In other words, the density variations of the data points within the same DLS should be relatively small. Points p i and p j belong to the same DLS if they satisfy the following condition:

$$ p_{i} , p_{j} \in DLS_{k}{\,\,if\,\,}DENVAR\left( {p_{i} , p_{j} } \right) \le \tau $$

where \( \tau \) is a density variation threshold which divides a multi-density dataset into several density level sets.

We implement partitioning method on DVList. Given a density variation threshold \( \tau \) (Definition 1), remove DENVAR values which are bigger than \( \tau \) out of DVList, then the points of remaining separated segments are considered as different density level sets. Here, we compute \( \tau \) according to the statistical characteristics of the DVList as follows:

$$ \tau = E\left( {DVList} \right) + \sigma \left( {DVList} \right) $$
(4)

where E is mathematical expectation and σ is standard deviation of DVList. According to the DVList values, there are only a small number of points with large DENVAR values which are used to divide the dataset into different sets according to the threshold \( \tau \).

After partitioning the dataset into different density level sets, we need to find representative value of the parameters (Eps and Minpts). We initialize the parameter Minpts as k-nearest neighbor and try to identify the value of parameter Eps for each density level. For a certain density level set (DLS), its corresponding Eps will be magnified by simply choosing the maximum DEN value. As we know, there are some points may correspond to border objects or noise and these points may have some influence on the Eps value. To deal with this problem, we compute Eps i for DLS i as follows:

$$ Eps_{i} = maxDEN\left( {DLS_{i} } \right) .\sqrt {\frac{{medianDEN(DLS_{i} )}}{{meanDEN(DLS_{i} )}}} $$
(5)

where maxDEN, meanDEN and medianDEN are the maximum, mean and the median density of DLS i respectively.

Finally computing the Eps parameter for each density level and initializing the parameter MinPts as k-nearest neighbor, we use the pairwise constraints for expanding the clustering process for each density level as follows:

  • In Step 11(a): we check if Point belongs to clusters or noise set. Where the key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts). Therefore we compute the Point’s Eps-neighborhood. If the number of points in Eps-neighborhood less than MinPts, adding Point to noise set.

  • In Step 11(b): satisfy Must-link constraints. If Point belongs to a must-link constraints, all the points contained in this Point are assigned to the current cluster, so as to satisfy the must-link constraints.

  • In Step 11(c): satisfy Cannot-link constraints. Before adding point p into the current cluster, we should ensure that the adding operation does not violate cannot-link constraints. If there is a point q in the current cluster and a pair {p, q} ∈ CL, adding p into the cluster will violate the cannot-link constraints, therefore the points p should not be assigned to the current cluster.

3 Experimental Results

In this section we present two experimental results of SemiDen algorithm on a variety of datasets, including synthetic datasets and several real world datasets. We implement our algorithm in Java and work on a 2.4 GHz Intel Core 2 PC running windows XP with 2 GB main memory.

Besides the proposed algorithm, we also implemented some competing counterparts as well as the baseline methods listed below for comparison.

  1. 1.

    APSCAN: An unsupervised clustering algorithm that uses affinity propagation for clustering datasets with varying densities [2].

  2. 2.

    HISSCUL: A hierarchical semi-supervised density based clustering algorithm. HISSCLU use the parameters ρ and ξ to establish borders between clusters when there are no clear cluster boundaries. In order to maximally preserve the original cluster structure HISSCLU is recommended to set up with ρ = 1.0, ξ = 0.5 [3].

  3. 3.

    C-DBSCAN: A density based semi-supervised clustering algorithm that is based on DBSCAN for clustering datasets with arbitrary structures. C-DBSCAN depends on two specified parameters (Eps and MinPts). We set the parameters Eps = 0.5 and MinPts = 4 (default values in DBSCAN) [4].

  4. 4.

    SSDBSCAN: Semi-supervised density based clustering algorithm that automatically finds density parameters for each natural cluster in a dataset [5].

The experiments were performed on datasets from UCI repository (yeast, segment, digits-389 and magic). These datasets provide a good representation of different characteristics: numbers of samples are ranges from 1484 to 19,020, dimensionalities from 8 to 19, and number of clusters from 2 to 10.

Figure 2 shows the NMI results over the different number of pairwise constraints on the real datasets. It can be observed from Fig. 2 that our algorithm “SemiDen” generally performs better than the four other methods when the number of constraints increased (e.g. yeast, segment, and magic).

We also notice that the constraint based clustering algorithms generally outperform the traditional clustering algorithms. It can be seen from Fig. 2 that the performance of APSCAN in all datasets is constant value as it is unsupervised clustering algorithms. This tends to prove the utility of constraint based clustering algorithms over unsupervised approaches when expert knowledge is available.

To evaluate the efficiency of clustering algorithms, we compare the average CPU time consumption of each semi-supervised clustering algorithm, with different number of pairwise constraints as shown in from Fig. 3.

Fig. 2.
figure 2

Comparison of normalized mutual information over the different number of pairwise constraints

Fig. 3.
figure 3

Comparison of execution time for the semi-supervised clustering algorithms.