Semi-supervised Clustering Method for Multi-density Data

Atwa, Walid; Li, Kan

doi:10.1007/978-3-319-22324-7_33

Walid Atwa¹⁸ &
Kan Li¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9052))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1132 Accesses

Abstract

Finding clusters is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Density-based clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. However, they are depending on two specified parameters (Eps and Minpts) that define a single density. Moreover, most of these methods are unsupervised, which cannot improve the clustering quality by utilizing a small number of prior knowledge. In this paper we show how background knowledge can be used to bias a density-based clustering method for multi-density data. Experimental results confirm that the proposed method gives better results than other semi-supervised and unsupervised clustering algorithms.

Download conference paper PDF

Relative density-based clustering algorithm for identifying diverse density clusters effectively

Article 13 March 2021

Yuying Wang & Youlong Yang

A New Accurate Clustering Approach for Detecting Different Densities in High Dimensional Data

A dynamic density-based clustering method based on K-nearest neighbor

Article 27 January 2024

Mahshid Asghari Sorkhi, Ebrahim Akbari, … Homayun Motameni

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

DBSCAN is a density based clustering algorithm and its effectiveness for spatial datasets has been demonstrated in the existing literature [1]. However, there are two distinct drawbacks for DBSCAN and its extension methods: (1) the performances of clustering depend on two specified parameters. One is the maximum radius of a neighborhood (Eps) and the other is the minimum number of the data points contained in this neighborhood (Minpts). In fact these two specified parameters define a single density. Nevertheless, without enough prior knowledge, these two parameters are difficult to be determined; (2) with these two parameters for a single density, DBSCAN does not perform well to datasets with varying densities.

For example, in Fig. 1(a), DBSCAN fails to find the four clusters, because this dataset has four different densities and the clusters are not totally separated by sparse regions. In Fig. 1(b), DBSCAN discovers only the three small clusters and considers the other two large clusters as noises, or merges the three small clusters in one cluster to be able to find the other two large clusters. These problems occur due to using global values of the parameters (Eps, Minpts).

Semi-supervised clustering algorithms have been received a significant amount of attention in data mining and machine learning fields. Unlike traditional clustering algorithms, semi-supervised clustering (also known as constrained clustering) is a category of techniques that tries to incorporate prior information like pairwise constraints into the clustering algorithms. Pairwise constraints provide the supervision information like must-link (ML) and cannot-link (CL), where must-link constraint specifies that the pair of instances should be assigned to the same cluster, and cannot-link constraint specifies that the pair of instances should be placed into different clusters.

In this paper, we propose a semi-supervised clustering (called SemiDen) algorithm that discovers clusters of different densities and arbitrary shapes. The idea of the proposed algorithm is to partition the dataset into different density levels and compute the density parameters for each density level set. Then, use the pairwise constraints for expanding the clustering process based on the computed density parameters. Evaluating SemiDen algorithm on real datasets confirms that the proposed algorithm gives better results than other semi-supervised and unsupervised density based approaches. In summary, our contribution in this paper is clustering multi-density datasets and arbitrary shapes using pairwise constraints.

2 Clustering Multi-density Data

In this section, we propose a semi-supervised density-based clustering (SemiDen) algorithm that can find clusters of varying densities, shapes and sizes, even in the presence of noise and outliers. The proposed algorithm is divided into two main parts: (1) partitioning the dataset into different density levels; (2) using pairwise constraints for expanding the clustering process for each density level. We summarize our semi-supervised clustering (SemiDen) algorithm in Algorithm 1.

First, we describe the details of partitioning the dataset into different density levels. First our algorithm finds the k-nearest neighbors for each point in the given dataset. Based on the k-nearest neighbors, a local density function is used to find the density at each point. Where the local density function at point x is defined as the sum of the distances among the point x and its k-nearest neighbors, as shown in Eq. (1).

$$ DEN\left( x \right) = \mathop \sum \limits_{i = 1}^{k} D\left( {x,y_{i} } \right) $$

(1)

where D(x, y _i) is the Euclidean distance between point x and its k-nearest neighbors y _i.

$$ D \left( {x , y} \right) = \sqrt {\mathop \sum \limits_{j = 1}^{n} \left( {x_{j} - y_{j} } \right)^{2} } $$

(2)

After computing the local density function for each data point, we sort them in ascending order and compute the density variation between each two adjacent points p _i and p _i+1 denoted by DENVAR(p _i, p _i+1). Then, we get DENVAR list (denoted by DVList) in which each element in DVList is a density variation between two points in the dataset.

$$ DENVAR\left( {p_{i} ,p_{i + 1} } \right) = \frac{{\left| {DEN\left( {p_{i + 1} } \right) - DEN\left( {p_{i} } \right)} \right|}}{{DEN\left( {p_{i} } \right)}} $$

(3)

For datasets with widely varied densities, there will be some distinct variation depending on the densities of the data points. But for points in the same density level, the range of variation is small. Thus, we can acquire all density level sets by detecting these distinct variations of density.

Definition 1: (Density Level Set). Density level set (DLS) consists of points whose densities are approximately the same. In other words, the density variations of the data points within the same DLS should be relatively small. Points p _i and p _j belong to the same DLS if they satisfy the following condition:

$$ p_{i} , p_{j} \in DLS_{k}{\,\,if\,\,}DENVAR\left( {p_{i} , p_{j} } \right) \le \tau $$

where $ \tau $ is a density variation threshold which divides a multi-density dataset into several density level sets.

We implement partitioning method on DVList. Given a density variation threshold $ \tau $ (Definition 1), remove DENVAR values which are bigger than $ \tau $ out of DVList, then the points of remaining separated segments are considered as different density level sets. Here, we compute $ \tau $ according to the statistical characteristics of the DVList as follows:

$$ \tau = E\left( {DVList} \right) + \sigma \left( {DVList} \right) $$

(4)

where E is mathematical expectation and σ is standard deviation of DVList. According to the DVList values, there are only a small number of points with large DENVAR values which are used to divide the dataset into different sets according to the threshold $ \tau $.

After partitioning the dataset into different density level sets, we need to find representative value of the parameters (Eps and Minpts). We initialize the parameter Minpts as k-nearest neighbor and try to identify the value of parameter Eps for each density level. For a certain density level set (DLS), its corresponding Eps will be magnified by simply choosing the maximum DEN value. As we know, there are some points may correspond to border objects or noise and these points may have some influence on the Eps value. To deal with this problem, we compute Eps _i for DLS _i as follows:

$$ Eps_{i} = maxDEN\left( {DLS_{i} } \right) .\sqrt {\frac{{medianDEN(DLS_{i} )}}{{meanDEN(DLS_{i} )}}} $$

(5)

where maxDEN, meanDEN and medianDEN are the maximum, mean and the median density of DLS _i respectively.

Finally computing the Eps parameter for each density level and initializing the parameter MinPts as k-nearest neighbor, we use the pairwise constraints for expanding the clustering process for each density level as follows:

In Step 11(a): we check if Point belongs to clusters or noise set. Where the key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts). Therefore we compute the Point’s Eps-neighborhood. If the number of points in Eps-neighborhood less than MinPts, adding Point to noise set.
In Step 11(b): satisfy Must-link constraints. If Point belongs to a must-link constraints, all the points contained in this Point are assigned to the current cluster, so as to satisfy the must-link constraints.
In Step 11(c): satisfy Cannot-link constraints. Before adding point p into the current cluster, we should ensure that the adding operation does not violate cannot-link constraints. If there is a point q in the current cluster and a pair {p, q} ∈ CL, adding p into the cluster will violate the cannot-link constraints, therefore the points p should not be assigned to the current cluster.

3 Experimental Results

In this section we present two experimental results of SemiDen algorithm on a variety of datasets, including synthetic datasets and several real world datasets. We implement our algorithm in Java and work on a 2.4 GHz Intel Core 2 PC running windows XP with 2 GB main memory.

Besides the proposed algorithm, we also implemented some competing counterparts as well as the baseline methods listed below for comparison.

1.
APSCAN: An unsupervised clustering algorithm that uses affinity propagation for clustering datasets with varying densities [2].
2.
HISSCUL: A hierarchical semi-supervised density based clustering algorithm. HISSCLU use the parameters ρ and ξ to establish borders between clusters when there are no clear cluster boundaries. In order to maximally preserve the original cluster structure HISSCLU is recommended to set up with ρ = 1.0, ξ = 0.5 [3].
3.
C-DBSCAN: A density based semi-supervised clustering algorithm that is based on DBSCAN for clustering datasets with arbitrary structures. C-DBSCAN depends on two specified parameters (Eps and MinPts). We set the parameters Eps = 0.5 and MinPts = 4 (default values in DBSCAN) [4].
4.
SSDBSCAN: Semi-supervised density based clustering algorithm that automatically finds density parameters for each natural cluster in a dataset [5].

The experiments were performed on datasets from UCI repository (yeast, segment, digits-389 and magic). These datasets provide a good representation of different characteristics: numbers of samples are ranges from 1484 to 19,020, dimensionalities from 8 to 19, and number of clusters from 2 to 10.

Figure 2 shows the NMI results over the different number of pairwise constraints on the real datasets. It can be observed from Fig. 2 that our algorithm “SemiDen” generally performs better than the four other methods when the number of constraints increased (e.g. yeast, segment, and magic).

We also notice that the constraint based clustering algorithms generally outperform the traditional clustering algorithms. It can be seen from Fig. 2 that the performance of APSCAN in all datasets is constant value as it is unsupervised clustering algorithms. This tends to prove the utility of constraint based clustering algorithms over unsupervised approaches when expert knowledge is available.

To evaluate the efficiency of clustering algorithms, we compare the average CPU time consumption of each semi-supervised clustering algorithm, with different number of pairwise constraints as shown in from Fig. 3.

References

Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Google Scholar
Chen, X., Liu, W., Qiu, K., Lai, J.: APSCAN: a parameter free algorithm for clustering. Pattern Recognit. Lett. 32, 973–986 (2011)
Article Google Scholar
Bohm, C., Plant, C.: Hissclu: a hierarchical density-based method for semi-supervised clustering. In: Proceedings of 11th International Conference on Extending Database Technology (2008)
Google Scholar
Ruiz, C., Spiliopoulou, M., Menasalvas, E.: Density-based semi-supervised clustering. Data Min. Knowl. Discov. 21, 345–370 (2010)
Article MathSciNet Google Scholar
Lelis, L., Sander, J.: Semi-supervised density-based clustering. In: Proceedings of 8th IEEE International Conference on Data Mining, pp. 842–847 (2009)
Google Scholar

Download references

Acknowledgment

The Research was supported in part by Natural Science Foundation of China (No.60903071), National Basic Research Program of China (973 Program, No.2013CB329605), Specialized Research Fund for the Doctoral Program of Higher Education of China, and Training Program of the Major Project of BIT.

Author information

Authors and Affiliations

Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, School of Computer, Beijing Institute of Technology, Beijing, China
Walid Atwa & Kan Li

Authors

Walid Atwa
View author publications
You can also search for this author in PubMed Google Scholar
Kan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kan Li .

Editor information

Editors and Affiliations

Soochow University, Suzhou, China
An Liu
Nagoya University, Nagoya, Japan
Yoshiharu Ishikawa
Wuhan University, Wuhan, China
Tieyun Qian
University of Hong Kong, Hong Kong, China
Sarana Nutanong
Monash University, Clayton, Victoria, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Atwa, W., Li, K. (2015). Semi-supervised Clustering Method for Multi-density Data. In: Liu, A., Ishikawa, Y., Qian, T., Nutanong, S., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9052. Springer, Cham. https://doi.org/10.1007/978-3-319-22324-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-22324-7_33
Published: 30 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22323-0
Online ISBN: 978-3-319-22324-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-supervised Clustering Method for Multi-density Data

Abstract

Similar content being viewed by others

Relative density-based clustering algorithm for identifying diverse density clusters effectively

A New Accurate Clustering Approach for Detecting Different Densities in High Dimensional Data

A dynamic density-based clustering method based on K-nearest neighbor

Keywords

1 Introduction

2 Clustering Multi-density Data

3 Experimental Results

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Semi-supervised Clustering Method for Multi-density Data

Abstract

Similar content being viewed by others

Relative density-based clustering algorithm for identifying diverse density clusters effectively

A New Accurate Clustering Approach for Detecting Different Densities in High Dimensional Data

A dynamic density-based clustering method based on K-nearest neighbor

Keywords

1 Introduction

2 Clustering Multi-density Data

3 Experimental Results

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation