Parameter independent clustering based on dominant sets and cluster merging
Introduction
Data clustering refers to the task of grouping objects into clusters so that the data in the same cluster are similar and those in different clusters are dissimilar. The popular clustering algorithms include k-means, DBSCAN [7], Fuzzy C-means [34], BIRCH, EM and CLIQUE [1], and some recent works on clustering include [5], [24], [25], [28], [38]. As an important unsupervised learning approach, clustering is widely used in pattern recognition, data mining and intelligent information processing and is potentially useful in other fields [35], [36].
In recent decades, graph based clustering has attracted much attention due to its great potential demonstrated in practice. By representing the data relationship with an edge-weighted graph and the corresponding pairwise similarity matrix, graph based clustering algorithms intend to partition the graph to obtain clusters. By making use of the rich data distribution information captured in the pairwise similarity matrix, graph based algorithms have been shown to generate superior results in many applications. As one of the most well-known graph based clustering algorithms, the normalized cuts (NCuts) algorithm [30] has been widely used as a benchmark in data clustering and image segmentation. Spectral clustering utilizes the eigen-structure of the pairwise similarity matrix to perform dimension reduction, and then accomplishes the clustering with a simple algorithm, e.g., k-means, in the new data space of reduced dimension. The affinity propagation (AP) algorithm [3] passes the affinity messages among input data iteratively and finds out the cluster centers and members gradually. The AP algorithm has obtained successful application in human face clustering and gene detection, etc. Another graph based algorithm worth mentioning is the dominant sets (DSets) algorithm [27]. The DSets algorithm defines a dominant set as a graph-theoretic concept of a cluster and extracts the clusters (dominant sets) in a sequential manner. The DSets algorithm has been shown to be effective in various tasks including image segmentation [12], [14], object detection [33], human activity analysis [10] and object classification [13], [15], etc.
From the review above we see that numerous clustering algorithms have been proposed, and some of them show impressive performance in clustering tasks. However, all of these algorithms require one or more input parameters explicitly or implicitly, and their clustering results usually depend heavily on the parameters. The k-means algorithm must be fed with the number of clusters, which is not easy to determine in many cases. As spectral clustering algorithms usually adopt k-means as a step, these algorithms also require the number of clusters to be determined beforehand. While DBSCAN and AP are able to determine the number of clusters by themselves, DBSCAN requires as input a neighborhood radius and the minimum number of data in the neighborhood, and AP requires the preference values of the data to be specified. In both cases the variance of input parameters has a significant influence on the clustering results. Although the DSets algorithm uses only the pairwise similarity matrix as input and no parameters are involved explicitly, parameters may be introduced in the case that the data for clustering are represented as feature vectors. Specifically, a commonly used similarity measure of two data items x and y is where d(x, y) is the Euclidean distance and σ is a regularization parameter. With the same set of data, different σ’s lead to different similarity matrices, which are found to result in different DSets clustering results. With these parameter dependent algorithms, we need a careful parameter tuning process in order to obtain satisfactory clustering results. This makes these algorithms less attractive in practical applications.
While the majority of existing clustering algorithms are parameter dependent, the efforts to achieve parameter independence can be traced back to [19]. In this paper we present a parameter independent clustering algorithm on the basis of the DSets algorithm and cluster merging. In applying the DSets algorithm to cluster the data represented as feature vectors, the parameter σ influences the pairwise similarity matrix directly, and then the clustering results indirectly. In [11] the authors propose to transform the similarity matrix with histogram equalization, so that the new similarity matrix is no longer influenced by σ. With this transformation, the DSets algorithm generates almost identical clustering results with different σ’s. However, this transformation is also found to result in over-small clusters. This problem is solved in [11] by expanding the clusters, where the expansion method involves user-specified parameters. In this paper we solve the small-cluster problem with a different approach which is independent of parameters. Specifically, we merge the over-small clusters to increase cluster size, and the cluster merging method is based on the relationship between intra-cluster and inter-cluster similarities. By making use of the nice properties of the DSets algorithm, this parameter independent method is shown to solve the small-cluster problem effectively and improve the clustering quality evidently.
A preliminary version of some works in this paper appear in [17]. Compared with [17], the contributions of this paper are as follows. First, we discuss in theory why histogram equalization transformation of similarity matrices is able to eliminate the influence of σ’s on DSets clustering results. Second, we show that in our algorithm, the similarity matrices obtained by histogram equalization transformation perform no worse than those obtained from fixed σ’s. Third, we present an internal criterion to evaluate the clustering quality, which is shown to perform better than the existing Davies–Bouldin index and the Dunn index. In addition, we validate both the major steps and the whole procedure of our algorithm with extensive experiments, which make our conclusions more convincing.
The rest of this paper is organized as follows. The concept and properties of the DSets algorithm are presented briefly in Section 2. Then we analyze the problems of the DSets algorithm and present our cluster merging method in Section 3. Section 4 reports the experimental results of the proposed algorithm and finally Section 6 concludes this paper.
Section snippets
Dominant sets
Considering that our algorithm is based in part on the DSets algorithm, in this section we introduce the concept of dominant set and the properties of this algorithm briefly. More details of this algorithm can be found in [26], [27].
As dominant set is a graph-theoretic concept of a cluster, we represent the n data to be clustered with an undirected edge-weighted graph without self-loops, where V is the vertex set containing all the data, E denotes the edge set consisting of the
Problems of the DSets algorithm
As a graph based clustering approach, the DSets algorithm uses the pairwise similarity matrix as the input. If the data for clustering are represented in the form of the pairwise similarity matrix, then the DSets clustering process is totally parameter independent. In the case that the data are represented as feature vectors, we need to evaluate the pairwise data similarities in order to build the similarity matrix. With the commonly used data similarity measure the
Parameter independence
We have shown that DSets-histeq is independent of the parameter σ. Since we present a cluster merging step to improve the clustering results of DSets-histeq, we firstly test if our algorithm is still parameter independent. Taking the Jain dataset for example, we show the clustering results of our algorithm with different σ’s in Fig. 8, where it is evident that the four clustering results are very similar. We further report the clustering results evaluated by F-measure and Jaccard index on the
Discussion
In this paper we present a parameter independent clustering algorithm based on the DSets algorithm and cluster merging. As parameter independence is a rather strong claim, we discuss our algorithm in a little more details.
In our algorithm Euclidean distance is used to evaluate the data relationship and build the similarity matrix. After histogram equalization, the similarity matrix and then the clustering result are influenced only by the magnitude ordering of the original similarity values.
Conclusion
In this paper we present a parameter independent clustering algorithm based on the dominant sets algorithm and cluster merging. Observing that the dominant sets algorithm is sensitive to the similarity parameter σ, we theoretically show that by transforming the similarity matrices with histogram equalization, the influence of σ on clustering results can be eliminated. In order to deal with the small clusters resulted from histogram equalization, we present a clustering merging method to improve
Acknowledgment
This work is supported in part by National Natural Science Foundation of China under Grant No. 61473045.
References (38)
- et al.
Robust path-based spectral clustering
Pattern Recognit.
(2008) - et al.
Looking for natural patterns in data: part 1. Density-based approach
Chemometrics Intell. Lab. Syst.
(2001) - et al.
A novel sequence representation for unsupervised analysis of human activities
Artif. Intell.
(2009) - et al.
Towards parameter-independent data clustering and image segmentation
Pattern Recognit.
(2016) - et al.
A simple feature combination method based on dominant sets
Pattern Recognit.
(2013) - et al.
Graph-based quadratic optimization: a fast evolutionary approach
Comput. Vision Image Understanding
(2011) - et al.
Contour-based object detection as dominant set computation
Pattern Recognit.
(2012) - et al.
A benchmark for interactive image segmentation algorithms
IEEE Workshop on Person-Oriented Vision
(2011) - et al.
Automatic subspace clustering of high dimensional data
International Conference on Knowledge Discovery and Data Mining
(2005) - et al.
On the importance of sorting in “neural gas” training of vector quantizers
International Conference on Neural Networks
(1997)
Clustering by passing messages between data points
Science
Power watersheds: a new image segmentation framework extending graph cuts, random walker and optimal spanning forest
IEEE International Conference on Computer Vision
A density-based algorithm for discovering clusters in large spatial databases with noise
International Conference on Knowledge Discovery and Data Mining
Flame, a novel fuzzy clustering method for the analysis of dna microarray data
BMC Bioinf.
Clustering aggregation
ACM Trans. Knowl. Discov. Data
Dset++: a robust clustering algorithm
International Conference on Image Processing
Dsets-dbscan: a parameter-free clustering algorithm
IEEE Trans. Image Process.
Feature combination and the knn framework in object classification
IEEE Trans. Neural Netw. Learn. Syst.
Experimental study on dominant sets clustering
IET Comput. Vision
Cited by (9)
Clustering based on grid and local density with priority-based expansion for multi-density data
2018, Information SciencesCitation Excerpt :These result in poor performance with most grid-based clustering algorithms. Studies have shown that traditional grid-based clustering algorithms have the following issues [1,16,20]. Their success heavily depends on global parameters which may be difficult to obtain.
DHC: A distributed hierarchical clustering algorithm for large datasets
2019, Journal of Circuits, Systems and ComputersMerging DBSCAN and Density Peak for Robust Clustering
2019, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)An enhanced density peak based clustering algorithm
2018, Proceedings - 4th Asian Conference on Pattern Recognition, ACPR 2017Enhanced Dominant Sets Clustering by Cluster Expansion
2018, IEEE Access