Elsevier

Information Sciences

Volume 405, September 2017, Pages 1-17
Information Sciences

Parameter independent clustering based on dominant sets and cluster merging

https://doi.org/10.1016/j.ins.2017.04.006Get rights and content

Abstract

Clustering is an important unsupervised learning approach with wide application in data mining, pattern recognition and intelligent information processing. However, existing clustering algorithms usually involve one or more user-specified parameters as input and their clustering results depend heavily on these parameters. In order to solve this problem, we present a parameter independent clustering algorithm based on the dominant sets algorithm and cluster merging. In the first step histogram equalization transformation is applied to solve the parameter dependence problem of the dominant sets algorithm. We provide the theoretic foundation of this method and discuss the implementation details. The clustering result is then refined with a cluster merging method, which is based on a new clustering quality evaluation criterion. We use extensive experiments on several datasets to validate each step and the whole procedures of our algorithm. It is shown that our parameter independent algorithm performs comparably to some existing clustering algorithms which benefit from user-specified parameters.

Introduction

Data clustering refers to the task of grouping objects into clusters so that the data in the same cluster are similar and those in different clusters are dissimilar. The popular clustering algorithms include k-means, DBSCAN [7], Fuzzy C-means [34], BIRCH, EM and CLIQUE [1], and some recent works on clustering include [5], [24], [25], [28], [38]. As an important unsupervised learning approach, clustering is widely used in pattern recognition, data mining and intelligent information processing and is potentially useful in other fields [35], [36].

In recent decades, graph based clustering has attracted much attention due to its great potential demonstrated in practice. By representing the data relationship with an edge-weighted graph and the corresponding pairwise similarity matrix, graph based clustering algorithms intend to partition the graph to obtain clusters. By making use of the rich data distribution information captured in the pairwise similarity matrix, graph based algorithms have been shown to generate superior results in many applications. As one of the most well-known graph based clustering algorithms, the normalized cuts (NCuts) algorithm [30] has been widely used as a benchmark in data clustering and image segmentation. Spectral clustering utilizes the eigen-structure of the pairwise similarity matrix to perform dimension reduction, and then accomplishes the clustering with a simple algorithm, e.g., k-means, in the new data space of reduced dimension. The affinity propagation (AP) algorithm [3] passes the affinity messages among input data iteratively and finds out the cluster centers and members gradually. The AP algorithm has obtained successful application in human face clustering and gene detection, etc. Another graph based algorithm worth mentioning is the dominant sets (DSets) algorithm [27]. The DSets algorithm defines a dominant set as a graph-theoretic concept of a cluster and extracts the clusters (dominant sets) in a sequential manner. The DSets algorithm has been shown to be effective in various tasks including image segmentation [12], [14], object detection [33], human activity analysis [10] and object classification [13], [15], etc.

From the review above we see that numerous clustering algorithms have been proposed, and some of them show impressive performance in clustering tasks. However, all of these algorithms require one or more input parameters explicitly or implicitly, and their clustering results usually depend heavily on the parameters. The k-means algorithm must be fed with the number of clusters, which is not easy to determine in many cases. As spectral clustering algorithms usually adopt k-means as a step, these algorithms also require the number of clusters to be determined beforehand. While DBSCAN and AP are able to determine the number of clusters by themselves, DBSCAN requires as input a neighborhood radius and the minimum number of data in the neighborhood, and AP requires the preference values of the data to be specified. In both cases the variance of input parameters has a significant influence on the clustering results. Although the DSets algorithm uses only the pairwise similarity matrix as input and no parameters are involved explicitly, parameters may be introduced in the case that the data for clustering are represented as feature vectors. Specifically, a commonly used similarity measure of two data items x and y is s(x,y)=exp(d(x,y)/σ), where d(x, y) is the Euclidean distance and σ is a regularization parameter. With the same set of data, different σ’s lead to different similarity matrices, which are found to result in different DSets clustering results. With these parameter dependent algorithms, we need a careful parameter tuning process in order to obtain satisfactory clustering results. This makes these algorithms less attractive in practical applications.

While the majority of existing clustering algorithms are parameter dependent, the efforts to achieve parameter independence can be traced back to [19]. In this paper we present a parameter independent clustering algorithm on the basis of the DSets algorithm and cluster merging. In applying the DSets algorithm to cluster the data represented as feature vectors, the parameter σ influences the pairwise similarity matrix directly, and then the clustering results indirectly. In [11] the authors propose to transform the similarity matrix with histogram equalization, so that the new similarity matrix is no longer influenced by σ. With this transformation, the DSets algorithm generates almost identical clustering results with different σ’s. However, this transformation is also found to result in over-small clusters. This problem is solved in [11] by expanding the clusters, where the expansion method involves user-specified parameters. In this paper we solve the small-cluster problem with a different approach which is independent of parameters. Specifically, we merge the over-small clusters to increase cluster size, and the cluster merging method is based on the relationship between intra-cluster and inter-cluster similarities. By making use of the nice properties of the DSets algorithm, this parameter independent method is shown to solve the small-cluster problem effectively and improve the clustering quality evidently.

A preliminary version of some works in this paper appear in [17]. Compared with [17], the contributions of this paper are as follows. First, we discuss in theory why histogram equalization transformation of similarity matrices is able to eliminate the influence of σ’s on DSets clustering results. Second, we show that in our algorithm, the similarity matrices obtained by histogram equalization transformation perform no worse than those obtained from fixed σ’s. Third, we present an internal criterion to evaluate the clustering quality, which is shown to perform better than the existing Davies–Bouldin index and the Dunn index. In addition, we validate both the major steps and the whole procedure of our algorithm with extensive experiments, which make our conclusions more convincing.

The rest of this paper is organized as follows. The concept and properties of the DSets algorithm are presented briefly in Section 2. Then we analyze the problems of the DSets algorithm and present our cluster merging method in Section 3. Section 4 reports the experimental results of the proposed algorithm and finally Section 6 concludes this paper.

Section snippets

Dominant sets

Considering that our algorithm is based in part on the DSets algorithm, in this section we introduce the concept of dominant set and the properties of this algorithm briefly. More details of this algorithm can be found in [26], [27].

As dominant set is a graph-theoretic concept of a cluster, we represent the n data to be clustered with an undirected edge-weighted graph G=(V,E,w) without self-loops, where V is the vertex set containing all the data, E denotes the edge set consisting of the

Problems of the DSets algorithm

As a graph based clustering approach, the DSets algorithm uses the pairwise similarity matrix as the input. If the data for clustering are represented in the form of the pairwise similarity matrix, then the DSets clustering process is totally parameter independent. In the case that the data are represented as feature vectors, we need to evaluate the pairwise data similarities in order to build the similarity matrix. With the commonly used data similarity measure s(x,y)=exp(d(x,y)/σ), the

Parameter independence

We have shown that DSets-histeq is independent of the parameter σ. Since we present a cluster merging step to improve the clustering results of DSets-histeq, we firstly test if our algorithm is still parameter independent. Taking the Jain dataset for example, we show the clustering results of our algorithm with different σ’s in Fig. 8, where it is evident that the four clustering results are very similar. We further report the clustering results evaluated by F-measure and Jaccard index on the

Discussion

In this paper we present a parameter independent clustering algorithm based on the DSets algorithm and cluster merging. As parameter independence is a rather strong claim, we discuss our algorithm in a little more details.

In our algorithm Euclidean distance is used to evaluate the data relationship and build the similarity matrix. After histogram equalization, the similarity matrix and then the clustering result are influenced only by the magnitude ordering of the original similarity values.

Conclusion

In this paper we present a parameter independent clustering algorithm based on the dominant sets algorithm and cluster merging. Observing that the dominant sets algorithm is sensitive to the similarity parameter σ, we theoretically show that by transforming the similarity matrices with histogram equalization, the influence of σ on clustering results can be eliminated. In order to deal with the small clusters resulted from histogram equalization, we present a clustering merging method to improve

Acknowledgment

This work is supported in part by National Natural Science Foundation of China under Grant No. 61473045.

References (38)

  • J.F. Brendan et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • C. Couprie et al.

    Power watersheds: a new image segmentation framework extending graph cuts, random walker and optimal spanning forest

    IEEE International Conference on Computer Vision

    (2009)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    International Conference on Knowledge Discovery and Data Mining

    (1996)
  • L. Fu et al.

    Flame, a novel fuzzy clustering method for the analysis of dna microarray data

    BMC Bioinf.

    (2007)
  • A. Gionis et al.

    Clustering aggregation

    ACM Trans. Knowl. Discov. Data

    (2007)
  • J. Hou et al.

    Dset++: a robust clustering algorithm

    International Conference on Image Processing

    (2013)
  • J. Hou et al.

    Dsets-dbscan: a parameter-free clustering algorithm

    IEEE Trans. Image Process.

    (2016)
  • J. Hou et al.

    Feature combination and the knn framework in object classification

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • J. Hou et al.

    Experimental study on dominant sets clustering

    IET Comput. Vision

    (2015)
  • Cited by (9)

    • Clustering based on grid and local density with priority-based expansion for multi-density data

      2018, Information Sciences
      Citation Excerpt :

      These result in poor performance with most grid-based clustering algorithms. Studies have shown that traditional grid-based clustering algorithms have the following issues [1,16,20]. Their success heavily depends on global parameters which may be difficult to obtain.

    • DHC: A distributed hierarchical clustering algorithm for large datasets

      2019, Journal of Circuits, Systems and Computers
    • Merging DBSCAN and Density Peak for Robust Clustering

      2019, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • An enhanced density peak based clustering algorithm

      2018, Proceedings - 4th Asian Conference on Pattern Recognition, ACPR 2017
    View all citing articles on Scopus
    View full text