Elsevier

Information Fusion

Volume 7, Issue 3, September 2006, Pages 264-275
Information Fusion

Moderate diversity for better cluster ensembles

https://doi.org/10.1016/j.inffus.2005.01.008Get rights and content

Abstract

Adjusted Rand index is used to measure diversity in cluster ensembles and a diversity measure is subsequently proposed. Although the measure was found to be related to the quality of the ensemble, this relationship appeared to be non-monotonic. In some cases, ensembles which exhibited a moderate level of diversity gave a more accurate clustering. Based on this, a procedure for building a cluster ensemble of a chosen type is proposed (assuming that an ensemble relies on one or more random parameters): generate a small random population of cluster ensembles, calculate the diversity of each ensemble and select the ensemble corresponding to the median diversity. We demonstrate the advantages of both our measure and procedure on 5 data sets and carry out statistical comparisons involving two diversity measures for cluster ensembles from the recent literature. An experiment with 9 data sets was also carried out to examine how the diversity-based selection procedure fares on ensembles of various sizes. For these experiments the classification accuracy was used as the performance criterion. The results suggest that selection by median diversity is no worse and in some cases is better than building and holding on to one ensemble.

Introduction

Cluster ensembles emerged recently as a coherent stream out of the multiple classifier systems area [12], [27], [28], [8], [9], [10], [6], [23], [1], [14], [11]. They are deemed to be better than single clustering algorithms for discovering complex or noisy structures in the data. The strongest argument in favour of cluster ensembles is as follows. It is known that the current off-the-shelf clustering methods may suggest very different structures in the same data. This is the result of the different clustering criteria being optimized. There is no layman guide to choosing a clustering method for a given data set and so an inexperienced user runs the risk of picking an inappropriate clustering method. There is no ground truth against which the result can be matched, therefore there is no critique to the user’s choice. Cluster ensembles provide a more universal solution in that various structures and shapes of clusters present in data may be discovered by the same ensemble method, and the solution is less dependent upon the chosen ensemble type [27].

Let Z be a data set and let P = {P1,  , PL} be a set of partitions on Z. Each partition is obtained by applying a clustering algorithm on Z or a subset of it. We assume that the partitions are generated by varying a random parameter of the clustering algorithm, for example starting the algorithm from L random initializations. The clustering algorithm (or run) which produces Pi will be called here an “ensemble member” or “clusterer”. The clusterers may be versions of the same clustering algorithm or different clustering algorithms. For simplicity, the same notation, Pi, will be used both for the clusterer and for the corresponding partition. The goal is to find a single (resultant) partition, P, based on the information contained in the set P.

The “accuracy” of a clustering algorithm (or a cluster ensemble) is measured by the match between the partition produced and some known ground-truth partition. A reliable ground-truth partition is seldom available, so most experimental studies employ generated data with pre-specified cluster structure. From the many matching indices suggested in the literature [4], [5], [16], [26], we chose the adjusted Rand index [16] because of the following properties: (1) it has a fixed value of 0 if the two compared partitions are formed independently from one another; (2) in our preliminary experiments, this index was found to have a greater sensitivity to pick out good partitions compared to other indices.

Diversity within an ensemble is of vital importance for its success. An ensemble of identical clusterers or classifiers will not outperform the individual ensemble members. However, finding a sensible quantitative measure of diversity in classifier ensembles has been notoriously hard [19], [20], [21]. Diversity in cluster ensembles is considered here. A diversity measure is proposed and its relationship with the accuracy of the ensemble is demonstrated. Based on the results, a procedure is suggested for selecting a cluster ensemble from a small population of ensembles. The proposed diversity measure as well as the match index for the ensemble accuracy are based on the Adjusted Rand Index.

The rest of the paper is organized as follows. Section 2 introduces cluster ensembles. Section 3 contains the proposed diversity measure together with some results on its relationship with the ensemble accuracy. At the end of this section we list the steps of our proposed methodology for selecting a cluster ensemble from a small population. Section 4 offers the results from a statistical comparison of the proposed diversity measure with two other measures due to Fern and Brodley [6] and Greene et al. [13]. Section 5 contains an experiment with 9 data sets looking into the relationship between the performance of the proposed selection method and the ensemble size. Section 6 concludes the study.

Section snippets

Cluster ensembles

There are various ways to build a cluster ensemble:

  • Use different subsets of features (overlapping or disjoint), called feature-distributed clustering in [13], [27], [28].

  • Use different clustering algorithms within the ensemble [15]. Such ensembles are called heterogeneous or hybrid. Ensembles with the same clustering method obtained by varying a random parameter will be called homogeneous.

  • Vary a random parameter of the clustering algorithm. For example, run the k-means clustering method from

Diversity measures for cluster ensembles

The adjusted Rand index needed for both diversity and accuracy of the ensemble is calculated as follows [16]. Let A and B be two partitions on a data set Z with N objects. Let A have cA clusters and B have cB clusters. Denote by

  • Nij the number of objects in cluster i in partition A and in cluster j in partition B.

  • N.j the number of objects in cluster j in partition B.

  • Ni. the number of objects in cluster i in partition A.

The adjusted Rand index ist1=i=1cANi.2,t2=i=1cBN.j2,t3=2t1t2N(N-1),ar(A,B)=

Experiments

Seven types of homogeneous ensembles were constructed as summarized in Table 2. Two most common types of clusterers were used: the k-means and the mean link method (average link, average linkage). All ensemble consisted of L = 25 clusterers. The parameters that we varied were:

  • the number of overproduced clusters, c. The value was either fixed at c = 20 (ensembles

    and
    ) or chosen randomly for each ensemble member in the range from 2 to 22;

  • the initialization of k-means for ensembles

    ,
    ,
    and
    ;

  • the

Relationship between diversity-selection procedure and the ensemble size

Our final set of experiments seeks to find out how the proposed selection methods behave for various ensemble sizes. The following set-up was used:

  • Ensemble method

    was employed as the one with the best performance among the studied ensembles.

  • To make the results more easily understandable, the classification accuracy is shown as the performance criterion. The classification accuracy is calculated as the proportion of the correctly labeled objects. Each cluster is labeled with the class most

Conclusions

Since diversity in classifier and cluster ensembles is a loosely defined concept, there are many ways to specify and measure it. Four indices are proposed here for estimating diversity in cluster ensembles. They are based on an observation in our previous studies [18] that only an averaged disagreement measure is insufficient. The results in this study support selecting the ensemble with medium diversity from a randomly generated set of ensembles. Two averaged measures of disagreement for

Acknowledgements

This work was supported by research grant # 15035 under the European Joint Project scheme, Royal Society, UK.

References (32)

  • V. Di Gesu

    Integrated fuzzy clustering

    Fuzzy Sets and Systems

    (1994)
  • L.I. Kuncheva

    Diversity in multiple classifier systems (Editorial)

    Information Fusion

    (2005)
  • E. Pekalska et al.

    Dissimilarity representations allow for building good classifiers

    Pattern Recognition Letters

    (2002)
  • H. Ayad, M. Kamel, Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors, in: T....
  • R.O. Duda et al.

    Pattern classification

    (2001)
  • S. Dudoit et al.

    Bagging to improve the accuracy of a clustering procedure

    Bioinformatics

    (2003)
  • A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustered data, in: Proc....
  • E.B. Fawlkes et al.

    A method for comparing two hierarchical clusterings

    Journal of the American Statistical Association

    (1983)
  • X.Z. Fern, C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, in: Proc....
  • B. Fischer et al.

    Bagging for path-based clustering

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • A. Fred

    Finding consistent clusters in data partitions

  • A. Fred, A.K. Jain, Data clustering using evidence accumulation, in: Proc. 16th International Conference on Pattern...
  • A.L.N. Fred, A.K. Jain, Robust data clustering, in: Proc. IEEE Computer Society Conference on Computer Vision and...
  • J. Ghosh

    Multiclassifier systems: back to the future

  • D. Greene, A. Tsymbal, N. Bolshakova, P. Cunningham, Ensemble clustering in medical diagnostics, in: R. Long et al....
  • K. Hornik, Clustrer ensembles....
  • Cited by (185)

    • Cluster ensemble selection based on maximum quality-maximum diversity

      2024, Engineering Applications of Artificial Intelligence
    • The Krypteia ensemble: Designing classifier ensembles using an ancient Spartan military tradition

      2023, Information Fusion
      Citation Excerpt :

      It is also possible to use other aggregation functions obtained by means of the so-called penalty functions [33]. There are also some interesting research lines in classifier ensemble design focusing on the influence of diversity in the ensemble accuracy [34], and on the trade-off between accuracy and complexity (i.e. selecting the optimal number of classifiers) [35]. Bearing in mind the issues above, the objectives of this work are:

    • Representation learning using deep random vector functional link networks for clustering: Representation learning using deep RVFL for clustering

      2022, Pattern Recognition
      Citation Excerpt :

      There are many mathematical and computational tools to solve the consensus problems, which can be divided into two main parts: object co-occurrence and median partition. Objects Co-occurrence methods are more common, which include Re-labelling & voting approach [27], Information Theory approach [28], Hypergraph methods [29], etc. Among them, Re-labelling & voting approach is a straightforward and effective method [30].

    View all citing articles on Scopus
    View full text