Moderate diversity for better cluster ensembles

doi:10.1016/j.inffus.2005.01.008

Information Fusion

Volume 7, Issue 3, September 2006, Pages 264-275

https://doi.org/10.1016/j.inffus.2005.01.008 Get rights and content

Abstract

Adjusted Rand index is used to measure diversity in cluster ensembles and a diversity measure is subsequently proposed. Although the measure was found to be related to the quality of the ensemble, this relationship appeared to be non-monotonic. In some cases, ensembles which exhibited a moderate level of diversity gave a more accurate clustering. Based on this, a procedure for building a cluster ensemble of a chosen type is proposed (assuming that an ensemble relies on one or more random parameters): generate a small random population of cluster ensembles, calculate the diversity of each ensemble and select the ensemble corresponding to the median diversity. We demonstrate the advantages of both our measure and procedure on 5 data sets and carry out statistical comparisons involving two diversity measures for cluster ensembles from the recent literature. An experiment with 9 data sets was also carried out to examine how the diversity-based selection procedure fares on ensembles of various sizes. For these experiments the classification accuracy was used as the performance criterion. The results suggest that selection by median diversity is no worse and in some cases is better than building and holding on to one ensemble.

Introduction

Cluster ensembles emerged recently as a coherent stream out of the multiple classifier systems area [12], [27], [28], [8], [9], [10], [6], [23], [1], [14], [11]. They are deemed to be better than single clustering algorithms for discovering complex or noisy structures in the data. The strongest argument in favour of cluster ensembles is as follows. It is known that the current off-the-shelf clustering methods may suggest very different structures in the same data. This is the result of the different clustering criteria being optimized. There is no layman guide to choosing a clustering method for a given data set and so an inexperienced user runs the risk of picking an inappropriate clustering method. There is no ground truth against which the result can be matched, therefore there is no critique to the user’s choice. Cluster ensembles provide a more universal solution in that various structures and shapes of clusters present in data may be discovered by the same ensemble method, and the solution is less dependent upon the chosen ensemble type [27].

Let Z be a data set and let P = {P₁, … , P_L} be a set of partitions on Z. Each partition is obtained by applying a clustering algorithm on Z or a subset of it. We assume that the partitions are generated by varying a random parameter of the clustering algorithm, for example starting the algorithm from L random initializations. The clustering algorithm (or run) which produces P_i will be called here an “ensemble member” or “clusterer”. The clusterers may be versions of the same clustering algorithm or different clustering algorithms. For simplicity, the same notation, P_i, will be used both for the clusterer and for the corresponding partition. The goal is to find a single (resultant) partition, P^∗, based on the information contained in the set P.

The “accuracy” of a clustering algorithm (or a cluster ensemble) is measured by the match between the partition produced and some known ground-truth partition. A reliable ground-truth partition is seldom available, so most experimental studies employ generated data with pre-specified cluster structure. From the many matching indices suggested in the literature [4], [5], [16], [26], we chose the adjusted Rand index [16] because of the following properties: (1) it has a fixed value of 0 if the two compared partitions are formed independently from one another; (2) in our preliminary experiments, this index was found to have a greater sensitivity to pick out good partitions compared to other indices.

Diversity within an ensemble is of vital importance for its success. An ensemble of identical clusterers or classifiers will not outperform the individual ensemble members. However, finding a sensible quantitative measure of diversity in classifier ensembles has been notoriously hard [19], [20], [21]. Diversity in cluster ensembles is considered here. A diversity measure is proposed and its relationship with the accuracy of the ensemble is demonstrated. Based on the results, a procedure is suggested for selecting a cluster ensemble from a small population of ensembles. The proposed diversity measure as well as the match index for the ensemble accuracy are based on the Adjusted Rand Index.

The rest of the paper is organized as follows. Section 2 introduces cluster ensembles. Section 3 contains the proposed diversity measure together with some results on its relationship with the ensemble accuracy. At the end of this section we list the steps of our proposed methodology for selecting a cluster ensemble from a small population. Section 4 offers the results from a statistical comparison of the proposed diversity measure with two other measures due to Fern and Brodley [6] and Greene et al. [13]. Section 5 contains an experiment with 9 data sets looking into the relationship between the performance of the proposed selection method and the ensemble size. Section 6 concludes the study.

Section snippets

Cluster ensembles

There are various ways to build a cluster ensemble:

•
Use different subsets of features (overlapping or disjoint), called feature-distributed clustering in [13], [27], [28].
•
Use different clustering algorithms within the ensemble [15]. Such ensembles are called heterogeneous or hybrid. Ensembles with the same clustering method obtained by varying a random parameter will be called homogeneous.
•
Vary a random parameter of the clustering algorithm. For example, run the k-means clustering method from

Diversity measures for cluster ensembles

The adjusted Rand index needed for both diversity and accuracy of the ensemble is calculated as follows [16]. Let A and B be two partitions on a data set Z with N objects. Let A have c_A clusters and B have c_B clusters. Denote by

•
N_ij the number of objects in cluster i in partition A and in cluster j in partition B.
•
N_.j the number of objects in cluster j in partition B.
•
N_i. the number of objects in cluster i in partition A.

The adjusted Rand index is $t_{1} = \sum_{i = 1}^{c_{A}} (\begin{matrix} N_{i .} \\ 2 \end{matrix}), t_{2} = \sum_{i = 1}^{c_{B}} (\begin{matrix} N_{. j} \\ 2 \end{matrix}), t_{3} = \frac{2 t_{1} t_{2}}{N (N - 1)},$ $ar (A, B) = \sum$

Experiments

Seven types of homogeneous ensembles were constructed as summarized in Table 2. Two most common types of clusterers were used: the k-means and the mean link method (average link, average linkage). All ensemble consisted of L = 25 clusterers. The parameters that we varied were:

•
the number of overproduced clusters, c. The value was either fixed at c = 20 (ensembles
and
) or chosen randomly for each ensemble member in the range from 2 to 22;
•
the initialization of k-means for ensembles
,
,
and
;
•
the

Relationship between diversity-selection procedure and the ensemble size

Our final set of experiments seeks to find out how the proposed selection methods behave for various ensemble sizes. The following set-up was used:

•
Ensemble method
was employed as the one with the best performance among the studied ensembles.
•
To make the results more easily understandable, the classification accuracy is shown as the performance criterion. The classification accuracy is calculated as the proportion of the correctly labeled objects. Each cluster is labeled with the class most

Conclusions

Since diversity in classifier and cluster ensembles is a loosely defined concept, there are many ways to specify and measure it. Four indices are proposed here for estimating diversity in cluster ensembles. They are based on an observation in our previous studies [18] that only an averaged disagreement measure is insufficient. The results in this study support selecting the ensemble with medium diversity from a randomly generated set of ensembles. Two averaged measures of disagreement for

Acknowledgements

This work was supported by research grant # 15035 under the European Joint Project scheme, Royal Society, UK.

References (32)

V. Di Gesu
Integrated fuzzy clustering
Fuzzy Sets and Systems
(1994)
L.I. Kuncheva
Diversity in multiple classifier systems (Editorial)
Information Fusion
(2005)
E. Pekalska et al.
Dissimilarity representations allow for building good classifiers
Pattern Recognition Letters
(2002)
H. Ayad, M. Kamel, Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors, in: T....
R.O. Duda et al.
Pattern classification
(2001)
S. Dudoit et al.
Bagging to improve the accuracy of a clustering procedure
Bioinformatics
(2003)
A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustered data, in: Proc....
E.B. Fawlkes et al.
A method for comparing two hierarchical clusterings
Journal of the American Statistical Association
(1983)
X.Z. Fern, C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, in: Proc....
B. Fischer et al.
Bagging for path-based clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2003)

A. Fred

Finding consistent clusters in data partitions

A. Fred, A.K. Jain, Data clustering using evidence accumulation, in: Proc. 16th International Conference on Pattern...

A.L.N. Fred, A.K. Jain, Robust data clustering, in: Proc. IEEE Computer Society Conference on Computer Vision and...

J. Ghosh

Multiclassifier systems: back to the future

D. Greene, A. Tsymbal, N. Bolshakova, P. Cunningham, Ensemble clustering in medical diagnostics, in: R. Long et al....

K. Hornik, Clustrer ensembles....

Cited by (185)

Cluster ensemble selection based on maximum quality-maximum diversity
2024, Engineering Applications of Artificial Intelligence
Diversity and quality are two important factors that affect clustering ensemble performance. Some base clusterings are irrelevant and redundant, which decreases the performance of the clustering ensemble. By removing irrelevant and redundant clusterings, diversity and quality are increased simultaneously, which often leads to a more accurate ensemble solution. The relationship between quality and diversity as an optimization problem is a challenge. Based on the minimum redundancy-maximum relevance (mRMR) criterion, pair-wise and non-pair-wise methods are proposed. In the pair-wise method, each clustering is weighted in contrast to other base clusterings, whereas in the non-pair-wise method, virtual labeling is obtained using a consensus function, and then based on this labeling, each clustering is weighted. To evaluate the performance of these methods, several experiments were conducted on 10 real datasets, and the obtained results were compared to those of full ensembles. The results showed that the proposed methods led to a more significant performance improvement compared with full ensembles and other clustering ensemble selection methods.
Cluster ensemble selection and consensus clustering: A multi-objective optimization approach
2024, European Journal of Operational Research
Cluster ensembles have emerged as a powerful tool to obtain clusters of data points by combining a library of clustering solutions into a consensus solution. In this paper, we address the cluster ensemble selection problem and design a multi-objective optimization-based solution framework to produce consensus solutions. Given a library of clustering solutions, we first design a preprocessing procedure that measures the agreement of each clustering solution with the other solutions and eliminates the ones that may mislead the process. We then develop a multi-objective optimization algorithm that selects representative clustering solutions from the preprocessed library with respect to size, coverage, and diversity criteria and combines them into a single consensus solution, for which the true number of clusters is assumed to be unknown. We conduct experiments on different benchmark data sets. The results show that our approach yields more accurate consensus solutions compared to full-ensemble and the existing approaches for most data sets. We also present an application on the customer segmentation problem, where our approach is used to segment customers and to find a consensus solution for each segment, simultaneously.
An improved weighted ensemble clustering based on two-tier uncertainty measurement
2024, Expert Systems with Applications
Existing locally weighted ensemble clustering algorithms strive to weight each cluster and take into account the differences among all clusters, but they tend to ignore the basic cluster labels. The purpose of this paper is to combine the influence of cluster level and the base clustering level in a unified ensemble clustering framework. A novel two-level weighted ensemble cluster method (TWEC) is proposed, which inserts a global weighting strategy into a local ensemble cluster learning framework. First, the cluster uncertainty based on an entropy criterion is refined by considering the base clustering labels for each cluster. Then, the two-level uncertainty is converted to cluster reliability via improved ensemble-driven cluster validity measure. Finally, two novel consensus functions are developed. Experiments validate the effectiveness of the proposed TWEC framework by comparing it with ten comparison algorithms on fourteen real-world datasets and twenty synthetic datasets. The results show that TWEC framework can improve the robustness and stability of clustering.
The Krypteia ensemble: Designing classifier ensembles using an ancient Spartan military tradition
2023, Information Fusion
Citation Excerpt :
It is also possible to use other aggregation functions obtained by means of the so-called penalty functions [33]. There are also some interesting research lines in classifier ensemble design focusing on the influence of diversity in the ensemble accuracy [34], and on the trade-off between accuracy and complexity (i.e. selecting the optimal number of classifiers) [35]. Bearing in mind the issues above, the objectives of this work are:
In this work we propose a new algorithm to train and optimize an ensemble of classifiers. We call this algorithm the Krypteia ensemble, based on an ancient Spartan tradition designed to convert their most promising individuals into future leaders of their society. We show how to adapt this ancient custom to optimize classifiers by generating different variations of the same task, each one offering different hardships according to distinct stochastic variables. This is thus applied to induce diversity in the set of individual weak learners. Then, we use a set of agents designed to select those subjects who excel in their assignments, and whose interaction minimizes excessive redundancies in the resulting population. We also study how different Krypteia ensembles can be stacked together, so that more complex classifiers can be built using the same procedure. Besides, we consider a wide range of different aggregation functions in the decision making phase to find the optimal performance for the different Krypteia ensemble variations tested. Finally, we study how different Krypteia ensembles perform for a wide range of classification datasets and we compare them with other state-of-the-art design techniques of classifier ensembles, obtaining favourable results to our proposal.
Representation learning using deep random vector functional link networks for clustering: Representation learning using deep RVFL for clustering
2022, Pattern Recognition
Citation Excerpt :
There are many mathematical and computational tools to solve the consensus problems, which can be divided into two main parts: object co-occurrence and median partition. Objects Co-occurrence methods are more common, which include Re-labelling & voting approach [27], Information Theory approach [28], Hypergraph methods [29], etc. Among them, Re-labelling & voting approach is a straightforward and effective method [30].
Random Vector Functional Link (RVFL) Networks have received a lot of attention due to the fast training speed as the non-iterative solution characteristic. Currently, the main research direction of RVFLs has supervised learning, including semi-supervised and multi-label. There are hardly any unsupervised research results for RVFLs. In this paper, we propose the unsupervised RVFL (usRVFL), and the unsupervised framework is generic that can be used with other RVFL variants, thus we extend it to an ensemble deep variant, unsupervised deep RVFL (usdRVFL). The unsupervised method is based on the manifold regularization while the deep variant is related to the consensus clustering method, which can increase the capability and diversity of RVFLs. Our unsupervised approaches also benefit from fast training speed, even the deep variant offers a very competitive computation efficiency. Empirical experiments on several benchmark datasets demonstrated the effectiveness of the proposed algorithms.
scEFSC: Accurate single-cell RNA-seq data analysis via ensemble consensus clustering based on multiple feature selections
2022, Computational and Structural Biotechnology Journal
With the development of next-generation sequencing technologies, single-cell RNA sequencing (scRNA-seq) has become one indispensable tool to reveal the wide heterogeneity between cells. Clustering is a fundamental task in this analysis to disclose the transcriptomic profiles of single cells and is one of the key computational problems that has received widespread attention. Recently, many clustering algorithms have been developed for the scRNA-seq data. Nevertheless, the computational models often suffer from realistic restrictions such as numerical instability, high dimensionality and computational scalability. Moreover, the accumulating cell numbers and high dropout rates bring a huge computational challenge to the analysis. To address these limitations, we first provide a systematic and extensive performance evaluation of four feature selection methods and nine scRNA-seq clustering algorithms on fourteen real single-cell RNA-seq datasets. Based on this, we then propose an accurate single-cell data analysis via Ensemble Feature Selection based Clustering, called scEFSC. Indeed, the algorithm employs several unsupervised feature selections to remove genes that do not contribute significantly to the scRNA-seq data. After that, different single-cell RNA-seq clustering algorithms are proposed to cluster the data filtered by multiple unsupervised feature selections, and then the clustering results are combined using weighted-based meta-clustering. We applied scEFSC to the fourteen real single-cell RNA-seq datasets and the experimental results demonstrated that our proposed scEFSC outperformed the other scRNA-seq clustering algorithms with several evaluation metrics. In addition, we established the biological interpretability of scEFSC by carrying out differential gene expression analysis, gene ontology enrichment and KEGG analysis. scEFSC is available at https://github.com/Conan-Bian/scEFSC.

View all citing articles on Scopus

View full text

Moderate diversity for better cluster ensembles

Abstract

Introduction

Section snippets

Cluster ensembles

Diversity measures for cluster ensembles

Experiments

Relationship between diversity-selection procedure and the ensemble size

Conclusions

Acknowledgements

Fuzzy Sets and Systems

Information Fusion

Pattern Recognition Letters

Pattern classification

Bagging to improve the accuracy of a clustering procedure

Bioinformatics

A method for comparing two hierarchical clusterings

Journal of the American Statistical Association

Bagging for path-based clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence

Finding consistent clusters in data partitions

Multiclassifier systems: back to the future