Elsevier

Pattern Recognition Letters

Volume 27, Issue 13, 1 October 2006, Pages 1457-1464
Pattern Recognition Letters

Maximum likelihood combination of multiple clusterings

https://doi.org/10.1016/j.patrec.2006.02.013Get rights and content

Abstract

A promising direction for more robust clustering is to derive multiple candidate clusterings over a common set of objects and then combine them into a consolidated one, which is expected to be better than any candidate. Given a candidate clustering set, we show that with a particular pairwise potential used in Markov random fields, the maximum likelihood estimation is the one closest to the set in terms of a metric distance between clusterings. To minimize such a distance, we present two combining methods based on the new similarity determined by the whole candidate set. We evaluate them on both artificial and real datasets, with candidate clusterings either from full space or subspace. Experiments show that they not only lead to a closer distance to the candidate set, but also achieve a smaller or comparable distance to the true clustering.

Introduction

Given a set of N data indexed with {1, 2, …, N}, with a prespecified number of clusters K < N, the aim of clustering is to assign each datum to one and exactly one cluster. The assignment can be characterized by a many-to-one mapping, k = C(i), which assigns datum i to the kth cluster. Among all these distinct clusterings, one seeks an optimal clustering C to achieve the required goal. Unfortunately, one cannot exhaust all possible clusterings to find the optimal one, because the number of different clusterings, S(N,K)=1K!k=1K(-1)K-kKkkN, grows very fast (Jain and Dubes, 1988). For example, S(10, 4) = 34 105, S(19, 4)  1010. So practical clustering algorithms only examine a very small fraction of all possible clusterings, with the goal to identify a small subset that is likely to contain the optimal, or at least the sub-optimal clustering.

Seeking more robust clusterings is the primary motivation of our work. As introduced above, clustering is a difficult problem combinatorially and has been extensively studied by statistics and machine learning communities. The problem is challenging. High dimensionality, data sparsity and noise make clustering a harder problem. Although a number of clustering methods have been proposed, none of them are universal enough to perform equally well in all cases (Zhao and Karypis, 2002). Differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies slow to occur (Jain et al., 1999). Since almost all clustering algorithms can only find a sub-optimal solution in practice, a natural question arises if we can obtain a better one by consensus clustering, that is, combining outcomes from different clustering algorithms to form a consolidated one. Similar problems are studied extensively in multiple classifier systems, where the classifier’s performance can be evaluated using the training set with known class labels. In the case of clustering, however, we have to evaluate the obtained clusterings in an unsupervised way, since we do not know the true clustering.

Distributed clustering is another motivation of our work. In practice, due to some reasons such as privacy or sheer size, the whole dataset may be vertically partitioned, possibly with overlap, and each part is allocated into a different site. Every site contains all entities but with a fraction of attributes. The clustering has to be performed in every subspace and only the result can be sent back to the headquarter for combination. This is called attribute-distributed clustering and the usefulness of having multiple views of data for better clustering is addressed in (Kargupta et al., 2001). With one candidate clustering from every subspace, we need to combine them to form a consolidated one, which is expected to be better than any candidate.

The rest of the paper is organized as follows. Related work is reviewed in Section 2. We formulate the problem in Section 3 and discuss search methods in Section 4. Experimental results are reported in Section 5. Finally we summarize this paper in Section 6.

Section snippets

Related work

Clustering can be regarded as an unsupervised classification problem. For supervised classification, there is an extensive body of work on combining multiple classifiers (Sharkey, 1999, Dietterich, 2001, Ghosh, 2002). Boosting, in particular, has been extensively studied (Schapire, 1990, Freund and Schapire, 1996, Friedman et al., 2000). The key idea is that every classifier’s performance is used to weigh its contribution to the final classification, which is feasible because we know the true

MRF framework

We use a probabilistic model, Markov random fields (MRFs) (Geman and Geman, 1984) to formulate the problem of combining multiple clusterings, so first we briefly review MRFs. An MRF is composed of a field L={li}i=1N of random variables. In the usual clustering framework, the set of variables are the unobservable cluster labels on the observable data points, indicating cluster assignments. Every variable li takes values from the set 1, …, K, which are the indices of the clusters assigned to the i

Direct construction

D(L, Φ) is minimized if we minimize m=1MVL,Cm(i,j) for every pair (i, j). This can be achieved if we set the combined pairwise relationship matrix PΦ as PΦ(i, j) = 1 if m=1MPCm(i,j) > M/2, and PΦ(i, j) = 0 if m=1MPCm(i,j) < M/2. That is, set PΦ(i, j) = 1 if the majority of candidates assign them together, 0 otherwise. PΦ(i, j) is undetermined if there is a tie. Without any conflicting constraint in PL, we can directly construct a clustering from PΦ by assigning objects to the same cluster if the

Experimental evaluation

In the following experiments, the two graph partitioning-based methods, shared neighbors (SN) and joint-cluster (JC), achieve varying success in combining candidate clusterings from either full space or subspace. For comparison, we also implement the greedy search (GS) method at the resolution of joint-clusters. Let us take a look at the worst time complexity for them. Suppose we have M candidate clusterings, each partitioning N data into K clusters. Assuming linear complexity for graph

Conclusion

In this paper we investigated how to combine multiple clusterings over a common set of objects to produce the consolidated one. First we showed that with a particular pairwise potential used in MRFs, the maximum likelihood estimation is the one that is closest to the candidate set in the sense of a metric distance between clusterings. Thus the problem is transformed to an optimization problem of minimizing such a distance to the candidate set. We presented two combining methods at different

References (19)

  • D. Frossyniotis et al.

    A clustering method based on boosting

    Pattern Recognit. Lett.

    (2004)
  • Dietterich, T.G., 2001. Ensemble methods in machine learning. In: Proc. of the 2nd Int. Workshop on Multiple Classifier...
  • Fayyad, U.M., Reina, C., Bradley, P.S., 1998. Initialization of iterative refinement clustering algorithms. In: Proc....
  • D. Fisher

    Iterative optimization and simplification of hierarchical clusterings

    J. Artif. Intell. Res.

    (1996)
  • Fred, A.L.N., Jain, A.K., 2002. Evidence accumulation clustering based on the k-means algorithm. In: Proc. of the Joint...
  • Freund, Y., Schapire, R., 1996. Experiments with a new boosting algorithm. In: Proc. of the 13th Int. Conf. on Machine...
  • J.H. Friedman et al.

    Additive logistic regression: A statistical view of boosting

    Ann. Statist.

    (2000)
  • Frossyniotis, D., Pertselakis, M., Stafylopatis, A., 2002. A multi-clustering fusion algorithm. In: Proc. of the 2nd...
  • S. Geman et al.

    Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1984)
There are more references available in the full text version of this article.

Cited by (14)

  • Entropy based probabilistic collaborative clustering

    2017, Pattern Recognition
    Citation Excerpt :

    The main limitation of this approach is that it only enables Fuzzy C-Means algorithms to collaborate together, and furthermore some methods even require that all of them be looking for the same number of clusters. Similar approaches were used to develop several other collaborative-like methods CoEM [17], CoFKM, [20], and another collaborative EM-like algorithm [21] based on Markov Random Fields. The work of Pedrycz on the CoFC algorithm was also extended to be adapted to the Self-Organizing Maps (SOM) [11,22,23] and to the Generative Topographic Maps (GTM) [24].

  • A hierarchical clusterer ensemble method based on boosting theory

    2013, Knowledge-Based Systems
    Citation Excerpt :

    A brief explanation of these methods is given in Section 2.1. In the area of clustering methods, various techniques are proposed to create clusterer ensembles [17–21], and some of them are based on bagging and boosting [3,22,23]. These methods usually use partitional (non-hierarchical) clustering algorithms to create basic clusterings [7].

  • Collaborative clustering with background knowledge

    2010, Data and Knowledge Engineering
    Citation Excerpt :

    The proposed algorithm provides a consensus clustering method as well as correspondence matrices that give the relations between the clusterings of the ensemble. Hu et al. [16] proposed a method which uses Markov random fields and maximum likelihood estimation to define a metric distance between clusterings. They present two combining methods based on this new similarity to find a consensus.

View all citing articles on Scopus
View full text