Loevinger's measures of rule quality for assessing cluster stability

https://doi.org/10.1016/j.csda.2004.10.012Get rights and content

Abstract

A method is developed for measuring clustering stability under the removal of a few objects from a set of objects to be partitioned. Measures of stability of an individual cluster are defined as Loevinger's measures of rule quality. The stability of an individual cluster can be interpreted as a weighted mean of the inherent stabilities in the isolation and cohesion, respectively, of the examined cluster. The design of the method also enables us to measure the stability of a partition, that can be viewed as a weighted mean of the stability measures of all clusters in the partition. As a consequence, an approach is derived for determining the optimal number of clusters of a partition. Furthermore, using a Monte Carlo test, a significance probability is computed in order to assess how likely any stability measure is, under a null model that specifies the absence of cluster stability. In order to illustrate the potential of the method, stability measures that were obtained by using the batch K-Means algorithm on artificial data sets and on Iris Data are presented.

Introduction

Many clustering methods of different types have been proposed and are nowadays available in order to analyze experimental data collected in various scientific disciplines. Simultaneously, not so much attention has been paid to the validation of the results obtained from these methods. Reviews of clustering validation methods can be found in Dubes and Jain (1979), Jain and Dubes (1988), Milligan (1996) and Gordon (1999). The main difficulty in assessing clustering results is that most clustering methods will provide clusters whether these clusters really exist or not. In other words, a clustering result might constitute either an artifact due to the clustering method or, to some extent, a real structure. Different validation approaches exist in order to overcome this difficulty. One type of approach is a general strategy which is called internal criterion analysis, since it aims to measure the fit between clustering results and the data, using only the data themselves (see, for example, Jain and Dubes (1988) and Gordon (1998)). More precisely, this strategy involves defining a cluster validity index in order to measure the suitability of the clustering structure for the data set examined, and then comparing this measure with the values that would be obtained for data sets of the same size in the case of absence of a clustering structure (cf. for example, Bailey and Dubes (1982), Lerman (1970) and Gordon (1994)). This comparison is achieved by computing the significance probability (p-value) of the observed value of the cluster validity index when testing the hypothesis that the data set has no clustering structure. Since the distribution function of the cluster validity index under the null hypothesis of absence of structure is not known for most indices and most data sets, this approach requires Monte Carlo simulations in order to approximate the unknown distribution function.

Another approach relevant to cluster validation aims to estimate the stability of clustering results. Measuring stability in order to estimate cluster validity provides a general validation framework, since stability measures are defined independently of the choice of the clustering algorithm (see Ben-Hur et al. (2002)). Interest in the stability of clustering results goes back to the advent of clustering methods: for example, see Silvestri and Hill (1964), Rohlf (1965) and Baker (1974). Cluster stability is generally supposed to hold when small changes in the data set have no significant effect on membership of the clusters. A brief review of the literature on classification stability can be found in Cheng and Milligan (1996).

Recently, Roberts (1997), Levine and Domany (2001) and Ben-Hur et al. (2002), using different approaches, aim to show that the estimation of partitional stability provides a method to determine the optimal number of clusters in a partitional method. Tibshirani et al. (2001) propose a cross validation based approach in order to determine the number of clusters: for each cluster of some k-partition of the test data, they compute the proportion of pairs of objects in that cluster that would be clustered together by using the clustering obtained by applying the same partitioning algorithm (with k clusters) on some training set data. The proportion that is minimal over the k test clusters, is considered as the “prediction strength” of the k-partition of the training set. Though the approach proposed by Tibshirani et al. (2001) and our approach proposed hereafter both use membership lists in order to estimate cluster stability, these two approaches differ both in their aims and their ways of assessing cluster stability.

More precisely, given any partitioning algorithm, our approach involves defining measures of stability of the isolation and the cohesion of an arbitrary cluster of the obtained partition. Furthermore, an estimation of the stability of the obtained partition is proposed by means of a weighted mean of all the stability measures of the individual clusters. Each measure of stability of a cluster is defined as Loevinger's measure (cf. Loevinger (1947)) of the quality of a logical rule that expresses either the degree of isolation or cohesion of the examined cluster. Such a logical rule involves only conditions about membership lists provided by the examined cluster and by the partition of the perturbed data set. Furthermore, each cluster stability measure can be evaluated by estimating the significance probability (p-value) of the cluster stability measure when testing the hypothesis that the data set has no clustering stability. It results that each stability measure of a cluster is based only on membership lists obtained by analyzing perturbed data sets with the same partitioning algorithm, so the method could be applied with any type of data perturbation like, for example, a random noise added to the data set. Nevertheless, in order to simplify the presentation given in this paper, we consider only the case where data perturbations involve subtracting a few objects from the data set.

The next two sections, namely Sections 2 and 3, provide a detailed description of the stability method. Section 4 contains three illustrations of the method on different data sets, two of which are artificial. The final section provides a short discussion including some recommendations and perspectives.

Section snippets

Cluster stability measures

In this section, we denote by X an arbitrary data set of n objects to be clustered. We assume that a generic k-way partitioning algorithm, say Pk, is run on the data set X, and we denote as P the obtained partition of X into k clusters, in other words P=Pk(X). Our aim is to assess the partition P and any arbitrary cluster of P, on the basis of stability measures that estimate either cluster isolation or cluster cohesion. As a by-product of the definition of these stability measures for

Methodological issues

In this section, we examine different issues about how to infer conclusions from observed values of the stability measures that have been defined in Section 2. First, we notice that each stability measure introduced in Section 2 was defined as the average of N Loevinger's measures, each of them being computed on the basis of the partition of an arbitrary sample X of the data set X. In order to use generic notations that will simplify our presentation, we denote by t¯X,N an arbitrary stability

Applications

We illustrate our approach both on simulated data sets and on the well-known Iris Data. Though our approach does not impose any constraint on the choice of partitional method, we have simplified our illustration by using only the batch K-Means method. We have set the sampling ratio f to value 0.8, in accordance with our experience and with the recommendation given in Ben-Hur et al. (2002).

Conclusion

The method of cluster validation proposed in this paper assesses cluster stability, which includes both the stability of the partition being examined and the stability of each of its clusters. Like for other validation methods based on stability (see for example Ben-Hur et al. (2002) and Levine and Domany (2001)), there is no restriction on the choice of the k-partitioning algorithm Pk that has generated the partition being assessed. Each stability measure is defined in order to estimate the

Acknowledgements

We want to thank an anonymous referee for having pointed out several issues that must be considered, and Richard Emilion for his helpful comments.

References (27)

  • R. Cheng et al.

    Measuring the influence of individual data points in a cluster analysis

    J. Classification

    (1996)
  • R.C. Dubes et al.

    A test for spatial homogeneity in cluster analysis

    J. Classification

    (1987)
  • Freitas, A.A., 1999. On rule interestingness measures. Knowl. Based Systems J., 12(5–6),...
  • Cited by (22)

    • Additive trees for the categorization of a large number of objects, with bootstrapping strategy for stability assessment. Application to the free sorting of wine odor terms

      2021, Food Quality and Preference
      Citation Excerpt :

      To evaluate the stability of a partition, it is necessary to measure the stability of its clusters. For this, several authors advocated the use of cohesion and isolation measures (Bertrand & Bel Mufti, 2006; El Moubarki, 2009; Lenca, Meyer, Vaillant, & Lallich, 2008). A cluster is considered cohesive if the objects which belong to it, remain together after a perturbation of the dataset.

    • Using the stability of objects to determine the number of clusters in datasets

      2017, Information Sciences
      Citation Excerpt :

      de Mulder introduced the notions of instability and structure-preserving data elements and proved that the removal of a structure-preserving unstable data element from the dataset is a way for improving the robustness of a clustering solution. Bertrand and Mufti [7] showed how the stability of a given cluster can be characterized using Loevinger's measures of rule quality. The authors defined the stability of a whole partition as a weighted mean of the stability scores of all clusters in the partition.

    • An approach to cluster separability in a partition

      2015, Information Sciences
      Citation Excerpt :

      [24] Similarly, in the literature (see, e.g., [4,5,8,11,13,15,23]), cluster stability in a partition is usually considered as a property of cluster elements, that small perturbations in the data do not significantly influence to which cluster the data belong. Thereby, stability of the partition is usually related to an optimal number of clusters therein.

    • Strong consistency of k-parameters clustering

      2013, Journal of Multivariate Analysis
      Citation Excerpt :

      Moreover, various validation methods are of benefit in checking the soundness of a proposed solution, for instance Bailey and Dubes’s [2] Cluster Validity Profiles or Bertrand and Bel Mufti’s [3] and Hennig’s [17] cluster stability methods.

    View all citing articles on Scopus
    View full text