Elsevier

Pattern Recognition

Volume 39, Issue 5, May 2006, Pages 776-788
Pattern Recognition

A partitional clustering algorithm validated by a clustering tendency index based on graph theory

https://doi.org/10.1016/j.patcog.2005.10.027Get rights and content

Abstract

Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value.

Introduction

Clustering is a method of data analysis which is used in many fields, such as pattern recognition (unsupervised learning), biological and ecological sciences (numerical taxonomy), social sciences (typology), graph theory (graph partitioning), psychology, etc. [1]. The main concern in the clustering process is about partitioning a given data set into subsets, groups or structures, identifying clusters which reflect the organization of the data set. The clusters must be compact and well separated, presenting a higher degree of similarity between data points belonging to the same cluster than between data points belonging to different clusters. So, the topic of clustering addresses the problem of summarizing the relationships within a set of objects by representing them as a smaller number of clusters of objects [2].

The heart of clustering analysis is the selection of the clustering method. A method must be selected that is suitable for the kind of structure that is expected to be present in the data. This decision is important because different clustering methods tend to find different types of cluster structures. In the literature, a wide variety of clustering algorithms have been proposed, which can be broadly classified into the following types: partitional [1], [3]); hierarchical [1], [3]) and density-based [4], [5]) clustering algorithms. Other clustering procedures like fuzzy and conceptual clustering are mentioned in Ref.[3].

The aim of the partitional clustering algorithms is to decompose directly the data set into a set of disjoint clusters, obtaining a partition which should optimize a certain criterion. One of the most popular partitional algorithms is the k-means algorithm [6] which attempts to minimize the dissimilarity between each element and the center of its cluster. More recent partitional algorithms include CLARANS [7] and the k-prototype [8] which is an extension of the k-means algorithm for clustering categorical data. This type of algorithms depend on the ordering of the elements in the data set and require some initial assumptions, usually the number of clusters the user believes to exist in the data. Moreover, partitional algorithms are generally unable to handle isolated points and to discover clusters with non-convex shapes.

Another important issue in clustering related to clustering validity is the problem of choosing the right number of clusters, and given this number, selecting the partition that better fits a data set. Addressing this problem may not be an easy task if no a priori information exists, as to the expected number of clusters in the data. Even when we know the right number of clusters, due to an inappropriate choice of algorithm parameters or wrong choice of the clustering algorithm itself, the generated partitions may not reflect the desired clustering of the data. Some authors have tried to overcome this problem, mention should be made to EjCluster [9], AUTOCLASS [10], [11]); other approaches may be found in Refs.[12], [13], [14], [15], [16].

There are many methods for clustering but these methods are not universal. Due to the wide applicability of cluster analysis, some algorithms are more suitable for some type of data than others. No method is good for all types of data, nor are all methods equally applicable to all problems. Clustering is mostly an unsupervised procedure, where there is no a priori knowledge about the structure of the data set. Almost all clustering algorithms are strongly dependent on the features of the data set and on the different values of the input parameters. Thus, the clustering scheme provided by any algorithm is based on certain assumptions and it is probably not the “best” one to fit the data set. This is a particularly serious issue since virtually any clustering algorithm will produce partitions for any data set, even random noise data which contain no cluster structure [17]. Further, classifications of the same data set obtained using different clustering criteria can differ markedly from one another [2]. So, clustering algorithms can provide misleading summaries of data, and attention has been devoted to investigating ways of guarding against reaching incorrect conclusions, by validating the results of a cluster analysis [2]. Therefore, in most applications, the resulting clustering scheme requires some sort of evaluation as regards its validity. Evaluation and assessing the results of a clustering algorithm is the main subject of cluster validity.

Clustering validation may be accomplished at three levels. First, we must check whether the data set possesses a clustering structure. If this is the case, then one may proceed by applying a clustering algorithm. Otherwise, cluster analysis is likely to lead to misleading results. The problem of determining the presence or the absence of a clustering structure is called clustering tendency [1]. The assessment of the clustering process follows by selecting a “good” clustering algorithm. For example Ness and Fisher [18], [19], [20] presented a list of properties, called admissible conditions, which one might expect clustering procedures to possess, and stated whether or not these properties were possessed by each of several standard clustering criteria. From background information about the data, the method indicates which clustering criteria could be relevant for the analysis of a particular data set [2].

Due to the lack of a precise mathematical formulations for defining the different concepts in clustering analysis, a formal study of methodologies in this field is not accomplished. Graph theory can be a valuable tool to develop models of abstraction for clustering, providing the required mathematical formalism. The basic concepts of graph theory can be used also to develop some clustering algorithms and validity indices. In Ref. [21], graphs provide structural models for cluster analysis. In Ref.[22] a clustering algorithm based on an optimal coloring assignment to the vertices of the graph defined on the data set has been proposed. The authors have proved that the partition provided by the coloring algorithm which obtains the minimum number of colors, is the one of minimum diameter. More recent work is mentioned in Ref.[1] where some clustering algorithms are proposed based on minimum spanning trees or on directed trees.

Our work tries to address some important issues of clustering processes regarding the determination of the number of clusters in the data set, the robustness as concerns isolated points, the detection of clusters of non-convex shapes and of data sets without a cluster structure and the assessment of the quality of the clustering results. Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value.

The remaining of the paper is organized as follows. Section 2 starts with a brief description of some graph theory concepts required for a good understanding of the remaining of the paper, followed by the description of the proposed method, which consists in a partitional clustering algorithm based on graph coloring. Section 3 introduces a clustering tendency index based on k-partite graphs. In Section 4 the performance of our approach is studied and compared with some known clustering algorithms. Section 5 concludes the paper.

Section snippets

A partitional clustering algorithm based on graph theory

Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The partitional algorithm is based on graph coloring and uses an extended greedy algorithm. The number of clusters and the partition that best fits the data set are selected according to the optimal clustering tendency index value. The key idea of this index is that there are k well-separated and compact clusters,

Clustering tendency index, IC

For validating the partition provided by the algorithm, a clustering tendency index on a data set is defined next.

Associated to each value of the control parameter (α), a graph can be defined on the data set. As a result of the optimized greedy coloring algorithm applied to the graph, we obtain a k-partite graph, where vertices belonging to different sets may or may not be adjacent. The index we are proposing in this section, identifies the partition that best fits the cluster structure of the

Application of the proposed method and comparison with other methods

In this section the optimized greedy coloring algorithm and the clustering tendency index are applied to several two-dimensional data sets and to the Iris data set, the results are compared to those obtained by some hierarchical and partitional clustering algorithms.

Concluding remarks

In this paper, we proposed a partitional clustering method and a clustering tendency index based on graph theory. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set are selected according to the optimal clustering tendency index value.

The proposed methodology has been applied on simulated data sets, so as to evaluate its performance. This study has shown that the method is efficient, in particular, in

Acknowledgements

The authors thank the referee for the remarks and suggestions, that helped improving the paper.

About the Author—HELENA BRÁS SILVA received the first degree in Applied Mathematics/Computer Science from Porto University, Portugal, the M.Sc. degree in Electronic and Computers Engineering, from Porto University and the Ph.D. degree in Applied Mathematics from Porto University, Portugal. From 1992 to 1997 she was researcher at the Institute of Systems Engineering and Computers in Porto (INESC). From 1997 until now she is assistant at the Department of Mathematics, Polytechnic School of

References (25)

  • S. Theodoridis et al.

    Pattern Recognition

    (1999)
  • A.D. Gordon

    Clustering validation

  • A.D. Gordon

    Classification

    (1999)
  • E. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial...
  • E. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental clustering for mining in a data warehousing...
  • J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth...
  • R.T. Ng, J. Han, Efficient and effective clustering methods for spatial data mining, in: Proceedings of the 20th...
  • Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, in: Data Mining,...
  • J.A. Garcia et al.

    A dynamic approach for clustering data

    Signal Process.

    (1994)
  • P. Cheeseman, J. Kelly, M. Self, J. Stutz, Autoclass: a bayesian classification system, in: Proceedings of the Fifth...
  • P. Cheeseman et al.

    Bayesian classification autoclass: theory and results

  • G. Milligan et al.

    An examination of procedures for determining the number of clusters in a data set

    Psychometrika

    (1985)
  • Cited by (26)

    • Identifying key pathways in manure and sewage management of dairy farming based on a quantitative typology: A case study in China

      2021, Science of the Total Environment
      Citation Excerpt :

      An optimal clustering solution was determined by a continuum of validation procedure at three levels, including assessing the feasibility of clustering analysis and the quality of clustering algorithms, and ranking validity results. Firstly, clustering tendency can make sure that the target dataset has meaningful clusters (i.e. a non-random structure) before applying any clustering algorithm (Silva et al., 2006). One simple yet effective statistic called Hopkins Statistic (H) is a spatial statistic that tests the spatial randomness of a variable as distributed in space (Han et al., 2011), which was calculated using R “clustertend” package.

    • Cluster validity measure and merging system for hierarchical clustering considering outliers

      2015, Pattern Recognition
      Citation Excerpt :

      However, most of these quality measures are valid only under specific assumptions about the data [2] and are not able to handle overlapping clusters of arbitrary shapes. Recent developments in clustering algorithms have focused on handling arbitrary cluster shapes based on different types of criteria [3]: non-linear distances with the kernel k-means [4], neural-networks [5], Bregman distances [6] or graph-based algorithms [7–9], hierarchical representation with the agglomerative algorithms [10–13], based on density with DBScan [14,15], OPTICS [16], CHAMELEON [17], DenClue [18] or the mean shift algorithm [19]. Another main trend in clustering research is the detection or rejection of outliers.

    • A generalized automatic clustering algorithm in a multiobjective framework

      2013, Applied Soft Computing Journal
      Citation Excerpt :

      Some fuzzy logic based cluster validity indices are proposed in [6]. A partitional clustering method based on graph theory and a clustering tendency index are proposed in [7]. The number of clusters and the partition that best fits the data set, are selected according to the optimal cluster tendency index value.

    • A robust adaptive clustering analysis method for automatic identification of clusters

      2012, Pattern Recognition
      Citation Excerpt :

      It is explained in the next section how the desired cluster number and the final result can be identified. Many works [9–12,16,17,19–21,32,33,35–44] have been reported for identifying optimal cluster numbers and results. Optimal cluster numbers can be found by minimising or maximising clustering validity indices.

    View all citing articles on Scopus

    About the Author—HELENA BRÁS SILVA received the first degree in Applied Mathematics/Computer Science from Porto University, Portugal, the M.Sc. degree in Electronic and Computers Engineering, from Porto University and the Ph.D. degree in Applied Mathematics from Porto University, Portugal. From 1992 to 1997 she was researcher at the Institute of Systems Engineering and Computers in Porto (INESC). From 1997 until now she is assistant at the Department of Mathematics, Polytechnic School of Engineering of Porto (ISEP), Portugal. Her research interests include Clustering and Graph Theory.

    About the Author—PAULA BRITO is an Associate Professor at the School of Economics, and member of the Artificial Intelligence and Data Analysis Group of the University of Porto. She holds a doctorate degree in Applied Mathematics from the University of Paris-IX Dauphine. Her current research interests include data analysis methods, with particular incidence in clustering methods, and analysis of multidimensional complex data, known as symbolic data.

    About the Author—JOAQUIM PINTO DA COSTA received his first degree in Applied Mathematics from Porto University, Porto, Portugal, M. Sc. degree in Applied Statistics from Oxford University, Oxford, UK, the Ph. D. degree in Applied Mathematics from University of Rennes II, Rennes, France, in 1986, 1988 and 1996, respectively. From October 1996 until now he became Assistant Professor of the Applied Mathematics Department of Porto University, Porto, Portugal. His research interests include Statistical Learning Theory, Pattern Recognition, Discriminant Analysis and Clustering, Data Analysis, Neural Networks, SVMs and Machine Learning.

    View full text