Elsevier

Pattern Recognition

Volume 33, Issue 9, September 2000, Pages 1455-1465
Pattern Recognition

Genetic algorithm-based clustering technique

https://doi.org/10.1016/S0031-3203(99)00137-5Get rights and content

Abstract

A genetic algorithm-based clustering technique, called GA-clustering, is proposed in this article. The searching capability of genetic algorithms is exploited in order to search for appropriate cluster centres in the feature space such that a similarity metric of the resulting clusters is optimized. The chromosomes, which are represented as strings of real numbers, encode the centres of a fixed number of clusters. The superiority of the GA-clustering algorithm over the commonly used K-means algorithm is extensively demonstrated for four artificial and three real-life data sets.

Introduction

Genetic algorithms (GAs) [1], [2], [3], [4] are randomized search and optimization techniques guided by the principles of evolution and natural genetics, having a large amount of implicit parallelism. GAs perform search in complex, large and multimodal landscapes, and provide near-optimal solutions for objective or fitness function of an optimization problem.

In GAs, the parameters of the search space are encoded in the form of strings (called chromosomes). A collection of such strings is called a population. Initially, a random population is created, which represents different points in the search space. An objective and fitness funtion is associated with each string that represents the degree of goodness of the string. Based on the principle of survival of the fittest, a few of the strings are selected and each is assigned a number of copies that go into the mating pool. Biologically inspired operators like crossover and mutation are applied on these strings to yield a new generation of strings. The process of selection, crossover and mutation continues for a fixed number of generations or till a termination condition is satisfied. An excellent survey of GAs along with the programming structure used can be found in Ref. [4]. GAs have applications in fields as diverse as VLSI design, image processing, neural networks, machine learning, jobshop scheduling, etc. [5], [6], [7], [8], [9], [10].

In the area of pattern recognition, there are many tasks involved in the process of analyzing/identifying a pattern which need appropriate parameter selection and efficient search in complex spaces in order to obtain optimum solutions. Therefore, the application of GAs for solving certain problems of pattern recognition (which need optimization of computation requirements, and robust, fast and close approximate solution) appears to be appropriate and natural. Research articles in this area have started to come out [11], [12]. Recently, an application of GAs has been reported in the area of (supervised) pattern classification in RN [13], [14] for designing a GA-classifier. It attempts to approximate the class boundaries of a given data set with a fixed number (say H) of hyperplanes in such a manner that the associated misclassification of data points is minimized during training.

When the only data available are unlabeled, the classification problems are sometimes referred to as unsupervised classification. Clustering [15], [16], [17], [18], [19] is an important unsupervised classification technique where a set of patterns, usually vectors in a multi-dimensional space, are grouped into clusters in such a way that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in the same sense. For this it is necessary to first define a measure of similarity which will establish a rule for assigning patterns to the domain of a particular cluster centre. One such measure of similarity may be the Euclidean distance D between two patterns x and z defined by D=||xz||. Smaller the distance between x and z, greater is the similarity between the two and vice versa.

Several clustering techniques are available in the literature [19], [20]. Some, like the widely used K-means algorithm [19], optimize of the distance criterion either by minimizing the within cluster spread (as implemented in this article), or by maximizing the inter-cluster separation. Other techniques like the graph theoretical approach, hierarchical approach, etc., are also available which perform clustering based on other criteria. These are discussed in brief in Section 2. Extensive studies dealing with comparative analysis of different clustering methods [21] suggest that there is no general strategy which works equally well in different problem domains. However, it has been found that it is usually beneficial to run schemes that are simpler, and execute them several times, rather than using schemes that are very complex but need to be run only once [21]. Since our aim is to propose a clustering technique based on GAs, a criterion is required whose optimization would provide the final clusters. An intuitively simple criterion is the within cluster spread, which, as in the K-means algorithm, needs to be minimized for good clustering. However, unlike the K-means algorithm which may get stuck at values which are not optimal [22], the proposed technique should be able to provide good results irrespective of the starting configuration. It is towards this goal that we have integrated the simplicity of the K-means algorithm with the capability of GAs in avoiding local optima for developing a GA-based clustering technique called GA-clustering algorithm. It is known that elitist model of GAs provide the optimal string as the number of iterations goes to infinity [23] when the probability of going from any population to the one containing the optimal string is greater than zero. Therefore, under limiting conditions, a GA based clustering technique is also expected to provide an optimal clustering with respect to the clustering metric being considered.

Experimental results comparing the GA-clustering algorithm with the K-means algorithm are provided for several artificial and real-life data sets. Since our purpose is to demonstrate the effectiveness of the proposed technique for a wide variety of data sets, we have chosen artificial and real-life data sets with both overlapping and non-overlapping class boundaries, where the number of dimensions ranges from two to ten and number of clusters ranges from two to nine. Note that we are encoding the centres of the clusters, which will be floating point numbers, in the chromosomes. One way in which this could have been implemented is by performing real representation with a binary encoding [24]. However, in order to keep the mapping between the actual cluster centres and the encoded centres straight forward, for convenience we have implemented real coded GAs over here [3]. (In this context one may note the observations in Ref. [25] after they experimentally compared binary and floating point representations in GAs. They found that floating point representation was faster, consistent and provided a higher degree of precision.)

Section snippets

Clustering

Clustering in N-dimensional Euclidean space RN is the process of partitioning a given set of n points into a number, say K, of groups (or, clusters) based on some similarity/dissimilarity metric. Let the set of n points {x1,x2,…,xn} be represented by the set S and the K clusters be represented by C1,C2,…,CK. ThenCi≠∅for i=1,…,K,Ci∩Cj=∅for i=1,…,K, j=1,…,K and i≠jand i=1KCi=S.Some clustering techniques that are available in the literature are K-means algorithm [19], branch and bound procedure

Basic principle

The searching capability of GAs has been used in this article for the purpose of appropriately determining a fixed number K of cluster centres in RN; thereby suitably clustering the set of n unlabelled points. The clustering metric that has been adopted is the sum of the Euclidean distances of the points from their respective cluster centres. Mathematically, the clustering metric M for the K clusters C1,C2,…,CK is given byM(C1,C2,…,CK)=i=1Kxj∈Ci||xjzi||.The task of the GA is to search for

Implementation results

The experimental results comparing the GA-clustering algorithm with the K-means algorithm are provided for four artificial data sets (Data 1, Data 2, Data 3 and Data 4) and three real-life data sets (Vowel, Iris and Crude Oil), respectively. These are first described below:

Discussion and conclusions

A genetic algorithm-based clustering algorithm, called GA-clustering, has been developed in this article. Genetic algorithm has been used to search for the cluster centres which minimize the clustering metric M. In order to demonstrate the effectiveness of the GA-clustering algorithm in providing optimal clusters, several artificial and real life data data sets with the number of dimensions ranging from two to ten and the number of clusters ranging from two to nine have been considered. The

Summary

Clustering is an important unsupervised classification technique where a set of patterns, usually vectors in a multi-dimensional space, are grouped into clusters in such a way that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in the same sense. For this it is necessary to first define a measure of similarity which will establish a rule for assigning patterns to the domain of a particular cluster centre. One such measure of similarity

Acknowledgements

This work was carried out when Ms. Sanghamitra Bandyopadhyay held the Dr. K. S. Krishnan fellowship awarded by the Department of Atomic Energy, Govt. of India. The authors acknowledge the reviewer whose valuable comments helped immensely in improving the quality of the article.

About the Author—UJJWAL MAULIK did his Bachelors in Physics and Computer Science in 1986 and 1989, respectively. Subsequently, he did his Masters and Ph.D in Computer Science in 1991 and 1997, respectively, from Jadavpur University, India. Dr. Maulik has visited Center for Adaptive Systems Applications, Los Alamos, New Mexico, USA in 1997. He is currently the Head of the Department of Computer Science, Kalyani Engineering College, India. His research interests include Parallel Processing and

References (32)

  • S. Forrest (Ed.), Proceedings of the Fifth International Conference Genetic Algorithms, Morgan Kaufmann, San Mateo,...
  • L.J. Eshelman (Ed.), Proceedings of the Sixth International Conference Genetic Algorithms, Morgan Kaufmann, San Mateo,...
  • E.S. Gelsema (Ed.), Special Issue on Genetic Algorithms, Pattern Recognition Letters, vol. 16(8), Elsevier Sciences...
  • S.K. Pal et al.

    Genetic algorithms for generation of class boundaries

    IEEE Trans. Systems, Man Cybernet.

    (1998)
  • M.R. Anderberg

    Cluster Analysis for Application

    (1973)
  • Cited by (1253)

    • An adaptive dynamic community detection algorithm based on multi-objective evolutionary clustering

      2024, International Journal of Intelligent Computing and Cybernetics
    View all citing articles on Scopus

    About the Author—UJJWAL MAULIK did his Bachelors in Physics and Computer Science in 1986 and 1989, respectively. Subsequently, he did his Masters and Ph.D in Computer Science in 1991 and 1997, respectively, from Jadavpur University, India. Dr. Maulik has visited Center for Adaptive Systems Applications, Los Alamos, New Mexico, USA in 1997. He is currently the Head of the Department of Computer Science, Kalyani Engineering College, India. His research interests include Parallel Processing and Interconnection Networks, Natural Language Processing, Evolutionary Computation and Pattern Recognition.

    About the Author—SANGHAMITRA BANDYOPADHYAY did her Bachelors in Physics and computer Science in 1988 and 1991, respectively. Subsequently, she did her Masters in Computer Science from Indian Institute of Technology, Kharagpur in 1993 and Ph.D in Computer Science from Indian Statistical Institute, Calcutta in 1998. Dr. Bandyopadhyay is the recipient of Dr. Shanker Dayal Sharma Gold Medal and Institute Silver Medal for being adjudged the best all round post graduate performer in 1993. She has visited Los Alamos National Laboratory in 1997. She is currently on a post doctoral assignment in University of New South Wales, Sydney, Australia. Her research interests include Evolutionary Computation, Pattern Recognition, Parallel Processing and Interconnection Networks.

    1

    On leave from Indian Statistical Institute.

    View full text