An evolutionary technique based on K-Means algorithm for optimal clustering in RN

https://doi.org/10.1016/S0020-0255(02)00208-6Get rights and content

Abstract

A genetic algorithm-based efficient clustering technique that utilizes the principles of K-Means algorithm is described in this paper. The algorithm called KGA-clustering, while exploiting the searching capability of K-Means, avoids its major limitation of getting stuck at locally optimal values. Its superiority over the K-Means algorithm and another genetic algorithm-based clustering method, is extensively demonstrated for several artificial and real life data sets. A real life application of the KGA-clustering in classifying the pixels of a satellite image of a part of the city of Mumbai is provided.

Introduction

Clustering [1], [2], [3], [4], [5], [6], [7] is an important unsupervised classification technique used in identifying some inherent structure present in a set of objects. The purpose of cluster analysis is to classify objects into subsets that have some meaning in the context of a particular problem. More specifically, in clustering, a set of patterns, usually vectors in a multi-dimensional space, are grouped into clusters in such a way that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in the same sense.

In some clustering problems, the number of clusters, K, is known a priori. In such situations clustering may be formulated as distribution of n patterns in N dimensional metric space among K groups, such that the patterns in a group are more similar to each other than to patterns in different groups. This involves minimization of some extrinsic optimization criterion. K-Means algorithm is a well known and widely used clustering technique applicable in such situations. However, the major drawback of the K-Means algorithm is that it often gets stuck at local minima and the result is largely dependent on the choice of the initial cluster centers. An attempt is made in this paper to integrate the effectiveness of the K-Means algorithm for partitioning data into a number of clusters, with the capability of genetic algorithm for providing the requisite perturbation to bring it out of this local minima.

Genetic algorithms (GAs) [8], [9], [10] are randomized search and optimization techniques guided by the principles of evolution and natural genetics. They are efficient, adaptive and robust search processes, performing multi-dimensional search in order to provide near optimal solutions of an evaluation (fitness) function in an optimization problem. Since the problem of clustering may be viewed as searching for a number of clusters in the feature space such that a given clustering metric is optimized, application of GAs to this problem seems natural and appropriate. One such attempt can be found in [11].

Murthy and Chowdhury [11] have considered a partition to be encoded as a string of length n, where n is the number of data points. The ith element of the chromosome represents the cluster number to which the corresponding point belongs. A comparison of the performance of their algorithm, subsequently referred to as the GA-clustering algorithm, with that of the K-Means algorithm is provided in [11]. Note that with the increase in the string length, the search space in GAs increases; thereby making the process more time consuming. Hence, when the number of points to be clustered is very large, which may happen in many real life situations, the method proposed in [11] (where the size of a chromosome is equal to the number of data points) will have limited applicability.

In this paper, we describe a GA-based clustering algorithm where the chromosome encodes the centers of the clusters instead of a possible partition of the data points. (Note that the length of a chromosome in the algorithm is restricted by the number of clusters, rather than the number of data points as in [11].) The algorithm tries to evolve the appropriate cluster centers while optimizing a given clustering metric.

Traditionally, GAs assume no prior knowledge of the problem under consideration. The only step that requires such a knowledge is the fitness computation procedure. They are therefore applicable to a wide variety of problems. However, in many situations, some additional information about the search space is often available. These can be effectively incorporated in GAs for improving its searching capability [12]. In the domain of clustering, it is often assumed that the centroid of the points belonging to a cluster represents the center of that cluster. This knowledge is incorporated in the fitness evaluation process of the GA-based clustering method, thereby providing the KGA-clustering algorithm, to make the search more efficient. The details are presented in Section 3.1.

Experimental results comparing the performance of the proposed KGA-clustering method with those of the K-Means and the GA-clustering [11] algorithms are provided for several artificial and real-life data sets. Moreover, the utility of the KGA-clustering algorithm for pixel classification of a satellite image for differentiating different land-cover regions is demonstrated. Note that although GAs usually deal with binary strings, we have implemented floating point coding of the chromosomes since it is a more natural form of representation for this problem.

Section snippets

Clustering

In this section, we first provide a formal statement of the clustering problem. Since we have compared the performance of our KGA-clustering algorithm with that of the K-Means algorithm, a brief outline of the latter is also provided.

Clustering using GAs

The algorithm described in this section has been designed for use in the areas where K-Means algorithm has wide spread applicability. The KGA clustering algorithm is first described in detail followed by a brief outline of the algorithm of Murthy and Chowdhury [11].

Implementation results

Two artificial data sets (Data 1 and Data 2) and three real-life data sets (Vowel, Iris and Crude Oil) are considered for the purpose of conducting the experiments. These data sets are first described below.

Discussion and conclusions

A genetic algorithm-based clustering algorithm, called KGA-clustering, has been developed in this paper. Genetic algorithm has been used to search for the cluster centers such that a given clustering metric, M, is minimized. The knowledge that the centroid of the points belonging to a cluster represents the center of the cluster is incorporated in the chromosome for enhancing the searching capability of the clustering method. Floating point representation of chromosomes has been adopted, since

References (20)

  • C.A. Murthy et al.

    In search of optimal clusters using genetic algorithms

    Pattern Recog. Lett.

    (1996)
  • M.R. Anderberg

    Cluster Analysis for Application

    (1973)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • J.T. Tou et al.

    Pattern Recognition Principles

    (1974)
  • S. Guha et al.

    CURE: an efficient clustering for large databases

  • R. Agrawal et al.

    Automatic subspace clustering of high dimensional data for data mining application

  • H. Frigui et al.

    A robust competitive clustering algorithm with application in computer vision

    IEEE Trans. Patt. Anal. Machine Intell.

    (1999)
  • A.K. Jain et al.

    Statistical pattern recognition: a review

    IEEE Trans. Patt. Anal. Machine Intell.

    (2000)
  • D.E. Goldberg

    Genetic Algorithms: Search, Optimization and Machine Learning

    (1989)
  • T.E. Davis et al.

    A simulated annealing-like convergence theory for the simple genetic algorithm

There are more references available in the full text version of this article.

Cited by (315)

  • An algorithm to compute time-balanced clusters for the delivery logistics problem

    2022, Engineering Applications of Artificial Intelligence
  • Fuzzy magnetic optimization clustering algorithm with its application to health care

    2024, Journal of Ambient Intelligence and Humanized Computing
View all citing articles on Scopus
View full text