An evolutionary technique based on K-Means algorithm for optimal clustering in RN

doi:10.1016/S0020-0255(02)00208-6

Information Sciences

Volume 146, Issues 1–4, October 2002, Pages 221-237

https://doi.org/10.1016/S0020-0255(02)00208-6 Get rights and content

Abstract

A genetic algorithm-based efficient clustering technique that utilizes the principles of K-Means algorithm is described in this paper. The algorithm called KGA-clustering, while exploiting the searching capability of K-Means, avoids its major limitation of getting stuck at locally optimal values. Its superiority over the K-Means algorithm and another genetic algorithm-based clustering method, is extensively demonstrated for several artificial and real life data sets. A real life application of the KGA-clustering in classifying the pixels of a satellite image of a part of the city of Mumbai is provided.

Introduction

Clustering [1], [2], [3], [4], [5], [6], [7] is an important unsupervised classification technique used in identifying some inherent structure present in a set of objects. The purpose of cluster analysis is to classify objects into subsets that have some meaning in the context of a particular problem. More specifically, in clustering, a set of patterns, usually vectors in a multi-dimensional space, are grouped into clusters in such a way that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in the same sense.

In some clustering problems, the number of clusters, K, is known a priori. In such situations clustering may be formulated as distribution of n patterns in N dimensional metric space among K groups, such that the patterns in a group are more similar to each other than to patterns in different groups. This involves minimization of some extrinsic optimization criterion. K-Means algorithm is a well known and widely used clustering technique applicable in such situations. However, the major drawback of the K-Means algorithm is that it often gets stuck at local minima and the result is largely dependent on the choice of the initial cluster centers. An attempt is made in this paper to integrate the effectiveness of the K-Means algorithm for partitioning data into a number of clusters, with the capability of genetic algorithm for providing the requisite perturbation to bring it out of this local minima.

Genetic algorithms (GAs) [8], [9], [10] are randomized search and optimization techniques guided by the principles of evolution and natural genetics. They are efficient, adaptive and robust search processes, performing multi-dimensional search in order to provide near optimal solutions of an evaluation (fitness) function in an optimization problem. Since the problem of clustering may be viewed as searching for a number of clusters in the feature space such that a given clustering metric is optimized, application of GAs to this problem seems natural and appropriate. One such attempt can be found in [11].

Murthy and Chowdhury [11] have considered a partition to be encoded as a string of length n, where n is the number of data points. The ith element of the chromosome represents the cluster number to which the corresponding point belongs. A comparison of the performance of their algorithm, subsequently referred to as the GA-clustering algorithm, with that of the K-Means algorithm is provided in [11]. Note that with the increase in the string length, the search space in GAs increases; thereby making the process more time consuming. Hence, when the number of points to be clustered is very large, which may happen in many real life situations, the method proposed in [11] (where the size of a chromosome is equal to the number of data points) will have limited applicability.

In this paper, we describe a GA-based clustering algorithm where the chromosome encodes the centers of the clusters instead of a possible partition of the data points. (Note that the length of a chromosome in the algorithm is restricted by the number of clusters, rather than the number of data points as in [11].) The algorithm tries to evolve the appropriate cluster centers while optimizing a given clustering metric.

Traditionally, GAs assume no prior knowledge of the problem under consideration. The only step that requires such a knowledge is the fitness computation procedure. They are therefore applicable to a wide variety of problems. However, in many situations, some additional information about the search space is often available. These can be effectively incorporated in GAs for improving its searching capability [12]. In the domain of clustering, it is often assumed that the centroid of the points belonging to a cluster represents the center of that cluster. This knowledge is incorporated in the fitness evaluation process of the GA-based clustering method, thereby providing the KGA-clustering algorithm, to make the search more efficient. The details are presented in Section 3.1.

Experimental results comparing the performance of the proposed KGA-clustering method with those of the K-Means and the GA-clustering [11] algorithms are provided for several artificial and real-life data sets. Moreover, the utility of the KGA-clustering algorithm for pixel classification of a satellite image for differentiating different land-cover regions is demonstrated. Note that although GAs usually deal with binary strings, we have implemented floating point coding of the chromosomes since it is a more natural form of representation for this problem.

Section snippets

Clustering

In this section, we first provide a formal statement of the clustering problem. Since we have compared the performance of our KGA-clustering algorithm with that of the K-Means algorithm, a brief outline of the latter is also provided.

Clustering using GAs

The algorithm described in this section has been designed for use in the areas where K-Means algorithm has wide spread applicability. The KGA clustering algorithm is first described in detail followed by a brief outline of the algorithm of Murthy and Chowdhury [11].

Implementation results

Two artificial data sets (Data 1 and Data 2) and three real-life data sets (Vowel, Iris and Crude Oil) are considered for the purpose of conducting the experiments. These data sets are first described below.

Discussion and conclusions

A genetic algorithm-based clustering algorithm, called KGA-clustering, has been developed in this paper. Genetic algorithm has been used to search for the cluster centers such that a given clustering metric, $M$ , is minimized. The knowledge that the centroid of the points belonging to a cluster represents the center of the cluster is incorporated in the chromosome for enhancing the searching capability of the clustering method. Floating point representation of chromosomes has been adopted, since

References (20)

C.A. Murthy et al.
In search of optimal clusters using genetic algorithms
Pattern Recog. Lett.
(1996)
M.R. Anderberg
Cluster Analysis for Application
(1973)
A.K. Jain et al.
Algorithms for Clustering Data
(1988)
J.T. Tou et al.
Pattern Recognition Principles
(1974)
S. Guha et al.
CURE: an efficient clustering for large databases
R. Agrawal et al.
Automatic subspace clustering of high dimensional data for data mining application
H. Frigui et al.
A robust competitive clustering algorithm with application in computer vision
IEEE Trans. Patt. Anal. Machine Intell.
(1999)
A.K. Jain et al.
Statistical pattern recognition: a review
IEEE Trans. Patt. Anal. Machine Intell.
(2000)
D.E. Goldberg
Genetic Algorithms: Search, Optimization and Machine Learning
(1989)
T.E. Davis et al.
A simulated annealing-like convergence theory for the simple genetic algorithm

There are more references available in the full text version of this article.

Cited by (315)

Design of urban medical waste recycling network considering loading reliability under uncertain conditions
2023, Computers and Industrial Engineering
The ongoing global coronavirus pandemic (COVID-19) has significantly increased urban medical waste. Such waste often contains pathogenic microorganisms, harmful chemicals, and even radioactive and defective substances, hence imposing disease transmission and public health risks. Nevertheless, due to multiple factors, the amount of medical waste produced in medical institutions is stochastic. This paper proposed an optimization model for a waste recycling network consideration of loading reliability to minimize the collective cost of location, vehicle usage, and transportation. A modified ant colony algorithm combined with the K-means clustering method based on a genetic algorithm is then proposed (MACO-GKA) to solve the optimal location problem and the vehicle routing problem (LP-VRP). The numerical examples are then conducted in Xuzhou City, China to evaluate the performance of the proposed model. Taking the loading reliability level $θ = 0.9$ as an example, the results show that the total cost will reach $100546.53 when collection points are not set, but decrease to $86,907.11 when they are set. In the latter case, the total cost was reduced by 13.57%. The detailed results indicate that the selection and establishment of medical waste collection points are essential factors in designing an urban medical waste recycling network. The proposed MACO-GKA algorithm also outperforms the CPLEX solver.
An algorithm to compute time-balanced clusters for the delivery logistics problem
2022, Engineering Applications of Artificial Intelligence
An effective supply chain organization is fundamental for any manufacturing, distribution, retail or wholesale business. New technologies have made considerable improvements in the whole process of inventory management; Artificial Intelligence (AI) represents one of the best options for the industry and their search for more intelligent and robust logistics solutions. Based on a real-world scenario, we approach the challenge of defining delivery routes within a city such that the time they require to be traveled is approximately the same. Moreover, while the routes must ensure that drivers’ workload is time balanced and contract regulations can be met, they also must correspond to a customers’ partition (sectorization) according to well-defined, non-overlapping delivery areas. We introduce an approach to solve the problem through the algorithm HSAC (Hierarchical Simulated Annealing Clustering). The proposed algorithm first applies a divisive approach to the data, using simulated annealing at each step to create time-balanced partitions, and then solves the TSP problem to create optimal routes within the defined groups. Based on real data concerning two Mexican cities, our experimental results show that HSAC can solve the sectorization problem efficiently.
A hybrid transmission network in pelagic islands with submarine cables and all-electric vessel based energy transmission routes
2020, International Journal of Electrical Power and Energy Systems
Isolated microgrid is a conventional power supply mode for inhabited pelagic islands, but limited land resources restrict the capacity of renewable energy units, resulting in overreliance on diesel turbines or gas turbines, which consumes a large amount of fossil fuels and produces substantial emissions. Establishing effective connections among pelagic islands can optimize the use of land/energy resources in uninhabited islands. However, a transmission network composed solely of submarine cables is not feasible or economical. This paper proposes an energy transmission route that uses all-electric vessels (AEVs) to transport battery energy storage (BES) between islands. The battery swapping mode is employed and part of the cable network is replaced by AEV routes to transmit the renewable energy and lower the overall cost. First, the transmission modes composed of AEVs and batteries are modeled to analyze energy transmission efficiency and cost-effectiveness. Then, a hybrid transmission planning model is established using scenario clustering method, and the additional costs under extreme weather scenarios are also considered. The model is optimized with the chaotic gravitational search algorithm, and the feasibility and effectiveness of the proposed method are verified through actual cases and sensitivity analysis.
Improving search result clustering using nature inspired approach
2024, Multimedia Tools and Applications
CO<inf>2</inf> emissions and delivery time of last-mile drone delivery using trucks
2024, IET Intelligent Transport Systems
Fuzzy magnetic optimization clustering algorithm with its application to health care
2024, Journal of Ambient Intelligence and Humanized Computing

View all citing articles on Scopus

View full text

An evolutionary technique based on K-Means algorithm for optimal clustering in RN

Abstract

Introduction

Section snippets

Clustering

Clustering using GAs

Implementation results

Discussion and conclusions

Pattern Recog. Lett.

Cluster Analysis for Application

Algorithms for Clustering Data

Pattern Recognition Principles

CURE: an efficient clustering for large databases

Automatic subspace clustering of high dimensional data for data mining application

A robust competitive clustering algorithm with application in computer vision

IEEE Trans. Patt. Anal. Machine Intell.

Statistical pattern recognition: a review

IEEE Trans. Patt. Anal. Machine Intell.

Genetic Algorithms: Search, Optimization and Machine Learning

A simulated annealing-like convergence theory for the simple genetic algorithm

An evolutionary technique based on K-Means algorithm for optimal clustering in $R^{N}$