Design of hybrids for the minimum sum-of-squares clustering problem

doi:10.1016/S0167-9473(02)00224-4

Computational Statistics & Data Analysis

Volume 43, Issue 2, 28 June 2003, Pages 235-248

https://doi.org/10.1016/S0167-9473(02)00224-4 Get rights and content

Abstract

A series of metaheuristic algorithms is proposed and analyzed for the non-hierarchical clustering problem under the criterion of minimum sum-of-squares clustering. These algorithms incorporate genetic operators and local search and tabu search procedures. The aim is to obtain quality solutions with short computation times. A series of computational experiments has been performed. The proposed algorithms obtain better results than previously reported methods, especially with a small number of clusters.

Introduction

Consider a set X={x₁,x₂,…,x_N} of N points in R^q and let m be a predetermined positive integer. The minimum sum-of-squares clustering (MSSC) problem is to find a partition of X into m disjoint subsets (clusters) so that the sum of squared distances from each point to the centroid of its cluster is minimum. Specifically, let P_m denote the set of all the partitions of X in m sets, where each partition P∈P_m is defined as P=(C₁,C₂,…,C_m) and where C_i denotes each of the clusters that forms P. Thus, the problem can be expressed as $min P∈P_{m} ∑ i=1 m ∑ x_{l} ∈C_{i} ||x_{l} − x ̄_{i} ||^{2},$ where the centroid $x ̄_{i}$ is defined as $x ̄_{i} = 1 n_{i} ∑ x_{l} ∈C_{i} x_{l} with n_{i} =|C_{i} |.$

Equivalently the problem can be written as $min ∑ l=1 N ||x_{l} − x ̄_{c(l)} ||^{2},$ where c(l) is the cluster to which point x_l belongs.

The design of clusters is a well-known exploratory data analysis issue called pattern recognition. The aim is to find whether a given set of cases X has some structure and, in if so, to display it in the form of a partition. This problem belongs to the area of non-hierarchical cluster design, which has many applications in Economics, Social and Natural Sciences. It is known to be NP-hard (Brucker, 1978).

Various exact methods for MSSC can be found in the literature (see, for example, Koontz et al., 1975; Diehr, 1985), some of which, such as the method proposed by du Merle et al. (2000), have succeeded in resolving problems with up to 150 points. For larger-sized problems the use of heuristic algorithms is still necessary. The most popular are those based on local search methods, such as the well-known K-means (Jancey, 1966) and H-means (Howard, 1966) procedures. In a recent work, Hansen and Mladenovic (2001) propose a new local search procedure, J-means, along with variants H-means+ or HK-means. In recent years algorithms using metaheuristic strategies have been designed, such as simulated annealing (Klein and Dubes, 1989), tabu search (TS) (Al-Sultan, 1995), genetic algorithms (Babu and Murty, 1993) or most recently variable neighborhood search or VNS (du Merle et al., 2000; Hansen and Mladenovic, 2001).

A series of algorithms that is able to obtain good solutions in short times is proposed for this problem. Initially, a genetic algorithm is designed using local search methods, thus becoming a memetic algorithm. A simple procedure based on a TS method using binary trees is also suggested. This method demonstrates its capacity to improve solutions in very few iterations. The incorporation of this procedure into memetic algorithms yields hybrid algorithms. Finally, these memetic and hybrid algorithms are analyzed and compared with other techniques. In all cases, the proposed techniques give adequate solutions, compared with other techniques, in reasonable time, especially with small values of m.

This paper is structured as follows: in the next section some of the already existing local search algorithms are described. In Section 3, the TS method is presented. Section 4 considers the genetic algorithm in detail. In Section 5 Memetic and hybrid algorithms are described. Section 6 presents the results of different computational experiments and compares the effectiveness and efficiency of the proposed algorithms with other recent techniques. Finally, Section 7 summarizes the contribution.

Section snippets

Principal local search algorithms

Some important local search algorithms are described. These are the well-known H-means and K-means, their variants H-means+ and HK-means, as well as the more recent J-means and J-means+. In all cases, we begin by obtaining an arbitrary initial solution (or partition), (C₁,C₂,…,C_M).

Algorithm 1

H-means

Repeat:

(a)
Calculate centroids $x ̄_{i}$ , for i=1,…,m;
(b)
reassign every point to its closest centroid (→ new clusters: clusters are composed by the points assigned to the same centroid); until convergence (that is, no

Description of a basic algorithm

TS is a strategy proposed by Glover 1989, Glover 1990. “Tabu Search is dramatically changing our possibilities of solving a host of combinatorial problems in different areas” (Glover and Laguna, 2002). This is a procedure that explores the solution space beyond the local optimum. Once a local optimum is reached, upward moves or those that worsen the solutions are allowed. Simultaneously, the last moves are marked as tabus during the following iterations to avoid cycling. Recent and

Genetic algorithm

According to Goldberg (1989) “Genetic Algorithms are search techniques based on the mechanics of natural selection and genetics”. These techniques are probably the best-known and widespread evolutionary algorithms. They were originally conceived by John Holland and described in the classic monograph “Adaptation in Natural and Artificial Systems” (Holland, 1975). This text has had a great influence on the later development of these techniques since the mechanisms described in it have long since

Memetic and hybrid algorithms

Memetic algorithms are also population-based methods and have been demonstrated to be faster than Genetic Algorithms for certain classes of problems, (Moscato and Laguna, 1996). In brief, they combine local search procedures with crossing or mutating operators; due to their structure some researchers have called them hybrid genetic algorithms, parallel genetic algorithms (PGAs) or genetic local search methods. The method is gaining wide acceptance particularly for the well-known problems of

Computational results

Next the results of a set of computational experiments using the proposed algorithms are shown. For each case, the TSPLIB library file (Reinelt, 1991) with N=1060 points is used with different numbers of clusters, m=10,20,30,…,150. These test data were previously used in Hansen and Mladenovic (2001), where the best solution known for every value of m (except m=40) is reported. These were obtained on a SUN Ultra I System workstation with $10 min$ computation time. All the tests in the current work

Conclusions

Methods that give adequate solutions to the cluster design problem in short time have been proposed. The methods proposed have obtained solutions equal to the best-known solutions for small values of m, and only slightly worse for higher values of m.

In the former case, (m⩽50), the new methods proposed obtain the best overall solutions from among those included in the comparison. In the latter case, they are only surpassed by the VNS algorithm proposed by Hansen and Mladenovic (2001), with

Acknowledgements

The authors are grateful to the editor and two anonymous reviewers for helpful comments.

References (28)

K.H. Al-Sultan
A tabu search approach to the clustering problem
Pattern Recognition
(1995)
T.A. Feo et al.
A probabilistic heuristic for a computationally difficult set covering problem
Oper. Res. Lett.
(1989)
P. Hansen et al.
J-means: a new local search heuristic for minimum sum-of-squares clustering
Pattern Recognition
(2001)
R.W. Klein et al.
Experiments in projection and clustering by simulated annealing
Pattern Recognition
(1989)
N. Mladenovic et al.
Variable neighborhood search
Comput. Oper. Res.
(1997)
G.P. Babu et al.
A near-optimal initial seed value selection in K-means algorithm using genetic algorithms
Pattern Recognition Lett.
(1993)
Beltrán, M., Pacheco, J., 2001. Nuevos métodos para el diseño de cluster no jerárquicos. Una aplicación a los...
Brucker, P., 1978. On the Complexity of Clustering Problems. Lecture Notes in Economics and Mathematical Systems, Vol....
Cano, F.J., 1999. Análisis de clusters o de conglomerados. I Jornadas de Matemáticas. Burgos, Octubre...
G. Diehr
Evaluation of a branch and bound algorithm for clustering
SIAM J. Sci. Statist. Comput.
(1985)

O. du Merle et al.

An interior point algorithm for minimum sum of squares clustering

SIAM J. Sci. Comput.

(2000)

T.A. Feo et al.

Greedy randomized adaptive search procedures

J. Global Optim.

(1995)

F. Glover

Tabu search: Part I. ORSA

J. Comput.

(1989)

F. Glover

Tabu search: Part II. ORSA

J. Comput.

(1990)

Cited by (26)

Advanced statistical tools and machine learning applied to elemental analysis associated with medical conditions
2022, Comprehensive Analytical Chemistry
Citation Excerpt :
The medoids have the highest similarity to all the other objects in the cluster, and the clustering criterion is based on maximizing similarities rather than minimizing the within-group sum of squares [57,58]. Other hill-climbing algorithms (which include k-means) that try to avoid local optima and achieve global optima include simulated annealing algorithms [61,62], genetic algorithms [63], variable neighbourhood search procedures [64], and tabu search algorithms [65]. In addition, although not as common as hierarchical and partitioning algorithms, model-based, density-based, and grid-based algorithms have also been described to improve clustering [57].
Elemental concentrations in biological samples may be used to better understand the onset and evolution of medical conditions and, in turn, contribute to their diagnosis and treatment. Considering that modern instrumental analytical methods are capable of simultaneously determining up to seventy elements, and clinical studies require many samples to provide statistically reliable results, large datasets are generated in these types of studies. With so much information (e.g. concentrations from a couple dozen analytes multiplied by a few hundred samples), it becomes humanly impossible to interpret data and draw useful conclusions. Thus, advanced statistical tools and machine learning techniques may become essential to identifying patterns in the data and reaching reliable results. Some of the most important among these techniques, which may be applied to elemental analysis associated with the study of medical conditions, are described in this chapter. Imputation, data visualization and dimensionality reduction, feature selection, clustering, and supervised learning, as well as their main algorithms are presented and discussed here. Examples of machine learning techniques in elementomics are also presented. Finally, the perspectives for the field, including potential bottlenecks, pitfalls, and areas that will require improvement are discussed.
Partitioning signed networks using relocation heuristics, tabu search, and variable neighborhood search
2019, Social Networks
Citation Excerpt :
Originally developed by Glover (1989, 1990), tabu search is a metaheuristic approach that facilitates the escape from local optima by forbidding some neighborhood moves for a prescribed number of local-search operations (see Glover and Laguna, 1993 for an extensive treatment of tabu search). Tabu search algorithms have been designed for a variety of partitioning problems, including K-means clustering (Pacheco and Valencia, 2003), p-median clustering (Hansen and Mladenovic, 1997), and clique partitioning (De Amorim et al., 1992). Within the context of network analysis, Borgatti and Everett (1997) applied tabu search in their FACTIONS program for analyzing two-mode network data.
Recently, there have been significant advancements in the development of exact methods and metaheuristics for partitioning signed networks. The metaheuristic advancements have led commonly to adverse implications for multiple restart (multistart) relocation heuristics for these networks. Most notably, it has been reported that multistart relocation heuristics are not computationally feasible for large signed networks with thousands or tens of thousands of vertices. In this paper, we show that combining multistart relocation heuristics with tabu search or variable neighborhood search can rapidly produce partitions of the vertices of signed networks that are competitive with those obtained using existing metaheuristics.
Cluster validation using an ensemble of supervised classifiers
2018, Knowledge-Based Systems
Citation Excerpt :
The first input of VIC is a clustering algorithm. To guarantee that our results on VIC’s performance are independent of the clustering algorithms, we appeal to the heterogeneity of the sample, having used representative algorithms that cluster objects based on different criteria, namely: partitional [73], hierarchical [73], neural network based [73,78], and metaheuristic [79,81,82]. In our experiments, we have used: Single Linkage (hierarchical) [74], k-means (partitional) [75], Expectation Maximization (partitional) [76], Learning Vector Quantization (neural network based) [77], Self-Organizing Maps (neural network based) [78], and Evolutionary k-means (metaheuristic/evolutionary based) [80].
A cluster validity index is used to select which clustering algorithm to apply for a given problem. It works by evaluating the quality of a partition, as output by a candidate clustering algorithm, getting around the common case of the lack of an expert in the given domain of discourse. Most existing validity indexes make assumptions, such as each cluster of the partition having an underlying structure, for example, a hypersphere, yielding incorrect evaluations when they do not hold. Here, we propose a new cluster validity index, which attempts to avoid this bias using an ensemble of distinct supervised classifiers; this way the bias is not attributable to a specific classifier, but to a collection thereof, hence alleviating the problem. The rationale behind our index is that a good partition should induce the construction of also a good classifier; the better the classification performance, the better the quality of the partition under evaluation. Notice how we use the partition to be assessed as a sort of labeled dataset, where each object is labeled with the cluster label it belongs to. We have tested our index on 50 numerical datasets, grouped using six different clustering algorithms. In our experiments, our index outperforms five validity indexes, including the most popular ones.
INCOME: Practical land monitoring in precision agriculture with sensor networks
2013, Computer Communications
Citation Excerpt :
Atomic regions with similar ratios will be merged to minimize the number of sensors needed to achieve the precision requirement. Actually, the optimization problem of split/merge is a NPC problem which can be reduced to the sum-of-squares clustering problem [16]. Fig. 3 illustrates a simple example of the split/merge process.
Land monitoring is a critical task to ensure the quality of agricultural production. Traditional precision agriculture techniques require intensive computation and expensive hardware devices. This paper explores the techniques of deploying sensor networks in the field for practical land monitoring. As an example, we accurately measure the dark-area/light-area ratios in agriculture fields based on the Monte Carlo theory. We formulate the minimum sensor deployment problem, whose aim is to minimize the number of sensor nodes needed to achieve measurement precision requirements while satisfying size limitation requirements. Size limitation requirements are specified so that manual treatments can be carried on sub-regions that have an extraordinary dark/light ratio. We propose an incremental deployment solution – INCOME – to solve the problem, which does not require any prior knowledge of the dark/light distribution of the field. A split/merge algorithm is designed in INCOME to divide the monitored field into sub-regions satisfying both requirements. We formally prove that the sensor number needed in INCOME is less than that of regular division, and analyze the sensors needed in the ideal case and worst case. Comprehensive simulation studies demonstrate that the performance of INCOME is close to the optimal solution.
A GRASP method for building classification trees
2012, Expert Systems with Applications
This paper proposes a new method for constructing binary classification trees. The aim is to build simple trees, i.e. trees which are as less complex as possible, thereby facilitating interpretation and favouring the balance between optimization and generalization in the test data sets. The proposed method is based on the metaheuristic strategy known as GRASP in conjunction with optimization tasks. Basically, this method modifies the criterion for selecting the attributes that determine the split in each node. In order to do so, a certain amount of randomisation is incorporated in a controlled way. We compare our method with the traditional method by means of a set of computational experiments. We conclude that the GRASP method (for small levels of randomness) significantly reduces tree complexity without decreasing classification accuracy.
The use of a genetic algorithm for clustering the weighing station performance in transportation - A case study
2011, Expert Systems with Applications
Citation Excerpt :
The sum of square errors is one of the main criteria to compare the performance of meta-heuristic algorithms. A series of heuristic and meta-heuristic methods of clustering are analyzed based on the sum of square errors, and the results show that meta-heuristic methods have more performance in clustering (Pacheco & Valencia, 2003). One of the main limitations in hierarchical clustering problems is losing data in the steps of merging observations.
In this paper, a genetic algorithm (GA) is developed to solve a clustering problem for evaluating and ranking the weighing stations according to their performances. In hierarchical steps of clustering, observations with the least similarities should be merged and some of them will be lost. To improve this defect, the main concept behind the proposed algorithm is to avoid losing data in the hierarchical process of clustering, so all of the observations are randomly assigned into a predefined number of clusters by GA procedures. In this model, we consider the performance factors related to the weighing operation, such as the traffic volume of trucks, detected overloading, type of portable or fixed scales, and rate of acceding detections compared to the same duration in the previous year. The required data of 126 weighing stations are collected during two 6-month periods. Different dimensions of the collected data are standardized to uniform dimensions. The main performance of a clustering method considered as the fitness value in a genetic algorithm (GA) is to maximize the sum of deviation squares from the mean of within groups. It guaranties that the clusters have most similarities within groups and least similarities in among groups. Four different techniques of the mathematical clustering are compared with the result of the proposed GA by using the MATLAB software. The related results show that the clustering of weighing stations is more likely to other methods.

View all citing articles on Scopus

View full text

Design of hybrids for the minimum sum-of-squares clustering problem

Abstract

Introduction

Section snippets

Principal local search algorithms

Description of a basic algorithm

Genetic algorithm

Memetic and hybrid algorithms

Computational results

Conclusions

Acknowledgements

Pattern Recognition

Oper. Res. Lett.

Pattern Recognition

Pattern Recognition

Comput. Oper. Res.

A near-optimal initial seed value selection in K-means algorithm using genetic algorithms

Pattern Recognition Lett.

Evaluation of a branch and bound algorithm for clustering

SIAM J. Sci. Statist. Comput.

An interior point algorithm for minimum sum of squares clustering

SIAM J. Sci. Comput.

Greedy randomized adaptive search procedures

J. Global Optim.

Tabu search: Part I. ORSA

J. Comput.

Tabu search: Part II. ORSA

J. Comput.