Design of hybrids for the minimum sum-of-squares clustering problem

https://doi.org/10.1016/S0167-9473(02)00224-4Get rights and content

Abstract

A series of metaheuristic algorithms is proposed and analyzed for the non-hierarchical clustering problem under the criterion of minimum sum-of-squares clustering. These algorithms incorporate genetic operators and local search and tabu search procedures. The aim is to obtain quality solutions with short computation times. A series of computational experiments has been performed. The proposed algorithms obtain better results than previously reported methods, especially with a small number of clusters.

Introduction

Consider a set X={x1,x2,…,xN} of N points in Rq and let m be a predetermined positive integer. The minimum sum-of-squares clustering (MSSC) problem is to find a partition of X into m disjoint subsets (clusters) so that the sum of squared distances from each point to the centroid of its cluster is minimum. Specifically, let Pm denote the set of all the partitions of X in m sets, where each partition PPm is defined as P=(C1,C2,…,Cm) and where Ci denotes each of the clusters that forms P. Thus, the problem can be expressed asminP∈Pmi=1mxl∈Ci||xlx̄i||2,where the centroid x̄i is defined asx̄i=1nixl∈Cixlwithni=|Ci|.

Equivalently the problem can be written asminl=1N||xlx̄c(l)||2,where c(l) is the cluster to which point xl belongs.

The design of clusters is a well-known exploratory data analysis issue called pattern recognition. The aim is to find whether a given set of cases X has some structure and, in if so, to display it in the form of a partition. This problem belongs to the area of non-hierarchical cluster design, which has many applications in Economics, Social and Natural Sciences. It is known to be NP-hard (Brucker, 1978).

Various exact methods for MSSC can be found in the literature (see, for example, Koontz et al., 1975; Diehr, 1985), some of which, such as the method proposed by du Merle et al. (2000), have succeeded in resolving problems with up to 150 points. For larger-sized problems the use of heuristic algorithms is still necessary. The most popular are those based on local search methods, such as the well-known K-means (Jancey, 1966) and H-means (Howard, 1966) procedures. In a recent work, Hansen and Mladenovic (2001) propose a new local search procedure, J-means, along with variants H-means+ or HK-means. In recent years algorithms using metaheuristic strategies have been designed, such as simulated annealing (Klein and Dubes, 1989), tabu search (TS) (Al-Sultan, 1995), genetic algorithms (Babu and Murty, 1993) or most recently variable neighborhood search or VNS (du Merle et al., 2000; Hansen and Mladenovic, 2001).

A series of algorithms that is able to obtain good solutions in short times is proposed for this problem. Initially, a genetic algorithm is designed using local search methods, thus becoming a memetic algorithm. A simple procedure based on a TS method using binary trees is also suggested. This method demonstrates its capacity to improve solutions in very few iterations. The incorporation of this procedure into memetic algorithms yields hybrid algorithms. Finally, these memetic and hybrid algorithms are analyzed and compared with other techniques. In all cases, the proposed techniques give adequate solutions, compared with other techniques, in reasonable time, especially with small values of m.

This paper is structured as follows: in the next section some of the already existing local search algorithms are described. In Section 3, the TS method is presented. Section 4 considers the genetic algorithm in detail. In Section 5 Memetic and hybrid algorithms are described. Section 6 presents the results of different computational experiments and compares the effectiveness and efficiency of the proposed algorithms with other recent techniques. Finally, Section 7 summarizes the contribution.

Section snippets

Principal local search algorithms

Some important local search algorithms are described. These are the well-known H-means and K-means, their variants H-means+ and HK-means, as well as the more recent J-means and J-means+. In all cases, we begin by obtaining an arbitrary initial solution (or partition), (C1,C2,…,CM).

Algorithm 1

H-means

Repeat:

  • (a)

    Calculate centroids x̄i,  for i=1,…,m;

  • (b)

    reassign every point to its closest centroid (→ new clusters: clusters are composed by the points assigned to the same centroid); until convergence (that is, no

Description of a basic algorithm

TS is a strategy proposed by Glover 1989, Glover 1990. “Tabu Search is dramatically changing our possibilities of solving a host of combinatorial problems in different areas” (Glover and Laguna, 2002). This is a procedure that explores the solution space beyond the local optimum. Once a local optimum is reached, upward moves or those that worsen the solutions are allowed. Simultaneously, the last moves are marked as tabus during the following iterations to avoid cycling. Recent and

Genetic algorithm

According to Goldberg (1989) “Genetic Algorithms are search techniques based on the mechanics of natural selection and genetics”. These techniques are probably the best-known and widespread evolutionary algorithms. They were originally conceived by John Holland and described in the classic monograph “Adaptation in Natural and Artificial Systems” (Holland, 1975). This text has had a great influence on the later development of these techniques since the mechanisms described in it have long since

Memetic and hybrid algorithms

Memetic algorithms are also population-based methods and have been demonstrated to be faster than Genetic Algorithms for certain classes of problems, (Moscato and Laguna, 1996). In brief, they combine local search procedures with crossing or mutating operators; due to their structure some researchers have called them hybrid genetic algorithms, parallel genetic algorithms (PGAs) or genetic local search methods. The method is gaining wide acceptance particularly for the well-known problems of

Computational results

Next the results of a set of computational experiments using the proposed algorithms are shown. For each case, the TSPLIB library file (Reinelt, 1991) with N=1060 points is used with different numbers of clusters, m=10,20,30,…,150. These test data were previously used in Hansen and Mladenovic (2001), where the best solution known for every value of m (except m=40) is reported. These were obtained on a SUN Ultra I System workstation with 10min computation time. All the tests in the current work

Conclusions

Methods that give adequate solutions to the cluster design problem in short time have been proposed. The methods proposed have obtained solutions equal to the best-known solutions for small values of m, and only slightly worse for higher values of m.

In the former case, (m⩽50), the new methods proposed obtain the best overall solutions from among those included in the comparison. In the latter case, they are only surpassed by the VNS algorithm proposed by Hansen and Mladenovic (2001), with

Acknowledgements

The authors are grateful to the editor and two anonymous reviewers for helpful comments.

References (28)

  • O. du Merle et al.

    An interior point algorithm for minimum sum of squares clustering

    SIAM J. Sci. Comput.

    (2000)
  • T.A. Feo et al.

    Greedy randomized adaptive search procedures

    J. Global Optim.

    (1995)
  • F. Glover

    Tabu search: Part I. ORSA

    J. Comput.

    (1989)
  • F. Glover

    Tabu search: Part II. ORSA

    J. Comput.

    (1990)
  • Cited by (26)

    • Advanced statistical tools and machine learning applied to elemental analysis associated with medical conditions

      2022, Comprehensive Analytical Chemistry
      Citation Excerpt :

      The medoids have the highest similarity to all the other objects in the cluster, and the clustering criterion is based on maximizing similarities rather than minimizing the within-group sum of squares [57,58]. Other hill-climbing algorithms (which include k-means) that try to avoid local optima and achieve global optima include simulated annealing algorithms [61,62], genetic algorithms [63], variable neighbourhood search procedures [64], and tabu search algorithms [65]. In addition, although not as common as hierarchical and partitioning algorithms, model-based, density-based, and grid-based algorithms have also been described to improve clustering [57].

    • Partitioning signed networks using relocation heuristics, tabu search, and variable neighborhood search

      2019, Social Networks
      Citation Excerpt :

      Originally developed by Glover (1989, 1990), tabu search is a metaheuristic approach that facilitates the escape from local optima by forbidding some neighborhood moves for a prescribed number of local-search operations (see Glover and Laguna, 1993 for an extensive treatment of tabu search). Tabu search algorithms have been designed for a variety of partitioning problems, including K-means clustering (Pacheco and Valencia, 2003), p-median clustering (Hansen and Mladenovic, 1997), and clique partitioning (De Amorim et al., 1992). Within the context of network analysis, Borgatti and Everett (1997) applied tabu search in their FACTIONS program for analyzing two-mode network data.

    • Cluster validation using an ensemble of supervised classifiers

      2018, Knowledge-Based Systems
      Citation Excerpt :

      The first input of VIC is a clustering algorithm. To guarantee that our results on VIC’s performance are independent of the clustering algorithms, we appeal to the heterogeneity of the sample, having used representative algorithms that cluster objects based on different criteria, namely: partitional [73], hierarchical [73], neural network based [73,78], and metaheuristic [79,81,82]. In our experiments, we have used: Single Linkage (hierarchical) [74], k-means (partitional) [75], Expectation Maximization (partitional) [76], Learning Vector Quantization (neural network based) [77], Self-Organizing Maps (neural network based) [78], and Evolutionary k-means (metaheuristic/evolutionary based) [80].

    • INCOME: Practical land monitoring in precision agriculture with sensor networks

      2013, Computer Communications
      Citation Excerpt :

      Atomic regions with similar ratios will be merged to minimize the number of sensors needed to achieve the precision requirement. Actually, the optimization problem of split/merge is a NPC problem which can be reduced to the sum-of-squares clustering problem [16]. Fig. 3 illustrates a simple example of the split/merge process.

    • A GRASP method for building classification trees

      2012, Expert Systems with Applications
    • The use of a genetic algorithm for clustering the weighing station performance in transportation - A case study

      2011, Expert Systems with Applications
      Citation Excerpt :

      The sum of square errors is one of the main criteria to compare the performance of meta-heuristic algorithms. A series of heuristic and meta-heuristic methods of clustering are analyzed based on the sum of square errors, and the results show that meta-heuristic methods have more performance in clustering (Pacheco & Valencia, 2003). One of the main limitations in hierarchical clustering problems is losing data in the steps of merging observations.

    View all citing articles on Scopus
    View full text