Elsevier

Expert Systems with Applications

Volume 55, 15 August 2016, Pages 184-193
Expert Systems with Applications

Multiobjective clustering analysis using particle swarm optimization

https://doi.org/10.1016/j.eswa.2016.02.009Get rights and content

Highlights

  • A multiobjective clustering method based on particle swarm optimization is proposed.

  • Two objective functions used to measure cohesion and connectivity of clusters.

  • Able to adaptively find the optimal number of clusters.

  • Tested on 27 benchmark datasets in terms of accuracy and robustness.

  • The system outperformed four state-of-the-art clustering algorithms in most cases.

Abstract

Clustering is a significant data mining task which partitions datasets based on similarities among data. This technique plays a very important role in the rapidly growing field known as exploratory data analysis. A key difficulty of effective clustering is to define proper grouping criteria that reflect fundamentally different aspects of a good clustering solution such as compactness and separation of clusters. Moreover, in the conventional clustering algorithms only a single criterion is considered that may not conform to the diverse and complex shapes of the underlying clusters. In this study, partitional clustering is defined as a multiobjective optimization problem. The aim is to obtain well-separated, connected, and compact clusters and for this purpose, two objective functions have been defined based on the concepts of data connectivity and cohesion. These functions are the core of an efficient multiobjective particle swarm optimization algorithm, which has been devised for and applied to automatic grouping of large unlabeled datasets. A comprehensive experimental study is conducted and the obtained results are compared with the results of four other state-of-the-art clustering techniques. It is shown that the proposed algorithm can achieve the optimal number of clusters, is robust and outperforms, in most cases, the other methods on the selected benchmark datasets.

Introduction

It is well known that huge amounts of data are currently being stored and collected in databases, and that this quantity continues to grow rapidly. Valuable information, still hidden in data, should be revealed to improve the decision-making process in organizations. Data mining consists of all methodologies that apply data analysis techniques to discover previously-unknown valid patterns and relationships in large datasets. These methods include a number of technical approaches, such as classification, data summarization, dependency network finding, regression, anomaly detection, and clustering (Han & Kamber, 2000). As for clustering, it is the process of partitioning data into groups with the desired properties that data in each group should be similar, while data from different groups should be dissimilar. Different areas, such as data mining, machine learning, biology, and statistics, include the roots of data clustering (Cheng, Yang, Cao, 2013, Kao, Zahara, Kao, 2008, Leung, Zhang, Xu, 2000, Nguyen, Cios, 2008, Qiu, Xu, Gao, Li, Chi, 2016, Saha, Alok, Ekbal, 2016, Sahoo, Zuo, Tiwari, 2012, Thong, et al., 2015).

Generally speaking, hierarchical and partitional clustering encompass most of the existing clustering methods. Hierarchical clustering results in a tree in which each internal node embodies other nodes (i.e., its children), until leaves are encountered (Leung et al., 2000). Hierarchical clustering algorithms do not need to know in advance the number of clusters and are independent from the initial conditions. On the other hand, they are typically “greedy”, meaning that objects that belong to a cluster cannot be reassigned to other clusters in the clustering process. Moreover, due to lack of information about the global shape or size of the clusters, these algorithms may not be able to separate overlapping clusters (Jain, Murty, & Flynn, 1999). Also partitional clustering typically decomposes a dataset into a set of disjoint clusters. Many partitional clustering algorithms try to minimize some measure of dissimilarity for objects that belong to the same cluster while maximizing the dissimilarity for objects that belong to different clusters. Summarizing, the main drawbacks of hierarchical algorithms usually become advantages for partitional algorithms, and vice versa (Frigui & Krishnapuram, 1999).

Swarm intelligence (SI) is an innovative subcategory of artificial intelligence, inspired by the intelligent behavior of insect or animal groups in nature, including ant colonies, bird flocks, fish schools, bee colonies, and bacterial swarms (Kennedy & Eberhart, 2001). In recent years, SI methods like swarm-based clustering algorithms have been successfully used to deal with clustering problems (Abraham, Das, Roy, 2008, Bharne, Gulhane, Yewale, 2011, Das, Abraham, Konar, 2008, Grosan, Abraham, Chis, 2006, Jiang, Li, Yi, Wang, Hu, 2011, Omran, Salman, Engelbrecht, 2006). For this reason, the research community has recently given them special attention, mainly due to the fact that swarm-based approaches are particularly suited to perform exploratory analysis and also because many issues are still open in this field (Abraham et al., 2008).

In this paper, we confine ourselves to the application of particle swarm optimization (PSO) to clustering. Similar to other SI methods, PSO is inspired by a phenomenon that occurs in nature –i.e., the social behavior of bird flocking or fish schooling (Poli, Kennedy, & Blackwell, 2007). Two PSO-based clustering methods are reported in Rana, Jasola, and Kumar (2011): the first method is used to find the centroids for a user-specified number of clusters and the second method is aimed at extending PSO with K-means (used to seed the initial swarm). It is shown that the latter algorithm has better convergence, compared to the classical version of K-means. Yang et al. propose a hybrid clustering algorithm based on PSO and K-harmonic (KHM) means (PSOKHM) (Yang, Sun, & Zhang, 2009). They show that the PSOKHM algorithm increases the convergence speed of PSO, is capable of escaping from local optima, and has better performance than PSO and KHM clustering on seven datasets. A multiobjective PSO and simulated annealing clustering algorithm (MOPSOSA) is proposed in Abubaker, Baharum, and Alrefaei (2015). This method simultaneously optimizes three different objective functions, which are used as cluster validity indexes for finding the proper number of clusters (and the clusters) according to the given dataset. Euclidean distance, point symmetry and short distances are considered validity indexes in MOPSOSA. The method obtains more promising results in comparison with other conventional clustering algorithms. Several other PSO-based clustering algorithms have been proposed in the literature (for a comprehensive review about PSO-based clustering the interested reader may consult (Cura, 2012, Izakian, Abraham, 2011, Kalyani, Swarup, 2011, Sarkar, Roy, Purkayastha, 2013, Tsai, Kao, 2011)). However, they mostly consider a single function as the objective of the clustering problem and, to the best of our knowledge, all recent works on multiobjective clustering do not apply the concept of Pareto optimal solutions (Kasprzak & Lewis, 2001).

In this paper, a multiobjective clustering particle swarm optimization (MCPSO, hereinafter) framework is proposed, which obtains well-separated, connected, and compact clusters, regardless from the expected optimal number of clusters and their characteristics. MCPSO is also able to automatically determine the optimal number of clusters. To achieve these goals, two conflicting objective functions are defined, based on the concepts of connectivity and cohesion, and MCPSO uses them to find a set of non-dominated clustering solutions, called Pareto front. A simple decision maker is then used to select the best solution among Pareto solutions. A comparison of the MCPSO performance against those obtained using four state-of-the-art clustering algorithms has also been made. As selected datasets are in fact labeled, we have been able to measure the average “accuracy” on clusters, assuming that each cluster actually accounts for a unique label. The accuracy measured on the results of clustering, together with the required computational time, are used as performance metrics in the comparative analysis.

The rest of this paper is organized as follows. In Section 2, swarm intelligence and multiobjective optimization are defined. The proposed MCPSO algorithm and the clustering objective functions are described in detail in Section 3. A comprehensive set of experimental results are provided in Section 4. Section 5 reports conclusions.

Section snippets

Multiobjective optimization and swarm intelligence

In the area of metaheuristics, swarm intelligence (SI) belongs to the group of approaches that apply the self-organized and decentralized characteristics of natural or artificial phenomena to deal with complex optimization problems. In particular, the behavior of natural individuals who relate to each other and to their environment plays a significant role in designing SI algorithms. Many of these algorithms have been introduced in recent years and have been successfully applied to different

Multiobjective clustering with particle swarm optimization

In this section, we describe the MCPSO method. As already pointed out, it is based on the particle swarm optimization algorithm (Kennedy & Eberhart, 2001), in a multiobjective setting. MCPSO consists of two main phases: optimization and decision making. Two conflicting objective functions are defined, based on connectivity and cohesion with the aim of obtaining well-separated, compact, and connected clusters. The optimization phase results in a set of optimal solutions for the given clustering

Experimental results and discussion

In this section, we empirically evaluate the performance of MCPSO. After performing a set of experiments aimed at finding a preliminary setting for the MCPSO parameters (by using some pilot datasets), the performance of the proposed algorithm has been compared with other clustering algorithms over a set of known benchmarks. MCPSO has been implemented in Python 2.7.6 on a Intel Core i7, with 2.4 GHz, 8 GB RAM in an Ubuntu 14.04 environment.

Conclusions

Clustering is one of the key tasks of exploratory data mining and the subject of active research in several research fields, including finance, information retrieval, network management, biology, and medicine. These fields need accurate grouping of huge datasets which may come with a variety of features and/or data characteristics. Swarm intelligence (SI) is a relatively new interdisciplinary field of research, which has gained huge popularity in the data mining area. SI methodologies, such as

References (60)

  • KaoY.-T. et al.

    A hybridized approach to data clustering

    Expert Systems with Applications

    (2008)
  • NguyenC.D. et al.

    Gakrem: a novel hybrid clustering algorithm

    Information Sciences

    (2008)
  • OmkarS. et al.

    Artificial bee colony (abc) for multi-objective design optimization of composite structures

    Applied Soft Computing

    (2011)
  • QiuH. et al.

    Multi-stage design space reduction and metamodeling optimization method based on self-organizing maps and fuzzy clustering

    Expert Systems with Applications

    (2016)
  • RodgerJ.A.

    Application of a fuzzy feasibility Bayesian probabilistic estimation of supply chain backorder aging, unfilled backorders, and customer wait time using stochastic simulation with Markov blankets

    Expert Systems with Applications

    (2014)
  • SahaS. et al.

    Brain image segmentation using semi-supervised clustering

    Expert Systems with Applications

    (2016)
  • SahooA.K. et al.

    A data clustering algorithm for stratified data partitioning in artificial neural network

    Expert Systems with Applications

    (2012)
  • Silva FilhoT.M. et al.

    Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization

    Expert Systems with Applications

    (2015)
  • ThongN.T.

    HIFCF: An effective hybrid model between picture fuzzy clustering and intuitionistic fuzzy recommender systems for medical diagnosis

    Expert Systems With Applications

    (2015)
  • TsaiC.-Y. et al.

    Particle swarm optimization with selective particle regeneration for data clustering

    Expert Systems with Applications

    (2011)
  • YangF. et al.

    An efficient hybrid data clustering method based on k-harmonic means and particle swarm optimization

    Expert Systems with Applications

    (2009)
  • AbrahamA. et al.

    Swarm intelligence algorithms for data clustering

    Proceedings of the soft computing for knowledge discovery and data mining

    (2008)
  • AbubakerA. et al.

    Automatic clustering using multi-objective particle swarm and simulated annealing

    PlOS One

    (2015)
  • AngusD. et al.

    Multiple objective ant colony optimisation

    Swarm Intelligence

    (2009)
  • BacheK. et al.

    UCI machine learning repository

    (2013)
  • BharneP.K. et al.

    Data clustering algorithms based on swarm intelligence

    Proceedings of the 3rd international conference on electronics computer technology (ICECT)

    (2011)
  • ChouC.-H. et al.

    A new cluster validity measure and its application to image compression

    Pattern Analysis and Applications

    (2004)
  • CoelloC.A.C. et al.

    Solving multiobjective optimization problems using an artificial immune system

    Genetic Programming and Evolvable Machines

    (2005)
  • CoelloC.A.C. et al.

    Mopso: A proposal for multiple objective particle swarm optimization

    Proceedings of the congress on evolutionary computation, CEC’02.

    (2002)
  • DasS. et al.

    Automatic clustering using an improved differential evolution algorithm

    Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on

    (2008)
  • Cited by (87)

    • Multi-objective evolutionary clustering with complex networks

      2021, Expert Systems with Applications
      Citation Excerpt :

      This section presents the simulation results of running the proposed algorithm and compares it with some clustering algorithms. The CMMOEC’s results are compared to K-means (Jain Anil, 2008), single-linkage (Voorhees, 1985), DBSCAN (Ester et al., 1996), NC-closures (Inkaya & Özdemirel, 2013), and MCPSO (Armano & Farmani, 2016). K-means belongs to partitional clustering, single-linkage is a hierarchical clustering algorithm, DBSCAN is a clustering technique based on density, and NC-closures is the method of using the concept of neighborhood construction to create clusters by merging the obtained closures.

    • A novel Whale Optimization Algorithm integrated with Nelder–Mead simplex for multi-objective optimization problems

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Also, the particle swarm optimization (PSO) [32] was adopted for tackling the multi-objective optimization problems, where it used an alternative repository of particles to help the other particles in their own flight. The multi-objective particle swarm optimization (MPSO) [33] was used to perform partitional clustering. Mousa et al. [34] proposed an approach that makes use of the merits of both the genetic algorithms and the PSO algorithm in one approach.

    View all citing articles on Scopus
    View full text