Elsevier

Expert Systems with Applications

Volume 117, 1 March 2019, Pages 243-266
Expert Systems with Applications

A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm

https://doi.org/10.1016/j.eswa.2018.09.050Get rights and content

Highlights

  • Improving results by combining random and homogeneity based merge-split method.

  • Reducing number of empty clusters by attention to data density for selecting center.

  • Avoid falling into local optimum points in the proposed assimilation step.

  • State-of-the-art accuracies for solving automatic clustering problems.

Abstract

Cluster analysis has a wide application in many areas, including pattern recognition, information retrieval, and image processing. In most real-world clustering problems, the number of clusters must be predetermined. Automatic clustering is a promising solution for this challenge which automatically determines the number and structure of clusters in data. In recent years, the evolutionary algorithm due to their search mechanisms has been popular in solving automatic clustering problems. Imperialist Competitive Algorithm (ICA) is a successful evolutionary algorithm. In this paper, for the first time, Imperialist Competitive Algorithm (ICA) is used for solving automatic clustering problems, called “the automatic clustering using ICA (AC-ICA)”. In the proposed algorithm, in order to increase the exploration ability, the movement of colonies toward the imperialist was changed at the assimilation step. A new method has been provided for changing the number of clusters by combining random and homogeneity based merge-split approach. Furthermore, an efficient method based on density has been proposed for reinitializing empty cluster centers. To use AC-ICA in automatic clustering, the initialization and imperialist competition steps were changed. Based on changes in these two steps, a framework was provided for changing different types of ICA to solve automatic clustering problems. Then, the basic ICA and its three recently developed types, were changed by this framework and their performances in automatic clustering were compared to AC-ICA. The examinations were done on six synthetic and ten real word data sets. The comparison of the proposed algorithm's results with basic ICA, its three recently developed types and several state-of-art automatic clustering methods, shows AC-ICA's superiority in terms of the speed of convergence to the optimal solution and quality of the obtained solution. We also applied our algorithm to a real world application (i.e., face recognition) and the achieved results were acceptable.

Introduction

Clustering as a popular technique in data analysis and data mining will partition a set of unlabeled data into several groups (or clusters), so that data in one cluster have the most similarity with each other and are different with data in other clusters as much as possible (Armano and Farmani, 2016, Panagiotakis, 2015). Cluster analysis has a wide application in many areas, including pattern recognition (Kalhori and Zarandi, 2015, Liu et al., 2016), image processing (Rodriguez and Laio, 2014, Saha et al., 2016, Thong, 2015), web mining (Forsati et al., 2015, Huang et al., 2014), compression (Hejrati, Fathi, & Abdali-Mohammadi, 2017), and information retrieval (Bordogna and Pasi, 2012, Chifu et al., 2015). So far, different clustering methods have been proposed and a useful review of these methods can be found in (Hruschka et al., 2009, Saxena et al., 2017, Xu and Wunsch, 2005). The existing algorithms are usually divided into two groups: hierarchical clustering and partitional clustering. In hierarchical clustering, data are arranged in a hierarchical tree structure based on the similarity between data points. In these methods, when a data point is assigned to a cluster at the initial steps of clustering, it cannot be assigned to another cluster. Therefore, the formation of clusters is static. Moreover, the total shape and size of clusters are ignored. On the other hand, partitional clustering attempts to directly analyze the dataset inside a set of separate clusters, so that intra-cluster and inter-cluster dissimilarity become small and large, respectively. Partitional clustering algorithms suppose that the number of clusters in a dataset must be predetermined. While in many real-world problems, information on the number of clusters is not predetermined. Therefore, under such condition, automatic determination of appropriate number of clusters and provision of a proper partition for dataset are challenging in this area. Automatic clustering is a promising solution for this challenge which automatically determines the number and structure of clusters in a dataset (Kuo, Huang, Lin, Wu, & Zulvia, 2014). Implementation of automatic clustering in a dataset is difficult due to large dimensionality and massive volume; especially when clusters are very different in terms of shape, size, and density. It is also hard when there is an overlap between groups (José-García & Gómez-Flores, 2016). In recent years, many efforts have been made to develop automatic clustering methods. Three comprehensive reviews of these methods can be found in (Hancer and Karaboga, 2017, José-García and Gómez-Flores, 2016, Mirkin, 2011). In a study conducted by Hancer and Karaboga (2017), automatic clustering methods were divided into three groups: traditional, merge-split based, and evolutionary computation (EC) based approaches. In traditional approaches, a cluster validity index is used and a traditional clustering algorithm is implemented for all the possible numbers of clusters successively in order to find a clustering with the best validity measure. Nevertheless, it is boring and computationally expensive. Also, many validity indices only work well when their assumptions about the structure of cluster are true (Tan, Ting, & Teng, 2011). Merge-split based approaches will merge and split clusters in a dataset based on the predetermined criteria. EC-based approaches use evolutionary algorithms for automatic clustering. In this method, the clustering is considered as an optimization problem, minimizing the dissimilarity within cluster and maximizing the dissimilarity between clusters (Kuo et al., 2014). EC-based approaches are better than traditional approaches and merge-split based approaches; because they obtain the correct numbers of clusters and high-quality clustering. In other words, these methods have the ability to find better solutions due to their global search mechanisms. As a result, these methods are more robust than two other methods (Hancer & Karaboga, 2017). In recent years, EC-based approaches such as genetic algorithm (GA) (Tseng & Yang, 2001), particle swarm optimization (PSO) (Omran, Salman, & Engelbrecht, 2006), differential evolution (DE) (Das & Konar, 2009), bee colony optimization (BCO) (Kuo et al., 2014), and improved versions of some of these algorithms (Ali, 2016, Das et al., 2008a, Ling et al., 2016, Ozturk et al., 2015, Sheng et al., 2016) have been used for the automatic clustering.

Imperialist competitive algorithm (ICA) is an evolutionary optimization algorithm which simulates the social and political behavior of imperialist countries in an attempt to dominate weaker countries. This algorithm was proposed by Atashpaz-Gargari and Lucas in 2007. In recent years, ICA and its improved types obtained successful results in solving practical and numerical optimization methods (Ardeh et al., 2017, Niknam et al., 2011, Xu et al., 2017, Aliniya and Keyvanpour, 2018b, Aliniya and Keyvanpour, 2018a). The most important advantage of ICA compared to other evolutionary optimization methods is its high convergence rate. Thus, this algorithm obtained significant results in a shorter time (Xu et al., 2017). An improved type of ICA called hybrid K-MICA was presented by Niknam et al. in 2011 for solving clustering problems. Experiments showed that hybrid K-MICA could be considered as an efficient metaheuristic method for finding the optimal or suboptimal solutions in solving clustering problems. This algorithm is competitable with other evolutionary methods such as PSO, GA, and ACO in terms of the quality of obtained solution and it has superiority over them in terms of the convergence rate. However, in this algorithm, the number of clusters must be predetermined.

According to the aforementioned issues, in this paper, a new algorithm called automatic clustering using ICA (AC-ICA) is proposed for finding the optimal number of clusters. To the best of our knowledge, this is the first application of ICA algorithm in solving automatic clustering problems. AC-ICA can simultaneously find the number of clusters and the corresponding partitions. In the proposed algorithm, by changing the movement of colonies toward the imperialist in assimilation step, the ability to explore solutions space is appropriately reinforced. Furthermore, a new method for changing the number of centers in solutions is proposed and an efficient method is also introduced for reinitializing the empty cluster center. Also, to use AC-ICA in automatic clustering, the initialization and imperialist competition steps were changed. Based on changes in these two steps, a framework is proposed for changing different ICA types in order to efficiently solve automatic clustering problems. With the help of this framework, the basic ICA and its three recently developed types including EXPLICA (Ardeh et al., 2017), Hybrid K-MICA (Niknam et al., 2011), and IICA-G (Xu et al., 2017) have been changed and their performance in automatic clustering have been compared with AC-ICA. In this paper, Taguchi designing approach (Al Khaled & Hosseini, 2015) has been used to calibrate parameters in the proposed algorithm. The comparison of results obtained from AC-ICA with basic ICA, its three recently developed types and several state-of-art automatic clustering methods, shows the success of the proposed algorithm. At the end, the AC-ICA was applied to a real application (i.e., face recognition), and achieved acceptable results. The rest of the paper is organized as follows:

In the next section, automatic clustering techniques are briefly reviewed. In Section 3, automatic clustering approaches was compared. In Section 4, the basic ICA is provided. Motivation and mathematic foundation for the proposed algorithm are provided in Section 5. Synthetic and Real-world datasets, experimental setups, and experimental results are reported in Section 6. The results of AC-ICA on face recognition area were also provided in Section 6. Finally, in Section 7, the paper is concluded by providing the results and statements for future researches.

Section snippets

Related work

As already mentioned, compared to the traditional and merge-split based approaches, EC-based approaches generally perform well in obtaining the correct number of clusters and the high clustering quality. In this section, some successful EC-based methods are reviewed.

In the late 1990s, the automatic clustering problem gave rise to a new era in cluster analysis with the application of nature-inspired metaheuristics. The earliest attempt at automatic clustering based on GA was done by Tseng and

Comparison of automatic clustering approaches

As already mentioned, in a study conducted by Hancer and Karaboga (2017), automatic clustering methods were divided into three groups: traditional, merge-split based, and evolutionary computation (EC) based approaches. Fig. 1 shows the overall structure of these three approaches. In traditional approaches, first, an internal validity index is chosen and a range of clusters [Kmin,Kmax] is defined. Then, the clustering algorithm with all values of number of clusters is run. The index value of

The basic ICA

Since the proposed method in this paper is based on ICA, in this section, the steps of ICA algorithm are briefly discussed. ICA is a population-based random search algorithm (Atashpaz-Gargari & Lucas, 2007). This algorithm has 7 steps as follows:

Step 1: Initialization of the empires

ICA is randomly initialized by a set of Npop produced solutions. Each solution is a 1 × D array which is called “the country”. In Eq. (1), xi is the ith parameters of the solution and D is the number of parameters

The proposed method

In this section, our motivation and then the proposed algorithm will be explained for the automatic clustering. The goal of the proposed algorithm is to find a corrected number of clusters as well as high-quality clustering for an automated clustering problem.

Experimental results

In this section, the performance of the proposed AC-ICA is compared with the basic ICA and three improved types as well as other automatic clustering methods. Section 6.1 explains the synthetic and real-world datasets. Section 6.2 shows the application of Taguchi method for adjusting parameters. Section 6.3 briefly reviews the criteria used for evaluating the results of algorithms. In Section 6.4, the performance of AC-ICA on synthetic datasets is evaluated. In Section 6.5, the performance of

Conclusion and future works

In this paper, an automatic clustering algorithm based on ICA is proposed which is called AC-ICA. For this purpose, a new method for changing the number of centers in solutions is proposed during the evolution and an efficient method is also introduced for reinitializing the empty cluster center. In the proposed algorithm, by changing the movement of colonies toward the imperialist in assimilation step, the ability to explore solutions space and maintain diversity among all solutions with

Author contribution statement

Zahra Aliniya: Formal analysis; Investigation; Methodology; Software; Validation; Visualization; Roles/Writing - original draft. Seyed Abolghasem Mirroshandel: Conceptualization; Methodology; Supervision; Validation; Writing - review & editing.

References (70)

  • E. Hancer et al.

    A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number

    Swarm and Evolutionary Computation

    (2017)
  • A. Hatamlou

    Black hole: A new heuristic optimization approach for data clustering

    Information Sciences

    (2013)
  • A. Hatamlou et al.

    A combined approach for clustering based on K-means and gravitational search algorithms

    Swarm and Evolutionary Computation

    (2012)
  • B. Hejrati et al.

    Efficient lossless multi-channel EEG compression based on channel clustering

    Biomedical Signal Processing and Control

    (2017)
  • I. Heloulou et al.

    Automatic multi-objective clustering based on game theory

    Expert Systems with Applications

    (2017)
  • S. Hosseini et al.

    A survey on the imperialist competitive algorithm metaheuristic: Implementation in engineering domain and directions for future research

    Applied Soft Computing

    (2014)
  • A. José-García et al.

    Automatic clustering using nature-inspired metaheuristics: A survey

    Applied Soft Computing

    (2016)
  • V. Kumar et al.

    Automatic cluster evolution using gravitational search algorithm and its application on image segmentation

    Engineering Applications of Artificial Intelligence

    (2014)
  • R. Kuo et al.

    Automatic kernel clustering with bee colony optimization algorithm

    Information Sciences

    (2014)
  • R. Kuo et al.

    Integration of particle swarm optimization and genetic algorithm for dynamic clustering

    Information Sciences

    (2012)
  • H.-L. Ling et al.

    How many clusters? A robust PSO-based local density model

    Neurocomputing

    (2016)
  • D. Liu et al.

    Analyzing documents with Quantum Clustering: A novel pattern recognition algorithm based on quantum mechanics

    Pattern Recognition Letters

    (2016)
  • R. Liu et al.

    Gene transposon based clone selection algorithm for automatic clustering

    Information Sciences

    (2012)
  • T. Niknam et al.

    An efficient hybrid algorithm based on modified imperialist competitive algorithm and K-means for data clustering

    Engineering Applications of Artificial Intelligence

    (2011)
  • C. Ozturk et al.

    A novel binary artificial bee colony algorithm based on genetic operators

    Information Sciences

    (2015)
  • M.K. Pakhira et al.

    Validity index for crisp and fuzzy clusters

    Pattern recognition

    (2004)
  • H. Peng et al.

    An automatic clustering algorithm inspired by membrane computing

    Pattern Recognition Letters

    (2015)
  • S. Saha et al.

    Brain image segmentation using semi-supervised clustering

    Expert Systems with Applications

    (2016)
  • A. Saxena et al.

    A review of clustering techniques and developments

    Neurocomputing

    (2017)
  • N.T. Thong

    HIFCF: An effective hybrid model between picture fuzzy clustering and intuitionistic fuzzy recommender systems for medical diagnosis

    Expert Systems with Applications

    (2015)
  • L.Y. Tseng et al.

    A genetic approach to the automatic clustering problem

    Pattern Recognition

    (2001)
  • J. Xie et al.

    Density core-based clustering algorithm with dynamic scanning radius

    Knowledge-based Systems

    (2018)
  • A. Al Khaled et al.

    Fuzzy adaptive imperialist competitive algorithm for global optimization

    Neural Computing and Applications

    (2015)
  • Y.M.B. Ali

    Unsupervised clustering based an adaptive particle swarm optimization algorithm

    Neural Processing Letters

    (2016)
  • Z. Aliniya et al.

    Solving constrained optimization problems using the improved imperialist competitive algorithm and Deb's technique

    Journal of Experimental & Theoretical Artificial Intelligence (TETA)

    (2018)
  • Cited by (39)

    View all citing articles on Scopus
    View full text