Automatic clustering by multi-objective genetic algorithm with numeric and categorical features

doi:10.1016/j.eswa.2019.06.056

Expert Systems with Applications

Volume 137, 15 December 2019, Pages 357-379

https://doi.org/10.1016/j.eswa.2019.06.056 Get rights and content

Highlights

•
We have developed a clustering algorithm for an unknown number of clusters by MOGA.
•
It works with continuous and categorical featured data sets.
•
It can work with data sets having missing values.
•
The final solution is selected by majority vote by all non-dominated solutions.
•
Context-sensitive and cluster-orient genetic operators are designed.

Abstract

Many clustering algorithms categorized as K-clustering algorithm require the user to predict the number of clusters (K) to do clustering. Due to lack of domain knowledge an accurate value of K is difficult to predict. The problem becomes critical when the dimensionality of data points is large; clusters differ widely in shape, size, and density; and when clusters are overlapping in nature. Determining the suitable K is an optimization problem. Automatic clustering algorithms can discover the optimal K. This paper presents an automatic clustering algorithm which is superior to K-clustering algorithm as it can discover an optimal value of K. Iterative hill-climbing algorithms like K-Means work on a single solution and converge to a local optimum solution. Here, Genetic Algorithms (GAs) find out near global optimum solutions, i.e. optimal K as well as the optimal cluster centroids. Single-objective clustering algorithms are adequate for efficiently grouping linearly separable clusters. For non-linearly separable clusters they are not so good. So for grouping non-linearly separable clusters, we apply Multi-Objective Genetic Algorithm (MOGA) by minimizing the intra-cluster distance and maximizing inter-cluster distance. Many existing MOGA based clustering algorithms are suitable for either numeric or categorical features. This paper pioneered employing MOGA for automatic clustering with mixed types of features. Statistical testing on experimental results on real-life benchmark data sets from the University of California at Irvine (UCI) machine learning repository proves the superiority of the proposed algorithm.

Introduction

Clustering has been a well-addressed problem yet considered as one of the most difficult and challenging tasks in data mining. Clustering techniques are mainly hierarchical based and partitional based (Jain, 2010). Hierarchical clustering algorithms build a hierarchy of clusters. These can be categorized into two categories: Agglomerative and Divisive (Rokach & Maimon, 2005). In this work, we use partitional based clustering. Partitional based clustering algorithms are divided into two categories: hard (or crisp) and soft (or fuzzy). Here, we implement hard clustering for its simplicity. Well-known partitional clustering algorithms like K-Means (Lloyd, 1982, MacQueen, 1967) handle only numerical features while K-Modes (Huang, 1998) is a popular method for categorical features, and for mixed data types K-Prototypes (Huang, 1997) is widely applied. However, final solutions given by K-Means/K-Modes/K-Prototypes depend on the initialization of clusters and may eventually converge to local optima solution. We use GA here as it finds near global optimum solutions in clustering the data points of the data set.

Many evolutionary clustering algorithms (Bandyopadhyay, Maulik, 2002, Bandyopadhyay, Maulik, Mukhopadhyay, 2007, Chang, Zhang, Zheng, 2009, Chang, Zhang, Zheng, Zhang, 2010, Chen, Wang, 2005, Dutta, Dutta, Sil, 2013a, Dutta, Sil, 2012, Fränti, Kivijärvi, Kaukoranta, Nevalainen, 1997, Garai, Chaudhuri, 2004, Hall, Ozyurt, Bezdek, et al., 1999, Handl, Knowles, 2007, He, Tan, 2012, Kirkland, Rayward-Smith, de la Iglesia, 2011, Korkmaz, Du, Alhajj, Barker, 2006, Lai, 2005, Laszlo, Mukherjee, 2007, Liu, Yu, 2005, Liu, Wu, Shen, 2011, Matake, Hiroyasu, Miki, Senda, 2007, Maulik, Bandyopadhyay, 2000, Merz, Zell, 2002, Mukhopadhyay, Maulik, 2009, Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Özyer, Zhang, Alhajj, 2011, Pan, Zhu, Han, 2003, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Siddique, 2009, Scheunders, 1997, Sheng, Swift, Zhang, Liu, 2005, Shirakawa, Nagao, 2009, Tseng, Yang, 2001, Xia, Zhuang, Yu, 2013, Xiao, Yan, Zhang, Tang, 2010) deal with numerical features only. Some of them deal with categorical features only (Demir, Uyar, Ögüdücü, 2007, Deng, He, Xu, 2010, Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012d, Gan, Wu, Yang, 2009, Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009). There are very few works (Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2013b, Dutta, Dutta, Sil, 2014, Jie, Xinbo, Li-cheng, Sep. 2003, Rahman, Islam, 2014, Zheng, Gong, Ma, Jiao, Wu, 2010) reported so far that deal with both continuous and categorical features. The proposed algorithm can deal with mixed data sets having numerical and categorical features. GA is known for its global search ability. To exploit the local search ability of K-Means/K-Modes/K-Prototypes, it is combined with GA in (Fränti, Kivijärvi, Kaukoranta, Nevalainen, 1997, He, Tan, 2012, Liu, Yu, 2005, Merz, Zell, 2002, Rahman, Islam, 2014, Scheunders, 1997, Sheng, Swift, Zhang, Liu, 2005, Xia, Zhuang, Yu, 2013). So, in the proposed method we combine K-Prototypes and MOGA to utilize the local search ability ofK-Prototypes and the global search ability ofMOGA.

In many GA based clustering algorithms, users supply the value of K. Few of them are capable of finding out the optimal K and they can be categorized as automatic clustering algorithms. An algorithm called “Clustering Records Following User Defined Attribute Weights” (CRUDAW) (Rahman & Islam, 2012) handles the problems of unknown K by choosing the K initial centroids using a deterministic process based on a user-defined radius r_x. By doing this, it avoids the disadvantages related to random centroid selection. CRUDAW do not require the K from the user, but the user has to specify cluster radius r_x, which is very difficult to predict. The value of r_x largely determines the quality of clusters. In this work user do not have to give the value of K or r_x. However, we often prefer clustering algorithms that optimize K and the corresponding partitions (Hruschka, Campello, Freitas et al., 2009). Therefore, we have done automatic clustering here. Generally, a GA uses a random number (K) of clusters (not user specified) ranging between 2 and $r o u n d (\sqrt{m})$ (m is the number of data points in the data set) (Pal & Bezdek, 1995) and thereby forms an initial clustering solution or chromosome having K centroids (called genes) (Hruschka, Campello, Freitas, et al., 2009, Liu, Wu, Shen, 2011, Rahman, Islam, 2014, Xiao, Yan, Zhang, Tang, 2010). A random number of cluster centroids in the range [K_min, K_max] are generated in (Bandyopadhyay, Maulik, 2002, Ripon, Siddique, 2009). In this work, we generate some chromosomes of the population randomly where the K varies in the range of 2 to $r o u n d (\sqrt{m})$ .

A set of alike entities form a cluster, and entities from different clusters are not alike (Everitt, Landau, & Leese, 2009). So there exist many criteria for good clustering. Many researchers used Single-Objective GA (SOGA) and tried to discover clustering solutions by optimizing different objectives of clustering or cluster validity indices. In many cases, validity indices are combinations of different clustering objectives. For example, in Sheng et al. (2005), researchers use a fitness function for GA that is the weighted sum of six normalized validity indices. An effective fitness function is crucial for the success of a SOGA. We can group SOGA based clustering methods into four categories:

•
SOGA based K-crisp-clustering (Babu, Murty, 1993, Cao, Liang, Bai, 2009, Chang, Zhang, Zheng, 2009, Deng, He, Xu, 2010, Fränti, Kivijärvi, Kaukoranta, Nevalainen, 1997, Krishna, Murty, 1999, Kuncheva, Bezdek, 1997, Laszlo, Mukherjee, 2007, Lu, Lu, Fotouhi, Deng, Brown, 2004, Lucasius, Dane, Kateman, 1993, Maulik, Bandyopadhyay, 2000, Merz, Zell, 2002, Murthy, Chowdhury, 1996, Pan, Zhu, Han, 2003, Raghavan, Birchard, 1979, Scheunders, 1997, Sheng, Liu, 2004, Zheng, Gong, Ma, Jiao, Wu, 2010).
•
SOGA based automatic-crisp-clustering (Alves, Campello, Hruschka, 2006, Bandyopadhyay, Maulik, 2002, Casillas, De Lena, Martínez, 2003, Chang, Zhang, Zheng, Zhang, 2010, Cole, 1998, Cowgill, Harvey, Watson, 1999, Garai, Chaudhuri, 2004, He, Tan, 2012, Hruschka, Campello, De Castro, 2006, Hruschka, Ebecken, 2003, Lai, 2005, Liu, Wu, Shen, 2011, Ma, Chan, Yao, Chiu, 2006, Pan, Cheng, 2007, Rahman, Islam, 2014, Sheng, Swift, Zhang, Liu, 2005, Tseng, Yang, 2001, Xiao, Yan, Zhang, Tang, 2010).
•
SOGA based K-fuzzy-clustering (Bezdek, Hathaway, 1994, Cao, Liang, Bai, 2009, Gan, Wu, Yang, 2009, Ghosh, Mishra, Ghosh, 2011, Hall, Ozyurt, Bezdek, et al., 1999, Jie, Xinbo, Li-cheng, Sep. 2003, Klawonn, Keller, 1998).
•
SOGA based automatic-fuzzy-clustering (Alves, Campello, Hruschka, 2007, Campello, Hruschka, Alves, 2009, Liu, Li, Chapman, 2003, Maulik, Bandyopadhyay, 2003, Mukhopadhyay, Maulik, 2009, Pakhira, Bandyopadhyay, Maulik, 2005, Park, Yoo, Cho, 2005).

However, for different kinds of data sets different validity measures are suitable. There are numerous conflicting measurements to evaluate a effective clustering solution; therefore, we have to simultaneously optimized many such measures for capturing different characteristics of the data sets. So, clustering is a Multi-Objective Optimization Problem (MOOP) as suggested by Ferligoj and Batagelj (1992). Multi-Objective Evolutionary Algorithms (MOEAs) like MOGAs are suitable for MOOP problems like clustering. For an ideal single objective clustering algorithm researchers identify globally optimal solutions while for multi-objective algorithms extract a set of solutions lying on the Pareto-optimal front. Therefore, multi-objective algorithms always find a solution equal or better than those of the single-objective algorithms (Handl & Knowles, 2007). So, we use MOGA to solve clustering. Many multi-objective evolutionary clustering algorithms are available. In a survey paper (Mukhopadhyay, Maulik, Bandyopadhyay, & Coello, 2014), researchers categorized them according to the type of MOEA, the chromosome encoding schemes, the objective measures optimized, the evolutionary operators used and the procedure used to select the final solution from the non-dominated front.

Researchers use mainly four MOEAs for multi-objective clustering as the underlying optimization tool.

1.
Pareto Envelope-based Selection Algorithm-II ( $P E S A - I I$ ) (Corne, Jerram, Knowles, & Oates, 2001),
2.
Nondominated Sorting Genetic Algorithm-II ( $N S G A - I I$ ) (Deb, Pratap, Agarwal, & Meyarivan, 2002),
3.
Strength Pareto Evolutionary Algorithm-2 (SPEA2) (Zitzler, Laumanns, & Thiele, 2001) and
4.
Niched Pareto Genetic Algorithm (NPGA) (Horn, Nafpliotis, & Goldberg, 1994).

Multi-Objective Clustering with automatic K determination (MOCK) (Handl & Knowles, 2007) uses $P E S A - I I$ . MOEA (dynamic) (Chen & Wang, 2005), Multi-Objective Genetic Clustering (Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2013), Multi-Objective Clustering Algorithms (MOCA) (Kirkland et al., 2011), Variable-length Real Jumping Genes Genetic Algorithms (VRJGGA) (Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006) and (Özyer et al., 2011) use $N S G A - I I,$ and $S P E A - 2$ used in Demir et al. (2007); Matake et al. (2007); Shirakawa and Nagao (2009) and NPGA applied in Korkmaz et al. (2006). Due to the limitations pointed out below for existing algorithms, we developed our own MOGA to do clustering.

Similar to SOGA based clustering, MOGA based clustering methods can be grouped into four categories.

•
MOGA based K-crisp-clustering (Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d)
•
MOGA based K-fuzzy-clustering (Bandyopadhyay, Maulik, Mukhopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009)
•
MOGA based automatic-crisp-clustering (Chen, Wang, 2005, Handl, Knowles, 2007, Kirkland, Rayward-Smith, de la Iglesia, 2011, Korkmaz, Du, Alhajj, Barker, 2006, Matake, Hiroyasu, Miki, Senda, 2007, Özyer, Zhang, Alhajj, 2011, Praditwong, Harman, Yao, 2011, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006, Shirakawa, Nagao, 2009)
•
MOGA based automatic-fuzzy-clustering (Demir, Uyar, Ögüdücü, 2007, Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Ripon, Siddique, 2009)

This work falls in the third category i.e. MOGA based automatic-crisp-clustering.

Prototype-predicated and point-predicated are two approaches of chromosome representation, directly related to the objective function to be optimized. In the prototype-predicated approach, chromosome encodes cluster representatives (centroids) as genes of chromosomes (Bandyopadhyay, Maulik, 2002, Bandyopadhyay, Maulik, Mukhopadhyay, 2007, Chang, Zhang, Zheng, 2009, Chang, Zhang, Zheng, Zhang, 2010, Chen, Wang, 2005, Deng, He, Xu, 2010, Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d, Dutta, Dutta, Sil, 2013a, Dutta, Dutta, Sil, 2013b, Dutta, Dutta, Sil, 2014, Dutta, Sil, 2012, Fränti, Kivijärvi, Kaukoranta, Nevalainen, 1997, Hall, Ozyurt, Bezdek, et al., 1999, He, Tan, 2012, Jie, Xinbo, Li-cheng, Sep. 2003, Kirkland, Rayward-Smith, de la Iglesia, 2011, Lai, 2005, Laszlo, Mukherjee, 2007, Maulik, Bandyopadhyay, 2000, Merz, Zell, 2002, Mukhopadhyay, Maulik, 2009, Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Pan, Zhu, Han, 2003, Rahman, Islam, 2014, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006, Scheunders, 1997, Sheng, Swift, Zhang, Liu, 2005, Tseng, Yang, 2001, Xia, Zhuang, Yu, 2013, Xiao, Yan, Zhang, Tang, 2010, Zheng, Gong, Ma, Jiao, Wu, 2010). In the prototype- predicated encoding, the length of the chromosomes is short and the genetic operators have less time complexities. Also, it is good for capturing overlapping clusters. Therefore, we follow this approach. However, these algorithms have a tendency to find the clusters with a spherical shape. Chromosomes are made by genes. We can categorize it as real encoding, as all genes are (n−1) dimensional real numbers where (n−1) is the number of features in data sets, excluding class labels and this represents the coordinates of the cluster prototypes. A chromosome encodes K such prototypes or genes. To form clusters, we assign data points to the nearest prototype. These algorithms are subdivided into fixed-length chromosome encoding algorithms (Bandyopadhyay, Maulik, 2002, Bandyopadhyay, Maulik, Mukhopadhyay, 2007, Chang, Zhang, Zheng, 2009, Chang, Zhang, Zheng, Zhang, 2010, Deng, He, Xu, 2010, Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d, Dutta, Dutta, Sil, 2013a, Dutta, Dutta, Sil, 2013b, Dutta, Dutta, Sil, 2014, Dutta, Sil, 2012, Fränti, Kivijärvi, Kaukoranta, Nevalainen, 1997, Hall, Ozyurt, Bezdek, et al., 1999, Jie, Xinbo, Li-cheng, Sep. 2003, Laszlo, Mukherjee, 2007, Maulik, Bandyopadhyay, 2000, Merz, Zell, 2002, Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Pan, Zhu, Han, 2003, Xia, Zhuang, Yu, 2013, Zheng, Gong, Ma, Jiao, Wu, 2010), (which use a fixed-length string to describe the cluster centroids where the number of clusters is specified a priori) and variable-length chromosome encoding algorithms (Chen, Wang, 2005, He, Tan, 2012, Kirkland, Rayward-Smith, de la Iglesia, 2011, Lai, 2005, Mukhopadhyay, Maulik, 2009, Mukhopadhyay, Maulik, 2011, Rahman, Islam, 2014, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006, Scheunders, 1997, Sheng, Swift, Zhang, Liu, 2005, Tseng, Yang, 2001, Xiao, Yan, Zhang, Tang, 2010), (which use a variable-length string to describe the cluster centroids and the K are automatically evolved). MOGA developed here use a variable-length string and K is not specified beforehand. For variable-length chromosomes special evolutionary operators are required. Moreover, the genes may be very large in prototype-predicated encoding if the number of features are large. Therefore, for high dimensional data sets this encoding strategy is not suitable. In (Dutta, Dutta, Sil, 2012d, Dutta, Dutta, Sil, 2013a, Dutta, Dutta, Sil, 2013b, Dutta, Dutta, Sil, 2014), we have done simultaneous feature selection and clustering to deal with this type of problem. This is known as subspace clustering. In (Xia et al., 2013), researchers discuss soft subspace clustering with a MOEA for high-dimensional data.

Point-predicated encoding encode the complete clustering solution for the data points of the data sets. Under this category, there are three main approaches, the cluster label-predicated approach, binary encoding, and locus-predicated adjacency representation. In the cluster label-predicated approach (Liu, Yu, 2005, Özyer, Zhang, Alhajj, 2011, Praditwong, Harman, Yao, 2011), the chromosome lengths are equal to m (m is the number of data points in the data sets). Value at each position in the chromosome represents the cluster label of the corresponding data points.

In binary encoding, ‘0’ and ‘1’ of length m make a chromosomes. If position j of the chromosome has ‘1’ then the jth record of the data set is a centroid of a cluster. According to Jain, Murty, and Flynn (1999), perhaps the oldest paper on the use of GAs for clustering is by Raghavan and Birchard (1979). Researchers in Ripon and Siddique (2009) follow this approach. In Babu and Murty (1993), Garai and Chaudhuri (2004), Hall et al. (1999), Kuncheva and Bezdek (1997) and Lai (2005), researchers use binary encoding, but their approaches are different. Among these in Hall et al. (1999) and Lai (2005) researchers reported both prototype-predicated and binary encoding.

Binary encoding has many drawbacks (Hruschka, Campello, Freitas, et al., 2009, Michalewicz, Hartley, 1996), so researchers adopted integer encoding. It is of two types: label based encoding and medoid based encoding. In a label based encoding, each chromosome has a length of m, where m is the number of data points of the data set. ith (where 1 ≤ i ≤ m) position of a chromosome may have a value k (where 1 ≤ k ≤ K), which indicates that the ith data point belongs to kth cluster (Krishna, Murty, 1999, Lu, Lu, Fotouhi, Deng, Brown, 2004, Murthy, Chowdhury, 1996). The label based encoding scheme is inherently redundant, as K! unique chromosomes can be constructed for a single clustering solution. In medoid based encoding, each chromosome have length K, ith (where 1 ≤ i ≤ K) position of a chromosome have a value in between 1 and m. Data points encoded in a chromosome are a set of cluster centers, thus giving a clustering solution. Data points are assigned to the nearest prototype to form clusters. This suffers from the same problem, i.e. redundancy, which can be avoided by a renumbering procedure (Falkenauer, 1998). Renumbering is also applicable for label based encoding. Menendez, Barrero, and Camacho (2014) use both the label based and medoid based encoding.

MOCAs such as graph-predicated sequence clustering (GraSC) (Demir et al., 2007) uses another type of integer encoding known as locus-predicated encoding strategy. MOCK (Handl, Knowles, 2007, Matake, Hiroyasu, Miki, Senda, 2007, Shirakawa, Nagao, 2009) uses a variant of the locus-predicated encoding strategy using Minimum Spanning Trees (MST). Here, each chromosome consists of m genes (m is the number of data points in data sets) and each gene takes integer values in {1, ..., m}. If the gene i is assigned to a value j, it represents a link between the data points i and j in the resulting clustering solution and these two data points belong to the same cluster. Although integer encoding techniques are not biased towards convex-shaped clusters, they, however, suffer from the large length of chromosomes when the number of data points m is large. In Korkmaz et al. (2006), researchers use linkage-based integer encoding. Algorithms those use this encoding approach are more time hungry to converge as it may assign different labeling representations to the same clustering solution, resulting in redundancy. It may not be suitable for detecting overlapped clusters. Unlike prototype-predicated encoding, here the chromosome length is independent of the encoded number of clusters.

Considering the pros and cons of the different chromosome encoding strategies, we use a prototype-predicated strategy with variable length chromosomes in this work.

Many MOCAs use two objectives (Bandyopadhyay, Maulik, Mukhopadhyay, 2007, Chen, Wang, 2005, Demir, Uyar, Ögüdücü, 2007, Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d, Dutta, Dutta, Sil, 2013a, Dutta, Dutta, Sil, 2013b, Dutta, Dutta, Sil, 2014, Dutta, Sil, 2012, Handl, Knowles, 2007, Korkmaz, Du, Alhajj, Barker, 2006, Liu, Yu, 2005, Matake, Hiroyasu, Miki, Senda, 2007, Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Siddique, 2009, Ripon, Tsang, Kwong, 2006, Shirakawa, Nagao, 2009, Xia, Zhuang, Yu, 2013), while objective measures used in different algorithms are different. Chen and Wang (2005) minimize cluster compactness and maximize connectedness - conceptually similar to the criterion of nearest-neighbor consistency introduced by Ding and He (2004). Ripon et al. maximize intra-cluster entropy and inter-cluster distance (Ripon, Tsang, Kwong, Ip, 2006, Ripon, Siddique, 2009, Ripon, Tsang, Kwong, 2006). Handl and Knowles (2007) use two optimization measures. One measures cluster compactness, i.e. the partitioning, and the other one reflects cluster connectedness, i.e. connectivity, which evaluates the degree of placing neighboring data points in the same cluster. Matake et al. (2007) and Demir et al. (2007) also use the same objectives. The global fuzzy compactness of the clusters and the fuzzy separation are optimized simultaneously in Mukhopadhyay and Maulik (2011). Shirakawa and Nagao (2009) use deviation and edge value as objectives. Korkmaz et al. (2006) apply two objectives - minimize the Total Within Cluster Variation (TWCV) similar to cluster compactness and the number of clusters (K). In many cases objective measures are clustering indices. The performance of the popular MOEAs such as $N S G A - I I,$ PAES and $P E S A - I I$ degrades with the increase of the number of objectives (Saxena, Duro, Tiwari, Deb, & Zhang, 2013). That may be the cause of using two objectives only. There are also a few multi-objective automatic clustering techniques that use more than two objective measures. In Kirkland et al. (2011), three objectives and in Özyer et al. (2011), four objectives have been simultaneously optimized. Kirkland et al. (2011) minimize the compactness, maximize the separateness and maximize the connectivity of clusters. Özyer et al. (2011) minimize the sum of the intra-cluster distance, maximize the sum of inter-cluster distances, minimize incomparable regions between clusters and minimize the cluster radius. Praditwong et al. (2011) consider five objectives for software module clustering. They maximize the sum of intra-edges, minimize the sum of the inter-edges, maximize the number of clusters, maximize the sum of the ratio of intra-edges and inter-edges in each cluster (MQ) and minimize the number of isolated clusters. It should be noted that the choice of a suitable set of objective functions is not a trivial problem and the clustering output may heavily depend on an appropriate choice (Handl & Knowles, 2012). In view of this, an interactive MOCA is proposed in Mukhopadhyay et al. (2013).

As the performance ofMOGAdegrades with the increase of the number of objectives, we take two objectives for optimization. The two objectives ofMOGAare minimization of compactness and maximization of separateness. A further important aspect in the choice of these objectives is their potential to balance each other’s tendency to increase or decrease the value ofK, enabling the use of chromosomes inGAwith differentK. While the objective value associated with overall compactness necessarily improves with an increasing K, the opposite is the case for separateness. Other combinations of objectives that do not have this property, cannot be considered for automatic clustering.

The final solution depends on the initialization of clusters, and it has been verified by experiments that the appropriate initial values of cluster centers greatly affect the quality of partitional clustering like K-Means. Generally, there are four categories of cluster initialization process for GA based clustering (He & Tan, 2012):

1.
Random sampling methods,
2.
Distance optimization methods,
3.
Density estimation methods and
4.
Attribute feature methods.

Following integer encoding, algorithms in Krishna and Murty (1999), Lu et al. (2004) and Murthy and Chowdhury (1996) generate the chromosomes of Initial Population (IP) by assigning data points into clusters randomly. Lucasius et al. (1993) select a set of data points as medoids to build IP. Similarly, data points are chosen randomly as initial prototypes in Bandyopadhyay and Maulik (2002), Bandyopadhyay et al. (2007), Chang et al. (2009), Chen and Wang (2005), Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d, Garai and Chaudhuri (2004), Hall et al. (1999), Kuncheva and Bezdek (1997), Kirkland et al. (2011), Liu and Yu (2005), Liu et al. (2011), Lai (2005), Laszlo and Mukherjee (2007), Maulik and Bandyopadhyay (2000), Merz and Zell (2002), Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Mukhopadhyay, Maulik, 2009, Mukhopadhyay, Maulik, 2011, Özyer et al. (2011), Pan et al. (2003), Praditwong et al. (2011), Ripon, Tsang, Kwong, Ip (2006), Ripon, Tsang, Kwong (2006), Ripon and Siddique (2009), Scheunders (1997), Sheng and Liu (2004), Tseng and Yang (2001), Xia et al. (2013) and Zheng et al. (2010). Researchers use random population initialization in quantum inspired GA (Xiao et al., 2010). We take 10% of the data set size as the IP size (Dutta, Dutta, Sil, 2012a, Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d), as IP size guides the searching power of MOGA and its size should increase with the increase of data set size. In prototype-predicated chromosome representation, some randomly selected data points are the prototypes in the initial chromosomes. On the other hand, for point-predicated encoding, random strings initialize the cluster label vectors so that each point gets a random cluster label. Some algorithms utilize specialized initialization. Researchers initialize the chromosomes by MST and K-Means clustering algorithms in Demir et al. (2007), Handl and Knowles (2007), Matake et al. (2007) and Shirakawa and Nagao (2009). Chromosomes of IP are generated considering constrain of linear linkage encoding scheme (Korkmaz et al., 2006). The algorithm described in Cao et al. (2009) selects initial centroids deterministically based on the frequency scores of the data points. As these types of methods build chromosomes as clustering solutions, these are potentially better than building chromosomes by randomly selecting records as cluster centroids. These are finely tuned in the subsequent generations of GAs by means of different generic operators and have a better chance to converge as these are starting from better positions. However, the time complexity of such algorithms is much greater than algorithms doing random initialization (Mukhopadhyay, Maulik, & Bandyopadhyay, 2015). Randomly chosen chromosomes have the probability to search the space that will not fall under the search space of the deterministic process. So, in the proposed work we use a combination of randomly chosen chromosomes and deterministically chosen chromosomes in theIP. It helps the MOGA to obtain a better chromosome finally. For this gene re-arrangement and twin removal are crucial (Rahman & Islam, 2014). In He and Tan (2012), researchers use the maximum attribute range partition method for IP building.

GAs use different types of genetic operators. In this regard Falkenauer (1998) introduces context insensitivity as follows “the schemata defined on the genes of the simple chromosome do not convey useful information that could be exploited by the implicit sampling process carried out by a clustering GA”. It is easily detected in single-point crossover operation with binary encoding, label based integer encoding and real encoding, whereas in integer medoid based encoding context insensitivity is not easily detectable. Two recombination operators used by Merz and Zell (2002) also suffer due to this problem. Another feature of genetic operators is cluster-orientation. In this context Hruschka et al. (2009) have written “Cluster-oriented operators mean the operators that are task dependent, such as operators that copy, split, merge, and eliminate clusters of data points, in contrast to conventional evolutionary operators that just exchange or switch bits without any regard to their task-dependent meaning”. Single-point crossover used in Murthy and Chowdhury (1996) are not cluster-oriented and context-sensitive. Traditional genetic operators usually just manipulate gene values without taking into account their connections with other genes. This is not helping to get an optimized solution in grouping problems like clustering. Kuncheva and Bezdek use a context-sensitive uniform crossover for binary coded chromosomes (Kuncheva & Bezdek, 1997). Pan et al. (2003) use a uniform crossover. In Lucasius et al. (1993) and Sheng and Liu (2004), the researchers use a cluster-oriented crossover operator. The genetic operators used by researchers in Maulik and Bandyopadhyay (2000) and Scheunders (1997) have exchanged information contained in the centroids of the clusters. These are context-insensitive. In Maulik and Bandyopadhyay (2000), parts of cluster centers may get modified due to the selection of crossover points within any cluster centers therefore, this approach is not cluster-oriented. Chang et al. apply two crossover operators: the path-based crossover and the heuristic crossover (Chang et al., 2009). They did gene rearrangement before crossover to produce better quality chromosomes after crossover. Fränti et al. (1997) propose three new crossover methods: pairwise, largest partitions, pairwise nearest neighbor. Laszlo and Mukherjee use a crossover exchanging neighboring centers (Laszlo & Mukherjee, 2007). Krishna and Murty have not use any crossover operator (Krishna & Murty, 1999). Zheng et al. (2010) have applied a Simulated Binary Crossover (SBX) (Agrawal, Deb, & Agrawal, 1995) on continuous features and a single-point crossover on categorical features. A conventional single-point crossover is used in Bandyopadhyay et al. (2007) and Mukhopadhyay, Maulik, and Bandyopadhyay (2009) by considering cluster centers as atomic units. Mukhopadhyay et al. (2007) use a conventional uniform crossover with a random mask. We use a single-point crossover, by considering cluster centers (Dutta et al., 2012a) as atomic units. Hence our approach is cluster-oriented. We implement the pairwise crossover (Fränti et al., 1997) in Dutta, Dutta, Sil, 2012b, Dutta, Dutta, Sil, 2012d considering context-sensitivity. Prototype-predicated representation in (Korkmaz et al. (2006), Mukhopadhyay and Maulik (2011), Mukhopadhyay et al. (2013) and Praditwong et al. (2011)) uses a single-point crossover, which are context-insensitive. In most of the work, gene rearrangement is not done, which may result in having useless offspring chromosomes. Chang et al. (2009) have done gene rearrangement before crossover to produce better quality chromosomes after crossover, but for gene rearrangement operations used in Chang et al. (2009) require chromosomes of the same size. In Chen and Wang (2005), researchers use a two-point crossover. Ripon, Tsang, Kwong, Ip (2006), Ripon, Tsang, Kwong (2006) and Ripon and Siddique (2009) employ a jumping gene crossover in their MOCAs. Researchers use uniform crossover in most cases where algorithms are based on point-predicated encoding policy (Demir, Uyar, Ögüdücü, 2007, Handl, Knowles, 2007, Matake, Hiroyasu, Miki, Senda, 2007, Shirakawa, Nagao, 2009) and hence cluster-oriented. Özyer et al. (2011) use a two-point crossover. Crossover operation in Kirkland et al. (2011) exchange one cluster in one chromosome with the corresponding smaller clusters in the other chromosome. In this work, we apply a single-point crossover after doing gene rearrangement with unequal sized chromosomes by considering context-sensitivity. To preserve the context-sensitivity and cluster-orientation we do the crossover at the boundary of the genes, i.e. here the genes are atomic.

After crossover, various types of mutation operators are also employed. The mutation operator in Kuncheva and Bezdek (1997) for the binary coded chromosomes is cluster-oriented, as it add or delete a prototype and corresponding cluster by altering bits of the binary coded chromosomes. Murthy and Chowdhury (1996) implement a mutation operator that randomly changes the gene values of some randomly selected genes, hence object-oriented. Some researchers (Krishna, Murty, 1999, Lu, Lu, Fotouhi, Deng, Brown, 2004) describe a mutation operator that changes a gene value based on the distances of the cluster centroids from the corresponding data point. Another group of scientists (Lucasius, Dane, Kateman, 1993, Sheng, Liu, 2004) develop a cluster-oriented GA operator that randomly selects a medoid that can be replaced with a data point from the data set according to a predetermined probability. Mutation operators described in Chang et al. (2009), Bandyopadhyay et al. (2007), Maulik and Bandyopadhyay (2000) and Scheunders (1997) slightly change (perturbation) the centroids encoded in chromosomes. Contrary to this, Merz and Zell (2002) propose a mutation operator that replaces cluster prototypes with data points from the data set, hence cluster-oriented. Some researchers apply a polynomial mutation (Hubert & Arabie, 1985) for continuous features and uniform mutation for categorical features (Zheng et al., 2010). In Mukhopadhyay, Maulik, Bandyopadhyay, 2007, Mukhopadhyay, Maulik, Bandyopadhyay, 2009, a position of a chromosome is selected for mutation, after that, the categorical value of that position is replaced by another random value chosen from the corresponding categorical domain. By the mutation we change cluster centers (Dutta et al., 2012a) or cluster modes (Dutta et al., 2012b) or cluster centroids (Dutta et al., 2012d) randomly by any data points from data sets. Hence this approach is cluster-oriented and object-oriented. In prototype-predicated encoding, centroid perturbation is found to be the predominant mutation operator (Chen, Wang, 2005, Mukhopadhyay, Maulik, 2011, Mukhopadhyay, Maulik, Bandyopadhyay, 2013, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006). Centroid perturbation is used to shift a randomly selected centroid slightly from its current position. These operators are not creating new clusters or eliminating existing ones. Korkmaz et al. (2006) propose a grafting mutation that changes the set of data point’s membership rather than just a single data point encoded in chromosomes. To deal with a large chromosome length, directed neighborhood-biased mutation, a special mutation operator, is proposed by Handl and Knowles (2007). This nearest-neighbor-based mutation operator cannot be categorized as cluster-oriented or object-oriented. It is also used by other researchers (Demir, Uyar, Ögüdücü, 2007, Matake, Hiroyasu, Miki, Senda, 2007, Shirakawa, Nagao, 2009). In Kirkland et al. (2011), a mutation operator is employed in which either random cluster centers of the chromosomes are perturbed or cluster centers are added to/deleted from the chromosome with equal probability. For the cluster label-based encoding, a single-point mutation replaces the class label of the selected point by an arbitrary class label (Praditwong et al., 2011). In Özyer et al. (2011), the mutation operator replaces each gene value with respect to the probability distribution. In this work, we replace centroids randomly by the mutation operator that can be considered as cluster-oriented.

At each generation, GA creates a new set of solutions by selecting individual potential solutions (chromosomes) according to their fitness in the problem domain. For SOGA, as there is a single objective, GA selects a set of solutions based on the values of the objective function or fitness function. In the case of MOGA, there may exist more than one solution that is superior to the rest of the solutions when some objectives are considered but is inferior to other solutions in one or more objectives. So, the process of selection is more complicated for MOGA than SOGA. Chen and Wang (2005) use two selection strategies: binary tournament selection for crossover and mutation, and the other one is based on NSGA-II (Deb et al., 2002). The selection strategy of Korkmaz et al. (2006) is based on Pareto dominance. A non-dominated sorting strategy employing the crowding-distance assignment is performed to achieve the elitism for the next generation in Ripon, Tsang, Kwong, Ip (2006), Ripon, Tsang, Kwong (2006). Shirakawa and Nagao (2009) select non-dominated solutions from the combined population of two generations. Özyer et al. (2011) use a tournament selection based on Pareto dominance. A crowded binary tournament selection method is used in Mukhopadhyay and Maulik (2011) and Mukhopadhyay et al. (2013). In this work, we select the non-dominated chromosomes from the combined population, thus we are preserving elitism.

The last generation of a MOEA based clustering algorithm gives a set of non-dominated solutions. MOEA based clustering algorithms differ from SOEAin the method for obtaining the final solution from the non-dominated set of solutions yielded by the MOEA. These methods can be broadly classified into three categories:

1.
The independent objective-based approach,
2.
The knee-based approach and
3.
The cluster ensemble-based approach.

In the independent objective-based approach, an independent cluster validity index is used to select a single solution from the non-dominated front. It is different from optimization measures used by MOEA based clustering algorithms. Many of the currently available multi-objective clustering techniques (Chen, Wang, 2005, Kirkland, Rayward-Smith, de la Iglesia, 2011, Korkmaz, Du, Alhajj, Barker, 2006, Mukhopadhyay, Maulik, 2011, Özyer, Zhang, Alhajj, 2011, Ripon, Tsang, Kwong, Ip, 2006, Ripon, Tsang, Kwong, 2006) use this approach because of its simplicity. One may feel inclined to criticize this approach by questioning as to why this independent validity measure is not optimized directly. This question does not have a very convincing answer.

The second approach is the knee-based approach, where the objective is to select a knee solution from the non-dominated solutions. A knee solution is an interesting solution where the change of one objective measure value induces the maximum change in the other one. J. Handl and J. Knowles use the knee-based approach in their MOCK algorithm (Handl & Knowles, 2007). Shirakawa and Nagao (2009) also use this approach. It is not well explained as to why the user should be interested in this solution. Another major problem is the high time complexity associated with choosing the knee solution (Matake et al., 2007). For that reason Matake et al. (2007) have shown a simpler way of choosing the knee solution.

The third approach is the cluster ensemble-based approach (Mukhopadhyay et al., 2013), where it is assumed that all the non-dominated solutions contain some information about the clustering structure of the data set. Therefore, the motivation is to combine this information to obtain a single clustering solution. Selection of the best solution out of the solutions lying on the Pareto-optimal front is an important open problem (José-García & Gómez-Flores, 2016). In this work, we invent a novel technique to select a solution from the non-dominated solutions by using nine cluster validity indices.

Research in the area of automatic clustering by MOGA became popular after the work of researchers Handl and Knowles (2007) called MOCK. However, some researches (Matake et al., 2007) have shown that the MOCK’s computational cost is too high for large data sets and they propose an improvement over MOCK that allows to apply MOCK for large data sets. Recently, $N S G A - I I$ is adopted as the basis for the Multi-Objective Evolutionary Approach based on Soft Subspace Clustering (MOEASSC) (Xia et al., 2013).

Our proposed clustering algorithm deals with mixed data by combining the local search ability of the K-Means clustering algorithm with the global search ability ofMOGAto find theK. In Fränti et al. (1997), He and Tan (2012), Liu and Yu (2005), Merz and Zell (2002), Rahman and Islam (2014), Scheunders (1997), Sheng et al. (2005) and Xia et al. (2013), K-Means are combined with GA. Minimization of compactness of clusters and maximization of separateness are two objectives of MOGA. Cluster centroids encoded in chromosomes are indivisible and we create anIPby a combination of random selection and deterministic selection of genes or cluster centers by following the work in Rahman and Islam (2014). The rest of the algorithm is quite different from their work. The most important difference is that it (Rahman & Islam, 2014) is single-objective and the proposed one is multi-objective. In this work, we use prototype-predicated encoding and a special crossover operator to deal with variable length chromosomes. We devise a novel majority voting technique based on several clustering validity indices to choose a solution from the set of solutions lying on the non-dominated front. The proposed MOGA based clustering algorithm has the following merits over many other clustering algorithms.

1.
It does not require a user-defined K or radius of a cluster.
2.
It avoids dependency of final clustering solutions on the initial cluster centroids selected.
3.
It avoids sticking in local minima and has a higher chance of reaching global optima.
4.
It works with continuous and categorical featured data sets.
5.
It can work with data sets having missing feature values.
6.
It combines the advantages of random and deterministic IP creation.
7.
It optimizes two measures simultaneously to obtain a set of high-quality non-dominated solutions.
8.
The final solution is selected by majority voting on all non-dominated solutions based on some clustering validity indices not used as optimization measures.
9.
It combines the local search ability of K-Prototypes and global search ability ofGA.
10.
Instead of SOGAit usesMOGA.
11.
Genetic operators are designed by taking care of context-sensitivity and cluster-orientation.

Here we have reviewed more than 50 GA based clustering algorithms. As it is a matured field, interested readers may go through following survey papers (Fahad, Alshatri, Tari, Alamri, Khalil, Zomaya, Foufou, Bouras, 2014, Hruschka, Campello, Freitas, et al., 2009, Jain, 2010, José-García, Gómez-Flores, 2016, Mukhopadhyay, Maulik, Bandyopadhyay, 2015, Mukhopadhyay, Maulik, Bandyopadhyay, Coello, 2014) for further reading. To the best of our knowledge, anyMOGAbased automatic clustering algorithms handling numeric and categorical features are not yet available, although at the preprocessing phase, one can convert numeric features to categorical features or vice-versa before applying the clustering algorithm, which results in loss of information. However, the existence of both the types of features in real-life problems is very common, which motivates us to devise the proposed clustering method usingMOGA, which works with both types of features.

We organize the remaining part of the paper as follows. Section 2 explains some basic concepts relevant to this work. Section 3 describes the proposed algorithm. Section 4 presents the process of selecting the best chromosome from a set of chromosomes on the non-dominated front. Section 5 tabulates the test results and Section 6 concludes the paper.

Section snippets

Basic preliminaries

Data set D shown in the Table 1 is a set of m records/data points $R = {R_{1}, R_{2}, \dots R_{m}}$ and n attributes/features $A = {A_{1}, A_{2}, \dots A_{n}}$ . An attribute A_i can have numerical or categorical values.

A record R_j (1 ≤ j ≤ m) is an n-dimensional feature vector - an ordered list of n values - $R_{j} = [V_{j 1}, V_{j 2}, \dots V_{j n}],$ where V_ji ∈ dom(A_i), with 1 ≤ i ≤ n. dom(A_i) is domain of ith attribute. V_ji is the ith feature value of jth record R_j. Each record R_j belongs to a predefined class, represented by V_jn where V_jn ∈ dom(A_n).

Proposed algorithm

Fig. 1 presents the flowchart of the proposed MOGA based automatic clustering algorithm.

Steps of the proposed algorithm are described below.

Selection of the best chromosome

The last generation of MOGA selects some chromosomes lying on the Pareto-optimal front. Each chromosome provides a clustering solution. So, to select the best solution from the selected set of solutions is the next job. We do this by using nine clustering indices discussed below. Clustering indices measure the goodness of clustering solutions and can rank them in terms of there goodness (Xie & Beni, 1991). Eqs. (2), (3) and (4) are used to calculate distances for clustering indices.

Testing

We use 25 benchmark data sets from the University of California at Irvine (UCI) machine learning repository (Asuncion & Newman, 2007) for testing the algorithm’s performance. For statistical test minimum sample size requirement is 20 (url:http://www.socscistatistics.com/tests/signedranks/Default2.aspx). Researchers used 14 data sets to statistically compare performance of different classifiers (Demšar, 2006, Garcia, Herrera, 2008). In Rahman and Islam (2014) 20 data sets are used to compare

Conclusions

So the proposed algorithm ( $M O G A - K P$ ) can deal with different types of features, i.e. continuous and categorical and missing feature values, as some of the data sets used for testing have mixed features and missing feature values. To find out clustering solutions the user does not have to specify K or radios of clusters thus doing automatic clustering. The statistical test shows the superiority of $M O G A - K P$ over SOGA and $S O G A - K P$ . It supports the effectiveness of hybridization of global and local

Funding

This research did not receive any specific grant from funding agencies in the public, commercial or not-for-profit sectors.

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Dipankar Dutta: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Software, Visualization, Writing - original draft, Writing - review & editing. Jaya Sil: Funding acquisition, Project administration, Resources, Software, Supervision, Validation, Writing - original draft, Writing - review & editing. Paramartha Dutta: Project administration, Resources, Software, Supervision, Validation, Writing - original draft, Writing -

References (127)

S. Bandyopadhyay et al.
Genetic clustering for automatic evolution of clusters and application to image classification
Pattern Recognition
(2002)
J.C. Bezdek et al.
FCM: The fuzzy c-means clustering algorithm
Computers & Geosciences
(1984)
F. Cao et al.
A new initialization method for categorical data clustering
Expert Systems with Applications
(2009)
D.-X. Chang et al.
A genetic algorithm with gene rearrangement for K-means clustering
Pattern Recognition
(2009)
D.-X. Chang et al.
A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem
Pattern Recognition
(2010)
M.C. Cowgill et al.
A genetic algorithm approach to cluster analysis
Computers & Mathematics with Applications
(1999)
S. Deng et al.
G-ANMI: A mutual information based genetic clustering algorithm for categorical data
Knowledge-Based Systems
(2010)
G. Gan et al.
A genetic fuzzy k-modes algorithm for clustering categorical data
Expert Systems with Applications
(2009)
G. Garai et al.
A novel genetic algorithm for automatic clustering
Pattern Recognition Letters
(2004)
A. Ghosh et al.
Fuzzy clustering algorithms for unsupervised change detection in remote sensing images
Information Sciences
(2011)

H. He et al.

A two-stage genetic algorithm for automatic clustering

Neurocomputing

(2012)

E.R. Hruschka et al.

Evolving clusters in gene-expression data

Information Sciences

(2006)

A.K. Jain

Data clustering: 50 years beyond K-means

Pattern Recognition Letters

(2010)

A. José-García et al.

Automatic clustering using nature-inspired metaheuristics: A survey

Applied Soft Computing

(2016)

M. Laszlo et al.

A genetic algorithm that exchanges neighboring centers for k-means clustering

Pattern Recognition Letters

(2007)

Y. Liu et al.

Automatic clustering using genetic algorithms

Applied Mathematics and Computation

(2011)

C.B. Lucasius et al.

On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasiblity and comparison

Analytica Chimica Acta

(1993)

U. Maulik et al.

Genetic algorithm-based clustering technique

Pattern Recognition

(2000)

A. Mukhopadhyay et al.

Towards improving fuzzy clustering using support vector machine: Application to gene expression data

Pattern Recognition

(2009)

A. Mukhopadhyay et al.

A multiobjective approach to MR brain image segmentation

Applied Soft Computing

(2011)

M.K. Pakhira et al.

A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification

Fuzzy Sets and Systems

(2005)

R.B. Agrawal et al.

Simulated binary crossover for continuous search space

Complex Systems

(1995)

V.S. Alves et al.

Towards a fast evolutionary algorithm for clustering

2006 IEEE international conference on evolutionary computation

(2006)

V.S. Alves et al.

A fuzzy variant of an evolutionary algorithm for clustering

2007 IEEE international fuzzy systems conference

(2007)

R. Assareh et al.

A novel many-objective clustering algorithm in mobile ad hoc networks

Wireless Personal Communications

(2017)

A. Asuncion et al.

UCI machine learning repository

(2007)

G.P. Babu et al.

A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm

Pattern Recognition Letters

(1993)

F.B. Baker et al.

A graph-theoretic approach to goodness-of-fit in complete-link hierarchical clustering

Journal of the American Statistical Association

(1976)

S. Bandyopadhyay et al.

Multiobjective genetic clustering for pixel classification in remote sensing imagery

IEEE transactions on Geoscience and Remote Sensing

(2007)

J.C. Bezdek et al.

Optimization of fuzzy clustering criteria using genetic algorithms

Proceedings of the first IEEE conference on evolutionary computation

(1994)

R.J. Campello et al.

On the efficiency of evolutionary fuzzy clustering

Journal of Heuristics

(2009)

A. Casillas et al.

Document clustering into an unknown number of clusters using a genetic algorithm

International conference on text, speech and dialogue

(2003)

E. Chen et al.

Dynamic clustering using multi-objective evolutionary algorithm

International conference on computational and information science

(2005)

Y. Cheng

Mean shift, mode seeking, and clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1995)

R.M. Cole

Clustering with genetic algorithms

(1998)

D.W. Corne et al.

PESA-II: Region-based selection in evolutionary multiobjective optimization

Proceedings of the 3rd annual conference on genetic and evolutionary computation

(2001)

D.L. Davies et al.

A cluster separation measure

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1979)

K. Deb et al.

An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: Solving problems with box constraints

IEEE Transactions Evolutionary Computation

(2014)

K. Deb et al.

A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation

(2002)

G.N. Demir et al.

Graph-based sequence clustering through multiobjective evolutionary algorithms for web recommender systems

Proceedings of the 9th annual conference on genetic and evolutionary computation

(2007)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

C. Ding et al.

K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization

Proceedings of the 2004 ACM symposium on applied computing

(2004)

J.C. Dunn

Well-separated clusters and optimal fuzzy partitions

Journal of Cybernetics

(1974)

D. Dutta et al.

Clustering by multi objective genetic algorithm

Proceedings of the 1st IEEE international conference on recent advances in information technology

(2012)

D. Dutta et al.

Clustering data set with categorical feature using multi objective genetic algorithm

Proceedings of the IEEE international conference on data science engineering

(2012)

D. Dutta et al.

Data clustering with mixed features by multi objective genetic algorithm

Proceedings of the 12th IEEE international conference on hybrid intelligent systems

(2012)

D. Dutta et al.

Categorical feature reduction using multi objective genetic algorithm in cluster analysis

Transactions on computational science XXI

(2013)

D. Dutta et al.

Simultaneous continuous feature selection and K clustering by multi objective genetic algorithm

Proceedings of the 3rd IEEE international advance computing conference

(2013)

D. Dutta et al.

Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm

International Journal on Hybrid Intelligence Systems (IJHIS)

(2014)

D. Dutta et al.

Evolution of genetic algorithms in classification rule mining

Handbook of Research on Computational Intelligence for Engineering, Science, and Business, vol. 1

(2012)

Cited by (40)

An analysis of the admissibility of the objective functions applied in evolutionary multi-objective clustering
2022, Information Sciences
Citation Excerpt :
These works use a random initialization. Besides that, [5,6] use a pair of objective functions considering compactness and separation criteria that are very similar in formulation to other selected clustering criteria. In contrast, [34] presented objective functions applied to a specific application (community detection).
A variety of clustering criteria have been applied as an objective function in Evolutionary Multi-Objective Clustering approaches (EMOCs). However, most EMOCs do not provide detailed analysis regarding the choice and usage of the objective functions. In general, the choice of the objective functions only considers the desired clustering properties, and most EMOCs present in the literature do not consider aspects of multi-objective optimization, such as the search direction, in their design. Aiming to support a better choice and definition of the objectives in the EMOCs, this paper proposes an analysis of the clustering criteria admissibility to examine the search direction and evaluate their potential in finding optimal results. We consider the fundamentals of the evaluation of a heuristic function to analyze the clustering criteria and demonstrate how they can influence the optimization. As a result, this study provides a detailed analysis of the main objective functions found in the literature and evaluates how the initialization interferes with their admissibility. Also, we highlight some common practices and issues found in some established EMOCs. Furthermore, we provide insights regarding how to combine and use the clustering criteria in the EMOCs.
A collaborative decision support system for multi-criteria automatic clustering
2022, Decision Support Systems
Citation Excerpt :
Although their method outperforms other evolutionary algorithms, it may not perform well with non-uniform distributions. Dutta et al. [16] utilized nine cluster VIs for finding the best solution from the non-dominated solutions for numeric and categorical datasets with missing features. One drawback of their method is that it only captures spherically shaped clusters.
Automatic clustering is a challenging problem, especially when the decision-maker has little or no information about the nature of the dataset and the criteria of interest. There is a lack of generalizability in the current validity indexes (VI) for automatic clustering algorithms, as each considers a limited number of objectives and mostly ignores the other aspects of clustering validation. The proposed framework benefits from collaboration among selected evolutionary algorithms. A mixed-integer non-linear programming model is developed, and a framework is proposed for a six-step decision support system to solve it. The decision-maker (DM) selects the quantitative (primary) VIs and the evolutionary algorithms. Given DM's knowledge on the dataset and VIs, DM can incorporate qualitative (secondary) VIs. DM determines the quality threshold for each VI and runs the evolutionary algorithms separately. The DSS then saves the best obtained value of VIs in order to prepare the input necessary to construct the aggregated function. Based on the selected primary VIs, a new normalized aggregated function is developed and solved repeatedly using the randomly selected or predefined weights of importance. Eventually, DM employs a proper DEA model to define the final clustering output among all possible solutions. Given multiple efficient solutions, the best-worst method and a multi-criteria decision-making approach are applied to find the final output. The applicability of the proposed approach is illustrated on a synthetic and two secondary datasets, and the result at each step is discussed in detail.
A multiobjective Cuckoo Search Algorithm for community detection in social networks
2022, Multi-Objective Combinatorial Optimization Problems and Solution Methods
This chapter proposes a new model based on the Multiobjective Cuckoo Search Algorithm (MOCSA) for community detection (CD) on social media. The MOCSA model uses a new strategy based on close neighbors' detection in the objective function to increase the CD's accuracy and speed on social networks. The evaluation will be performed on eight datasets such as Karate, Dolphins, Polbooks, Football, Email, Geom, NetScience, and Power Grid. The results show that the normalized mutual information (NMI) value for the Karate, Dolphin, Football, and Polbooks datasets is 1.0000, 0.9984, 0.9486, and 0.7455, respectively. The modularity value for the Karate, Dolphin, Football, and Polbooks datasets is 0.4192, 0.5262, 0.6025, and 0.5264, respectively. The modularity for the Email, Geom, NetScience, and Power datasets are 0.5362, 0.7025, 0.9497, and 0.8382, respectively. Comparisons show that MOCSA performance is better than MOPSO, MOGA, MOMFO, MOFA, MOFPA, MOIWO, MOABC models.
A unified framework for effective team formation in social networks
2021, Expert Systems with Applications
Collaboration networks are social networks in which nodes represent experts, and edges represent the interactions between them. Team Formation Problem (TFP) in Social Networks (SN) is to construct a group of individuals to work on complex tasks. Teams should satisfy the skill set required by the tasks and can collaborate effectively under multiple constraints. Although many algorithms have been proposed to confront the TFP, most of them optimize different criteria and various parameters (e.g. communication cost or expertise level). There is no unified framework to incorporate the most significant parameters towards formulating effective teams of experts. We propose a unified framework for the TFP in SN based on a multi-objective cultural algorithm that involves the integration of essential cost functions such as communication cost, expertise level, collective trust score, and geological proximity. Since these are conflicting objectives, we return a set of Pareto front of teams that are not dominated by other feasible teams with regards to any of the objectives. Moreover, we examine the temporal nature of both communication costs and expertise levels in our model and introduce a new method to formulate them. We introduce a profile similarity formula to express the trust score. We then discuss the importance of emotional index in TFP. Our model is tested on a benchmark table, which is generated with various criteria of social networks. Our model is then compared with NSGA II, Graph-Based and Exhaustive search.
Cooperative Task Planning Method for Air-Sea Heterogeneous Unmanned System with an Application to Ocean Environment Information Monitoring
2024, SSRN
Application of K-Means Clustering Algorithm in Automatic Machine Learning
2024, Lecture Notes in Electrical Engineering

View all citing articles on Scopus

View full text

Automatic clustering by multi-objective genetic algorithm with numeric and categorical features

Highlights

Abstract

Introduction

Section snippets

Basic preliminaries

Proposed algorithm

Selection of the best chromosome

Testing

Conclusions

Funding

Conflict of interest

CRediT authorship contribution statement

Pattern Recognition

Computers & Geosciences

Expert Systems with Applications

Pattern Recognition

Pattern Recognition

Computers & Mathematics with Applications

Knowledge-Based Systems

Expert Systems with Applications

Pattern Recognition Letters

Information Sciences

Neurocomputing

Information Sciences

Pattern Recognition Letters

Applied Soft Computing

Pattern Recognition Letters

Applied Mathematics and Computation

Analytica Chimica Acta

Pattern Recognition

Pattern Recognition

Applied Soft Computing

Fuzzy Sets and Systems

Simulated binary crossover for continuous search space

Complex Systems

Towards a fast evolutionary algorithm for clustering

2006 IEEE international conference on evolutionary computation

A fuzzy variant of an evolutionary algorithm for clustering

2007 IEEE international fuzzy systems conference

A novel many-objective clustering algorithm in mobile ad hoc networks

Wireless Personal Communications

UCI machine learning repository

A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm

Pattern Recognition Letters

A graph-theoretic approach to goodness-of-fit in complete-link hierarchical clustering

Journal of the American Statistical Association

Multiobjective genetic clustering for pixel classification in remote sensing imagery

IEEE transactions on Geoscience and Remote Sensing

Optimization of fuzzy clustering criteria using genetic algorithms

Proceedings of the first IEEE conference on evolutionary computation

On the efficiency of evolutionary fuzzy clustering

Journal of Heuristics

Document clustering into an unknown number of clusters using a genetic algorithm

International conference on text, speech and dialogue

Dynamic clustering using multi-objective evolutionary algorithm

International conference on computational and information science

Mean shift, mode seeking, and clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence

Clustering with genetic algorithms

PESA-II: Region-based selection in evolutionary multiobjective optimization

Proceedings of the 3rd annual conference on genetic and evolutionary computation

A cluster separation measure

IEEE Transactions on Pattern Analysis and Machine Intelligence

An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: Solving problems with box constraints

IEEE Transactions Evolutionary Computation

A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation

Graph-based sequence clustering through multiobjective evolutionary algorithms for web recommender systems

Proceedings of the 9th annual conference on genetic and evolutionary computation

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization

Proceedings of the 2004 ACM symposium on applied computing

Well-separated clusters and optimal fuzzy partitions

Journal of Cybernetics

Clustering by multi objective genetic algorithm

Proceedings of the 1st IEEE international conference on recent advances in information technology

Clustering data set with categorical feature using multi objective genetic algorithm

Proceedings of the IEEE international conference on data science engineering

2007 IEEE international fuzzy systems conference

Proceedings of the 2004 ACM symposium on applied computing