Elsevier

Neurocomputing

Volume 330, 22 February 2019, Pages 116-126
Neurocomputing

Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

https://doi.org/10.1016/j.neucom.2018.11.016Get rights and content

Highlights

  • Employ the intuitionistic fuzzy set theory in fuzzy clustering for categorical attributes.

  • Use the new similarity measure for categorical data, which is based on the frequency probability-based distance metric, to calculate the dissimilarity measure.

  • Consider the importance of each categorical attribute differently by updating the weight for each categorical attribute in the clustering process iteratively.

  • Exploit the global optimal solution by genetic algorithm (GA).

  • Provide the unsupervised feature selection process to remove the redundant features of the original dataset prior to performing GA process.

Abstract

Data clustering with categorical attributes has been widely used in many real-world applications. Most of the existing clustering algorithms proposed for the categorical data face two major drawbacks of termination at a local optimal solution and considering all attributes equally. Thus, this study proposes a novel clustering method, named genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm, based on the conventional fuzzy k-modes and genetic algorithm (GA). The proposed algorithm firstly introduces the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the intuitionistic fuzzy set in the clustering process and the new similarity measure for categorical data based on frequency probability-based distance metric. Then, the GIWFKM algorithm, which integrates the IWFKM algorithm and GA, is proposed to employ the global optimal solution. Moreover, the GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to remove some redundant features which can both improve the clustering performance and reduce the computational time. To evaluate the clustering result, a series of experiments in different categorical datasets are conducted to compare the performance of the proposed algorithms with that of other benchmark algorithms including fuzzy k-modes, weighted fuzzy k-modes, genetic fuzzy k-modes, space structure-based clustering, and many-objective fuzzy centroids clustering algorithms. The experimental results conducted on the datasets collected from UCI machine learning repository exhibit that the GIWFKM algorithm outperforms the other benchmark algorithms in terms of Adjusted Rank Index (ARI) and clustering accuracy (CA).

Introduction

Data clustering is an unsupervised learning technique that partitions a given dataset into multiple clusters in which objects in a cluster are similar to each other and distinct from the objects that belong to other clusters [1]. The clustering process aims to reveal the hidden structure of the unlabeled data instances in various applications, such as pattern recognition, market research, decision making, medical application, and so on. In general, the clustering algorithms are usually reserved for numerical data, which uses the standard distance measure to calculate the distance between any pair of data instances straightforwardly. Clustering of categorical data has received less attention than those of numerical data because of challenge and difficulty in nature of data. Categorical attributes are obviously deficient in inherent order that causes difficulty to identify the proximity measure between two data objects [2].

The classic approach for the categorical data clustering is to expand some existing clustering algorithms for numerical data with a suitable dissimilarity measure which is particular for categorical attributes. For instance, the first conventional algorithm for categorical data, k-modes algorithm, which was proposed by Huang [3], is an extended version of k-means algorithm using Hamming distance and cluster mode to represent cluster center instead of Euclidean distance and cluster mean. Similarly, fuzzy k-modes algorithm [4] is also an extended version of fuzzy k-means algorithm for the categorical data. Thereafter, the clustering algorithms for the categorical data have been paid progressively more attention due to the variety of the categorical data in the real-world problems. These algorithms consist of both single objective and multiple objectives, such as ROCK [5], CACTUS [6], COOLCAT [7], LIMBO [8], wk-modes [9], MOGA [10], NSGA-FMC [11], SBC [12], MOFC [13], and so on. However, most of the existing algorithms face two major drawbacks that can reduce the clustering performance, i.e., some algorithms usually consider all attributes equally when calculating the dissimilarity between two objects, while some algorithms may terminate at a local optimal solution.

Recently, intuitionistic fuzzy set (IFS), which was firstly introduced by Atanassov [14] based on the concept of fuzzy set theory, has been used in data clustering to enhance the clustering performance. The IFS is known as a generalization of fuzzy sets and usually used for handling uncertainty. An IFS is described by three parameters including membership, non-membership, and hesitation degrees. Xu et al. [15] reported a clustering algorithm for IFSs which classified the IFSs by constructing the association and equivalent association matrix. Xu [16] appended the IFS to hierarchical clustering to deal with uncertain data based on the distance measure between the IFS and the intuitionistic fuzzy aggregation operator. Similarly, some studies developed clustering techniques by combining the IFS with fuzzy c-means algorithm, such as intuitionistic fuzzy c-means algorithm [15], intuitionistic fuzzy possibilistic c-means clustering algorithm [17]. Besides, Xu et al. [18] also integrated the IFS with spectral clustering to improve the clustering performance as well as obtain the global optimal solution. The existing methods are generally based on either distance measures or intuitionistic fuzzy information; however, some of them cannot warranty for the global optimal solution [18]. Consequently, they are all reserved for numerical datasets.

To overcome the aforementioned drawbacks of the existing algorithms as well as consider the application prospects of the IFS to improve the clustering performance, this study proposes a novel clustering algorithm for the categorical data, i.e., genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm. This algorithm is a combination of the conventional fuzzy k-modes algorithm [4] and the IFS. We firstly introduce the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the IFS in the clustering process. The IWFKM algorithm considers the importance of each attribute differently by updating the weight vector for categorical attributes in each iteration. In addition, the IWFKM algorithm replaces Hamming distance with the new similarity measure named frequency probability-based distance metric, which has been proved that could improve the clustering result [19]. Then, the proposed GIWFKM algorithm integrates the IWFKM algorithm and genetic algorithm (GA) to exploit the global optimal solution. The reason to choose the GA is that GA is known as a search and optimization technique which is used to solve various problem domains due to its extensive applicability [20]. Moreover, the GA has been applied in many clustering approaches for both numerical and categorical data to improve the clustering performance, e.g., genetic k-means algorithm [21], genetic fuzzy c-means [22], and genetic fuzzy k-modes (GFKM) [23]. Besides, the proposed GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to remove some redundant features, therefore, improve the clustering performance and reduce the computational time.

The rest of this paper is organized as follows. Section 2 reviews some related literatures such as fuzzy k-modes algorithm, weighted fuzzy k-modes algorithm, and the IFS theory. The proposed algorithms are introduced in Section 3, while Section 4 comes with a series of experiments and results. Finally, the conclusion and future research directions are summarized in Section 5.

Section snippets

Literature review

This section firstly reviews fuzzy k-modes and weighted fuzzy k-modes algorithms. Then the IFS theory with two generating functions is also described.

Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm

The proposed algorithm, i.e. GIWFKM, is described in this section. We firstly introduce the IWFKM algorithm which integrates the IFS with the WFKM algorithm. Moreover, the IWFKM uses the frequency probability-based distance metric instead of Hamming distance to calculate the dissimilarity between data instances. Consequently, the proposed GIWFKM algorithm, which employs the IWFKM algorithm and GA, is expected to exploit the global optimal solution of the clustering process. In the proposed

Datasets and parameter setting

In this study, the experimental datasets are collected from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Twelve categorical datasets are selected with a variety of dimensionalities. For instance, the Lung dataset has the largest dimensionality which contains 56 attributes, while the two smallest ones, the Breast Cancer and Tic-tac-toe datasets, have only 9 attributes. Table 1 provides a brief description of the datasets used in this study.

In addition, several benchmark

Conclusion

First, the proposed IWFKM algorithm, which integrates the IFS and WFKM algorithm, is investigated experimentally in this study. The proposed IWFKM algorithm provides some novel enhancements, for instance, employing the IFS to improve clustering result, considering each categorical attribute differently according to the weight vector, and using the frequency probability-based distance metric to estimate the distance between data instances instead of using the Hamming distance. The results

Acknowledgment

This study was financially supported by the Ministry of Science and Technology of the Taiwanese Government, under contracts MOST 105-2410-H-011-017-MY3 and MOST 106-2811-H-011-002. This support is really appreciated.

R.J. Kuo received the M.S. degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the Ph.D. degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994.

Currently, he is the Distinguished Professor in the Department of Industrial Management at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as

References (31)

  • I. Heloulou et al.

    A multi-act sequential game-based multi-objective clustering approach for categorical data

    Neurocomputing

    (2017)
  • M. Hoffman et al.

    A note on using the adjusted Rand index for link prediction in networks

    Soc. Netw.

    (2015)
  • P.-N. Tan et al.

    Introduction to Data Mining

    (2006)
  • S. Boriah et al.

    Similarity measures for categorical data: a comparative evaluation

  • Z. Huang

    Extensions to the k-means algorithm for clustering large data sets with categorical values

    Data Min. Knowl. Discov.

    (1998)
  • Cited by (20)

    • FKMAWCW: Categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning

      2021, Chaos, Solitons and Fractals
      Citation Excerpt :

      The fuzzy k-modes (FKM) clustering algorithm [13] is one of the most popular clustering algorithms that is applied for clustering categorical data [14]. FKM has shown successful results in various applications such as [15–18]. In this method, a sample can be assigned to several clusters with different degrees of membership.

    • Feature weighting methods: A review

      2021, Expert Systems with Applications
      Citation Excerpt :

      In the latter case, the final result is highly dependent on the initialisation of the ML clustering algorithm (Gan & Ng, 2015). In order to circumvent this problem, some FW methods integrate an evolutionary algorithm for the optimisation process to be able to efficiently explore the solution space and not converge to a local optimum (Gançarski & Blansché, 2008; Kuo & Nguyen, 2019). The second level of the proposed taxonomy discriminates between the way the weights are calculated, i.e., globally or locally.

    • An LSH-based k-representatives clustering method for large categorical data

      2021, Neurocomputing
      Citation Excerpt :

      With these definitions of the dissimilarity measure and modes, the k-modes algorithm can be easily stated in a similar fashion to the k-means algorithm with the replacement of the Euclidean distance with the dissimilarity measure (7) and means with modes. Recently, several extensions of k-modes algorithm have been developed to enhance the clustering performance for categorical data such as in [16,36,37]. It is worth noting here that, by definition, the mode of a cluster is not unique in general and the clustering result strongly depends on the selection of modes during the clustering process.

    View all citing articles on Scopus

    R.J. Kuo received the M.S. degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the Ph.D. degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994.

    Currently, he is the Distinguished Professor in the Department of Industrial Management at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as Information Sciences, Neural Networks, Decision Support Systems, European Journal of Operational Research, and Applied Soft Computing. His research interests include architecture issues of computational intelligence and their applications in data mining, electronic business, production management, supply chain management, and decision support systems.

    Thi Phuong Quyen Nguyen received the B.S. degree in industrial systems engineering from the Ho Chi Minh City University of Technology, Vietnam, in 2008, the M.S. and Ph.D. degrees in industrial management from the National Taiwan University of Science and Technology, Taiwan, in 2013 and 2016, respectively.

    She is currently a Postdoctoral Research Fellow with the Department of Industrial Management, National Taiwan University of Science and Technology. Her research interests include data mining, machine learning, and meta-heuristic approaches.

    View full text