Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

doi:10.1016/j.neucom.2018.11.016

Neurocomputing

Volume 330, 22 February 2019, Pages 116-126

https://doi.org/10.1016/j.neucom.2018.11.016 Get rights and content

Highlights

•
Employ the intuitionistic fuzzy set theory in fuzzy clustering for categorical attributes.
•
Use the new similarity measure for categorical data, which is based on the frequency probability-based distance metric, to calculate the dissimilarity measure.
•
Consider the importance of each categorical attribute differently by updating the weight for each categorical attribute in the clustering process iteratively.
•
Exploit the global optimal solution by genetic algorithm (GA).
•
Provide the unsupervised feature selection process to remove the redundant features of the original dataset prior to performing GA process.

Abstract

Data clustering with categorical attributes has been widely used in many real-world applications. Most of the existing clustering algorithms proposed for the categorical data face two major drawbacks of termination at a local optimal solution and considering all attributes equally. Thus, this study proposes a novel clustering method, named genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm, based on the conventional fuzzy k-modes and genetic algorithm (GA). The proposed algorithm firstly introduces the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the intuitionistic fuzzy set in the clustering process and the new similarity measure for categorical data based on frequency probability-based distance metric. Then, the GIWFKM algorithm, which integrates the IWFKM algorithm and GA, is proposed to employ the global optimal solution. Moreover, the GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to remove some redundant features which can both improve the clustering performance and reduce the computational time. To evaluate the clustering result, a series of experiments in different categorical datasets are conducted to compare the performance of the proposed algorithms with that of other benchmark algorithms including fuzzy k-modes, weighted fuzzy k-modes, genetic fuzzy k-modes, space structure-based clustering, and many-objective fuzzy centroids clustering algorithms. The experimental results conducted on the datasets collected from UCI machine learning repository exhibit that the GIWFKM algorithm outperforms the other benchmark algorithms in terms of Adjusted Rank Index (ARI) and clustering accuracy (CA).

Introduction

Data clustering is an unsupervised learning technique that partitions a given dataset into multiple clusters in which objects in a cluster are similar to each other and distinct from the objects that belong to other clusters [1]. The clustering process aims to reveal the hidden structure of the unlabeled data instances in various applications, such as pattern recognition, market research, decision making, medical application, and so on. In general, the clustering algorithms are usually reserved for numerical data, which uses the standard distance measure to calculate the distance between any pair of data instances straightforwardly. Clustering of categorical data has received less attention than those of numerical data because of challenge and difficulty in nature of data. Categorical attributes are obviously deficient in inherent order that causes difficulty to identify the proximity measure between two data objects [2].

The classic approach for the categorical data clustering is to expand some existing clustering algorithms for numerical data with a suitable dissimilarity measure which is particular for categorical attributes. For instance, the first conventional algorithm for categorical data, k-modes algorithm, which was proposed by Huang [3], is an extended version of k-means algorithm using Hamming distance and cluster mode to represent cluster center instead of Euclidean distance and cluster mean. Similarly, fuzzy k-modes algorithm [4] is also an extended version of fuzzy k-means algorithm for the categorical data. Thereafter, the clustering algorithms for the categorical data have been paid progressively more attention due to the variety of the categorical data in the real-world problems. These algorithms consist of both single objective and multiple objectives, such as ROCK [5], CACTUS [6], COOLCAT [7], LIMBO [8], wk-modes [9], MOGA [10], NSGA-FMC [11], SBC [12], MOFC [13], and so on. However, most of the existing algorithms face two major drawbacks that can reduce the clustering performance, i.e., some algorithms usually consider all attributes equally when calculating the dissimilarity between two objects, while some algorithms may terminate at a local optimal solution.

Recently, intuitionistic fuzzy set (IFS), which was firstly introduced by Atanassov [14] based on the concept of fuzzy set theory, has been used in data clustering to enhance the clustering performance. The IFS is known as a generalization of fuzzy sets and usually used for handling uncertainty. An IFS is described by three parameters including membership, non-membership, and hesitation degrees. Xu et al. [15] reported a clustering algorithm for IFSs which classified the IFSs by constructing the association and equivalent association matrix. Xu [16] appended the IFS to hierarchical clustering to deal with uncertain data based on the distance measure between the IFS and the intuitionistic fuzzy aggregation operator. Similarly, some studies developed clustering techniques by combining the IFS with fuzzy c-means algorithm, such as intuitionistic fuzzy c-means algorithm [15], intuitionistic fuzzy possibilistic c-means clustering algorithm [17]. Besides, Xu et al. [18] also integrated the IFS with spectral clustering to improve the clustering performance as well as obtain the global optimal solution. The existing methods are generally based on either distance measures or intuitionistic fuzzy information; however, some of them cannot warranty for the global optimal solution [18]. Consequently, they are all reserved for numerical datasets.

To overcome the aforementioned drawbacks of the existing algorithms as well as consider the application prospects of the IFS to improve the clustering performance, this study proposes a novel clustering algorithm for the categorical data, i.e., genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm. This algorithm is a combination of the conventional fuzzy k-modes algorithm [4] and the IFS. We firstly introduce the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the IFS in the clustering process. The IWFKM algorithm considers the importance of each attribute differently by updating the weight vector for categorical attributes in each iteration. In addition, the IWFKM algorithm replaces Hamming distance with the new similarity measure named frequency probability-based distance metric, which has been proved that could improve the clustering result [19]. Then, the proposed GIWFKM algorithm integrates the IWFKM algorithm and genetic algorithm (GA) to exploit the global optimal solution. The reason to choose the GA is that GA is known as a search and optimization technique which is used to solve various problem domains due to its extensive applicability [20]. Moreover, the GA has been applied in many clustering approaches for both numerical and categorical data to improve the clustering performance, e.g., genetic k-means algorithm [21], genetic fuzzy c-means [22], and genetic fuzzy k-modes (GFKM) [23]. Besides, the proposed GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to remove some redundant features, therefore, improve the clustering performance and reduce the computational time.

The rest of this paper is organized as follows. Section 2 reviews some related literatures such as fuzzy k-modes algorithm, weighted fuzzy k-modes algorithm, and the IFS theory. The proposed algorithms are introduced in Section 3, while Section 4 comes with a series of experiments and results. Finally, the conclusion and future research directions are summarized in Section 5.

Section snippets

Literature review

This section firstly reviews fuzzy k-modes and weighted fuzzy k-modes algorithms. Then the IFS theory with two generating functions is also described.

Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm

The proposed algorithm, i.e. GIWFKM, is described in this section. We firstly introduce the IWFKM algorithm which integrates the IFS with the WFKM algorithm. Moreover, the IWFKM uses the frequency probability-based distance metric instead of Hamming distance to calculate the dissimilarity between data instances. Consequently, the proposed GIWFKM algorithm, which employs the IWFKM algorithm and GA, is expected to exploit the global optimal solution of the clustering process. In the proposed

Datasets and parameter setting

In this study, the experimental datasets are collected from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Twelve categorical datasets are selected with a variety of dimensionalities. For instance, the Lung dataset has the largest dimensionality which contains 56 attributes, while the two smallest ones, the Breast Cancer and Tic-tac-toe datasets, have only 9 attributes. Table 1 provides a brief description of the datasets used in this study.

In addition, several benchmark

Conclusion

First, the proposed IWFKM algorithm, which integrates the IFS and WFKM algorithm, is investigated experimentally in this study. The proposed IWFKM algorithm provides some novel enhancements, for instance, employing the IFS to improve clustering result, considering each categorical attribute differently according to the weight vector, and using the frequency probability-based distance metric to estimate the distance between data instances instead of using the Hamming distance. The results

Acknowledgment

This study was financially supported by the Ministry of Science and Technology of the Taiwanese Government, under contracts MOST 105-2410-H-011-017-MY3 and MOST 106-2811-H-011-002. This support is really appreciated.

R.J. Kuo received the M.S. degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the Ph.D. degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994.

References (31)

S. Guha et al.
ROCK: A robust clustering algorithm for categorical attributes
Inf. Syst
(2000)
F. Cao et al.
A weighting k-modes algorithm for subspace clustering of categorical data
Neurocomputing
(2013)
C.-L. Yang et al.
Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering
Appl. Soft Comput.
(2015)
S. Zhu et al.
Many-objective fuzzy centroids clustering algorithm for categorical data
Expert Syst. Appl.
(2018)
K.T. Atanassov
Intuitionistic fuzzy sets
Fuzzy Sets Syst
(1986)
Z. Xu et al.
Clustering algorithm for intuitionistic fuzzy sets
Inform. Sci.
(2008)
D. Xu et al.
A spectral clustering algorithm based on intuitionistic fuzzy information
Knowl. Based Syst
(2013)
G. Gan et al.
A genetic fuzzy k-Modes algorithm for clustering categorical data
Expert Syst. Appl.
(2009)
A. Saha et al.
Categorical fuzzy k-modes clustering with automated feature weight learning
Neurocomputing
(2015)
A. Ahmad et al.
A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set
Pattern Recogn. Lett.
(2007)

I. Heloulou et al.

A multi-act sequential game-based multi-objective clustering approach for categorical data

Neurocomputing

(2017)

M. Hoffman et al.

A note on using the adjusted Rand index for link prediction in networks

Soc. Netw.

(2015)

P.-N. Tan et al.

Introduction to Data Mining

(2006)

S. Boriah et al.

Similarity measures for categorical data: a comparative evaluation

Z. Huang

Extensions to the k-means algorithm for clustering large data sets with categorical values

Data Min. Knowl. Discov.

(1998)

Cited by (20)

A kernel-based intuitionistic weight fuzzy k-modes algorithm using coupled chained P system combines DNA genetic rules for categorical data
2023, Neurocomputing
A kernel-based intuitionistic weight fuzzy k-modes algorithm (KIWFKM) is proposed in this paper which can improve the clustering performance of categorical data. The current FKM algorithm which were proposed by researchers generally have three drawbacks. Firstly, the algorithms were easily limited to the local optimal solution. Secondly, most of algorithms were considering all attributes equally. Thirdly, most algorithms are sensitive to noise data points. So the intuitionistic fuzzy sets (IFS), kernel trick and weight concept are introduced into the objective function which can not only solve the problem of all attributes equally but also improve the robustness to noise. In addition, a coupled DCP (chained tissue-like P system combines DNA genetic rules) system is established which is used for realizing the KIWFKM algorithm (KIWFKM-DCP). The uncertainty and implicit parallelism of the DCP system can help the KIWFKM algorithm jump out of the local optimal solution and find better solution. Finally, we conduct experiments and compare experiment results with six state-of-the-art clustering methods. Experimental results conduct that the KIWFKM-DCP algorithm perform better than the other comparison clustering algorithms.
DP-k-modes: A self-tuning k-modes clustering algorithm
2022, Pattern Recognition Letters
The $k$ -modes clustering algorithm was proposed by Huang for handling datasets with categorical attributes, however, the dissimilarity measure used limits its applicability. Ng et al. improved on Huang’s $k$ -modes algorithm by proposing a new dissimilarity measure between objects. Moreover, both $k$ -modes algorithms require the initial seeds to be randomly chosen and the number of clusters be specified manually. To overcome the limitations of Huang’s and Ng’s $k$ -modes clustering algorithms, we first extend the clustering algorithm published in Science in 2014 (“clustering by fast search and find of density peaks”). The optimal initial seeds and the number of clusters of a dataset are determined simultaneously by taking the standard deviation as the self-tuning cutoff distance and the simple match dissimilarity as the distance measurement in the definition of the density of a point. A new dissimilarity measure is proposed to calculate the dissimilarities between objects to improve on that of Ng’s $k$ -modes algorithm. The performance of our resulting self-tuning $k$ -modes clustering algorithm was tested on nine datasets (three being relatively large) from the UCI (University of California in Irvine) machine learning repository. The clustering results were compared to those produced by Huang’s and Ng’s algorithms. Statistical tests of three $k$ -modes algorithms were undertaken to determine whether or not there is significant difference between our self-tuning $k$ -modes algorithm and Huang’s and Ng’s $k$ -modes algorithms. All these experimental results demonstrate that our proposed $k$ -modes clustering algorithm is superior to Hang’s and Ng’s $k$ -modes algorithms in terms of clustering accuracy (ACC) and the well-known Adjusted Rand Index (ARI) metric. Our self-tuning $k$ -modes algorithm is significantly different from both Huang’s and Ng’s $k$ -modes algorithms, and there is no statistically significant difference between Ng’s and Huang’s $k$ -modes algorithms.
FKMAWCW: Categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning
2021, Chaos, Solitons and Fractals
Citation Excerpt :
The fuzzy k-modes (FKM) clustering algorithm [13] is one of the most popular clustering algorithms that is applied for clustering categorical data [14]. FKM has shown successful results in various applications such as [15–18]. In this method, a sample can be assigned to several clusters with different degrees of membership.
The fuzzy k-modes (FKM) is a popular method for clustering categorical data. However, the main problem of this algorithm is that it is very sensitive to the initialization of primary clusters, so inappropriate initial cluster centers lead to poor local optima. Another problem with the FKM is the equal importance of the attributes used during the clustering process, which in real applications, the importance of the attributes are different, and some attributes are more important than others. Some versions of FKM have been presented in the literature, each of which has somehow solved one of the above problems. In this paper, we propose a new clustering method (FKMAWCW) to solve mentioned problems at the same time. In the proposed clustering process, a local attribute weighting mechanism is used to weight the attributes of each cluster properly. Also, a cluster weighting mechanism is proposed to solve the initialization sensitivity. Attribute weight and cluster weight are learned simultaneously and automatically during the clustering process. In addition, to reduce the noise sensitivity, a new distance function is proposed. So, the proposed algorithm can tolerate noisy environment. Extensive experiments on 11 benchmark datasets and an artificially generated dataset show that the proposed algorithm performs better than the state-of-the-art algorithms. This paper presents mathematical analyses to obtain updating functions, providing the convergence proof of the algorithm. The implementation source code of FKMAWCW is made publicly available at https://github.com/Amin-Golzari-Oskouei/FKMAWCW.
Feature weighting methods: A review
2021, Expert Systems with Applications
Citation Excerpt :
In the latter case, the final result is highly dependent on the initialisation of the ML clustering algorithm (Gan & Ng, 2015). In order to circumvent this problem, some FW methods integrate an evolutionary algorithm for the optimisation process to be able to efficiently explore the solution space and not converge to a local optimum (Gançarski & Blansché, 2008; Kuo & Nguyen, 2019). The second level of the proposed taxonomy discriminates between the way the weights are calculated, i.e., globally or locally.
In the last decades, a wide portfolio of Feature Weighting (FW) methods have been proposed in the literature. Their main potential is the capability to transform the features in order to contribute to the Machine Learning (ML) algorithm metric proportionally to their estimated relevance for inferring the output pattern. Nevertheless, the extensive number of FW related works makes difficult to do a scientific study in this field of knowledge. Therefore, in this paper a global taxonomy for FW methods is proposed by focusing on: (1) the learning approach (supervised or unsupervised), (2) the methodology used to calculate the weights (global or local), and (3) the feedback obtained from the ML algorithm when estimating the weights (filter or wrapper). Among the different taxonomy levels, an extensive review of the state-of-the-art is presented, followed by some considerations and guide points for the FW strategies selection regarding significant aspects of real-world data analysis problems. Finally, a summary of conclusions and challenges in the FW field is briefly outlined.
An LSH-based k-representatives clustering method for large categorical data
2021, Neurocomputing
Citation Excerpt :
With these definitions of the dissimilarity measure and modes, the k-modes algorithm can be easily stated in a similar fashion to the k-means algorithm with the replacement of the Euclidean distance with the dissimilarity measure (7) and means with modes. Recently, several extensions of k-modes algorithm have been developed to enhance the clustering performance for categorical data such as in [16,36,37]. It is worth noting here that, by definition, the mode of a cluster is not unique in general and the clustering result strongly depends on the selection of modes during the clustering process.
Clustering categorical data remains a challenging problem in the era of big data, due to the difficulty in measuring dis/similarity meaningfully for categorical data and the high computational complexity of existing clustering algorithms that makes it difficult to be applied in practical use for big data mining applications. In this paper, we propose an integrated approach that incorporates the Locality-Sensitive Hashing (LSH) technique into the $k$ -means-like clustering so as to make it capable of predicting the better initial clusters for boosting clustering effectiveness. To this end, we first utilize a data-driven dissimilarity measure for categorical data to construct a family of binary hash functions that are then used to generate the initial clusters. We also propose to use a nearest neighbor search at each iteration for cluster reassignment of data objects to improve the clustering complexity. These solutions are incorporated into the $k$ -representatives algorithm resulting in the so-called LSH- $k$ -representatives algorithm. Extensive experiments conducted on multiple real-world and synthetic datasets have demonstrated the effectiveness of the proposed method. It is shown that the newly developed algorithm yields comparable or better clustering results in comparison to the existing closely related works, yet it is significantly more efficient by a factor of between 2 $\times$ and 32 $\times$ .
A Mutual Information Based on Ant Colony Optimization Method to Feature Selection for Categorical Data Clustering
2023, Iranian Journal of Science

View all citing articles on Scopus

Currently, he is the Distinguished Professor in the Department of Industrial Management at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as Information Sciences, Neural Networks, Decision Support Systems, European Journal of Operational Research, and Applied Soft Computing. His research interests include architecture issues of computational intelligence and their applications in data mining, electronic business, production management, supply chain management, and decision support systems.

Thi Phuong Quyen Nguyen received the B.S. degree in industrial systems engineering from the Ho Chi Minh City University of Technology, Vietnam, in 2008, the M.S. and Ph.D. degrees in industrial management from the National Taiwan University of Science and Technology, Taiwan, in 2013 and 2016, respectively.

She is currently a Postdoctoral Research Fellow with the Department of Industrial Management, National Taiwan University of Science and Technology. Her research interests include data mining, machine learning, and meta-heuristic approaches.

View full text

Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

Highlights

Abstract

Introduction

Section snippets

Literature review

Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm

Datasets and parameter setting

Conclusion

Acknowledgment

Inf. Syst

Neurocomputing

Appl. Soft Comput.

Expert Syst. Appl.

Fuzzy Sets Syst

Inform. Sci.

Knowl. Based Syst

Expert Syst. Appl.

Neurocomputing

Pattern Recogn. Lett.

Neurocomputing

Soc. Netw.

Introduction to Data Mining

Similarity measures for categorical data: a comparative evaluation

Extensions to the k-means algorithm for clustering large data sets with categorical values

Data Min. Knowl. Discov.