Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data
Introduction
Data clustering is an unsupervised learning technique that partitions a given dataset into multiple clusters in which objects in a cluster are similar to each other and distinct from the objects that belong to other clusters [1]. The clustering process aims to reveal the hidden structure of the unlabeled data instances in various applications, such as pattern recognition, market research, decision making, medical application, and so on. In general, the clustering algorithms are usually reserved for numerical data, which uses the standard distance measure to calculate the distance between any pair of data instances straightforwardly. Clustering of categorical data has received less attention than those of numerical data because of challenge and difficulty in nature of data. Categorical attributes are obviously deficient in inherent order that causes difficulty to identify the proximity measure between two data objects [2].
The classic approach for the categorical data clustering is to expand some existing clustering algorithms for numerical data with a suitable dissimilarity measure which is particular for categorical attributes. For instance, the first conventional algorithm for categorical data, k-modes algorithm, which was proposed by Huang [3], is an extended version of k-means algorithm using Hamming distance and cluster mode to represent cluster center instead of Euclidean distance and cluster mean. Similarly, fuzzy k-modes algorithm [4] is also an extended version of fuzzy k-means algorithm for the categorical data. Thereafter, the clustering algorithms for the categorical data have been paid progressively more attention due to the variety of the categorical data in the real-world problems. These algorithms consist of both single objective and multiple objectives, such as ROCK [5], CACTUS [6], COOLCAT [7], LIMBO [8], wk-modes [9], MOGA [10], NSGA-FMC [11], SBC [12], MOFC [13], and so on. However, most of the existing algorithms face two major drawbacks that can reduce the clustering performance, i.e., some algorithms usually consider all attributes equally when calculating the dissimilarity between two objects, while some algorithms may terminate at a local optimal solution.
Recently, intuitionistic fuzzy set (IFS), which was firstly introduced by Atanassov [14] based on the concept of fuzzy set theory, has been used in data clustering to enhance the clustering performance. The IFS is known as a generalization of fuzzy sets and usually used for handling uncertainty. An IFS is described by three parameters including membership, non-membership, and hesitation degrees. Xu et al. [15] reported a clustering algorithm for IFSs which classified the IFSs by constructing the association and equivalent association matrix. Xu [16] appended the IFS to hierarchical clustering to deal with uncertain data based on the distance measure between the IFS and the intuitionistic fuzzy aggregation operator. Similarly, some studies developed clustering techniques by combining the IFS with fuzzy c-means algorithm, such as intuitionistic fuzzy c-means algorithm [15], intuitionistic fuzzy possibilistic c-means clustering algorithm [17]. Besides, Xu et al. [18] also integrated the IFS with spectral clustering to improve the clustering performance as well as obtain the global optimal solution. The existing methods are generally based on either distance measures or intuitionistic fuzzy information; however, some of them cannot warranty for the global optimal solution [18]. Consequently, they are all reserved for numerical datasets.
To overcome the aforementioned drawbacks of the existing algorithms as well as consider the application prospects of the IFS to improve the clustering performance, this study proposes a novel clustering algorithm for the categorical data, i.e., genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm. This algorithm is a combination of the conventional fuzzy k-modes algorithm [4] and the IFS. We firstly introduce the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the IFS in the clustering process. The IWFKM algorithm considers the importance of each attribute differently by updating the weight vector for categorical attributes in each iteration. In addition, the IWFKM algorithm replaces Hamming distance with the new similarity measure named frequency probability-based distance metric, which has been proved that could improve the clustering result [19]. Then, the proposed GIWFKM algorithm integrates the IWFKM algorithm and genetic algorithm (GA) to exploit the global optimal solution. The reason to choose the GA is that GA is known as a search and optimization technique which is used to solve various problem domains due to its extensive applicability [20]. Moreover, the GA has been applied in many clustering approaches for both numerical and categorical data to improve the clustering performance, e.g., genetic k-means algorithm [21], genetic fuzzy c-means [22], and genetic fuzzy k-modes (GFKM) [23]. Besides, the proposed GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to remove some redundant features, therefore, improve the clustering performance and reduce the computational time.
The rest of this paper is organized as follows. Section 2 reviews some related literatures such as fuzzy k-modes algorithm, weighted fuzzy k-modes algorithm, and the IFS theory. The proposed algorithms are introduced in Section 3, while Section 4 comes with a series of experiments and results. Finally, the conclusion and future research directions are summarized in Section 5.
Section snippets
Literature review
This section firstly reviews fuzzy k-modes and weighted fuzzy k-modes algorithms. Then the IFS theory with two generating functions is also described.
Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm
The proposed algorithm, i.e. GIWFKM, is described in this section. We firstly introduce the IWFKM algorithm which integrates the IFS with the WFKM algorithm. Moreover, the IWFKM uses the frequency probability-based distance metric instead of Hamming distance to calculate the dissimilarity between data instances. Consequently, the proposed GIWFKM algorithm, which employs the IWFKM algorithm and GA, is expected to exploit the global optimal solution of the clustering process. In the proposed
Datasets and parameter setting
In this study, the experimental datasets are collected from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Twelve categorical datasets are selected with a variety of dimensionalities. For instance, the Lung dataset has the largest dimensionality which contains 56 attributes, while the two smallest ones, the Breast Cancer and Tic-tac-toe datasets, have only 9 attributes. Table 1 provides a brief description of the datasets used in this study.
In addition, several benchmark
Conclusion
First, the proposed IWFKM algorithm, which integrates the IFS and WFKM algorithm, is investigated experimentally in this study. The proposed IWFKM algorithm provides some novel enhancements, for instance, employing the IFS to improve clustering result, considering each categorical attribute differently according to the weight vector, and using the frequency probability-based distance metric to estimate the distance between data instances instead of using the Hamming distance. The results
Acknowledgment
This study was financially supported by the Ministry of Science and Technology of the Taiwanese Government, under contracts MOST 105-2410-H-011-017-MY3 and MOST 106-2811-H-011-002. This support is really appreciated.
R.J. Kuo received the M.S. degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the Ph.D. degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994.
Currently, he is the Distinguished Professor in the Department of Industrial Management at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as
References (31)
- et al.
ROCK: A robust clustering algorithm for categorical attributes
Inf. Syst
(2000) - et al.
A weighting k-modes algorithm for subspace clustering of categorical data
Neurocomputing
(2013) - et al.
Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering
Appl. Soft Comput.
(2015) - et al.
Many-objective fuzzy centroids clustering algorithm for categorical data
Expert Syst. Appl.
(2018) Intuitionistic fuzzy sets
Fuzzy Sets Syst
(1986)- et al.
Clustering algorithm for intuitionistic fuzzy sets
Inform. Sci.
(2008) - et al.
A spectral clustering algorithm based on intuitionistic fuzzy information
Knowl. Based Syst
(2013) - et al.
A genetic fuzzy k-Modes algorithm for clustering categorical data
Expert Syst. Appl.
(2009) - et al.
Categorical fuzzy k-modes clustering with automated feature weight learning
Neurocomputing
(2015) - et al.
A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set
Pattern Recogn. Lett.
(2007)
A multi-act sequential game-based multi-objective clustering approach for categorical data
Neurocomputing
A note on using the adjusted Rand index for link prediction in networks
Soc. Netw.
Introduction to Data Mining
Similarity measures for categorical data: a comparative evaluation
Extensions to the k-means algorithm for clustering large data sets with categorical values
Data Min. Knowl. Discov.
Cited by (20)
DP-k-modes: A self-tuning k-modes clustering algorithm
2022, Pattern Recognition LettersFKMAWCW: Categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning
2021, Chaos, Solitons and FractalsCitation Excerpt :The fuzzy k-modes (FKM) clustering algorithm [13] is one of the most popular clustering algorithms that is applied for clustering categorical data [14]. FKM has shown successful results in various applications such as [15–18]. In this method, a sample can be assigned to several clusters with different degrees of membership.
Feature weighting methods: A review
2021, Expert Systems with ApplicationsCitation Excerpt :In the latter case, the final result is highly dependent on the initialisation of the ML clustering algorithm (Gan & Ng, 2015). In order to circumvent this problem, some FW methods integrate an evolutionary algorithm for the optimisation process to be able to efficiently explore the solution space and not converge to a local optimum (Gançarski & Blansché, 2008; Kuo & Nguyen, 2019). The second level of the proposed taxonomy discriminates between the way the weights are calculated, i.e., globally or locally.
An LSH-based k-representatives clustering method for large categorical data
2021, NeurocomputingCitation Excerpt :With these definitions of the dissimilarity measure and modes, the k-modes algorithm can be easily stated in a similar fashion to the k-means algorithm with the replacement of the Euclidean distance with the dissimilarity measure (7) and means with modes. Recently, several extensions of k-modes algorithm have been developed to enhance the clustering performance for categorical data such as in [16,36,37]. It is worth noting here that, by definition, the mode of a cluster is not unique in general and the clustering result strongly depends on the selection of modes during the clustering process.
A Mutual Information Based on Ant Colony Optimization Method to Feature Selection for Categorical Data Clustering
2023, Iranian Journal of Science
R.J. Kuo received the M.S. degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the Ph.D. degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994.
Currently, he is the Distinguished Professor in the Department of Industrial Management at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as Information Sciences, Neural Networks, Decision Support Systems, European Journal of Operational Research, and Applied Soft Computing. His research interests include architecture issues of computational intelligence and their applications in data mining, electronic business, production management, supply chain management, and decision support systems.
Thi Phuong Quyen Nguyen received the B.S. degree in industrial systems engineering from the Ho Chi Minh City University of Technology, Vietnam, in 2008, the M.S. and Ph.D. degrees in industrial management from the National Taiwan University of Science and Technology, Taiwan, in 2013 and 2016, respectively.
She is currently a Postdoctoral Research Fellow with the Department of Industrial Management, National Taiwan University of Science and Technology. Her research interests include data mining, machine learning, and meta-heuristic approaches.