k-mw-modes: An algorithm for clustering categorical matrix-object data

doi:10.1016/j.asoc.2017.04.019

Applied Soft Computing

Volume 57, August 2017, Pages 605-614

https://doi.org/10.1016/j.asoc.2017.04.019 Get rights and content

Highlights

•
We propose a k-multi-weighted-modes (abbr. k-mw-modes) algorithm for clustering categorical matrix-object data and the k-modes algorithm is its special case.
•
We give a heuristic method to choose the locally optimal multi-weighted-modes in the iteration of the k-mw-modes algorithm and the update process of the k-modes algorithm is its special case.
•
Experimental results on the five real data sets from different applications have shown the effectiveness of the k-mw-modes algorithm.

Abstract

In data mining, the input of most algorithms is a set of n objects and each object is described by a feature vector. However, in many real database applications, an object is described by more than one feature vector. In this paper, we call an object described by more than one feature vector as a matrix-object and a data set consisting of matrix-objects as a matrix-object data set. We propose a k-multi-weighted-modes (abbr. k-mw-modes) algorithm for clustering categorical matrix-object data. In this algorithm, we define the distance between two categorical matrix-objects and a multi-weighted-modes representation of cluster prototypes is proposed. We give a heuristic method to choose the locally optimal multi-weighted-modes in the iteration of the k-mw-modes algorithm. We validated the effectiveness and benefits of the k-mw-modes algorithm on the five real data sets from different applications.

Graphical abstract

Introduction

In data mining, the input of an algorithm in most cases is a data set X, also called a table or matrix. The data set consists of n objects {x₁, x₂, …, x_n} and each object is described by m attributes {A₁, A₂, …, A_m} [1]. Most importantly, each object in X only corresponds with a feature vector (x_i1;x_i2;…;x_im), i ∈ {1, 2, …, n}. However, in many real applications, a database often contains multiple tables. There are one-one, one-many or many-many relationships between two tables. Thus, an object usually corresponds with more than one transactional record. A real database application example from http://www.taobao.com is described in Table 1.

There are two parts in Table 1. The left half describes the basic information of users and the right one records that each user visited different brands in different time points, where the attribute Visited_Times represents the visiting-times of a user on the same day for one brand. We call the left part as a master table and the right one as a detail table in database. Therefore, two parts in Table 1 exist a typical one-many relationship. Data in Table 1 have the following characteristics:

•
Correlation: Data from the master table and the detail table maybe have some correlations. Users with different sex or age maybe have different preferences. For example, the female user of 24 years old from Table 1 visited the commodities that are usually used by most female users, such as JOSINY and WETHERM. However, the female user of 40 years old visited the commodities used by men or women, maybe because she needs to take after their families.
•
One-many: Each user in the master table corresponds with more than one record in the detail table. Moreover, the number of brands visited by different users is often different in Table 1. For example, the user 10944750 has 11 records while the user 8149250 has 4 records.
•
Mixed: In most cases, an object is described by categorical and numerical attributes together. For example, in the detail table, Brand_Name is a categorical attribute while Visited_Times is a numerical attribute.
•
Evolution: Some attribute values will change as time goes on. For example, a user visits one brand repeatedly in this month, but the brand may be not visited by him or her in the next month. In other words, the change of a user's behavior is a dynamic evolution process with time.

From the detail table, we can see clearly that every user visited one brand at least and a brand may be browsed by many users. Besides, a brand may be visited several times by a user in a day. Of course, it may also be visited many times by a user in several days. Obviously, if a user visited many times about a brand, he or she may be interested in this commodity. For example, for the user 10944750, the JOSINY is visited in continuous four months, and there are several visiting times in every month. So, we can predict the user is likely very fond of the JOSINY. However, the SEMIR is visited only once by the user 10944750 in this data set, by which we know that the user may have less like about it compared with the JOSINY. Such a data representation shown in Table 1 is widespread in banking, insurance, telecommunication, retails, and medical databases. Therefore, it is necessary to develop a method that can discover user groups with different behavior patterns from the detail table instead of the master table. Because the behavior analysis can help managers obtain more valuable information for decision making.

Clustering is a widely used method to find different user groups in real applications [2] and the master table tends to be taken as its input. But the information in the master table cannot enough reflect the behavior characteristics of a user. More importantly, in traditional clustering algorithms, the dissimilarity measure between two objects is based on the value difference of two feature vectors. For the detail table, each user has more than one transactional record. In other words, each user is described by multiple feature vectors. Therefore, some classical dissimilarity measures, such as Euclidean distance, Manhattan distance and Hamming distance, cannot be used to process this kind of data directly.

In the detail table, each user has multiple feature vectors, each of which is described by numerical and categorical attributes together in most cases. How to define a dissimilarity measure between two users is a very crucial problem, because it has direct effects on clustering results. For simplicity, in this paper, we only investigate the clustering algorithm for the detail table whose each record is described by categorical attributes. The k-modes algorithm [3] has realized the clustering of the categorical data sets compared with the k-means algorithm [4], but it still has some shortcomings. Only the data sets whose each object only contains one record can be clustered by the k-modes algorithm. Obviously, if the problem above wants to be solved with the k-modes algorithm, the data sets need to be compressed as the form that the algorithm required by selecting an attribute value whose frequency is the highest. Thus, lots of information is at a loss in the data so that the clustering results are unfaithful.

Without loss of generality, a general description of detail information in Table 1 is illustrated as follows. Suppose that X = {X₁, X₂, …, X_n} is a set of n objects described by m attributes {A₁, A₂, …, A_m}, where X_i = (X_i1;X_i2;…;X_im) and $X_{is} = {[v_{i 1 s}, v_{i 2 s}, \dots, v_{{ir}_{i} s}]}^{'}$ .

r_i represents the number of records in X_i and $v_{ijs}$ denotes the jth value of X_i on A_s. We call X_i as a matrix-object and X as a matrix-object data set. Suppose that V^s represents the domain values of the attribute A_s in X and $V_{X_{i}}^{A_{s}}$ denotes a set of values on the attribute A_s for X_i. Obviously, $⋃_{i = 1}^{n} V_{X_{i}}^{A_{s}} = V^{s}$ . In traditional data representation, an object is only described by a feature vector or a record while a matrix-object is usually represented by multiple feature vectors or records. Therefore, a matrix-object is a general representation of a traditional object.

In this paper, we propose a new clustering algorithm, the k-mw-modes algorithm, to cluster categorical matrix-object data. The main contributions are summarized as follows:

•
We define a new dissimilarity measure to calculate the distance between two categorical matrix-objects.
•
We give a new representation and update way of the cluster centers to optimize the clustering process.
•
We give a heuristic method to choose the cluster center of a set.
•
We propose the k-mw-modes clustering algorithm to cluster categorical matrix-object data.
•
Experimental results on the real data sets have shown the effectiveness of the k-mw-modes algorithm.

The rest of this paper is organized as follows. In Section 2, we propose the k-mw-modes algorithm. In Section 3, we give a heuristic method to choose the locally optimal multi-weighted-modes for the k-mw-modes algorithm. In Section 4, we show experimental results on the five real data sets from different applications. In Section 5, we review some related work. We give conclusions and future work in Section 6.

Section snippets

k-multi-weighted-modes clustering

The k-modes clustering algorithm consists of three components: (1) representation of cluster centroids, (2) allocation of objects into clusters and (3) updates of cluster centroids. In this section, we present the k-mw-modes algorithm that uses the k-modes clustering process to cluster categorical matrix-object data. In this algorithm, we define a dissimilarity measure to calculate the distance between two matrix-objects and give a kind of representation and update way of cluster centers.

A heuristic method for updating cluster centers

The GAFMWM for finding cluster centers is not efficient if the number of domain values is very large. In this section, we give a heuristic method of updating cluster centers in the k-mw-modes clustering process. For X_i, X_j ∈ X, we have $V_{X_{i}}^{A_{s}} = V_{X_{j}}^{A_{s}}$ or $V_{X_{i}}^{A_{s}} \neq V_{X_{j}}^{A_{s}}$ on the attribute A_s. Even if $V_{X_{i}}^{A_{s}} = V_{X_{j}}^{A_{s}}$ , the frequency of the same attribute value may be different in X_i and X_j, because a value maybe appears more than once in a given matrix-object. The higher the frequency of a value in a given

Experiments on real data

In this section, we mainly make some experiments on the five real data sets, Microsoft Web data, Market Basket data, Alibaba data, Musk data and Movielens data, to evaluate the effectiveness of the proposed algorithm. We firstly describe the preprocessing process of the five data sets. Then five evaluation indexes are introduced. Finally, we show the comparison results of the k-mw-modes algorithm with other algorithms and discuss the impact of the parameter ɛ on the clustering performance.

Related work

In real applications, categorical data are widespread. The k-modes algorithm [3] extends the k-means algorithm [4] by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering objective function. These extensions have removed the numeric-only limitation of the k-means algorithm and enable the k-means clustering process to be used to efficiently

Conclusions

In many database applications, the behavioral traits of a customer are carried in a detail table instead of a master table. To find the customer groups with different behavioral traits, a k-mw-modes algorithm was proposed for clustering categorical matrix-object data. In the proposed algorithm, the distance between two matrix-objects was defined and the representation and update ways of cluster centers were developed further. The convergence of the proposed algorithm was proved and the

Acknowledgements

This work was supported by the National Natural Science Foundation of China (under grants 61573229, 61473194, 61432011 and U1435212), the Natural Science Foundation of Shanxi Province (under grant 2015011048), the Shanxi Scholarship Council of China (under grant 2016-003) and the National Key Basic Research and Development Program of China (973) (under grant 2013CB329404).

References (22)

F. Cao et al.
Trend analysis of categorical data streams with a concept change method
Inform. Sci.
(2014)
F. Cao et al.
A dissimilarity measure for the k-modes clustering algorithm
Knowl.-Based Syst.
(2012)
F. Cao et al.
A weighting k-modes algorithm for subspace clustering of categorical data
Neurocomputing
(2013)
D.-W. Kim et al.
Fuzzy clustering of categorical data using fuzzy centroids
Pattern Recogn. Lett.
(2004)
J. Han et al.
Data Mining: Concepts and Techniques
(2011)
R. Xu et al.
Clustering, vol. 10
(2008)
Z. Huang
Extensions to the k-means algorithm for clustering large data sets with categorical values
Data Mining Knowl. Discov.
(1998)
J. MacQueen
Some methods for classification and analysis of multivariate observations
S. Schiffman et al.
Introduction to Multidimensional Scaling: Theory, Methods, and Applications
(1981)
K. Bache et al.
UCI Machine Learning Repository
(2014)

J. Liang et al.

The k-means type algorithms versus imbalanced data distributions

IEEE Trans. Fuzzy Syst.

(2012)

Cited by (10)

Weighted matrix-object data clustering guided by matrix-object distributions
2022, Engineering Applications of Artificial Intelligence
In data mining, the input of most algorithms is a data set in which each example is a feature vector. However, in many real applications an example usually contains multiple feature vectors and its observed classification is the responsibility of all feature vectors. We call this example kind matrix-object. Some existing clustering algorithms for matrix-object data fail to consider contributions of attributes to clusters, which may degrade clustering solutions due to less discriminative attributes. Some existing clustering algorithms for the data in which each example is a vector consider the contributions but encounter difficulties in handling matrix-object data. For matrix-object data, ordered and cross matrix-object distributions may exist in a cluster and cause different ways of measuring qualities of clusters. In this paper, we propose a weighted matrix-object data clustering algorithm guided by matrix-object distributions. We define cluster and matrix-object compactness respectively for the two distributions to measure qualities of clusters. The bigger the compactness is, the higher the quality is. So the proposed algorithm utilizes the compactness to assign a weight to each attribute for each cluster and maximizes weighted cluster and matrix-object compactness to find the optimal weight and the final clustering partition. Furthermore, a regular term about weight is added to the objective function to make more higher discriminative attributes participate in the optimization. Experimental results on real data have shown the effectiveness of the proposed algorithm. Compared with previous clustering algorithms, the proposed algorithm improves the clustering performance and enhances the interpretability of clustering results.
An outlier detection algorithm for categorical matrix-object data
2021, Applied Soft Computing
Citation Excerpt :
Using the concept of clustering to understand is objects in the same cluster are closer than objects in different clusters. Therefore, we can use the distance formula [17] to consider the coupling degree. The distance between two matrix-objects is defined as follows.
Outlier detection is a significant problem in data mining and machine learning which aims to discover objects in a data set that do not conform to well-defined notions of expected behavior. Generally, the input of the existing outlier detection algorithms is a collection of $n$ objects and each object is described by a feature vector. However, in many real world applications, an object is not only described by one feature vector, but a number of feature vectors. In this paper, we define an object described by more than one feature vector as a matrix-object. Inspired by the concepts of cohesion and coupling in software engineering, we define the coupling of a matrix-object based on the average distance between it and other matrix-objects, and define its cohesion based on information entropy and mutual information. On this basis, the outlier factor of a matrix-object is given, and an outlier detection algorithm for categorical matrix-object data is proposed. The experimental results on real and synthetic data sets have shown that the proposed outlier detection algorithm can effectively detect outliers for the matrix-object data set compared with other algorithms.
k-Mnv-Rep: A k-type clustering algorithm for matrix-object data
2021, Information Sciences
Citation Excerpt :
In real world, it is common that numeric and categorical attributes are mixed in many data sets. In this section, with the k-prototypes algorithm as a reference, we extend the k-Mnv-Rep to domains with mixed numeric and categorical values by combining the k-mw-modes algorithm [5]. The new algorithm is also described from two processes: dissimilarity measure and updating cluster centers.
In matrix-object data, an object (or a sample) is described by more than one feature vector (record) and all of those feature vectors are responsible for the observed classification of the object. A task for matrix-object data is to cluster it into a set of groups by analyzing and utilizing the information of feature vectors. Matrix-object data are widespread in many real applications. Previous studies typically address data sets that an object is generally represented by a feature vector, which may be violated in many real-world tasks. In this paper, we propose a k-multi-numeric-values-representatives (abbr. k-Mnv-Rep) algorithm to cluster numeric matrix-object data. In this algorithm, a new dissimilarity measure between two numeric matrix-objects is defined and a new heuristic method of updating cluster centers is given. Furthermore, we also propose a k-multi-values-representatives (abbr. k-Mv-Rep) algorithm to cluster hybrid matrix-object data. The two proposed algorithms break the limitations of the previous studies, and can be applied to address matrix-object data sets that exist widely in many real-world tasks. The benefits and effectiveness of the two algorithms are shown by some experiments on real and synthetic data sets.
Combining attribute content and label information for categorical data ensemble clustering
2020, Applied Mathematics and Computation
Ensemble clustering has been attracting increasing attention in recent years, because it is able to combine multiple base clusterings (ensemble members) into a more robust clustering. It mainly consists of two parts, generating multiple ensemble members and finding a final partition. The construction of the information matrix plays an important role for finding a final partition. In general categorical data ensemble clustering framework, most existing information matrices are constructed only relying on label information of ensemble members without considering original information of data sets. To solve this problem, a new ensemble clustering framework for categorical data is proposed, in which the information matrix considers label information and original data information together, and is instantiated into the ALM matrix in this paper. The ALM matrix takes account of not only the distribution of attribute content in each ensemble member, but also the relationship among ensemble members based on the distribution. To simplicity, the k-means technique is used to cluster the ALM matrix and form a new ensemble clustering algorithm. The experimental results have shown the benefits of the ALM matrix by comparing the proposed algorithm with other ensemble clustering algorithms.
Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering
2019, Pattern Recognition
Citation Excerpt :
Since categorical data are ubiquitous in the real world, clustering data with categorical attributes has a broad range of practical applications. A number of clustering algorithms for categorical datasets have been studied and practiced in literature, such as the k-means-based algorithm [40], the k-modes algorithms [24], the ROCK algorithm [17], the CACTUS algorithm [15], the RST-based algorithms [8,38], the k-populations algorithm [31], the soft feature-selection scheme [11], the Clustering ensemble selection algorithm [51], the k-multi-weighted-modes algorithm [9], and the clustering methods based on k-nearest-neighbor graph [36,39]. Among them, the k-modes categorical clustering algorithm is the most well-known algorithm that can cluster large-sized categorical datasets into a given number of clusters represented by the most-frequent attribute values (i.e., the modes) in a fast manner.
The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.
Outlier Detection Based on Maximal-Entropy Random Walk for Numeric Matrix-Object Datasets
2022, SSRN

View all citing articles on Scopus

View full text

k-mw-modes: An algorithm for clustering categorical matrix-object data

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

k-multi-weighted-modes clustering

A heuristic method for updating cluster centers

Experiments on real data

Related work

Conclusions

Acknowledgements

Inform. Sci.

Knowl.-Based Syst.

Neurocomputing

Pattern Recogn. Lett.

Data Mining: Concepts and Techniques

Clustering, vol. 10

Extensions to the k-means algorithm for clustering large data sets with categorical values

Data Mining Knowl. Discov.

Some methods for classification and analysis of multivariate observations

Introduction to Multidimensional Scaling: Theory, Methods, and Applications

UCI Machine Learning Repository

The k-means type algorithms versus imbalanced data distributions

IEEE Trans. Fuzzy Syst.