k-mw-modes: An algorithm for clustering categorical matrix-object data
Graphical abstract
Introduction
In data mining, the input of an algorithm in most cases is a data set X, also called a table or matrix. The data set consists of n objects {x1, x2, …, xn} and each object is described by m attributes {A1, A2, …, Am} [1]. Most importantly, each object in X only corresponds with a feature vector (xi1;xi2;…;xim), i ∈ {1, 2, …, n}. However, in many real applications, a database often contains multiple tables. There are one-one, one-many or many-many relationships between two tables. Thus, an object usually corresponds with more than one transactional record. A real database application example from http://www.taobao.com is described in Table 1.
There are two parts in Table 1. The left half describes the basic information of users and the right one records that each user visited different brands in different time points, where the attribute Visited_Times represents the visiting-times of a user on the same day for one brand. We call the left part as a master table and the right one as a detail table in database. Therefore, two parts in Table 1 exist a typical one-many relationship. Data in Table 1 have the following characteristics:
- •
Correlation: Data from the master table and the detail table maybe have some correlations. Users with different sex or age maybe have different preferences. For example, the female user of 24 years old from Table 1 visited the commodities that are usually used by most female users, such as JOSINY and WETHERM. However, the female user of 40 years old visited the commodities used by men or women, maybe because she needs to take after their families.
- •
One-many: Each user in the master table corresponds with more than one record in the detail table. Moreover, the number of brands visited by different users is often different in Table 1. For example, the user 10944750 has 11 records while the user 8149250 has 4 records.
- •
Mixed: In most cases, an object is described by categorical and numerical attributes together. For example, in the detail table, Brand_Name is a categorical attribute while Visited_Times is a numerical attribute.
- •
Evolution: Some attribute values will change as time goes on. For example, a user visits one brand repeatedly in this month, but the brand may be not visited by him or her in the next month. In other words, the change of a user's behavior is a dynamic evolution process with time.
From the detail table, we can see clearly that every user visited one brand at least and a brand may be browsed by many users. Besides, a brand may be visited several times by a user in a day. Of course, it may also be visited many times by a user in several days. Obviously, if a user visited many times about a brand, he or she may be interested in this commodity. For example, for the user 10944750, the JOSINY is visited in continuous four months, and there are several visiting times in every month. So, we can predict the user is likely very fond of the JOSINY. However, the SEMIR is visited only once by the user 10944750 in this data set, by which we know that the user may have less like about it compared with the JOSINY. Such a data representation shown in Table 1 is widespread in banking, insurance, telecommunication, retails, and medical databases. Therefore, it is necessary to develop a method that can discover user groups with different behavior patterns from the detail table instead of the master table. Because the behavior analysis can help managers obtain more valuable information for decision making.
Clustering is a widely used method to find different user groups in real applications [2] and the master table tends to be taken as its input. But the information in the master table cannot enough reflect the behavior characteristics of a user. More importantly, in traditional clustering algorithms, the dissimilarity measure between two objects is based on the value difference of two feature vectors. For the detail table, each user has more than one transactional record. In other words, each user is described by multiple feature vectors. Therefore, some classical dissimilarity measures, such as Euclidean distance, Manhattan distance and Hamming distance, cannot be used to process this kind of data directly.
In the detail table, each user has multiple feature vectors, each of which is described by numerical and categorical attributes together in most cases. How to define a dissimilarity measure between two users is a very crucial problem, because it has direct effects on clustering results. For simplicity, in this paper, we only investigate the clustering algorithm for the detail table whose each record is described by categorical attributes. The k-modes algorithm [3] has realized the clustering of the categorical data sets compared with the k-means algorithm [4], but it still has some shortcomings. Only the data sets whose each object only contains one record can be clustered by the k-modes algorithm. Obviously, if the problem above wants to be solved with the k-modes algorithm, the data sets need to be compressed as the form that the algorithm required by selecting an attribute value whose frequency is the highest. Thus, lots of information is at a loss in the data so that the clustering results are unfaithful.
Without loss of generality, a general description of detail information in Table 1 is illustrated as follows. Suppose that X = {X1, X2, …, Xn} is a set of n objects described by m attributes {A1, A2, …, Am}, where Xi = (Xi1;Xi2;…;Xim) and .
ri represents the number of records in Xi and denotes the jth value of Xi on As. We call Xi as a matrix-object and X as a matrix-object data set. Suppose that Vs represents the domain values of the attribute As in X and denotes a set of values on the attribute As for Xi. Obviously, . In traditional data representation, an object is only described by a feature vector or a record while a matrix-object is usually represented by multiple feature vectors or records. Therefore, a matrix-object is a general representation of a traditional object.
In this paper, we propose a new clustering algorithm, the k-mw-modes algorithm, to cluster categorical matrix-object data. The main contributions are summarized as follows:
- •
We define a new dissimilarity measure to calculate the distance between two categorical matrix-objects.
- •
We give a new representation and update way of the cluster centers to optimize the clustering process.
- •
We give a heuristic method to choose the cluster center of a set.
- •
We propose the k-mw-modes clustering algorithm to cluster categorical matrix-object data.
- •
Experimental results on the real data sets have shown the effectiveness of the k-mw-modes algorithm.
The rest of this paper is organized as follows. In Section 2, we propose the k-mw-modes algorithm. In Section 3, we give a heuristic method to choose the locally optimal multi-weighted-modes for the k-mw-modes algorithm. In Section 4, we show experimental results on the five real data sets from different applications. In Section 5, we review some related work. We give conclusions and future work in Section 6.
Section snippets
k-multi-weighted-modes clustering
The k-modes clustering algorithm consists of three components: (1) representation of cluster centroids, (2) allocation of objects into clusters and (3) updates of cluster centroids. In this section, we present the k-mw-modes algorithm that uses the k-modes clustering process to cluster categorical matrix-object data. In this algorithm, we define a dissimilarity measure to calculate the distance between two matrix-objects and give a kind of representation and update way of cluster centers.
A heuristic method for updating cluster centers
The GAFMWM for finding cluster centers is not efficient if the number of domain values is very large. In this section, we give a heuristic method of updating cluster centers in the k-mw-modes clustering process. For Xi, Xj ∈ X, we have or on the attribute As. Even if , the frequency of the same attribute value may be different in Xi and Xj, because a value maybe appears more than once in a given matrix-object. The higher the frequency of a value in a given
Experiments on real data
In this section, we mainly make some experiments on the five real data sets, Microsoft Web data, Market Basket data, Alibaba data, Musk data and Movielens data, to evaluate the effectiveness of the proposed algorithm. We firstly describe the preprocessing process of the five data sets. Then five evaluation indexes are introduced. Finally, we show the comparison results of the k-mw-modes algorithm with other algorithms and discuss the impact of the parameter ɛ on the clustering performance.
Related work
In real applications, categorical data are widespread. The k-modes algorithm [3] extends the k-means algorithm [4] by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering objective function. These extensions have removed the numeric-only limitation of the k-means algorithm and enable the k-means clustering process to be used to efficiently
Conclusions
In many database applications, the behavioral traits of a customer are carried in a detail table instead of a master table. To find the customer groups with different behavioral traits, a k-mw-modes algorithm was proposed for clustering categorical matrix-object data. In the proposed algorithm, the distance between two matrix-objects was defined and the representation and update ways of cluster centers were developed further. The convergence of the proposed algorithm was proved and the
Acknowledgements
This work was supported by the National Natural Science Foundation of China (under grants 61573229, 61473194, 61432011 and U1435212), the Natural Science Foundation of Shanxi Province (under grant 2015011048), the Shanxi Scholarship Council of China (under grant 2016-003) and the National Key Basic Research and Development Program of China (973) (under grant 2013CB329404).
References (22)
- et al.
Trend analysis of categorical data streams with a concept change method
Inform. Sci.
(2014) - et al.
A dissimilarity measure for the k-modes clustering algorithm
Knowl.-Based Syst.
(2012) - et al.
A weighting k-modes algorithm for subspace clustering of categorical data
Neurocomputing
(2013) - et al.
Fuzzy clustering of categorical data using fuzzy centroids
Pattern Recogn. Lett.
(2004) - et al.
Data Mining: Concepts and Techniques
(2011) - et al.
Clustering, vol. 10
(2008) Extensions to the k-means algorithm for clustering large data sets with categorical values
Data Mining Knowl. Discov.
(1998)Some methods for classification and analysis of multivariate observations
- et al.
Introduction to Multidimensional Scaling: Theory, Methods, and Applications
(1981) - et al.
UCI Machine Learning Repository
(2014)
The k-means type algorithms versus imbalanced data distributions
IEEE Trans. Fuzzy Syst.
Cited by (10)
Weighted matrix-object data clustering guided by matrix-object distributions
2022, Engineering Applications of Artificial IntelligenceAn outlier detection algorithm for categorical matrix-object data
2021, Applied Soft ComputingCitation Excerpt :Using the concept of clustering to understand is objects in the same cluster are closer than objects in different clusters. Therefore, we can use the distance formula [17] to consider the coupling degree. The distance between two matrix-objects is defined as follows.
k-Mnv-Rep: A k-type clustering algorithm for matrix-object data
2021, Information SciencesCitation Excerpt :In real world, it is common that numeric and categorical attributes are mixed in many data sets. In this section, with the k-prototypes algorithm as a reference, we extend the k-Mnv-Rep to domains with mixed numeric and categorical values by combining the k-mw-modes algorithm [5]. The new algorithm is also described from two processes: dissimilarity measure and updating cluster centers.
Combining attribute content and label information for categorical data ensemble clustering
2020, Applied Mathematics and ComputationOptimal mathematical programming and variable neighborhood search for k-modes categorical data clustering
2019, Pattern RecognitionCitation Excerpt :Since categorical data are ubiquitous in the real world, clustering data with categorical attributes has a broad range of practical applications. A number of clustering algorithms for categorical datasets have been studied and practiced in literature, such as the k-means-based algorithm [40], the k-modes algorithms [24], the ROCK algorithm [17], the CACTUS algorithm [15], the RST-based algorithms [8,38], the k-populations algorithm [31], the soft feature-selection scheme [11], the Clustering ensemble selection algorithm [51], the k-multi-weighted-modes algorithm [9], and the clustering methods based on k-nearest-neighbor graph [36,39]. Among them, the k-modes categorical clustering algorithm is the most well-known algorithm that can cluster large-sized categorical datasets into a given number of clusters represented by the most-frequent attribute values (i.e., the modes) in a fast manner.