Improvements to the relational fuzzy c-means clustering algorithm
Introduction
Consider a set of objects , where the goal is to group them into c natural groups. Objects can be described by feature vectors such that xi is an attribute vector of dimension p representing object oi. Alternatively, objects can be represented using a pairwise relationship. The relationships are stored in a relational matrix R, where measures the relationship between oi and oj. If R is a dissimilarity relation denoted by , then it must satisfy the following three conditions:where condition (1a) is self-dissimilarity, (1b) is non-negativity and (1c) is symmetry. A well-known relational clustering algorithm that is suitable for clustering objects described by D is the relational fuzzy c-means (RFCM) proposed in [1] (Algorithm 1). RFCM, the relational dual of the FCM algorithm, takes an input dissimilarity matrix D and outputs a fuzzy partition matrix , where Algorithm 1 Relational fuzzy c-means (RFCM) [1]
The duality relationship between RFCM and FCM is based on the squared Euclidean distance or 2-norm that defines the dissimilarity dij between two feature vectors xi and xj describing oi and oj and the dissimilarity between the cluster center vi and oj. In other words, RFCM assumes that
The relation is Euclidean if there exists feature vectors with an embedding dimension , such that for all i,j . When D is Euclidean, it has a realization in some Euclidean space. In this case, RFCM and FCM will produce the same partition of relational and feature vector representation of the data. If D is not Euclidean, RFCM will still find clusters in any D whose entries satisfy (1) as long as it can execute, but in this case it is possible for RFCM to experience an execution failure. This happens when the relational distances between prototypes and objects in Eq. (3) become negative for some i and k (Algorithm 1, line 6). Another important observation about RFCM is that it expects squared dissimilarities D. If the dissimilarities are not squared, meaning that we have instead of D such that , then the dissimilarities must be squared before clustering using RFCM so that D is the Hadamard product . Throughout this paper D is assumed to contain squared dissimilarities.
Non-Euclidean Relational Fuzzy c-Means (NERFCM), repairs RFCM “on the fly” with a self-healing property that automatically adjusts the values of and the dissimilarities in D in case of failure [2]. The self-healing property is based on the β-spread, which works by adding a positive constant β to the off-diagonal elements of D. In fact, there exists β0 such that the β-spread transformed matrix is Euclidean for all . The parameter β controls the amount spreading and must be as small as possible to minimize unnecessary dilation that distorts the original D, which in turn may result in the loss of cluster information. The exact value of β0 is the largest positive eigenvalue of the matrix PDP, where and I is n×n identity matrix. Eigenvalue computation is avoided by the self-healing module, which is invoked during execution only when needed. When activated, this module adjusts the current D by adding a minimal β-spread to its all off-diagonal elements.
An alternative to using NERFCM is to transform the matrix D by a mapping that converts it to Euclidean form (we call this operation “Euclideanizing D”), and then running RFCM on the Euclideanized matrix . This approach guarantees that RFCM will not fail since is already Euclidean. There are at least five ways to Euclideanize D, including the β-spread transformation. In addition to the β-spread transformation, this paper will study the other four Euclideanization approaches indicated under option 1 in Fig. 1. As a result of this study, we will append an “i” (short for the word “improved”) to RFCM, but not to NERFCM, which is NOT altered by these results. We hope to write a companion paper to this one that discusses improvements to NERFCM which would then become iNERFCM, but attempts to find an alternative to the current “self-healing” method described in [2] which is NERFCM have so far met stiff resistance.
Section snippets
Euclidean distance matrices (EDM) and the iRFCM algorithm
Given a dissimilarity matrix D it is known thatwhere
P is the centering matrix defined as
I is the identity matrix and D0.5 is defined as
In (10) and below, is the Hadamard square of . The trick in using (8) is knowing if the dissimilarities are squared as in D or not squared as in , which determines which case of (10) to use. This is a
Experimental results
We have three types of ground truth partitions UGT for the examples in this section. UGT per Table 2 is assumed for Example 1, adopted from another study for Example 2, and supplied as physical labels for the three Iris subsets in Example 3. Since the ground truth itself is on somewhat shaky ground, we will use just one external cluster validity index, viz., the Hubert and Arabie Adjusted Rand Index (ARI), which is ably discussed in [13]. We denote this index as . This ARI (there are
Conclusion and discussion
RFCM is a popular algorithm for (fuzzily) clustering objects described by a dissimilarity data matrix D. But since RFCM is the relational dual of FCM, execution of the algorithm is guaranteed only when the dissimilarities in D have a Euclidean representation with an embedding dimension . If D is not Euclidean then the duality relation will be violated and most importantly the distances can become negative. There are two options to circumvent this problem. Option 2 in Fig. 1 advocates
Mohammed A. Khalilia received a Ph.D. in computer science (2014) from the University of Missouri. His research interests include pattern recognition, computational intelligence and natural language processing.
References (23)
- et al.
Relational duals of the c-means clustering algorithms
Pattern Recognit.
(1989) - et al.
Nerf c-means: non-Euclidean relational fuzzy clustering
Pattern Recognit.
(1994) - et al.
Multidimensional Scaling
(2000) - et al.
Multivariate Analysis, Probability and Mathematical Statistics
(1979) - et al.
On a general transformation making a Dissimilarity Matrix Euclidean
J. Classif.
(2007) - et al.
Additive similarity trees
Psychometrika
(1977) Shortest connection networks and some generalizations
Bell Syst. Tech. J.
(1957)The relation between hierarchical and Euclidean models for psychological distances
Psychometrika
(1972)- et al.
On the history of the minimum spanning tree problem
IEEE Ann. Hist. Comput.
(1985) - et al.
High performance solution of the complex symmetric eigenproblem
Numer. Algorithms
(1998)
Convex Optimization & Euclidean Distance Geometry
Cited by (0)
Mohammed A. Khalilia received a Ph.D. in computer science (2014) from the University of Missouri. His research interests include pattern recognition, computational intelligence and natural language processing.
James Bezdek has a Ph.D., Applied Mathematics, Cornell, 1973; past president - NAFIPS, IFSA and IEEE CIS; founding editor – Int’l. Jo. Approximate Reasoning, IEEE Transactions on Fuzzy Systems; Life fellow – IEEE and IFSA; recipient – IEEE 3rd Millennium, IEEE CIS Fuzzy Systems Pioneer, IEEE Frank Rosenblatt TFA, IPMU Kempe de Feret Medal.
Mihail Popescu is currently an Associate Professor with the Department of Health Management and Informatics, University of Missouri. His research interests include eldercare technologies, fuzzy logic, ontologies and pattern recognition.
James M. Keller holds the University of Missouri Curators Professorship in the Electrical and Computer Engineering and Computer Science Departments on the Columbia campus. He is also the R. L. Tatum Professor in the College of Engineering. His research interests include computational intelligence, computer vision, pattern recognition, and information fusion.