Elsevier

Pattern Recognition

Volume 47, Issue 12, December 2014, Pages 3920-3930
Pattern Recognition

Improvements to the relational fuzzy c-means clustering algorithm

https://doi.org/10.1016/j.patcog.2014.06.021Get rights and content

Highlights

  • Improved relational fuzzy c-means for clustering relational data D is proposed.

  • The matrix D is transformed to Euclidean matrix D˜ using different transformations.

  • Quality of D˜ is judged by the ability of RFCM to discover the apparent clusters.

  • The subdominant ultrametric transformation produces much better partitions of D˜.

  • β-spread minimizes the distortion between D and D˜, but produces worst clusterings.

Abstract

Relational fuzzy c-means (RFCM) is an algorithm for clustering objects represented in a pairwise dissimilarity values in a dissimilarity data matrix D. RFCM is dual to the fuzzy c-means (FCM) object data algorithm when D is a Euclidean matrix. When D is not Euclidean, RFCM can fail to execute if it encounters negative relational distances. To overcome this problem we can Euclideanize the relation D prior to clustering. There are different ways to Euclideanize D such as the β-spread transformation. In this article we compare five methods for Euclideanizing D to D˜. The quality of D˜ for our purpose is judged by the ability of RFCM to discover the apparent cluster structure of the objects underlying the data matrix D. The subdominant ultrametric transformation is a clear winner, producing much better partitions of D˜ than the other four methods. This leads to a new algorithm which we call the improved RFCM (iRFCM).

Introduction

Consider a set of objects O={o1,,on}, where the goal is to group them into c natural groups. Objects can be described by feature vectors X={x1,,xn}p such that xi is an attribute vector of dimension p representing object oi. Alternatively, objects can be represented using a pairwise relationship. The relationships are stored in a relational matrix R, where R=[rij] measures the relationship between oi and oj. If R is a dissimilarity relation denoted by D=[dij], then it must satisfy the following three conditions:dii=0fori=1,,n;dij0fori=1,,nandj=1,,n;anddij=djifori=1,,nandj=1,,n,where condition (1a) is self-dissimilarity, (1b) is non-negativity and (1c) is symmetry. A well-known relational clustering algorithm that is suitable for clustering objects described by D is the relational fuzzy c-means (RFCM) proposed in [1] (Algorithm 1). RFCM, the relational dual of the FCM algorithm, takes an input dissimilarity matrix D and outputs a fuzzy partition matrix UMfcn, whereMfcn={Uc×n|uik[0,1],k=1nuik>0,i=1cuik=1,1icand1kn}

Algorithm 1

Relational fuzzy c-means (RFCM) [1]

The duality relationship between RFCM and FCM is based on the squared Euclidean distance or 2-norm that defines the dissimilarity dij between two feature vectors xi and xj describing oi and oj and the dissimilarity between the cluster center vi and oj. In other words, RFCM assumes thatD=[dij]=[||xixj||22]

The relation D=[dij] is Euclidean if there exists feature vectors X={x1,,xn}p with an embedding dimension p<n, such that for all i,j dij=||xixj||22. When D is Euclidean, it has a realization in some Euclidean space. In this case, RFCM and FCM will produce the same partition of relational and feature vector representation of the data. If D is not Euclidean, RFCM will still find clusters in any D whose entries satisfy (1) as long as it can execute, but in this case it is possible for RFCM to experience an execution failure. This happens when the relational distances between prototypes and objects dR,ik in Eq. (3) become negative for some i and k (Algorithm 1, line 6). Another important observation about RFCM is that it expects squared dissimilarities D. If the dissimilarities are not squared, meaning that we have D instead of D such that D=D1/2=[dij], then the dissimilarities must be squared before clustering using RFCM so that D is the Hadamard product D=(D)2. Throughout this paper D is assumed to contain squared dissimilarities.

Non-Euclidean Relational Fuzzy c-Means (NERFCM), repairs RFCM “on the fly” with a self-healing property that automatically adjusts the values of dR,ik and the dissimilarities in D in case of failure [2]. The self-healing property is based on the β-spread, which works by adding a positive constant β to the off-diagonal elements of D. In fact, there exists β0 such that the β-spread transformed matrix Dβ is Euclidean for all ββ0. The parameter β controls the amount spreading and must be as small as possible to minimize unnecessary dilation that distorts the original D, which in turn may result in the loss of cluster information. The exact value of β0 is the largest positive eigenvalue of the matrix PDP, where P=I(1/n)(11T) and I is n×n identity matrix. Eigenvalue computation is avoided by the self-healing module, which is invoked during execution only when needed. When activated, this module adjusts the current D by adding a minimal β-spread to its all off-diagonal elements.

An alternative to using NERFCM is to transform the matrix D by a mapping that converts it to Euclidean form (we call this operation “Euclideanizing D”), and then running RFCM on the Euclideanized matrix D˜. This approach guarantees that RFCM will not fail since D˜ is already Euclidean. There are at least five ways to Euclideanize D, including the β-spread transformation. In addition to the β-spread transformation, this paper will study the other four Euclideanization approaches indicated under option 1 in Fig. 1. As a result of this study, we will append an “i” (short for the word “improved”) to RFCM, but not to NERFCM, which is NOT altered by these results. We hope to write a companion paper to this one that discusses improvements to NERFCM which would then become iNERFCM, but attempts to find an alternative to the current “self-healing” method described in [2] which is NERFCM have so far met stiff resistance.

Section snippets

Euclidean distance matrices (EDM) and the iRFCM algorithm

Given a dissimilarity matrix D it is known thatDisaEuclideandistancematrix(EDM)W(D0.5)ispositivesemidefinite(p.s.d)whereW(D0.5)=PD0.5P,

P is the centering matrix defined asP=I1n(11T),

I is the identity matrix and D0.5 is defined asD0.5={1/2Dforsquareddissimilarities1/2(D)2Otherwise.

In (10) and below, (D)2 is the Hadamard square of D. The trick in using (8) is knowing if the dissimilarities are squared as in D or not squared as in D, which determines which case of (10) to use. This is a

Experimental results

We have three types of ground truth partitions UGT for the examples in this section. UGT per Table 2 is assumed for Example 1, adopted from another study for Example 2, and supplied as physical labels for the three Iris subsets in Example 3. Since the ground truth itself is on somewhat shaky ground, we will use just one external cluster validity index, viz., the Hubert and Arabie Adjusted Rand Index (ARI), which is ably discussed in [13]. We denote this index as ARI(U|UGT). This ARI (there are

Conclusion and discussion

RFCM is a popular algorithm for (fuzzily) clustering objects described by a dissimilarity data matrix D. But since RFCM is the relational dual of FCM, execution of the algorithm is guaranteed only when the dissimilarities in D have a Euclidean representation with an embedding dimension p<n. If D is not Euclidean then the duality relation will be violated and most importantly the distances dR,ik can become negative. There are two options to circumvent this problem. Option 2 in Fig. 1 advocates

Mohammed A. Khalilia received a Ph.D. in computer science (2014) from the University of Missouri. His research interests include pattern recognition, computational intelligence and natural language processing.

References (23)

  • R.J. Hathaway et al.

    Relational duals of the c-means clustering algorithms

    Pattern Recognit.

    (1989)
  • R.J. Hathaway et al.

    Nerf c-means: non-Euclidean relational fuzzy clustering

    Pattern Recognit.

    (1994)
  • T. Cox et al.

    Multidimensional Scaling

    (2000)
  • K.V. Mardia et al.

    Multivariate Analysis, Probability and Mathematical Statistics

    (1979)
  • J. Benasseni et al.

    On a general transformation making a Dissimilarity Matrix Euclidean

    J. Classif.

    (2007)
  • S. Sattath et al.

    Additive similarity trees

    Psychometrika

    (1977)
  • R. Prim

    Shortest connection networks and some generalizations

    Bell Syst. Tech. J.

    (1957)
  • E. Holman

    The relation between hierarchical and Euclidean models for psychological distances

    Psychometrika

    (1972)
  • R.L. Graham et al.

    On the history of the minimum spanning tree problem

    IEEE Ann. Hist. Comput.

    (1985)
  • I. Bar-On et al.

    High performance solution of the complex symmetric eigenproblem

    Numer. Algorithms

    (1998)
  • J. Dattorro

    Convex Optimization & Euclidean Distance Geometry

    (2005)
  • Cited by (0)

    Mohammed A. Khalilia received a Ph.D. in computer science (2014) from the University of Missouri. His research interests include pattern recognition, computational intelligence and natural language processing.

    James Bezdek has a Ph.D., Applied Mathematics, Cornell, 1973; past president - NAFIPS, IFSA and IEEE CIS; founding editor – Int’l. Jo. Approximate Reasoning, IEEE Transactions on Fuzzy Systems; Life fellow – IEEE and IFSA; recipient – IEEE 3rd Millennium, IEEE CIS Fuzzy Systems Pioneer, IEEE Frank Rosenblatt TFA, IPMU Kempe de Feret Medal.

    Mihail Popescu is currently an Associate Professor with the Department of Health Management and Informatics, University of Missouri. His research interests include eldercare technologies, fuzzy logic, ontologies and pattern recognition.

    James M. Keller holds the University of Missouri Curators Professorship in the Electrical and Computer Engineering and Computer Science Departments on the Columbia campus. He is also the R. L. Tatum Professor in the College of Engineering. His research interests include computational intelligence, computer vision, pattern recognition, and information fusion.

    View full text