A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

https://doi.org/10.1016/j.eswa.2010.03.028Get rights and content

Abstract

Partially missing data sets are a prevailing problem in clustering analysis. In this paper, missing attributes are represented as intervals, and a novel fuzzy c-means algorithm for incomplete data based on nearest-neighbor intervals is proposed. The algorithm estimates the nearest-neighbor interval representation of missing attributes by using the attribute distribution information of the data sets sufficiently, which can enhances the robustness of missing attribute imputation compared with other numerical imputation methods. Also, the convex hyper-polyhedrons formed by interval prototypes can present the uncertainty of missing attributes, and simultaneously reflect the shape of the clusters to some degree, which is helpful in enhancing the robustness of clustering analysis. Comparisons and analysis of the experimental results for several UCI data sets demonstrate the capability of the proposed algorithm.

Introduction

The fuzzy c-means (FCM) algorithm (Bezdek, 1981) is a useful tool for clustering, which partitions a real s-dimensional dataset X={x1,x2,,xn}Rs into several clusters to describe an underlying structure within the data, and has been extensively used in pattern recognition and data mining. However, in pattern classification applications, many datasets suffer from incompleteness, i.e. a dataset X can contain vectors that are missing one or more of the attribute values, as a result of failure in data collection, measurement errors, missing observations, random noise, etc. and FCM is not directly applicable to such incomplete datasets.

The problem of doing pattern recognition with incomplete data can be traced back to the 1960s, when Sebestyen (1962) introduced an approach based on probabilistic assumptions. Subsequently the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) was used to handle incomplete data and probabilistic clustering (McLachlan & Basford, 1988). In 1998, several methods were proposed for handling missing values in FCM (Miyamoto, Takata, & Umayahara, 1998). One basic strategy, imputation, replaces the missing values by the weighted averages of the corresponding attributes. Another approach, discarding/ignoring, ignores the missing values and calculates the distances from the remaining coordinates. In 2001, Hathaway and Bezdek proposed other strategies to continue the FCM clustering of incomplete data (Hathaway & Bezdek, 2001). One simple strategy (whole data strategy, WDS) removes all sample data that include missing values from the dataset, but the strategy is not desirable because the elimination brings a loss of information. Another method uses the partial distance strategy (PDS), which calculates partial distances using all available attribute values, and scales this quantity by the reciprocal of the proportion of components used. Two further methods proposed by Hathaway and Bezdek (2001) belong to the imputation method, which involve computations to replace the missing values with estimation based on the available information. The optimal completion strategy (OCS) views the missing values as an optimization problem and imputes missing values in each iteration to find better estimates. The nearest prototype strategy (NPS) replaces missing values with the corresponding attributes of the nearest prototype. Besides the above methods, by taking into account the information why data are missing, Timm, Doring, and Kruse (2004) developed a fuzzy clustering algorithm extended from the Gath and Geva algorithm. Hathaway and Bezdek (2002) used triangle inequality-based approximation schemes to cluster incomplete relational data, and Honda and Ichihashi (2004) partitioned the incomplete datasets into several linear fuzzy clusters by extracting local principal components.

In this paper, by adopting the idea of nearest-neighbor rule, a novel fuzzy c-means algorithm for incomplete data based on nearest-neighbor intervals (FCM-NNI) is proposed. Firstly, because of the uncertainty of missing attributes, missing attributes are represented by nearest-neighbor intervals (NNI) based on the nearest-neighbor information, which are more robust than the numerical values obtained by imputation methods mentioned above. Secondly, the clustering problem can be thus viewed as clustering for interval-valued data, which will result in interval cluster prototypes rather than point prototypes. Therefore, the convex hyper-polyhedrons formed by interval prototypes in the attribute space, as a kind of cluster prototype with more complicated geometrical structure, can present the uncertainty of missing attributes, and at the same time reflect the shape of the clusters to some degree, thus validating the robustness of clustering pattern with more accurate clustering results.

This paper is organized as follows. Section 2 presents a short description of the FCM algorithm and FCM clustering algorithm for interval-valued data (IFCM) based on clustering objective function minimization. The nearest-neighbor interval representation of missing attributes and the novel FCM-NNI algorithm are introduced in Section 3. Section 4 presents clustering results of several UCI data sets and a comparative study of our proposed algorithm with various other methods for handling missing values in FCM. Finally, conclusions are drawn in Section 5.

Section snippets

Fuzzy c-means algorithm

The fuzzy c-means (FCM) algorithm partitions a set of complete data X={x1,x2,,xn}Rs into c-(fuzzy) clusters by minimizing the clustering objective functionJ(U,V)=i=1ck=1nuikmxk-vi22,with the constraint ofi=1cuik=1,fork=1,2,,n,where xk = [x1k, x2k,  , xsk]T is an object datum, and xjk is the jth attribute value of xk; vi is the ith point cluster prototype, viRs, and let the matrix of cluster prototypes V=[vji]=[v1,v2,,vc]Rs×c for convenience; uik is the membership that represents the

Nearest-neighbor intervals determination

Recently, the use of nearest-neighbor (NN) based techniques has been proposed for imputation of missing values. A simple NN imputation method is to substitute the missing attribute by the corresponding attribute of the nearest-neighbor (Stade, 1996). And in another popular approach, k-nearest-neighbor imputation (Acuna & Rodriguez, 2004), missing attributes are supplemented by the mean value of the attribute in the k-nearest-neighbors. Subsequently, many similarity measures other than Euclidean

Data sets

In the experiments presented below, we tested the performance of the proposed FCM-NNI for three well-known data sets: IRIS, Wine, and Bupa Liver Disorder. All of these databases are taken from the UCI machine repository (Hettich, Blake, & Merz, 1998), and often used as standard databases to test the performance of clustering algorithms.

The IRIS data contains 150 four-dimensional attribute vectors, depicting four attributes of iris flowers, which include Petal Length, Petal Width, Sepal Length

Conclusion

This paper aims at the problem of clustering of incomplete data, and proposes a novel fuzzy c-means algorithm for incomplete data based on nearest-neighbor intervals (FCM-NNI). The proposed algorithm has two main advantages. Firstly, interval estimations of missing attributes are obtained by using the attribute distribution information of data sets sufficiently, which is superior in expressing the uncertainty of missing attributes, and enhances the robustness of missing attributes

Acknowledgments

The authors would like to express their gratitude to the editor, associate editor, and all reviewers for their valuable suggestions for improving this manuscript.

References (17)

There are more references available in the full text version of this article.

Cited by (0)

View full text