Elsevier

Knowledge-Based Systems

Volume 22, Issue 6, August 2009, Pages 477-481
Knowledge-Based Systems

Semi-supervised fuzzy clustering: A kernel-based approach

https://doi.org/10.1016/j.knosys.2009.06.009Get rights and content

Abstract

Semi-supervised clustering algorithms aim to improve the clustering accuracy under the supervisions of a limited amount of labeled data. Since kernel-based approaches, such as kernel-based fuzzy c-means algorithm (KFCM), have been successfully used in classification and clustering problems, in this paper, we propose a novel semi-supervised clustering approach using the kernel-based method based on KFCM and denote it the semi-supervised kernel fuzzy c-mean algorithm (SSKFCM). The objective function of SSKFCM is defined by adding classification errors of both the labeled and the unlabeled data, and its global optimum has been obtained through repeatedly updating the fuzzy memberships and the optimized kernel parameter. The objective function may have more than one local optimum, so we employ a function transformation technique to reformulate the objective function after a local minimum has been obtained, and select the best optimum as the solution to the objective function. Experimental results on both the artificial and several real data sets show SSKFCM performs better than its conventional counterparts and it achieves the best accurate clustering results when the parameter is optimized.

Introduction

Semi-supervised clustering addresses the problem of building better clusters using labeled data together with large amount of unlabeled data. As unlabeled data can be obtained more easily than labeled data, and labeling an instance is difficult and time-consuming, semi-supervised clustering algorithms have recently received a large amount of attention in the field of machine learning and data mining. Clustering algorithms are commonly used to generate clustering groups of totally unlabeled data when no prior information about the data structure and models are known. However, sometimes we need to handle labeled and unlabeled data. The challenge in this situation is to determine how to incorporate the prior information about clusters into the algorithm, and therefore improve the clustering performance. Many approaches have been proposed in the literature, such as EM with generative mixture models [17], self-training [15], Co-training [1], [4], transductive support vector machines [7], [13], graph-based approaches [3], [22] and fuzzy c-means [2].

Recently, kernel-based methods have become popular tools in the communities of data mining and machine learning to solve classification and regression problems. In kernel-based approaches, a mapping function is used to transform data in the original input space to a high dimensional feature space and classify the data in the feature space. A kernel function is used to represent the dot product of two data points in the feature space since the mapping function may not be explicitly expressed. Several kernel-based algorithms have been proposed. Supported vector machines are the most well known algorithms using the idea of kernel substitution [6]. Other algorithms involve kernel principal component analysis, kernel fisher discriminant analysis and several recent kernel-based clustering algorithms such as kernel k-means algorithms [10] and kernel fuzzy c-means algorithms [20]. Performances of clustering algorithms in the kernel-induced feature space have been evaluated based on various data sets [14]. The evaluation results show that each kernel-based clustering algorithm works better than its original algorithm for almost all the data sets used in the experiment. This provides evidence that supports our decision to use kernel-based methods to solve semi-supervised clustering problems.

It has been noted in the literature that the values of the kernel parameters in kernel-based methods influence the clustering performance, and should be optimized. Empirical and cross-validation approaches have been used to optimize the kernel parameter [19], and another different approach learns the kernel parameter by minimizing an objective function [21]. The above mentioned kernel parameter learning methods gain either none-optimized parameter values or sub-optimized values, and none of the above methods promise to find the global optimized parameter values.

Kernel-based semi-supervised clustering has been proposed and applied in many fields. Kernel versions of semi-supervised dimensionality reduction frameworks was proposed [18], and a kernel-based distance was evaluated in the enhancement of fuzzy clustering by mechanisms of partial supervision [5]. A weakly supervised optimization of a feature vector set was introduced to improve the representation of digital document collections, and this feature-based approach to semi-supervised similarity learning was validated based on synthetic and real data [11]. Kernel selection was implemented for semi-supervised kernel machines [9]. Classification with SVM was considerably enhanced by using a kernel function learned from the training data prior to discrimination, and the kernel was shown to enhance retrieval based on data similarity [12]. Based on a data-dependent kernel function, a unified kernel optimization framework was proposed to simplify the optimization of the objective function defined as any discriminant criteria formulated in a pairwise manner [8]. To our knowledge, incorporating the kernel-based approach together with the learning of the kernel parameter has not been discussed in the literature.

In this paper, a kernel-based fuzzy algorithm is proposed to learn a cluster from both the labeled and unlabeled data. The kernel parameter, which has a great impact on the clustering performance, is also learned during the learning process. We define an objective function by taking into account the information of the labeled, unlabeled data, and the kernel parameter, and use a function deflection approach to learn the optimized variable values in the objective function.

Section snippets

Kernel-based semi-supervised clustering

For a given labeled data set L and an unlabeled data set U in the d-dimensional input data space, a mapping function Φ transforms the data points to a high-dimensional feature space. For example, given a data point xiLU, its image in the feature space is Φ(xi). A kernel function K(xi,xj) is defined as the dot product of the image of xi and the image of xj in the feature space, that is K(xi,xj)=Φ(xi)·Φ(xj). It is obvious that the kernel can be constructed without knowing the concrete form of Φ

Learning the fuzzy clustering

The fuzzy memberships and the kernel parameter σ are updated, and the cluster center vi is recalculated in the learning iteration. Data contributions to the cluster centers depend on both the fuzzy memberships and the distance between the data and the center. Under these conditions, the i-th cluster centroid is calculatedΦ(vi)=j=1lμijk(xj,vi)Φ(xj)+j=1uμijk(yj,vi)Φ(yj)j=1lμijk(xj,vi)+j=1uμijk(yj,vi)At the beginning stage, the centroid of the labeled data is calculatedΦ(vi,o)=j=1lμij,okxj,vi

Semi-supervised fuzzy cluster algorithm

Based on the above discussion, we derive the following learning algorithm SSKFCM based on the deflection technique and the kernel function.

  • (1)

    Initialize the values of σ and μij for the labeled and unlabeled data.

    • (a)

      μij,o is given by domain experts and is the initialized value of the labeled data.

    • (b)

      For each of the unlabeled data, initialize the fuzzy memberships as follows: generate c positive random numbers (r1j,,rcj) lying in the interval of [0,1] for the jth unlabeled data. Initialize the fuzzy

Experimental results

In this section, we present the experimental results comparing SSKFCM with none-optimized σ (SSKFCM-none), and semi-supervised fuzzy c-means algorithm (SSFCM) without data mapping. The results are obtained from experiments done on artificial data set circles and real world data sets Iris, Bupa and Pendigitis from UCI machine learning repository1.

The parameter m controlling the fuzziness is set to 2, and the kernel parameter σ is initially fixed

Conclusions

In this paper, a novel algorithm SSKFCM is proposed, which can optimize semi-supervised clustering for both labeled and unlabeled data. The analysis of this algorithm is based on a kernel-based approach and a new objective function is defined which considers both the labeled data clustering error and the unlabeled data clustering error. Learning the data clustering is reduced to find the optimal solutions to the objective function. The resulting algorithm has several advantages over other

Acknowledgement

This research is partially supported by the following programs: a grant from the Technology Research and Development Programs of Science and Technology Commission Foundation of Shandong Province, China (No. 2008GG10001015); the Research Project of Shandong Bureau of Education, China (No. J07YJ04); the High Technology Creative Research and Development Programs of Science and Technology Commission Foundation of Shandong Province, China (No. 2007ZZ17); the Natural Science Fund of Science and

References (22)

  • O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of the 10th...
  • Cited by (0)

    View full text