Confidence-weighted safe semi-supervised clustering

https://doi.org/10.1016/j.engappai.2019.02.007Get rights and content

Highlights

  • We develop a confidence-weighted safe semi-supervised clustering method.

  • The safe confidences of the labeled samples are estimated by unsupervised clustering.

  • A local graph is constructed to safely exploit the risky prior knowledge.

Abstract

In this paper, we propose confidence-weighted safe semi-supervised clustering where prior knowledge is given in the form of class labels. In some applications, some samples may be wrongly labeled by the users. Therefore, our basic idea is that different samples should have different impacts or confidences on the clustering performance. In our algorithm, we firstly use unsupervised clustering to perform the dataset partition and compute the normalized confusion matrix Nc. Nc is used to estimate the safe confidence of each labeled sample based on the assumption that a correctly clustered sample should have a high confidence. Then we construct a local graph to model the relationship between the labeled and its nearest unlabeled samples through the clustering results. Finally, a confidence-weighted fidelity term and a graph-based regularization term are incorporated into the objective function of unsupervised clustering. In this case, on the one hand, the outputs of the labeled samples with high confidences are restricted to be the given prior labels. On the other hand, the outputs of the labeled ones with low confidences are forced to approach those of the local homogeneous unlabeled neighbors modeled by the local graph. Hence, the labeled samples are expected to be safely exploited which is the goal of safe semi-supervised clustering. To verify the effectiveness of our algorithm, we carry out some experiments over several datasets by comparison to the unsupervised and semi-supervised clustering methods and achieve the promising results.

Introduction

Recently, safe semi-supervised learning (S3L) has attracted much attention in machine learning field. In some scenarios, the traditional semi-supervised learning (SSL) methods may perform worse than the corresponding supervised learning (SL) methods which restricts the practical applications of SSL. In other words, unlabeled samples may be harmful to the performance. Therefore, S3L tries to develop different safe mechanisms to realize that the learning performance is never inferior to that of SL by safely exploiting the unlabeled samples. Due to the merit, S3L extends the application scopes of SSL. In fact, some previous studies (Gan et al., 2013a, Cohen et al., 2004, Singh et al., 2009, Yang and Priebe, 2011) have analyzed the negative impact of unlabeled samples on the learning performance in theoretical and empirical aspects. For the safe exploitation of the unlabeled samples, Li and Zhou proposed two S3L methods in 2011, named S3VM_us (Li and Zhou, 2011a) and safe semi-supervised SVMs (S4VMs) (Li and Zhou, 2011b). S3VM_us introduced a safe mechanism by selecting the helpful unlabeled samples based on a hierarchical clustering method. Different from S3VM_us which only found one optimal low-density separator, S4VMs constructed multiple S3VM candidates simultaneously to reduce the risk of the unlabeled samples. S3VM_us and S4VMs both yielded the promising results and reached the goal of S3L. Up to now, several S3L methods (Wang and Chen, 2013, Li et al., 2016b, Wang et al., 2016, Li et al., 2016a, Dong et al., 2016, Li et al., 2017) have been proposed to alleviate the harm of the unlabeled samples for SSL. However, these S3L methods were mainly designed for semi-supervised classification. Furthermore, Li et al. (2017) proposed safe semi-supervised regression (SAFER) which was used for semi-supervised regression. That is to say, the past studies mainly focused on classification and regression. Specifically, there is not related work for semi-supervised clustering.

In fact, past decades have witnessed the successfulness of semi-supervised clustering in the various practical applications. The goal of semi-supervised clustering is to fully utilize the prior knowledge to aid the clustering procedure, such as class labels and pair-wise constraints. A lot of semi-supervised clustering methods (Gan et al., 2015, Zhang and Lu, 2009, Basu et al., 2002, Chen and Feng, 2012, Givoni and Frey, 2009, Bensaid et al., 1996, Pedrycz and Waletzky, 1997) are developed based on the traditional unsupervised clustering methods, such as k-means (Hartigan and Wong, 1979), Gaussian mixture models (GMM) (Chen et al., 2011), Fuzzy c-Means (FCM) (Bezdek, 1981). Traditional semi-supervised clustering generally gives the hypothesis that the prior knowledge is benefit to the clustering performance. However, the collected prior knowledge (e.g., wrongly labeled samples and noise) may result in the performance degeneration as mentioned in the semi-supervised classification and regression. Yin et al. (2010) have discussed the negative impact of noisy pair-wise constraints and pointed out that the wrong prior knowledge would yield the inferior clustering performance.

Based on the mentioned-above two aspects, it is meaningful and worthy to design a safe semi-supervised clustering method which can outperform the corresponding unsupervised and semi-supervised clustering methods. Recently, Gan et al. (2018) developed Local Homogeneous Consistent Safe Semi-Supervised FCM (LHC-S3FCM) where the class labels are given as the prior knowledge. A new graph-based regularization term was built for LHC-S3FCM which meant that the outputs of the labeled sample and its nearest homogeneous unlabeled ones should be similar. However, it is implied that the labeled samples equally hurt the clustering performance in LHC-S3FCM.

Hence, we invent confidence-weighted safe semi-supervised clustering in this paper. Different from LHC-S3FCM, our basic idea is that different samples should have different impacts or confidences on the performance degeneration. In our algorithm, we firstly use unsupervised clustering to perform the dataset partition and compute the normalized confusion matrix based on the clustering results. The probability distribution in the normalized confusion matrix is used to compute the safe confidence of each labeled sample based on the assumption that a correctly clustered sample should have a high confidence. Then we construct a local graph which is similar to the literature (Gan et al., 2018). The graph can be used to model the relationship between the labeled and its nearest unlabeled samples through the clustering results. Finally, a confidence-weighted fidelity term and a graph-based regularization term are incorporated into the objective function of unsupervised clustering. On the one hand, the outputs of the labeled samples with high confidences are restricted to be the given prior labels. On the other hand, the outputs of the labeled samples with low confidences are forced to approach that of the local homogeneous unlabeled neighbors modeled by the local graph. In this sense, the outputs of the labeled samples in our algorithm are a tradeoff between the given labels and the outputs of local nearest neighbors. Hence, the labeled samples are expected to be safely exploited which is the goal of safe semi-supervised clustering. The main contributions of the paper can be summarized as:

  • 1.

    We develop a confidence-weighted safe semi-supervised clustering method which can safely exploit the labeled samples.

  • 2.

    The safe confidences of the labeled samples are estimated by unsupervised clustering which is free from the wrong or noise labels.

  • 3.

    A local graph is constructed to model the relationship between the labeled and its nearest unlabeled samples and the graph structure is used to safely exploit the risky prior knowledge.

The structure of the paper is organized as: Section 2 will review the related work. Then we will present the details of our algorithm in Section 3. Section 4 will report the results on several databases. Finally, Section 5 will give the conclusion and future work.

Section snippets

Related work

In S3L, since Li and Zhou proposed S3VM_us (Li and Zhou, 2011a) and S4VMs (Li and Zhou, 2011b) in 2011, several S3L methods have been developed for safely exploiting the unlabeled samples and achieved the promising results. Wang and Chen (2013) developed a safety-aware SSCCM (SA-SSCCM) which is extended from the semi-supervised classification method based on class membership (SSCCM). The performance of SA-SSCCM is never significantly inferior to that of SL and seldom significantly inferior to

Motivation

Traditional semi-supervised clustering generally assumes that the labeled samples are always benefit to the performance improvement. However, in some scenarios, the samples may be wrongly labeled by the users. In other words, the sample labels may be different from the true ones. Traditional semi-supervised clustering does not consider the risk of the wrongly labeled samples. And it is a reasonable assumption that different samples should have the different impacts or safe confidences on the

Experimental analysis

In this section, we carry out a series of experiments over several datasets to evaluate the performance of our algorithm. The performance is measured through the clustering accuracy and used to verified the effectiveness of our algorithm by comparison to the following methods:

  • k-means (Jain, 2010)

  • GMM (Bouman, 1997)

  • FCM (Bezdek, 1981)

  • seeded-kmeans (Basu et al., 2002)

  • semiGMM (Martinez-Uso et al., 2010)

  • SSFCM (Pedrycz and Waletzky, 1997)

  • SFCM (Mai and Ngo, 2015)

  • SKFCM-F (Mai and Ngo, 2018)

  • LHC-S3FCM (

Conclusion

This paper presents CS3FCM for safe semi-supervised clustering. CS3FCM assumes that the different samples should have the different safe confidences on the performance. And the confidences are estimated through FCM which is free from the wrong labels. The safe mechanism is implemented by balancing the tradeoff between the given labels and outputs of the local homogeneous unlabeled samples. The experimental results show that our algorithm can alleviate the harm of the wrongly labeled samples. In

Acknowledgments

The work was supported by Zhejiang Provincial Natural Science Foundation of China under grant No. LY19F020040, and National Natural Science Foundation of China under grant No. 61601162, 61771178 and 61671197, and Zhejiang Provincial Natural Science Foundation of China under grant No. LY17F030021 and LY18F030009.

Declarations of interest

None.

References (43)

  • YinX. et al.

    Semi-supervised fuzzy clustering with metric learning and entropy regularization

    Knowl.-Based Syst.

    (2012)
  • ZhangH. et al.

    Semi-supervised fuzzy clustering: A kernel-based approach

    Knowl.-Based Syst.

    (2009)
  • de AmorimR.C. et al.

    Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering

    Pattern Recognit.

    (2012)
  • BasuS. et al.

    Semi-supervised clustering by seeding

  • BasuS. et al.

    A probabilistic framework for semi-supervised clustering

  • BezdekJ.C.

    Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • BilenkoM. et al.

    Integrating constraints and metric learning in semi-supervised clustering

  • Bouman, C.A., 1997. Cluster: An unsupervised algorithm for modeling Gaussian mixtures. Available from...
  • CohenI. et al.

    Semisupervised learning of classifiers: theory, algorithms, and their application to human–computer interaction

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2004)
  • DingS. et al.

    Research of semi-supervised spectral clustering algorithm based on pairwise constraints

    Neural Comput. Appl.

    (2014)
  • Frank, A., Asuncion, A., 2010. UCI machine learning...
  • Cited by (19)

    • Safe semi-supervised clustering based on Dempster–Shafer evidence theory

      2023, Engineering Applications of Artificial Intelligence
    • A review on semi-supervised clustering

      2023, Information Sciences
    View all citing articles on Scopus

    One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.02.007..

    View full text