Confidence-weighted safe semi-supervised clustering☆
Graphical abstract
Introduction
Recently, safe semi-supervised learning (S3L) has attracted much attention in machine learning field. In some scenarios, the traditional semi-supervised learning (SSL) methods may perform worse than the corresponding supervised learning (SL) methods which restricts the practical applications of SSL. In other words, unlabeled samples may be harmful to the performance. Therefore, S3L tries to develop different safe mechanisms to realize that the learning performance is never inferior to that of SL by safely exploiting the unlabeled samples. Due to the merit, S3L extends the application scopes of SSL. In fact, some previous studies (Gan et al., 2013a, Cohen et al., 2004, Singh et al., 2009, Yang and Priebe, 2011) have analyzed the negative impact of unlabeled samples on the learning performance in theoretical and empirical aspects. For the safe exploitation of the unlabeled samples, Li and Zhou proposed two S3L methods in 2011, named S3VM_us (Li and Zhou, 2011a) and safe semi-supervised SVMs (S4VMs) (Li and Zhou, 2011b). S3VM_us introduced a safe mechanism by selecting the helpful unlabeled samples based on a hierarchical clustering method. Different from S3VM_us which only found one optimal low-density separator, S4VMs constructed multiple S3VM candidates simultaneously to reduce the risk of the unlabeled samples. S3VM_us and S4VMs both yielded the promising results and reached the goal of S3L. Up to now, several S3L methods (Wang and Chen, 2013, Li et al., 2016b, Wang et al., 2016, Li et al., 2016a, Dong et al., 2016, Li et al., 2017) have been proposed to alleviate the harm of the unlabeled samples for SSL. However, these S3L methods were mainly designed for semi-supervised classification. Furthermore, Li et al. (2017) proposed safe semi-supervised regression (SAFER) which was used for semi-supervised regression. That is to say, the past studies mainly focused on classification and regression. Specifically, there is not related work for semi-supervised clustering.
In fact, past decades have witnessed the successfulness of semi-supervised clustering in the various practical applications. The goal of semi-supervised clustering is to fully utilize the prior knowledge to aid the clustering procedure, such as class labels and pair-wise constraints. A lot of semi-supervised clustering methods (Gan et al., 2015, Zhang and Lu, 2009, Basu et al., 2002, Chen and Feng, 2012, Givoni and Frey, 2009, Bensaid et al., 1996, Pedrycz and Waletzky, 1997) are developed based on the traditional unsupervised clustering methods, such as k-means (Hartigan and Wong, 1979), Gaussian mixture models (GMM) (Chen et al., 2011), Fuzzy c-Means (FCM) (Bezdek, 1981). Traditional semi-supervised clustering generally gives the hypothesis that the prior knowledge is benefit to the clustering performance. However, the collected prior knowledge (e.g., wrongly labeled samples and noise) may result in the performance degeneration as mentioned in the semi-supervised classification and regression. Yin et al. (2010) have discussed the negative impact of noisy pair-wise constraints and pointed out that the wrong prior knowledge would yield the inferior clustering performance.
Based on the mentioned-above two aspects, it is meaningful and worthy to design a safe semi-supervised clustering method which can outperform the corresponding unsupervised and semi-supervised clustering methods. Recently, Gan et al. (2018) developed Local Homogeneous Consistent Safe Semi-Supervised FCM (LHC-SFCM) where the class labels are given as the prior knowledge. A new graph-based regularization term was built for LHC-SFCM which meant that the outputs of the labeled sample and its nearest homogeneous unlabeled ones should be similar. However, it is implied that the labeled samples equally hurt the clustering performance in LHC-SFCM.
Hence, we invent confidence-weighted safe semi-supervised clustering in this paper. Different from LHC-SFCM, our basic idea is that different samples should have different impacts or confidences on the performance degeneration. In our algorithm, we firstly use unsupervised clustering to perform the dataset partition and compute the normalized confusion matrix based on the clustering results. The probability distribution in the normalized confusion matrix is used to compute the safe confidence of each labeled sample based on the assumption that a correctly clustered sample should have a high confidence. Then we construct a local graph which is similar to the literature (Gan et al., 2018). The graph can be used to model the relationship between the labeled and its nearest unlabeled samples through the clustering results. Finally, a confidence-weighted fidelity term and a graph-based regularization term are incorporated into the objective function of unsupervised clustering. On the one hand, the outputs of the labeled samples with high confidences are restricted to be the given prior labels. On the other hand, the outputs of the labeled samples with low confidences are forced to approach that of the local homogeneous unlabeled neighbors modeled by the local graph. In this sense, the outputs of the labeled samples in our algorithm are a tradeoff between the given labels and the outputs of local nearest neighbors. Hence, the labeled samples are expected to be safely exploited which is the goal of safe semi-supervised clustering. The main contributions of the paper can be summarized as:
- 1.
We develop a confidence-weighted safe semi-supervised clustering method which can safely exploit the labeled samples.
- 2.
The safe confidences of the labeled samples are estimated by unsupervised clustering which is free from the wrong or noise labels.
- 3.
A local graph is constructed to model the relationship between the labeled and its nearest unlabeled samples and the graph structure is used to safely exploit the risky prior knowledge.
The structure of the paper is organized as: Section 2 will review the related work. Then we will present the details of our algorithm in Section 3. Section 4 will report the results on several databases. Finally, Section 5 will give the conclusion and future work.
Section snippets
Related work
In S3L, since Li and Zhou proposed S3VM_us (Li and Zhou, 2011a) and S4VMs (Li and Zhou, 2011b) in 2011, several S3L methods have been developed for safely exploiting the unlabeled samples and achieved the promising results. Wang and Chen (2013) developed a safety-aware SSCCM (SA-SSCCM) which is extended from the semi-supervised classification method based on class membership (SSCCM). The performance of SA-SSCCM is never significantly inferior to that of SL and seldom significantly inferior to
Motivation
Traditional semi-supervised clustering generally assumes that the labeled samples are always benefit to the performance improvement. However, in some scenarios, the samples may be wrongly labeled by the users. In other words, the sample labels may be different from the true ones. Traditional semi-supervised clustering does not consider the risk of the wrongly labeled samples. And it is a reasonable assumption that different samples should have the different impacts or safe confidences on the
Experimental analysis
In this section, we carry out a series of experiments over several datasets to evaluate the performance of our algorithm. The performance is measured through the clustering accuracy and used to verified the effectiveness of our algorithm by comparison to the following methods:
- •
k-means (Jain, 2010)
- •
GMM (Bouman, 1997)
- •
FCM (Bezdek, 1981)
- •
seeded-kmeans (Basu et al., 2002)
- •
semiGMM (Martinez-Uso et al., 2010)
- •
SSFCM (Pedrycz and Waletzky, 1997)
- •
SFCM (Mai and Ngo, 2015)
- •
SKFCM-F (Mai and Ngo, 2018)
- •
LHC-SFCM (
Conclusion
This paper presents CS3FCM for safe semi-supervised clustering. CS3FCM assumes that the different samples should have the different safe confidences on the performance. And the confidences are estimated through FCM which is free from the wrong labels. The safe mechanism is implemented by balancing the tradeoff between the given labels and outputs of the local homogeneous unlabeled samples. The experimental results show that our algorithm can alleviate the harm of the wrongly labeled samples. In
Acknowledgments
The work was supported by Zhejiang Provincial Natural Science Foundation of China under grant No. LY19F020040, and National Natural Science Foundation of China under grant No. 61601162, 61771178 and 61671197, and Zhejiang Provincial Natural Science Foundation of China under grant No. LY17F030021 and LY18F030009.
Declarations of interest
None.
References (43)
- et al.
Partially supervised clustering for image segmentation
Pattern Recognit.
(1996) - et al.
Spectral clustering: A semi-supervised approach
Neurocomputing
(2012) - et al.
Discriminative structure selection method of Gaussian Mixture Models with its application to handwritten digit recognition
Neurocomputing
(2011) - et al.
Semi-supervised classification method through oversampling and common hidden space
Inform. Sci.
(2016) - et al.
Local homogeneous consistent safe semi-supervised clustering
Expert Syst. Appl.
(2018) - et al.
Towards designing risk-based safe laplacian regularized least squares
Expert Syst. Appl.
(2016) - et al.
Using clustering analysis to improve semi-supervised classification
Neurocomputing
(2013) Data clustering: 50 years beyond k-means
Pattern Recognit. Lett.
(2010)- et al.
Multiple kernel approach to semi-supervised fuzzy clustering algorithm for land-cover classification
Eng. Appl. Artif. Intell.
(2018) - et al.
Semi-supervised clustering with metric learning: An adaptive kernel method
Pattern Recognit.
(2010)
Semi-supervised fuzzy clustering with metric learning and entropy regularization
Knowl.-Based Syst.
Semi-supervised fuzzy clustering: A kernel-based approach
Knowl.-Based Syst.
Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering
Pattern Recognit.
Semi-supervised clustering by seeding
A probabilistic framework for semi-supervised clustering
Pattern Recognition with Fuzzy Objective Function Algorithms
Integrating constraints and metric learning in semi-supervised clustering
Semisupervised learning of classifiers: theory, algorithms, and their application to human–computer interaction
IEEE Trans. Pattern Anal. Mach. Intell.
Research of semi-supervised spectral clustering algorithm based on pairwise constraints
Neural Comput. Appl.
Cited by (19)
Neighborhood information based semi-supervised fuzzy C-means employing feature-weight and cluster-weight learning
2024, Chaos, Solitons and FractalsSemi-supervised fuzzy clustering algorithm based on prior membership degree matrix with expert preference
2024, Expert Systems with ApplicationsSemi-supervised possibilistic c-means clustering algorithm based on feature weights for imbalanced data
2024, Knowledge-Based SystemsSafe semi-supervised clustering based on Dempster–Shafer evidence theory
2023, Engineering Applications of Artificial IntelligenceA review on semi-supervised clustering
2023, Information Sciences
- ☆
One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.02.007..