Abstract
Semi-supervised classification has become an active topic recently, and a number of algorithms, such as self-training, have been proposed to improve the performance of supervised classification using unlabeled data. Considering the influence of spatial distribution of data set and mislabeled samples on the classification performance of self-training method, an improved self-training algorithm based on density peaks and cut edge weight statistic is proposed in this paper. Firstly, the representative unlabeled samples are selected for labels prediction by space structure, which is discovered by clustering method based on density peaks. Secondly, cut edge weight is used as statistics to make hypothesis testing for identifying whether samples are labeled correctly. Thirdly, the labeled data set is gradually enlarged with correctly labeled samples. The above steps are iterated until all unlabeled samples are labeled. The framework of improved self-training method not only makes full use of space structure information, but also solves the problem that some samples may be classified incorrectly. Thus, the classification accuracy of algorithm is improved in a great measure. Extensive experiments on benchmark data sets clearly illustrate the effectiveness of proposed algorithm.
Similar content being viewed by others
References
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult Valued Log Soft Comput 17(2–3):255–287
Asuncion A, Newman D (2007) UCI machine learning repository. Available at http://archive.ics.uci.edu/ml/datasets.php
Cao Y, He H, Huang H (2011) Lift: a new framework of learning from testing data for face recognition. Neurocomputing 74(6):916–926
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM TIST 2(3):27:1–27:27
Chen W, Shao Y, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mach Learn Cybern 5(3):459–468
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18
Di W, Xin L, Wang G, Shang M, Yan H (2017) A highly-accurate framework for self-labeled semi-supervised classification in industrial applications. IEEE Trans Ind Inform PP(99):1
Domingos PM (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87
Fan GF, Peng LL, Hong WC (2018) Short term load forecasting based on phase space reconstruction algorithm and bi-square kernel regression model. Appl Energy 224:13–33
Gan H, Sang N, Huang R, Tong X, Dan Z (2013) Using clustering analysis to improve semi-supervised classification. Neurocomputing 101:290–298
Jm I, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Li Y, Guo M (2012) A new relational tri-training system with adaptive data editing for inductive logic programming. Knowl Based Syst 35(none):173–185
Liu X, Pan S et al (2014) Graph-based semi-supervised learning by mixed label propagation with a soft constraint. Inf Sci 277:327–337
Manevitz LM, Yousef M (2002) One-class svms for document classification. J Mach Learn Res 2(1):139–154
Muhlenbach F, Lallich S, Zighed DA (2004) Identifying and handling mislabelled instances. J Intell Inf Syst Integr Artif Intell Database Technol 22(1):89–109
Narayanaswamy S, Paige B, van de Meent J, Desmaison A, Goodman ND, Kohli P, Wood FD, Torr PHS (2017) Learning disentangled representations with semi-supervised deep generative models. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, 4–9 December 2017. Long Beach, pp 5925–5935
Nigam K, Mccallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2–3):103–134
Pavlinek M, Podgorelec V (2017) Text classification method based on self-training and LDA topic models. Expert Syst Appl 80:83–93
Rodriguez A, Laio A (2014a) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Rodriguez A, Laio A (2014b) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Sakai T, du Plessis MC, Niu G, Sugiyama M (2017) Semi-supervised classification based on classification from positive and unlabeled data. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2998–3006
Su Y, Shan S, Chen X, Gao W (2009) Hierarchical ensemble of global and local classifiers for face recognition. IEEE Trans Image Process 18(8):1885–1896
Tanha J, van Someren M, Afsarmanesh H (2017) Semi-supervised self-training for decision tree classifiers. Int J Mach Learn Cybern 8(1):355–370
Triguero I, Sáez JA et al (2014) On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification. Neurocomputing 132:30–41
Wang J, Jebara T, Chang SF (2013) Semi-supervised learning using greedy max-cut. J Mach Learn Res 14(1):771–800
Wang XF, Xu Y (2017) Fast clustering using adaptive density peak detection. Stat Methods Med Res 26(6):2800–2811
Wang Y, Li H, Yen GG, Song W (2015) MOMMOP: multiobjective optimization for locating multiple optimal solutions of multimodal optimization problems. IEEE Trans Cybern 45(4):830–843
Wu D, Shang EA (2018) Self-training semi-supervised classification based on density peaks of data. Neurocomputing 275:180–191
Wu D, Shang M, Wang G, Li L (2018) A self-training semi-supervised classification algorithm based on density peaks of data and differential evolution. In: 15th IEEE international conference on networking, sensing and control, ICNSC 2018, Zhuhai, China, March 27-29, 2018, pp 1–6
Yun J, Yong M, Li Z (2012) A modified self-training semi-supervised SVM algorithm. In: 2012 international conference on communication systems and network technologies. IEEE, pp 224–228
Zeng N, Wang Z, Zhang H, Liu W, Alsaadi FE (2016) Deep belief networks for quantitative analysis of a gold immunochromatographic strip. Cognit Comput 8(4):684–692
Zhang P, He Z (2013) A weakly supervised approach to chinese sentiment classification using partitioned self-training. J Inf Sci 39(6):815–831
Zhang Y, Wen J, Wang X, Jiang Z (2014) Semi-supervised learning combining co-training with active learning. Expert Syst Appl 41(5):2372–2378
Zhang Z, Hong WC (2019) Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98(2):1107–1136
Zhou ZH, Li M (2010) Semi-supervised learning by disagreement. Knowl Inf Syst 24(3):415–439
Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Synth Lect Artif Intell Mach Learn 3(1):130
Zighed DA, Lallich S, Muhlenbach F (2002) Separability index in supervised learning. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery
Zou Y, Yu Z, Kumar BVKV, Wang J (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, proceedings, part III, pp 297–313
Acknowledgements
This research was supported by National Natural Science Foundation of China (No. 61573266).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wei, D., Yang, Y. & Qiu, H. Improving self-training with density peaks of data and cut edge weight statistic. Soft Comput 24, 15595–15610 (2020). https://doi.org/10.1007/s00500-020-04887-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-04887-8