Elsevier

Neurocomputing

Volume 173, Part 3, 15 January 2016, Pages 1288-1298
Neurocomputing

Semi-supervised learning combining transductive support vector machine with active learning

https://doi.org/10.1016/j.neucom.2015.08.087Get rights and content

Abstract

In typical data mining applications, labeling the large amounts of data is difficult, expensive, and time consuming, if annotated manually. To avoid manual labeling, semi-supervised learning uses unlabeled data along with the labeled data in the training process. Transductive support vector machine (TSVM) is one such semi-supervised, which has been found effective in enhancing the classification performance. However there are some deficiencies in TSVM, such as presetting number of the positive class samples, frequently exchange of class label, and its requirement for larger amount of unlabeled data. To tackle these deficiencies, in this paper, we propose a new semi-supervised learning algorithm based on active learning combined with TSVM. The algorithm applies active learning to select the most informative instances based on the version space minimum–maximum division principle with human annotation for improve the classification performance. Simultaneously, in order to make full use of the distribution characteristics of unlabeled data, we added a manifold regularization term to the objective function. Experiments performed on several UCI datasets and a real world book review case study demonstrate that our proposed method achieves significant improvement over other benchmark methods yet consuming less amount of human effort, which is very important while labeling data manually.

Introduction

Support vector machine (SVM) is a supervised machine learning approach for solving binary classification pattern recognition problems. It adopts maximum margin to find the decision surface that separates the positive and negative labeled training examples of a class [1]. For a given date point, the regular SVM results in distances among data points ranges from 0 to 1. The value 0 indicates that this data point locates on the hyper-plane and the value 1 means that this data point is a support vector. Although SVM has been successfully used in various fields, such as [2], [3], [4], [5], [6], however, in many real word applications, there is not enough labeled data to train a good classification model. Compare to the standard SVM which uses only labeled training data, many semi-supervised SVM employ unlabeled data along with some labeled data for training classifiers with improved generalization and performance. Semi-supervised SVM has been well received attention because of two reasons. Firstly, labeling a large number of examples is time-consuming and labor-intensive. This task has also to be carried out by qualified experts and thus is expensive. Secondly, some studies show that using unlabeled data for learning can improve the accuracy of classifiers [7], [8]. Transductive support vector machine (TSVM) [9] is an efficient method for improving the generalization accuracy of SVM by finding a labeling for the unlabeled data, so that a linear boundary has the maximum margin on both the original labeled data and the labeled unlabeled data [10].

The notable characteristic of TSVM, being transductive, aims at such learning problems that are really interested in only the particular datasets of the testing or working (or training) data [9], [11], while traditional work on inductive learning estimates a classifier based on some training data that generalizes to any input examples. The main idea of transductive learning is building models for the best prediction performance on a particular testing dataset instead of developing generalized models to be applied to any testing dataset [12]. In other words, by explicitly including the working dataset consisting of unlabeled examples in problem formulation, a better generalization can be achieved on problems with insufficient labeled data points [13]. One of the most common problems is that the machine may incorrectly label the training dataset, which will lead to classification error. The solution for this problem is in active learning.

Active learning (AL) is a technique of selecting a small subset from the unlabeled data such that labeling on the subset maximizes the learning accuracy. The selected subset is manually labeled by experts. In this way, AL can complement the TSVM by reducing the labeling errors [14]. In this paper, we explore combining the TSVM with the AL to improve the performance of the classification task. Below are the major contributions of our work:

  • 1.

    In learning process, TSVM exploits a large amount of unlabeled data which explicitly include the geometrical distribution characteristics. To capture the geometrical structure of the data, we define L as a function of Laplacian graph. In this way, it can explore the structure of the data manifold structure by adding a regularization term that penalize any “abrupt changes” of the function values evaluated on neighbor samples in the Laplacian graph.

  • 2.

    Active learning uses query framework, where an active learner queries the instances for labeling. Our algorithm defined the version space minimum–maximum division principle as the selection criteria to achieve best labeling results. The selected most informative instance is considered to be the most likely support vector, which reduced learning cost by deleting non-support vector. At the same time, it can achieve batch-sampling mode, which improves the training efficiency. Overall, this method can achieve more significant improvement for consuming the same amount of human effort and produce a desirable result considerably fewer labeled data.

The rest of this paper is organized as follows. Section 2 describes the concept of TSVM, active learning, and graph-based method. Section 3 reviews some of the issues in the TSVM and active learning. Section 4 of the paper introduces the proposed algorithm. Section 5 reports experimental results on several UCI datasets and the book reviews dataset, and further analyze the underlying reasons for choosing the algorithm. Finally, Section 6 concludes and highlights the future work.

Section snippets

Transductive SVM

Transductive SVM (TSVM) is a semi-supervised large-margin classification method based on the low density separation assumption. Similar to traditional SVM, TSVM searches for a hyper-plane with largest margin to separate the classes, and simultaneously takes into account labeled and unlabeled examples. The detail descriptions and proofs of the concepts can be found in [9].

Set a group of independent and identically distributed labeled examples{(x1,D1),,(xi,Di)}Rn×R,i=1,,l,yi={1,+1}and u

Characteristics of active learning

Active learning (AL) is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain. It is an interactive learning technique designed to reduce the labor cost of labeling in which the learning algorithm can freely assign the unlabeled examples to the training set. The main idea is to select the most informative examples and ask the expert for their label in the successive learning rounds. The strategy of AL is to select a most

Combining transductive SVM with active learning

In this section we provide a new semi-supervised learning algorithm based on active learning (AL), which combines TSVM with AL, called ALTSVM algorithm. Firstly, to explore the data manifold structure, we add a regularization term that penalizes any “abrupt changes” of the evaluated function values on neighbor samples in the Laplacian graph. Secondly, we propose a new unlabeled sample selection principle for AL, called version space minimum–maximum division principle. Thirdly, we describe the

Experimental results and analysis

To evaluate the performance of the proposed algorithm, we conduct a set of experiments by comparing the proposed algorithm with several state-of-the-art active learning methods on benchmark UCI datasets [32], and also on a book reviews dataset as a real world application.

Conclusions

In this paper, we proposed to solve the problems with using transductive support vector machine (TSVM), by a preset number of positive class samples N. Presetting the N correctly is very difficult before training the TSVM, therefore leads to considerable estimation error, especially when the number of the labeled examples is very small. To avoid using more unlabeled examples in a native way, we suggested active learning. Studies have found no correlation between using more unlabeled examples

Acknowledgements

This work was supported by the National Key Basic Research Program (973) of China under Grant no. 2013CB328903, the National Science Foundations of China under Grant nos. 61379158 and 71301177, and the Basic and Advanced Research Program of Chongqing under Grant nos. cstc2013jcyjA1658 and cstc2014jcyjA40054.

Xibin Wang is a PhD student in College of Computer Science at Chongqing University, China. He received his MS degree in computer science from Guizhou University in 2012. His research focuses on computational intelligence, data mining and business intelligence, and machine learning.

References (40)

  • C. Constantinopoulos et al.

    Semi-supervised and active learning with the probabilistic RBF classifier

    Neurocomputing

    (2008)
  • C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Min. Knowl. Discov.

    (1998)
  • K. Bennett et al.

    Semi-supervised support vector machines

    Adv. Neural Inf. Process. Syst.

    (1999)
  • G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in: Proceedings of the Seventeenth...
  • T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of...
  • Y. Tian et al.

    Recent advances on support vector machines research

    Technol. Econ. Dev. Econ.

    (2012)
  • E. Kondratovich et al.

    Transductive support vector machines: promising approach to model small and unbalanced datasets

    Mol. Inf.

    (2013)
  • A. Gammerman, V. Vovk, V. Vapnik, Learning by transduction, in: Proceedings of the Fourteenth Conference on Uncertainty...
  • B. Settles, Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of...
  • X. Zhu, J. Lafferty, Z. Ghahramani, Combining active learning and semi-supervised learning using gaussian fields and...
  • Cited by (26)

    • Multilinear clustering via tensor Fukunaga–Koontz transform with Fisher eigenspectrum regularization

      2021, Applied Soft Computing
      Citation Excerpt :

      Differently, the proposed method does not depend on pre-training and efficiently employs handcrafted features, covering a wider range of applications. For example, several machine learning problems have no available pre-trained models and not enough labeled data in order to train a deep learning model [86,87]. The use of HOF and iDT improves the RTFKT accuracy by about 7%.

    • Social media mining for ideation: Identification of sustainable solutions and opinions

      2021, Technovation
      Citation Excerpt :

      The lack of a sufficient number of labelled samples may result in an overfitted discriminating hyperplane. Due to this, semi-supervised methods that utilise a vast number of unlabelled samples during training have gained popularity in recent years (Wang et al., 2016). In the present study, we used a semi-supervised learning Algorithm for the classification task and compared it to the supervised learning algorithms.

    • A Selection Metric for semi-supervised learning based on neighborhood construction

      2021, Information Processing and Management
      Citation Excerpt :

      In fact, extracting useful knowledge from the unlabeled data is the main task and challenge in semi-supervised learning (Chapelle, Scholkopf, & Zien, 2009). Several different approaches have been introduced to this task, such as the Expectation Maximization (EM) (Alzanin & Azmi, 2019), Self–training (He & Zhou, 2011; Tanha, van Someren, & Afsarmanesh, 2017), Co-training (Park & Zhang, 2004; Peng, Estrada, Pedersoli, & Desrosiers, 2020), Transduction Support Vector Machine (TSVM) (Li, Wang, Bi, & Jiang, 2018; Wang, Wen, Alam, Jiang, & Wu, 2016), Semi–Supervised SVM (S3VM) (Ding, Zhu, & Zhang, 2017), Graph-based method (Sawant & Prabukumar, 2020; Subramanya & Talukdar, 2014), and Boosting–based semi-supervised learning approach (Tanha, 2019; Tanha, van Someren, & Afsarmanesh, 2014). Most of the semi-supervised algorithms follow two main approaches: the extension of a specific base learner to learn from labeled and unlabeled instances, and application of a framework to learn from both labeled and unlabeled data regardless of the assigned base learner.

    • Recent Trends in Computer Assisted Diagnosis (CAD) System for Breast Cancer Diagnosis Using Histopathological Images

      2019, IRBM
      Citation Excerpt :

      iii) General feature descriptors such as SIFT, SURF, ORB [72,73] used for feature extraction can be studied and analyzed in future for the development of an efficient CAD system to diagnose breast cancer tissues using histopathological images. ( iv) For classification, most of the techniques employ basic SVM but still there is scope to employ Transductive Support Vector Machine (TSVM) [99,100], advance Progressive Transductive Support Vector Machine (PTSVM) [101] to diagnose the breast cancer more accurately. ( v) In classification stage, the technique CNN if employed to classify histopathological images efficiently at lower magnifications such as 10x may reduce time complexity and cost of the CAD system.

    View all citing articles on Scopus

    Xibin Wang is a PhD student in College of Computer Science at Chongqing University, China. He received his MS degree in computer science from Guizhou University in 2012. His research focuses on computational intelligence, data mining and business intelligence, and machine learning.

    Junhao Wen received his BS, MS and PhD degrees in computer science from Chongqing University, China, in 1991, 1999 and 2008, respectively. Currently, he is a professor and PhD supervisor at Chongqing University. His research focuses on recommended system, data mining and business intelligence, and machine learning.

    Shafiq Alam received his PhD degree from University of Auckland, New Zealand. He is currently a postdoctoral research fellow at the Department of Computer Science, University of Auckland. He has published in International Journals of high repute, A* ranked conferences, and edited a book in his research area. He has been also a general chair of a workshop, and served on the program committee of various conferences. His research interests include computational intelligence, shilling and fraud detection in recommender systems, data mining, and decision support systems.

    Zhuo Jiang received his BS degree from the Mathematics department of Xinjiang University, China, in 2008 and his MS degree in computer science from Chongqing University, China, in 2010. Currently, he is a PhD candidate at Chongqing University, China. His research interest includes AI planning, Web service and data mining.

    Yingbo Wu received his BS, MS and PhD degrees in Computer Science from Chongqing University, China, in 2000, 2003, and 2012, respectively. Currently, he is a professor and MS supervisor at Chongqing University. His research focuses on distributed and intelligent computing, service systems engineering, field engineering and industry information.

    View full text