Elsevier

Knowledge-Based Systems

Volume 152, 15 July 2018, Pages 200-214
Knowledge-Based Systems

Revisiting transductive support vector machines with margin distribution embedding

https://doi.org/10.1016/j.knosys.2018.04.017Get rights and content

Abstract

Transductive Support Vector Machine (TSVM) is one of the most successful classification methods for semi-supervised learning (SSL). One challenge of TSVMs is that the performance degeneration is caused by unlabeled examples that are obscure or misleading for the discovery of the underlying distribution. To address this problem, we disclose the underlying data distribution and describe the margin distribution of TSVMs as the first-order (margin mean) and second-order (margin variance) statistics of examples. Since the optimization problems of TSVMs are not convex, we utilize the concave-convex procedure and variation of stochastic variance reduced gradient methods to solve them. Particularly, we propose two specific algorithms to optimize the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously, which the generalization ability is improved and being robust to the outliers and noise. In addition, we derive a bound on the expectation of error according to the leave-one-out cross-validation estimate, which is an unbiased estimate of the probability of test error. Finally, to validate the effectiveness of the proposed method, extensive experiments are conducted on diversity datasets. The experimental results demonstrate that the performance of proposed algorithms are superior to the existing TSVMs and other semi-supervised learning methods.

Introduction

It is quite easy to get a large number of unlabeled data in many practical settings (e.g., document classification, image classification, speech recognition), but labeled ones are fairly expensive because they require human effort. For example, people could create data and share with others on social media using digital equipments (e.g., camera, iPhone, video recorder) in the era of big data. However, plentiful of data are unlabeled. Clearly, if we only use a small amount of “expensive” labeled data, the learning system that modeled from training data cannot achieve strong generalization performance. Meanwhile, ignoring the large amount of “cheap” unlabeled data is a huge waste of data resources [1], [2], [3].

Semi-Supervised Learning (SSL) aims to make use of unlabeled data for training—typically a small set of labeled data together with a large collection of unlabeled data. It has been extensively studied in literatures [4], [5]. Transductive Support Vector Machine (TSVM) is one of the most successful classification methods for SSL. The purpose of learning is to achieve the best generalization performance by determining the margin classification boundary of all the labeled and unlabeled examples. Early works for implementing TSVM, such as [6] and [7] involve high complexity and cannot tackle with large-scale datasets. Concave-convex procedure (CCCP) for TSVM [8] performs better than previous works but is not effective for very large-scale datasets. There exists several implementations of TSVMs for efficient computation [5], [9] and regularization methods for improving generalization ability of TSVMs [10], [11], [12]. Recent work [13] found that the degradation of the performance degeneration of TSVMs is caused by unlabeled examples that are obscure or misleading for the discovery of the underlying distribution. In addition, existing SSL methods are mostly based on clustering or manifold assumptions [14], [15], which nearby examples (or examples on the same structure) are likely to have the same label, recent years have witnessed a lot of works on these assumptions [16], [17], [18]. According to [19], they found that the performance degeneration of SSL is caused by incorrect model assumptions, because fitting unlabeled data based on an incorrect model assumption will mislead the learning process. It is also very difficult to make a correct model assumption without sufficient domain knowledge [13]. For graph-based methods in SSL, the basic idea is the label propagation that propagates the labels on the weighted graph according to the distribution of both labeled and unlabeled data. But, developing a good graph in general situations remains an open problem. For co-training and its extension methods [20], [21], [22], they usually generate pseudo-labels of unlabeled data by learners during the training process. However, incorrect pseudo-labels may disrupt the learning process. Some other methods may use pairwise constraints to make connection between labeled and unlabeled examples, but, if the labeled number is too limited, the advantages of pairwise constraints over the class labels will not exist any more. Moreover, there exists the noise and outliers in large-scale datasets, which may lead to error-prone classification and cause massive accuracy degeneration.

Margin theory is proposed by Vapnik and Vladimir [23]. It provides a good theoretical guarantee for why SVM can achieve good generalization performance and also makes an explanation for AdaBoost. In addition, early works [24], [25] also make contributions for understanding why AdaBoost can frequently resist to overfitting. Such as, Reyzin et al. [26] speculated that the margin distribution played a key role in generalization performance of AdaBoost and suggested the average margin or median margin was used as a measure of comparison for margin distribution [27]. Consequently, Aiolli et al. [28] proposed a kernel method for optimizing the margin distribution. Finally, Gao and Zhou [29] proved the important role of margin mean and margin variance as description of margin distribution, thus Zhang and Zhou [30] introduced margin distribution into SVM and proposed the margin distribution optimization algorithms, which were validated the superiority of generalization ability. In SSL setting, one of the major utility of unlabeled data is to disclose the underlying data distribution [5], our method aims to embed data distribution into margin of TSVM. The key challenge of our method is that how to embed data distribution into the margin of TSVM and study the statistical property of our method. In addition, the optimization methods and skills are still to be improved for reducing the computational cost. Moreover, a large number of works on semi-supervised clustering and classification have been proposed [15], [31], [32], [33]. The connection between these methods and our method is that we all want to estimate or disclose the reliability of unlabeled examples. The difference is that our method is to disclose the underlying data distribution and embed it into the margin of TSVM. Other methods may estimate the similarity between the labeled and unlabeled examples.

Motivations. It is often time-consuming to get large number of labeled examples because the labeling process often requires human effort and expertise in many practical applications. SSL methods try to learn from limited labeled examples, however, most of the existing SSL methods leverage unlabeled examples by making assumptions, using label propagation or generating pseudo-labels during learning process. In these methods, incorrect assumptions or pseudo-labels may disrupt the learning process. Moreover, one of the performance degeneration reason for SSL methods is caused by unlabeled examples that are obscure or misleading for the discovery of the underlying distribution [13]. Another consideration is that there exists the noise and outliers in large-scale datasets, which may lead to error-prone classification and cause massive accuracy degeneration. Unlike TSVM, it is always single-point maximum margin on the datasets, our method discloses the underlying data distribution and embeds it into the margin of TSVM, which is proved the important role for improving generalization [29], [30] and being robust to the outliers and noise. Here, we give an illustration of the role of margin distribution based on our experimental results. In Fig. 1, positive/negative examples are denoted as +/- and test examples are denoted as dot. Clearly, we can see that the maximum margin hyperplanes with the dashed line and solid line are denoted by the solution of SVM and TSVM respectively. The line of our method (TSVMMD) shows a hyperplane that is chosen by embedding the margin distribution of training data, and the examples around the margin are classified accurately. Obviously, the performance of the solid line (TSVMMD) is more robust than TSVM.

Inspired by theoretical results [29], [30]: they proved that optimizing margin distribution was more vital than maximizing the margin for better generalization performance. Our work takes advantage of unlabeled examples to determine data distribution and optimizes the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously. Since the objective functions of TSVM are not convex, we use the CCCP and variation of stochastic variance reduced gradient (SVRG) methods [34] to solve the optimization problems, which have been successfully used by sparse SVM, large-scale TSVM [8], sparse PCA and neural networks. Moreover, we analysis the theoretical results via studying the statistical property of our method. The main contributions of this paper are summarized as follows.

  • We propose a novel semi-supervised learning method by disclosing the underlying data distribution. The first-order (margin mean) and second-order statistics (margin variance) of examples are embedded into TSVM and we optimize the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously, which the generalization ability is improved and being robust to the outliers and noise.

  • We derive a bound on the expectation of error according to the leave-one-out cross-validation estimate, which is an unbiased estimate of the probability of test error. The CCCP and variation of SVRG methods are utilized to solve the optimization problem of TSVMs with margin distribution embedding, which could tackle with large-scale datasets effectively and have better convergence during the training.

  • By taking advantage of specific statistical property of examples, our method is competent to TSVMs and other semi-supervised approaches. The experimental study on both small and large-scale datasets demonstrate that the proposed method is effective and efficient.

The rest of this paper is organized as follows. Section 2 introduces some related work. Section 3 discusses CCCP and variation of SVRG methods for TSVM with margin distribution embedding. Section 4 gives the optimization methods for our method and analyzes the convergence as well as statistical property of our method. The extensive experimental results are provided and analyzed in Section 5. Finally, Section 6 concludes this work with future direction.

Section snippets

Related work

In this section, we briefly review the related researches on transductive support vector machines and semi-supervised learning.

Transductive support vector machine with margin distribution

In this section, we revisit the TSVMs with Margin Distribution embedding, denoted by TSVMMD. First, we define the margin mean and margin variance as the margin distribution, then we optimize the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously. Since the optimization problem of TSVM is not convex, the CCCP and variation of SVRG methods are utilized to solve the optimization problem for large-scale TSVMs. Before giving formulation of

Optimization for TSVMMD

In this section, we give the optimization methods for our approach. First, we present the CCCP and variation of SVRG methods for TSVMMD. Then, we analyze the convergence and complexity of our proposed algorithms. Finally, we give the advantages and disadvantages of our method.

Experimental study

Extensive experiments are provided to validate various of algorithms on both small and large-scale datasets. To validate the effectiveness of our algorithms, we study the convergence of our method. Since our method involves some hyper-parameters, we also study how the parameters impact the performance of our method. The last subsection we compare the training time of TSVMMD with other semi-supervised learning methods.

Conclusion

This paper revisits the TSVMs with margin distribution embedding. By taking advantage of specific statistical property of examples, our method is robust to outliers and noise. It optimizes the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously to improve the generalization ability. In addition, the optimization methods of CCCP and variation of SVRG could solve the non-convex problem very efficient instead of high computational complexity

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is supported in part by the National Natural Science Foundation of China under Grant 61170035 and 61272420, Six talent peaks project in Jiangsu Province (Grant No. 2014 WLW-004), the Fundamental Research Funds for the Central Universities (Grant No. 30920130112006), Jiangsu Province special funds for transformation of science and technology achievement (Grant No. BA2013047), the

References (73)

  • X. Zhu et al.

    Introduction to semi-supervised learning

    Synth. Lect. Artif. Intell. Mach. Learn.

    (2009)
  • O. Chapelle et al.

    Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]

    IEEE Trans. Neural Networks

    (2009)
  • T. Joachims

    Transductive inference for text classification using support vector machines

    ICML

    (1999)
  • T. Joachims

    Making large scale SVM learning practical

    Technical Report

    (1999)
  • R. Collobert et al.

    Large scale transductive svms

    J. Mach. Learn. Res.

    (2006)
  • O. Chapelle et al.

    Semi-supervised classification by low density separation.

    AISTATS

    (2005)
  • Z. Xu et al.

    Efficient convex relaxation for transductive support vector machine

    Advances in neural information processing systems

    (2008)
  • Z. Xu et al.

    Adaptive regularization for transductive support vector machine

    Advances in Neural Information Processing Systems

    (2009)
  • P.P. Talukdar et al.

    New regularized algorithms for transductive learning

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    (2009)
  • Y.-F. Li et al.

    Towards making unlabeled data never hurt

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • Z. Zhang et al.

    Graph based constrained semi-supervised learning framework via label propagation over adaptive neighborhood

    IEEE Trans. Knowl. Data Eng.

    (2015)
  • R.G. Soares et al.

    Semisupervised classification with cluster regularization

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • M. Belkin et al.

    Manifold regularization: a geometric framework for learning from labeled and unlabeled examples

    J. Mach. Learn. Res.

    (2006)
  • K. Zhang et al.

    Scaling up graph-based semisupervised learning via prototype vector machines

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • L. Bing, W.W. Cohen, B. Dhingra, Using graphs of classifiers to impose declarative constraints on semi-supervised...
  • F.G. Cozman et al.

    Semi-supervised learning of mixture models

    Proceedings of the 20th International Conference on Machine Learning (ICML-03)

    (2003)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

    Proceedings of the eleventh annual conference on Computational learning theory

    (1998)
  • Z.-H. Zhou et al.

    Tri-training: exploiting unlabeled data using three classifiers

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • X. Xu et al.

    Co-labeling for multi-view weakly labeled learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • V. Vapnik

    The Nature of Statistical Learning Theory

    (2013)
  • J.R. Quinlan

    Induction of decision trees

    Mach. Learn.

    (1986)
  • L. Breiman

    Prediction games and arcing algorithms

    Neural Comput.

    (1999)
  • L. Reyzin et al.

    How boosting the margin can also boost classifier complexity

    Proceedings of the 23rd international conference on Machine learning

    (2006)
  • A. Garg et al.

    Margin distribution and learning algorithms

    Proceedings of the Fifteenth International Conference on Machine Learning (ICML)

    (2003)
  • F. Aiolli et al.

    A kernel method for the optimization of the margin distribution

    Artif. Neural Netw.-ICANN 2008

    (2008)
  • W. Gao et al.

    On the doubt about margin explanation of boosting

    Artif. Intell.

    (2013)
  • Cited by (0)

    View full text