A fast quasi-Newton method for semi-supervised SVM
Introduction
In the last few years, remarkable research work has been done in supervised learning. Most of these learning models apply the inductive inference concept, where prediction function, derived only from the labeled input data, is used to predict the label of any unlabeled object. A well-known two-class classification technique is based on the support vector machine (SVM) Burges [1], where a SVM solution corresponds to the maximum margin between the two classes of labeled objects under consideration. But in many applications, labeled data are scarce, manual labeling for the purposes of training SVMs is often a slow, expensive, and error-prone process. On the other hand, in many applications of machine learning and data mining, abundant amounts of unlabeled data can be cheaply and automatically collected. Some examples are text processing, web categorization, medical diagnosis, and bioinformatics. In spite of its natural and pervasive need, solutions to the problem of utilizing unlabeled data with labeled examples have only recently emerged in machine learning literature. Using both labeled and unlabeled data for the purpose of learning is called semi-supervised learning. An interested reader can refer to Zhu [19] for a nice review on semi-supervised learning.
A major body of work in semi-supervised SVMs (S3VM) is based on the following idea Chapelle et al. [2]: solve the standard SVM problem while treating the unknown labels as additional optimization variables. By maximizing the margin in the presence of unlabeled data, one learns a decision boundary that traverses through low data-density regions while respecting labels in the input space. In other words, this approach implements the cluster assumption for semi-supervised learning-that is, points in a data cluster likely to have same class labels.
S3VM might seem to be the perfect semi-supervised algorithm, since it combines the powerful regularization of SVMs with a direct implementation of cluster assumption. However, its main drawback is that the objective function is nonconvex and thus is difficult to optimize. Due to this reason, a wide spectrum of techniques have been applied to solve the nonconvex optimization problem associated with S3VMs, for example, local combinatorial search Joachims [3]; gradient descent Chapelle et al. [2]; continuation techniques Chapelle et al. [15]; convex–concave procedures Fung et al. [9]; branch-and-bound algorithms Bennett et al. [8].
In this work, we propose to use the S3VM to solve semi-supervised classification problems. In particular, we adopt the model described in Joachims [3], focusing on the specific features of the optimization problem to be solved, which can be formulated as a nonsmooth nonconvex minimization problem. To tackle this problem, we use a quasi-Newton method described in Yu et al. [6]. The main contributions made by us and reported in this work are:
- 1.
We outline an implementation of a variant of S3VM Joachims [3] designed for linear semi-supervised classification on large datasets. As compared to state-of-the-art large scale semi-supervised learning techniques described in Sindhwani et al. [5], [10], our method effectively exploits data sparsity and linearity of the problem to provide superior scalability. The improved generalization performance and training speed turn the proposed scheme into a feasible tool for large scale applications.
- 2.
We outline an implementation of a variant of S3VM Joachims [3]; Collobert et al. [10] designed for non-linear semi-supervised classification on the datasets, where it is difficult to find a linear decision boundary in the input space.
- 3.
We conducted an experimental study on many binary classification tasks with several thousands of examples and features. This study clearly shows the usefulness of our algorithm for large scale semi-supervised classification.
- 4.
In summary, the work is carried out with an efficient scheme for semi-supervised learning based on both linear and non-linear SVMs.
Throughout the paper, we adopt the following notation: We denote by the Euclidean norm in and by aTb or the inner product of the vectors a and b. Moreover, the subdifferential of a convex function f at any point a is denoted by . We recall that the subdifferential of a convex function f at point a is the set of the subgradients of f at a, that is, the set of vectors satisfying the subgradient inequality,
Section snippets
Semi-supervised SVMs
We consider the problem of binary classification. The training set consists of l labeled examples , and u unlabeled examples {xi}i=l+1n, with n=l+u, typically, and . Our goal is to construct a classifier that utilizes unlabeled data and gives better generalization performance.
S3VM appends an additional term in the SVM objective function whose role is to find a hyperplane far away from both the labeled and the unlabeled points. Variants of this idea have appeared
Quasi-Newton approach for nonsmooth convex function
The (L)BFGS quasi-Newton method uses an approximation to the inverse Hessian in place of true inverse that is required in Newton's method. The approximate Hessian is built up on the basis of information gathered during the descent process. (L)BFGS forms a local quadratic model of a smooth objective function at the current iterate :where is a symmetric positive definite matrix. Minimizing Mt(q) gives the quasi-Newton direction
Solving S3VM using subLBFGS
In this section we show how subLBFGS can be applied to problem (1). Similar approach can be used for problem (4). Differentiating (using the concept of subgradient) Eq. (1) after plugging l1,l2 yieldswhere with where E, M, and W denote the sets of points which are in error, on the margin, and
Related work
Finding the exact S3VM solution is NP-hard. Due to this reason, major effort has focused on efficient approximation algorithms. Early algorithms Chapelle et al. [2], Bennett et al. [8], Fung et al. [9], and Chapelle et al. [15] could not handle more than a few hundred unlabeled examples or did not do so in experiments. Some of the S3VM algorithms that are applied to large scale problems and the algorithms that we are comparing with our work are briefly discussed below. For the linear case we
Linear case
Semi-supervised learning experiments were conducted to test the algorithm on five binary classification problems. The dataset characteristics are listed in Table 1. The CCAT dataset poses the problem of separating corporate related articles; it is the top-level category in the RCV1 training dataset Lewis et al. [16]. The AUT-AVN binary classification dataset comes from a collection of UseNet1 articles from two discussion groups,
Discussion and conclusion
In this paper, we have applied the modified quasi-Newton method for semi-supervised classification problems, primarily in the context of S3VM. Specifically, our approach has the following major advantages:
- 1.
The S3VM classifier can be trained on a large number of labeled and unlabeled examples.
- 2.
The proposed algorithm outperforms other methods on a number different datasets in terms of generalization performance and training time. On AUT-AVN and ISOLET datasets, the algorithm yields accuracy
Sathish Reddy received his M.E. in Computer Science and Engineering from Indian Institute of Science, Bangalore, India, in 2009. Since September 2009, he is working as a researcher in Centre for Global Logistics and Manufacturing Strategies, Indian School of Business, Hyderabad, India. His research interests include Data Mining, Artificial Intelligence, Robotics, and Machine Learning.
References (20)
- et al.
Some results on tchebycheffian spline functions
Journal of Mathematical Analysis
(1971) A tutorial on support vector machines
Data Mining and Knowledge Discovery
(1998)- O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Tenth International Workshop on...
- T. Joachims, Transductive inference for text classification using support vector machines, in: International Conference...
- et al.
A modified finite Newton method for fast solution of large scale linear SVMs
Journal of Machine Learning Research
(2005) - V. Sindhwani, S.S. Keerthi, Large scale semi-supervised linear SVMs, in: SIGIR,...
- J. Yu, S.V.N. Vishwanathan, S. Gunter, N.N. Schraudolph, A quasi-Newton approach to nonsmooth convex optimization, in:...
- et al.
Numerical Optimization, Springer Series in Operation Research
(1999) - K. Bennett, A. Demiriz, Semi-supervised support vector machines, in: Advances in Neural Information Processing Systems,...
- et al.
Semi-supervised support vector machines for unlabeled data classification
Optimization Methods and Software
(2001)
Cited by (33)
A multi-scheme semi-supervised regression approach
2019, Pattern Recognition LettersCitation Excerpt :The optimization problems that appear here are non-convex, letting various strategies to be followed for reaching efficient solutions, both concerning the accuracy and the time of convergence. Mixed integer programs, concave minimization methods, quasi-Newton for non-smooth convex functions are some of the recorded strategies [24]. Although the proposed approach constitutes a regression wrapper, that theoretically could adopt any from the existing SSL schemes, it is inspired by a very well-known semi-supervised classification technique, the heuristic approach of Self-training.
Fast stochastic second-order method logarithmic in condition number
2019, Pattern RecognitionReject inference in credit scoring using Semi-supervised Support Vector Machines
2017, Expert Systems with ApplicationsTowards designing risk-based safe Laplacian Regularized Least Squares
2016, Expert Systems with ApplicationsCitation Excerpt :Generally speaking, SSL utilizes the following assumptions on the data space: (1) smoothness; (2) cluster; (3) manifold; and (4) disagreement. Many algorithms (Adankon and Cheriet, 2011; Belkin, Niyogi, and Sindhwani, 2006; Blum and Mitchell, 1998; Reddy, Shevade, and Murty, 2011; Zhou and Li, 2005) have been proposed and achieved the encouraging performance using one or more assumptions in many tasks. Among these assumptions, manifold regularization (Belkin et al., 2006; Gan, Sang, & Chen, 2013) based methods have received much attention which exploit the intrinsic manifold structure of both labeled and unlabeled data.
The responsibility weighted Mahalanobis kernel for semi-supervised training of support vector machines for classification
2015, Information SciencesCitation Excerpt :This conclusion is underlying the category of low-density separation methods that try to place decision boundaries into lower density regions. One of the most frequently used algorithms in this class are transductive SVM [16] and their various implementations, e.g., TSVM [17] and S3VM [18–21]. The second assumption, called manifold assumption [22], claims that the marginal distribution underlying the data can be described by means of a manifold of much lower dimension than the input space, so that the distances and densities defined on this manifold can be used for learning [22].
Sathish Reddy received his M.E. in Computer Science and Engineering from Indian Institute of Science, Bangalore, India, in 2009. Since September 2009, he is working as a researcher in Centre for Global Logistics and Manufacturing Strategies, Indian School of Business, Hyderabad, India. His research interests include Data Mining, Artificial Intelligence, Robotics, and Machine Learning.
Shirish Shevade (Ph.D. in Computer Science, Indian Institute of Science (IISc), Bangalore, India) is a Principal Research Scientist at IISc. His research interests span many areas of Machine Learning such as support vector machines, Gaussian processes and semi-supervised learning. He has been on the programme committee of several international conferences such as IEEE ICDM and PAKDD. He is a Senior Member of IEEE.
Narasimha Murty received his Ph.D. from the Indian Institute of Science, Bangalore, India, in 1982. He is currently a Professor (Chairman) in the Department of Computer Science and Automation (CSA) at the Indian Institute of Science. He has guided 18 Ph.D. students in the areas of Pattern Recognition and Data Mining. He has also published around 125 papers in various journals and conference proceedings in these areas. His co-authored work, Data Clustering: A Review (ACM Computing Surveys, 1999) is among the most popular magazine and computing survey articles cited and downloaded. He is also an elected Fellow of the Indian National Academy of Engineering (FNAE). In the past, he has worked on several Indo-US projects and visited Michigan State University, East Lansing, USA and University of Dauphine, Paris. His research interests include Data Mining and Pattern Clustering.