Elsevier

Pattern Recognition

Volume 44, Issues 10–11, October–November 2011, Pages 2305-2313
Pattern Recognition

A fast quasi-Newton method for semi-supervised SVM

https://doi.org/10.1016/j.patcog.2010.09.002Get rights and content

Abstract

Due to its wide applicability, semi-supervised learning is an attractive method for using unlabeled data in classification. In this work, we present a semi-supervised support vector classifier that is designed using quasi-Newton method for nonsmooth convex functions. The proposed algorithm is suitable in dealing with very large number of examples and features. Numerical experiments on various benchmark datasets showed that the proposed algorithm is fast and gives improved generalization performance over the existing methods. Further, a non-linear semi-supervised SVM has been proposed based on a multiple label switching scheme. This non-linear semi-supervised SVM is found to converge faster and it is found to improve generalization performance on several benchmark datasets.

Introduction

In the last few years, remarkable research work has been done in supervised learning. Most of these learning models apply the inductive inference concept, where prediction function, derived only from the labeled input data, is used to predict the label of any unlabeled object. A well-known two-class classification technique is based on the support vector machine (SVM) Burges [1], where a SVM solution corresponds to the maximum margin between the two classes of labeled objects under consideration. But in many applications, labeled data are scarce, manual labeling for the purposes of training SVMs is often a slow, expensive, and error-prone process. On the other hand, in many applications of machine learning and data mining, abundant amounts of unlabeled data can be cheaply and automatically collected. Some examples are text processing, web categorization, medical diagnosis, and bioinformatics. In spite of its natural and pervasive need, solutions to the problem of utilizing unlabeled data with labeled examples have only recently emerged in machine learning literature. Using both labeled and unlabeled data for the purpose of learning is called semi-supervised learning. An interested reader can refer to Zhu [19] for a nice review on semi-supervised learning.

A major body of work in semi-supervised SVMs (S3VM) is based on the following idea Chapelle et al. [2]: solve the standard SVM problem while treating the unknown labels as additional optimization variables. By maximizing the margin in the presence of unlabeled data, one learns a decision boundary that traverses through low data-density regions while respecting labels in the input space. In other words, this approach implements the cluster assumption for semi-supervised learning-that is, points in a data cluster likely to have same class labels.

S3VM might seem to be the perfect semi-supervised algorithm, since it combines the powerful regularization of SVMs with a direct implementation of cluster assumption. However, its main drawback is that the objective function is nonconvex and thus is difficult to optimize. Due to this reason, a wide spectrum of techniques have been applied to solve the nonconvex optimization problem associated with S3VMs, for example, local combinatorial search Joachims [3]; gradient descent Chapelle et al. [2]; continuation techniques Chapelle et al. [15]; convex–concave procedures Fung et al. [9]; branch-and-bound algorithms Bennett et al. [8].

In this work, we propose to use the S3VM to solve semi-supervised classification problems. In particular, we adopt the model described in Joachims [3], focusing on the specific features of the optimization problem to be solved, which can be formulated as a nonsmooth nonconvex minimization problem. To tackle this problem, we use a quasi-Newton method described in Yu et al. [6]. The main contributions made by us and reported in this work are:

  • 1.

    We outline an implementation of a variant of S3VM Joachims [3] designed for linear semi-supervised classification on large datasets. As compared to state-of-the-art large scale semi-supervised learning techniques described in Sindhwani et al. [5], [10], our method effectively exploits data sparsity and linearity of the problem to provide superior scalability. The improved generalization performance and training speed turn the proposed scheme into a feasible tool for large scale applications.

  • 2.

    We outline an implementation of a variant of S3VM Joachims [3]; Collobert et al. [10] designed for non-linear semi-supervised classification on the datasets, where it is difficult to find a linear decision boundary in the input space.

  • 3.

    We conducted an experimental study on many binary classification tasks with several thousands of examples and features. This study clearly shows the usefulness of our algorithm for large scale semi-supervised classification.

  • 4.

    In summary, the work is carried out with an efficient scheme for semi-supervised learning based on both linear and non-linear SVMs.

This paper is organized as follows: Section 2 analyzes the S3VM objective function and studies its characteristics. In Section 3 we describe the quasi-Newton method for nonsmooth convex functions. In Section 4 we present S3VM implementation using quasi-Newton method. Section 5 compares our work with other recent efforts in this area. Experimental results are reported in Section 6. Section 7 contains some useful concluding comments.

Throughout the paper, we adopt the following notation: We denote by · the Euclidean norm in Rd and by aTb or a·b the inner product of the vectors a and b. Moreover, the subdifferential of a convex function f at any point a is denoted by f(a). We recall that the subdifferential of a convex function f at point a is the set of the subgradients of f at a, that is, the set of vectors gRd satisfying the subgradient inequality, f(b)f(a)+gT(ba)bRd.

Section snippets

Semi-supervised SVMs

We consider the problem of binary classification. The training set consists of l labeled examples {xi,yi}i=1l,yi{1,+1}, and u unlabeled examples {xi}i=l+1n, with n=l+u, typically, lu and xiRd. Our goal is to construct a classifier that utilizes unlabeled data and gives better generalization performance.

S3VM appends an additional term in the SVM objective function whose role is to find a hyperplane far away from both the labeled and the unlabeled points. Variants of this idea have appeared

Quasi-Newton approach for nonsmooth convex function

The (L)BFGS quasi-Newton method uses an approximation to the inverse Hessian in place of true inverse that is required in Newton's method. The approximate Hessian is built up on the basis of information gathered during the descent process. (L)BFGS forms a local quadratic model of a smooth objective function L:RdR at the current iterate qtRd:Mt(q)L(qt)+L(qt)T(qqt)+12(qqt)TBt1(qqt),where BtHt1 is a symmetric positive definite matrix. Minimizing Mt(q) gives the quasi-Newton directionptB

Solving S3VM using subLBFGS

In this section we show how subLBFGS can be applied to problem (1). Similar approach can be used for problem (4). Differentiating (using the concept of subgradient) Eq. (1) after plugging l1,l2 yieldsJ(w)=λw1li=1lβiyixiλui=l+1nβiyixi=w¯1liMβiyixiλuiMβiyixi,where w¯λw(1/l)iɛyixi(λ/u)iɛyixi with βi1ifiE,E{i:1yiwTxi>0},ψwhereψ(0,1),ifiM,M{i:1yiwTxi=0},0ifiW,W{i:1yiwTxi<0},where E, M, and W denote the sets of points which are in error, on the margin, and

Related work

Finding the exact S3VM solution is NP-hard. Due to this reason, major effort has focused on efficient approximation algorithms. Early algorithms Chapelle et al. [2], Bennett et al. [8], Fung et al. [9], and Chapelle et al. [15] could not handle more than a few hundred unlabeled examples or did not do so in experiments. Some of the S3VM algorithms that are applied to large scale problems and the algorithms that we are comparing with our work are briefly discussed below. For the linear case we

Linear case

Semi-supervised learning experiments were conducted to test the algorithm on five binary classification problems. The dataset characteristics are listed in Table 1. The CCAT dataset poses the problem of separating corporate related articles; it is the top-level category in the RCV1 training dataset Lewis et al. [16]. The AUT-AVN binary classification dataset comes from a collection of UseNet1 articles from two discussion groups,

Discussion and conclusion

In this paper, we have applied the modified quasi-Newton method for semi-supervised classification problems, primarily in the context of S3VM. Specifically, our approach has the following major advantages:

  • 1.

    The S3VM classifier can be trained on a large number of labeled and unlabeled examples.

  • 2.

    The proposed algorithm outperforms other methods on a number different datasets in terms of generalization performance and training time. On AUT-AVN and ISOLET datasets, the algorithm yields accuracy

Sathish Reddy received his M.E. in Computer Science and Engineering from Indian Institute of Science, Bangalore, India, in 2009. Since September 2009, he is working as a researcher in Centre for Global Logistics and Manufacturing Strategies, Indian School of Business, Hyderabad, India. His research interests include Data Mining, Artificial Intelligence, Robotics, and Machine Learning.

References (20)

  • G. Kimeldorf et al.

    Some results on tchebycheffian spline functions

    Journal of Mathematical Analysis

    (1971)
  • C.J.C. Burges

    A tutorial on support vector machines

    Data Mining and Knowledge Discovery

    (1998)
  • O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Tenth International Workshop on...
  • T. Joachims, Transductive inference for text classification using support vector machines, in: International Conference...
  • S.S. Keerthi et al.

    A modified finite Newton method for fast solution of large scale linear SVMs

    Journal of Machine Learning Research

    (2005)
  • V. Sindhwani, S.S. Keerthi, Large scale semi-supervised linear SVMs, in: SIGIR,...
  • J. Yu, S.V.N. Vishwanathan, S. Gunter, N.N. Schraudolph, A quasi-Newton approach to nonsmooth convex optimization, in:...
  • J. Nocedal et al.

    Numerical Optimization, Springer Series in Operation Research

    (1999)
  • K. Bennett, A. Demiriz, Semi-supervised support vector machines, in: Advances in Neural Information Processing Systems,...
  • G. Fung et al.

    Semi-supervised support vector machines for unlabeled data classification

    Optimization Methods and Software

    (2001)
There are more references available in the full text version of this article.

Cited by (33)

  • A multi-scheme semi-supervised regression approach

    2019, Pattern Recognition Letters
    Citation Excerpt :

    The optimization problems that appear here are non-convex, letting various strategies to be followed for reaching efficient solutions, both concerning the accuracy and the time of convergence. Mixed integer programs, concave minimization methods, quasi-Newton for non-smooth convex functions are some of the recorded strategies [24]. Although the proposed approach constitutes a regression wrapper, that theoretically could adopt any from the existing SSL schemes, it is inspired by a very well-known semi-supervised classification technique, the heuristic approach of Self-training.

  • Towards designing risk-based safe Laplacian Regularized Least Squares

    2016, Expert Systems with Applications
    Citation Excerpt :

    Generally speaking, SSL utilizes the following assumptions on the data space: (1) smoothness; (2) cluster; (3) manifold; and (4) disagreement. Many algorithms (Adankon and Cheriet, 2011; Belkin, Niyogi, and Sindhwani, 2006; Blum and Mitchell, 1998; Reddy, Shevade, and Murty, 2011; Zhou and Li, 2005) have been proposed and achieved the encouraging performance using one or more assumptions in many tasks. Among these assumptions, manifold regularization (Belkin et al., 2006; Gan, Sang, & Chen, 2013) based methods have received much attention which exploit the intrinsic manifold structure of both labeled and unlabeled data.

  • The responsibility weighted Mahalanobis kernel for semi-supervised training of support vector machines for classification

    2015, Information Sciences
    Citation Excerpt :

    This conclusion is underlying the category of low-density separation methods that try to place decision boundaries into lower density regions. One of the most frequently used algorithms in this class are transductive SVM [16] and their various implementations, e.g., TSVM [17] and S3VM [18–21]. The second assumption, called manifold assumption [22], claims that the marginal distribution underlying the data can be described by means of a manifold of much lower dimension than the input space, so that the distances and densities defined on this manifold can be used for learning [22].

View all citing articles on Scopus

Sathish Reddy received his M.E. in Computer Science and Engineering from Indian Institute of Science, Bangalore, India, in 2009. Since September 2009, he is working as a researcher in Centre for Global Logistics and Manufacturing Strategies, Indian School of Business, Hyderabad, India. His research interests include Data Mining, Artificial Intelligence, Robotics, and Machine Learning.

Shirish Shevade (Ph.D. in Computer Science, Indian Institute of Science (IISc), Bangalore, India) is a Principal Research Scientist at IISc. His research interests span many areas of Machine Learning such as support vector machines, Gaussian processes and semi-supervised learning. He has been on the programme committee of several international conferences such as IEEE ICDM and PAKDD. He is a Senior Member of IEEE.

Narasimha Murty received his Ph.D. from the Indian Institute of Science, Bangalore, India, in 1982. He is currently a Professor (Chairman) in the Department of Computer Science and Automation (CSA) at the Indian Institute of Science. He has guided 18 Ph.D. students in the areas of Pattern Recognition and Data Mining. He has also published around 125 papers in various journals and conference proceedings in these areas. His co-authored work, Data Clustering: A Review (ACM Computing Surveys, 1999) is among the most popular magazine and computing survey articles cited and downloaded. He is also an elected Fellow of the Indian National Academy of Engineering (FNAE). In the past, he has worked on several Indo-US projects and visited Michigan State University, East Lansing, USA and University of Dauphine, Paris. His research interests include Data Mining and Pattern Clustering.

View full text