Elsevier

Neurocomputing

Volume 153, 4 April 2015, Pages 62-76
Neurocomputing

Sparse semi-supervised support vector machines by DC programming and DCA

https://doi.org/10.1016/j.neucom.2014.11.051Get rights and content

Highlights

  • Two DC (Difference of Convex) Algorithms based approaches are investigated for feature selection in S3VM using the zero norm.

  • The first is DC approximation approach that approximates the zero norm by a DC function. Five approximations are considered.

  • The second is an exact reformulation approach based on exact penalty techniques in DC programming.

  • All the resulting problems are DC programs for which six versions of DCA are developed.

  • A careful empirical experiment on several benchmark datasets are performed.

Abstract

This paper studies the problem of feature selection in the context of Semi-Supervised Support Vector Machine (S3VM). The zero norm, a natural concept dealing with sparsity, is used for feature selection purpose. Due to two nonconvex terms (the loss function of unlabeled data and the ℓ0 term), we are faced with a NP hard optimization problem. Two continuous approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithms) are developed. The first is DC approximation approach that approximates the ℓ0-norm by a DC function. The second is an exact reformulation approach based on exact penalty techniques in DC programming. All the resulting optimization problems are DC programs for which DCA are investigated. Several usual sparse inducing functions are considered, and six versions of DCA are developed. Empirical numerical experiments on several Benchmark datasets show the efficiency of the proposed algorithms, in both feature selection and classification.

Introduction

In machine learning, supervised learning is a task of inferring a predictor function (classifier) from a labeled training dataset. Each example in training set consists of an input object and a label. The objective is to build a predictor function which can be used to identify the label of new examples with the highest possible accuracy. Nevertheless, in most of real word applications, a large portion of training data are unlabeled, and supervised learning cannot be used in these contexts. To deal with this difficulty, recently, in the machine learning community, there has been an attracting increasing attention in using semi-supervised learning methods. In contrast to supervised methods, semi-supervised learning methods take into account both labeled and unlabeled data to construct prediction models.

We are interested in semi-supervised classification, more precisely, in the so-called Semi-Supervised Support Vector Machines (S3VM). Among the semi-supervised classification methods, the large margin approach S3VM, which extends the Support Vector Machine (SVM) to semi-supervised learning concept, is certainly the most popular [5], [6], [7], [8], [9], [10], [11], [12], [13], [37], [39], [41], [47], [54], [59]. An extensive overview of semi-supervised classification methods can be found in [61]. S3VM was originally proposed by Vapnik and Sterin in 1977 [57] under the name of transductive support vector machine. Later, in 1999, Bennett and Demiriz [2] proposed the first optimization formulation of S3VM which is described as follows.

Given a training set which consists of m labeled points {(xi,yi)Rn×{1,1},i=1,,m} and p unlabeled points {xiRn,i=(m+1),,(m+p)}. We are to find a separating hyperplane P={xxRn,xTw=b}, far away from both the labeled and unlabeled points. Hence, the optimization problem of S3VM takes the formminw,bw22+αi=1mL(yi(w,xi+b))+βi=m+1m+pL(|w,xi+b|).Here, the first two terms define a standard SVM while the third one incorporates the loss function of unlabeled data points. The loss function of labeled and unlabeled data points are weighted by penalty parameters α>0 and β>0. Usually, in classical SVM one uses the hinge loss function L(u)=max{0,1u} which is convex. On the contrary, the problem (1) is nonconvex, due to the nonconvexity of the third term.

There are two broad strategies for solving the optimization problem (1) of S3VM: the combinatorial methods (Mixed Integer Programming [2], Branch and Bound algorithm [6]) and the continuous optimization methods such as self-labeling heuristic S3VMlight [15], gradient descent [5], deterministic annealing [53], semi-definite programming [4], DC programming [9]. Combinatorial methods are not available for massive data sets in real applications (high dimension and large data set). Thus, major efforts have focused on efficient local algorithms. For more complete reviews of S3VM methods, the reader is referred to [7], [61] and references therein.

On the other hand, feature selection is one of the fundamental problems in machine learning. In many areas of application such as text classification, web mining, gene expression, micro-array analysis, and combinatorial chemistry, image analysis, data sets contain a large number of features, many of which are irrelevant or redundant. Feature selection is often applied to high dimensional data prior to classification learning. The main goal is to select a subset of features of a given data set while preserving or improving the discriminative ability of a classifier. Several feature-selection methods for SVMs have been proposed in the literature (see e.g. [3], [11], [17], [19], [29], [30], [33], [40], [46], [49], [62], [63]).

This paper deals with the feature selection in the context of S3VM. We are to find a separating hyperplane far away from both the labeled and unlabeled points, that uses the least number of features. As standard approaches for feature selection in SVM, here we use the zero norm, a natural concept dealing with sparsity for feature selection purpose. The zero norm of a vector, denoted ℓ0 or .0, is defined as the number of its nonzero components. Similar to SVM, we replace the term w22 in (1) by the ℓ0-norm and then formulate the feature selection S3VM problem as follows:minw,bw0+αi=1mL(yi(w,xi+b))+βi=m+1m+pL(|w,xi+b|).

While S3VM has been widely studied, there exist few methods in the literature for feature selection in S3VM. Due to the discontinuity of the 0 term and the nonconvexity of the third term, we are facing “double” difficulties in (2) (it is well known that the problem of minimizing the zero-norm is NP-Hard [1]).

During the last two decades, research is very active in models and methods optimization involving the zero norm. Works can be divided into three categories according to the way to treat the zero norm: convex approximation (the ℓ0-norm is replaced by a convex function, for instance the 1-norm [55] or the conjugate function [43]), nonconvex approximation (a continuous nonconvex function is used instead to the ℓ0-norm, usual sparse inducing functions are introduced in [3], [11], [32], [45], [63]), and nonconvex exact reformulation (with the binary variables ui=0 if wi=0 and ui=1 otherwise, the original problem is formulated as a combinatorial optimization problem which is equivalently reformulated as a continuous nonconvex program via exact penalty techniques, see [33], [56]). An extensive overview of these approaches can be found in [33]. When the objective function (besides the ℓ0-term) is convex, convex approximation techniques result to a convex optimization problem which is so far “easy” to solve. Unfortunately, for S3VM, the problem (2) remains nonconvex with any approximation – convex or nonconvex, of the ℓ0-norm.

In this work, we tackle the problem (2) by nonconvex approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithms), powerful tools in nonconvex programming framework. Our motivating arguments to use the ℓ0-norm are multiple. Firstly, even though using the ℓ1 is the simplest way to deal with the sparsity, the ℓ1 can encourage the sparsity in only some cases with restrictive assumptions (see [14]). In particular, for feature selection purpose, the ℓ1 penalty has been shown to be, in certain cases, inconsistent and biased [60]. Secondly, the ℓ0-norm is the most natural and suitable concept for modeling the sparsity, and nonconvex approximations of the ℓ0-norm are, in general, deeper than the ℓ1-norm, and can then produce better sparsity. Especially, for feature selection in SVM, solutions of the ℓ0-norm penalty problem have been shown to be much sparser than those of ℓ1-norm approach in several previous works [3], [29], [33], [40]. Thirdly, although we are faced with a “double” difficulties in the problem (2) because of the ℓ0-norm and the third term, the power of DCA can be exploited to efficiently solve this hard problem, knowing that DCA has been successfully developed in a variety of works in Machine Learning (see e.g. [9], [20], [27], [28], [38], [48], [50] and the list of references in [36]), in particular for feature selection in SVM [29], [30], [33], [34], [35], [40], [46].

Paper׳s contributions: We develop an unified approach based on DC programming and DCA for solving the nonconvex optimization problem (2). The first work in this research direction has been published in the conference paper [22] where we considered the piecewise exponential [3] for the ℓ0-norm and presented a DC formulation, its corresponding DCA for solving the resulting problem as well as some preliminary numerical experiments. In this paper, we carefully explore and exploit DCA based approaches to deal with the zero norm and sparse S3VM. Firstly, several DC approximations of ℓ0-norm will be used to (2): logarithm function [63], Smoothly Clipped Absolute Deviation (SCAD) [11], piecewise exponential [3], DC polyhedral [45] and piecewise linear approximation [32]. Secondly, we inspire the same technique in [33], [56] to equivalently formulate (2) as a combinatorial optimization problem. Then, thanks to the new result on exact penalty techniques recently developed in [31], we reformulate the resulting problem as a continuous optimization problem and investigate DCA to solve it. Finally, we provide an empirical experimentation, on several Benchmark datasets, of all proposed algorithms to study their efficiency in both feature selection and classification.

The remainder of the paper is organized as follows. DC programming and DCA are briefly presented in Section 2 while Section 3 is devoted to the development of DCA for solving the feature selection S3VM problem (2). Computational experiments are reported in Section 4 and finally Section 5 concludes the paper.

Section snippets

Outline of DC programming and DCA

DC programming and DCA constitute the backbone of smooth/nonsmooth nonconvex programming and global optimization. They address the problem of minimizing a function f which is the difference of two convex functions on the whole space Rd or on a convex set CRd. Generally speaking, a DC program is an optimization problem of the form: α=inf{f(x)g(x)h(x):xRd}(Pdc)where g,h are lower semi-continuous proper convex functions on Rd. Such a function f is called a DC function, and gh a DC

Feature selection in S3VM by DC programming and DCA

Assume that the m labeled points and p unlabeled points are represented by the matrix ARm×n and BRp×n, respectively. D is a m×m diagonal matrix where Di,i=yi,i=1,,m. Denote by e the vector of ones in the appropriate vector space. For each labeled point xi (i=1,,m), we introduce a new variable ξi which represents the misclassification error. Similarly, for each unlabeled point xi (i=(m+1),,(m+p)), we define ri and si for two possible misclassification errors. Let r and s be vectors in Rp

Datasets

We perform numerical experiments on two datasets. The first one contains several real-world data taken from UCI Machine Learning Repository and NIPS 2003 Feature Selection Challenge. The second dataset is the CBCL Face Database, taken from MIT Center For Biological and Computation Learning (http://cbcl.mit.edu/projects/cbcl/software-datasets/).

The CBCL Face Database is considered for an instance of the face detection problem which can be exploited in numerous application areas including

Conclusion

We have intensively investigated a unified DC programming approach for feature selection in the context of Semi-Supervised Support Vector Machine. Using 5 different approximations and the continuous exact reformulation via an exact penalty technique leads us to 6 DC programs. Then we developed 6 based DCA algorithms for solving the resulting problems. Numerical results on several real datasets showed the robustness, the effectiveness of the DCA based schemes. We are convinced that DCA is a

Acknowledgments

The authors acknowledges the support from the “Fonds Européens de Développement Régional” (FEDER) of Lorraine under the project INNOMAD.

Hoai Minh Le earned his Ph.D. in Computer Science from the University Paul Verlaine-Metz, France in 2007. From 2008 to 2009, he was a post-doctoral researcher at the University of Le Havre, France. Since 2010, he is a researcher at the Laboratory of Theoretical & Applied Computer Science, University of Lorraine, France. He recently joined the Modeling and Optimization of Complex Systems Research Group, Ton Duc Thang University, Ho Chi Minh City, Vietnam. His research interests include

References (63)

  • O. Chapelle et al.

    Optimization techniques for semi-supervised support vector machines

    J. Mach. Learn. Res.

    (2008)
  • R. Collobert et al.

    Large scale transductive SVMs

    J. Mach. Learn.

    (2006)
  • W. Emara, M.K.M. Karnstedt, K. Sattler, D. Habich, W. Lehner, An approach for incremental semi-supervised SVM, in:...
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its Oracle Properties

    J. Am. Stat. Assoc.

    (2001)
  • G. Fung et al.

    Semi-supervised support vector machines for unlabeled data classification

    Optim. Methods Softw.

    (2001)
  • F. Gieseke, A. Airola, T. Pahikkala, O. Kramer, Sparse Quasi-Newton optimization for semi-supervised support vector...
  • R. Gribonval et al.

    Sparse representation in union of bases

    IEEE Trans. Inf. Theory

    (2003)
  • T. Joachims, Transductive inference for text classification using support vector machines, in: 16th International...
  • B. Heisele, P. Ho, T. Poggio, Face recognition with support vector machines: global versus Component-based approach,...
  • L. Hermes, J.M. Buhmann, Feature selection for support vector machines, in: Proceedings 15th International Conference...
  • Z. Hui

    The adaptive lasso and its oracle properties

    J. Am. Stat. Assoc.

    (2006)
  • N. Krause, Y. Singer, Leveraging the margin more carefully, in: Proceeding of ICML ׳04, NY, USA, 2004, pp....
  • I. Kukenys, Human face detection with support vector machines (Thesis), University of Otago,...
  • H.M. Le, H.A. Le Thi, M.C. Nguyen, DCA based algorithms for feature selection in semi-supervised support vector...
  • H.A. Le Thi, Analyse numérique des algorithmes de l׳Optimisation d. c. Approches locales et globales, Codes et...
  • H.A. Le Thi, Contribution à l׳optimisation non convexe et l׳optimisation globale: Théorie, Algoritmes et Applications,...
  • H.A. Le Thi et al.

    Solving a class of linearly constrained indefinite quadratic problems by DC algorithms

    J. Glob. Optim.

    (1997)
  • H.A. Le Thi et al.

    The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems

    Ann. Oper. Res.

    (2005)
  • H.A. Le Thi et al.

    A new efficient algorithm based on DC programming and DCA for clustering

    J. Glob. Optim.

    (2006)
  • H.A. Le Thi et al.

    Optimization based DC programming and DCA for hierarchical clustering

    Eur. J. Oper. Res.

    (2007)
  • H.A. Le Thi et al.

    A DC programming approach for feature selection in support vector machines learning

    J. Adv. Data Anal. Classif.

    (2008)
  • Cited by (0)

    Hoai Minh Le earned his Ph.D. in Computer Science from the University Paul Verlaine-Metz, France in 2007. From 2008 to 2009, he was a post-doctoral researcher at the University of Le Havre, France. Since 2010, he is a researcher at the Laboratory of Theoretical & Applied Computer Science, University of Lorraine, France. He recently joined the Modeling and Optimization of Complex Systems Research Group, Ton Duc Thang University, Ho Chi Minh City, Vietnam. His research interests include non-convex optimization in Machine Learning (supervised/unsupervised classification, feature selection...), Scheduling and Cryptography. He is the author/co-author of more than 25 journal articles, international conference papers and book chapters and is a reviewer of several international journals/conferences.

    Hoai An Le Thi obtained her PhD with Distinction in Optimization in 1994, her Habilitation in 1997 from university of Rouen, France. She is currently a Director of Laboratory of Theoretical & Applied Computer Science, university of Lorraine and serving as a full Professor of Exceptional Class. She is the author/co-author of more than 180 journal articles, international conference papers and book chapters, the co-editor of 5 books and 12 special issues of international journals. She has been a president of Scientific Committee and a president of Organizing Committee as well as member of Scientific Committee of various international conferences, and has been heading of several regional/national/international projects. Her research interests include Machine Learning, Optimization and Operations Research and their applications in Information Systems and various complex Industrial Systems. She is the co-founder (with Pham Dinh Tao) of DC programming and DCA, an innovative approach in non-convex programming.

    Manh Cuong Nguyen is an assistance professor at Hanoi University of Industry since 2004. He earned his Ph.D. in May 2014 from University of Lorraine, France. He is currently a postdoctoral researcher in a R&D project at the Laboratory of Theoretical & Applied Computer Science, University of Lorraine. His research interests include issues related to nonconvex optimization applied on data mining and machine learning such as community detection in large networks, Kohonen maps, feature selection in support vector machines, and recommender systems.

    View full text